📚
Book Analytics Service
  • 📚Dashboard overview
    • Book Analytics Service
    • How the Dashboard works
    • Dashboard data sources
    • How to use your Dashboard
    • More information and contact us
      • Glossary
      • License
      • Contributing Guide
  • 🖱️Installing BAD Workflows
  • 🔭Workflows & Telescopes
    • Workflow Schedule
    • Data Telescopes
      • Google Analytics Universal
      • Google Books
      • IRUS Fulcrum
      • IRUS OAPEN
      • JSTOR
      • UCL Discovery
      • UCL Sales
    • Metadata Telescopes
      • OAPEN Metadata
      • ONIX
      • Thoth
    • ONIX Workflow
      • Data Partners
      • Schemas
      • Crossref Metadata
Powered by GitBook
On this page
  • Telescope Configuration
  • Telescope kwargs
  • Related Product Manipulation
  • Related Product Elevation
  • Normalise Related Products
  • Related Product De-duplication
  • Telescope Tasks
  • Data Download
  • Data Transform
  • BigQuery Load
  • Table Schema
  1. Workflows & Telescopes
  2. Metadata Telescopes

OAPEN Metadata

PreviousMetadata TelescopesNextONIX

Last updated 1 month ago

The OAPEN Metadata telescope collects data from the OAPEN Metadata feed. OAPEN enables libraries and aggregators to use the metadata of all available titles in the OAPEN Library. OAPEN metadata is available in different formats and this telescope harvests the data in the XML format using . See the for more information.

Dataset Name

onix

Table Name

onix

Table Type

Sharded

Average Runtime

10 min

Average Download Size

1-200 MB

Harvest Type

URI

Run Schedule

Weekly

Catch-up Missed Runs

Each Run Includes All Data

Telescope Configuration

Telescope kwargs

Fields passed as keyword arguments to the telescope upon instantiation.

Metadata URI (metadata_uri)

This field holds the URI for the publisher/collection. This is available via the Memo tool. We can specify it in the telescope config via kwargs:

kwargs: 
    metadata_uri: "https://memo.oapen.org/file/my_file.xml"

Elevate Related Products (related_products_elevation)

kwargs:
      related_product_elevation: True

A boolean value ("True" | "False") that determines whether the transform step should elevate the feed's related products.

Related Product Manipulation

Some steps of the transform process manipulate the Related Products of the metadata feed in a way that alters the state of the input data. These processes deserve an explanation as it's not obvious what they're doing and why.

Related Product Elevation

The OAPEN Metadata telescope has the option to manipulate the feed's related product entries. Each Related Product in each entry is turned into its own entry. For each of these fabricated entries, the reference to itself as a related product is removed and replaced with the product identifier of the original entry. This can lead to many more entries than the original.

There elevation process will only apply to a Related Product if:

For example, if we have a Product that looks like this:

<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

The related product is elevated, creating a new product and keeping the original:

<!-- Original -->
<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

<!-- Elevated Related Product -->
<Product>
    <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

Normalise Related Products

Related products retrieved from the OAPEN feed tend to have a format that is not consistent with ONIX 3.0 standards. For example, the following is an invalid implementation:

<RelatedProduct>
    <ProductRelationCode></ProductRelationCode>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
</RelatedProduct>

The only exception to this format is if the IDValue elements are identical and the ProductIDType elements are different.

The correct format is the following:

<RelatedProduct>
    <ProductRelationCode></ProductRelationCode>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
</RelatedProduct>
<RelatedProduct>
    <ProductRelationCode></ProductRelationCode>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
</RelatedProduct>

This is an important distinction as the ONIX parser will ignore any consecutive ProductIdentifier tags that have the same ProductIDType.

Related Product De-duplication

Occasionally, a product can have multiple of the same Related Products, or it may have a Related Product that has the same identifier as its parent - effectively referencing itself as a Related Product. Under either of these circumstances, these duplicated Related Products are removed.

For example:

<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

The above product has a copy of the Product as a Related Product (ID 100). It also has two identical Related Products (with ID 200). After the deduplication process, the product will look like the following:

<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

Telescope Tasks

Data Download

Note that if the metadata file is part-way through an update (occurring daily at +0000GMT and taking upwards of one hour), the XML file will be incomplete and invalid. The telescope has a fail-safe to attempt to resolve this during runtime, which can lead to much longer than normal 'download' times.

Data Transform

The data transform step modifies the downloaded metadata into a valid ONIX format. This is done in a few steps:

  1. The PersonName and InvertedPersonName fields are created (where possible) from the KeyNames and NamesBeforeKey fields.

  2. Subject fields are collapsed (converted to a single semicolon-separated string) to match our expected input

BigQuery Load

The valid ONIX feed can now be loaded from the transform bucket into a BigQuery date-sharded table in the onix dataset (which will be created if it does not yet exist). There will be multiple onix_YYYYMMDD tables.

Table Schema

The related product has an ISBN (product identifier code 15 as described in the )

The relation code is "06" (ie. alternative format as described in the )

This is where the metadata is downloaded. The XML file containing metadata is downloaded using the .

The XML is loaded and all unnecessary fields are removed. The necessary fields for the BAD workflows are described by a (.json) file.

The resulting XML is parsed through the Python . This reveals any remaining invalid products. These products are removed from the file. The removed products are saved to a separate file and uploaded to the transform bucket for storage/archiving.

Any Related Products that are incorrectly formatted will be fixed through the .

Duplicated Related Products are removed via the .

Optionally, the Related Products can be elevated through the Related Product .

The XML is then parsed throuh the , which results in a .jsonl file.

The OAPEN Metadata table uses the same schema as the ONIX Telescope's table. See the .

🔭
OAPEN's Memo tool
OAPEN Metadata webpage
codelist
codelist
corresponding schema
onixcheck
Java ONIX Parser
Metadata URI
normalisation process
deduplication process
elevation process
❌
✅
ONIX Telescope schema