OAPEN Metadata

The OAPEN Metadata telescope collects data from the OAPEN Metadata feed. OAPEN enables libraries and aggregators to use the metadata of all available titles in the OAPEN Library. OAPEN metadata is available in different formats and this telescope harvests the data in the XML format. See the OAPEN Metadata webpage for more information.

Dataset Name

onix

Table Name

onix

Table Type

Sharded

Average Runtime

10 min

Average Download Size

1-200 MB

Harvest Type

URI

Run Schedule

Weekly

Catch-up Missed Runs

Each Run Includes All Data

Telescope Configuration

Telescope kwargs

Fields passed as keyword arguments to the telescope upon instantiation.

Metadata URI (metadata_uri)

This field holds the URI for the publisher/collection. For example, the OAPEN's ONIX feed URI is as follows:

kwargs: 
    metadata_uri: "https://library.oapen.org/download-export?format=onix"

The URI must either be known internally, or in some cases can be retrieved from the OAPEN Metadata page.

kwargs:
      related_product_elevation: True

A boolean value ("True" | "False") that determines whether the transform step should elevate the feed's related products.

Some steps of the transform process manipulate the Related Products of the metadata feed in a way that alters the state of the input data. These processes deserve an explanation as it's not obvious what they're doing and why.

The OAPEN Metadata telescope has the option to manipulate the feed's related product entries. Each Related Product in each entry is turned into its own entry. For each of these fabricated entries, the reference to itself as a related product is removed and replaced with the product identifier of the original entry. This can lead to many more entries than the original.

There elevation process will only apply to a Related Product if:

  • The related product has an ISBN (product identifier code 15 as described in the codelist)

  • The relation code is "06" (ie. alternative format as described in the codelist)

For example, if we have a Product that looks like this:

<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

The related product is elevated, creating a new product and keeping the original:

<!-- Original -->
<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

<!-- Elevated Related Product -->
<Product>
    <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

Related products retrieved from the OAPEN feed tend to have a format that is not consistent with ONIX 3.0 standards. For example, the following is an invalid implementation:

<RelatedProduct>
    <ProductRelationCode></ProductRelationCode>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
</RelatedProduct>

The only exception to this format is if the IDValue elements are identical and the ProductIDType elements are different.

The correct way is the following:

<RelatedProduct>
    <ProductRelationCode></ProductRelationCode>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
</RelatedProduct>
<RelatedProduct>
    <ProductRelationCode></ProductRelationCode>
    <ProductIdentifier>
        <IDValue></IDValue>
        <ProductIDType></ProductIDType>
    </ProductIdentifier>
</RelatedProduct>

This is an important distinction as the ONIX parser will ignore any consecutive ProductIdentifier tags that have the same ProductIDType.

Occasionally, a product can have multiple of the same Related Products, or it may have a Related Product that has the same identifier as its parent - effectively referencing itself as a Related Product. Under either of these circumstances, these duplicated Related Products are removed.

For example:

<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

The above product has a copy of the Product as a Related Product (ID 100). It also has two identical Related Products (with ID 200). After the deduplication process, the product will look like the following:

<Product>
    <ProductIdentifier>
            <IDValue>100</IDValue>
            <ProductIDType>15</ProductIDType>
    </ProductIdentifier>
    <RelatedProduct>
        <ProductRelationCode>06</ProductRelationCode>
        <ProductIdentifier>
            <IDValue>200</IDValue>
            <ProductIDType>15</ProductIDType>
        </ProductIdentifier>
    </RelatedProduct>
</Product>

Telescope Tasks

Data Download

This is where the metadata is downloaded. The XML file containing metadata is downloaded using the Metadat URI.

Note that if the metadata file is part-way through an update (occurring daily at +0000GMT and taking upwards of one hour), the XML file will be incomplete and invalid. The telescope has a failesafe to attempt to resolve this during runtime, which can lead to much longer than normal 'download' times.

Data Transform

The transform step modifies the downloaded metadata into a valid ONIX format. This is done in a few steps:

  1. The XML is loaded and all unnecessary fields are removed. The necessary fields for the BAD workflows are described by a corresponding schema (.json) file.

  2. The resulting XML is parsed through the Python onixcheck. This reveals any remaining invalid products. These products are removed from the file. The removed products are saved to a separate file and uploaded to the transform bucket for storage/archiving.

  3. Any Related Products that are incorrectly formatted will be fixed through the normalisation process.

  4. Duplicated Related Products are removed via the deduplication process.

  5. Optionally, the Related Products can be elevated through the Related Product elevation process.

  6. The XML is then parsed throuh the Java ONIX Parser, which results in a .jsonl file.

  7. The PersonName and InvertedPersonName fields are created (where possible) from the KeyNames and NamesBeforeKey fields.

  8. Subject fields are collapsed (converted to a single semicolon-separated string) to match our expected input

BigQuery Load

The valid ONIX feed can now be loaded from the transform bucket into a BigQuery date-sharded table in the onix dataset (which will be created if it does not yet exist). There will be multiple onix_YYYYMMDD tables.

Table Schema

The OAPEN Metadata table uses the same schema as the ONIX Telescope's table. See the ONIX Telescope schema.

Last updated