OAPEN Metadata
Last updated
Last updated
The OAPEN Metadata telescope collects data from the OAPEN Metadata feed. OAPEN enables libraries and aggregators to use the metadata of all available titles in the OAPEN Library. OAPEN metadata is available in different formats and this telescope harvests the data in the XML format using . See the for more information.
Dataset Name
onix
Table Name
onix
Table Type
Sharded
Average Runtime
10 min
Average Download Size
1-200 MB
Harvest Type
URI
Run Schedule
Weekly
Catch-up Missed Runs
Each Run Includes All Data
Fields passed as keyword arguments to the telescope upon instantiation.
This field holds the URI for the publisher/collection. This is available via the Memo tool. We can specify it in the telescope config via kwargs:
A boolean value ("True" | "False") that determines whether the transform step should elevate the feed's related products.
Some steps of the transform process manipulate the Related Products of the metadata feed in a way that alters the state of the input data. These processes deserve an explanation as it's not obvious what they're doing and why.
The OAPEN Metadata telescope has the option to manipulate the feed's related product entries. Each Related Product in each entry is turned into its own entry. For each of these fabricated entries, the reference to itself as a related product is removed and replaced with the product identifier of the original entry. This can lead to many more entries than the original.
There elevation process will only apply to a Related Product if:
For example, if we have a Product that looks like this:
The related product is elevated, creating a new product and keeping the original:
Related products retrieved from the OAPEN feed tend to have a format that is not consistent with ONIX 3.0 standards. For example, the following is an invalid implementation:
The only exception to this format is if the IDValue
elements are identical and the ProductIDType
elements are different.
The correct way is the following:
This is an important distinction as the ONIX parser will ignore any consecutive ProductIdentifier
tags that have the same ProductIDType
.
Occasionally, a product can have multiple of the same Related Products, or it may have a Related Product that has the same identifier as its parent - effectively referencing itself as a Related Product. Under either of these circumstances, these duplicated Related Products are removed.
For example:
The above product has a copy of the Product as a Related Product (ID 100). It also has two identical Related Products (with ID 200). After the deduplication process, the product will look like the following:
Note that if the metadata file is part-way through an update (occurring daily at +0000GMT and taking upwards of one hour), the XML file will be incomplete and invalid. The telescope has a failesafe to attempt to resolve this during runtime, which can lead to much longer than normal 'download' times.
The transform step modifies the downloaded metadata into a valid ONIX format. This is done in a few steps:
The PersonName and InvertedPersonName fields are created (where possible) from the KeyNames and NamesBeforeKey fields.
Subject fields are collapsed (converted to a single semicolon-separated string) to match our expected input
The valid ONIX feed can now be loaded from the transform bucket into a BigQuery date-sharded table in the onix dataset (which will be created if it does not yet exist). There will be multiple onix_YYYYMMDD tables.
The related product has an ISBN (product identifier code 15 as described in the )
The relation code is "06" (ie. alternative format as described in the )
This is where the metadata is downloaded. The XML file containing metadata is downloaded using the .
The XML is loaded and all unnecessary fields are removed. The necessary fields for the BAD workflows are described by a (.json) file.
The resulting XML is parsed through the Python . This reveals any remaining invalid products. These products are removed from the file. The removed products are saved to a separate file and uploaded to the transform bucket for storage/archiving.
Any Related Products that are incorrectly formatted will be fixed through the .
Duplicated Related Products are removed via the .
Optionally, the Related Products can be elevated through the Related Product .
The XML is then parsed throuh the , which results in a .jsonl file.
The OAPEN Metadata table uses the same schema as the ONIX Telescope's table. See the .