Thoth
The Thoth Telescope downloads, transforms and loads publisher ONIX feeds from Thoth into BigQuery. ONIX is a standard format that book publishers use to share information about the books that they have published.
Thoth is a free, open metadata service that publishers can choose to utilise as a solution for metadata storage. Thoth can provide metadata upon request in a number of formats. The Thoth Telescope used the Thoth Export API to download metadata in an ONIX format. This API provides a snapshot of a specified publisher's metadata at the time of request.
The Thoth telescope downloads the ONIX metadata files and then transforms the data into a format suitable for loading into BigQuery with the ONIX parser Java command line tool. This is a near-identical process to how the ONIX telescope's data-transformation step is executed. The transformed data is loaded into BigQuery, where it can be picked up and used by the ONIX Workflow.
Dataset Name | onix |
Table Name | onix |
Table Type | Sharded |
Average Runtime | 5 min |
Average Download Size | 1-200 MB |
Harvest Type | API |
Run Schedule | Weekly |
Catch-up Missed Runs | |
Each Run Includes All Data |
Telescope Configuration
Telescope kwargs
Fields passed as keyword arguments to the telescope upon instantiation.
Publisher ID (publisher_id)
This field holds the Thoth internal ID for the publisher. For example, Open Book Publisher's ID would be presented as follows:
Format Specification (format_specification)
Thoth can output the metadata feed in a number of different formats. Refer to the Thoth export API for more information. The format specification should be provided to the Telescope.
Elevate Related Products (related_products_elevation)
A boolean value ("True" | "False") that determines whether the transform step should elevate the feed's related products. See Related Products Manipulation for more information.
Retrieving the Publisher ID
To get a publisher's internal identifier, navigate to the Thoth's GraphiQL page and supply the following query:
Which will output all available publishers with their name and internal ID. Inputs to the query are available to narrow down the search. For example, the 'filter' input can be used to filter by a string.
This will only output publishers with Press in their name.
Telescope Tasks
Data Download
The download step is simple, thanks to Thoth's export API. The telescope uses the API to gain a publisher's metadata record. All that is required is to query the API with the proper URI:
Where the format_specification
and publisher_id
are those supplied in the telescope kwargs.
Data Transform
The transform step consists of a few steps:
The downloaded XML file is parsed though the Java ONIX Parser, which results in a .jsonl file
The SubjectHeadingText field is collapsed into a single semicolon-separated string (for downstream)
The PersonName field is created (where possible) from the KeyNames and NamesBeforeKey fields.
BigQuery Load
The valid ONIX feed can now be loaded from the transform bucket into a BigQuery date-sharded table in the onix dataset, which will be created if it does not yet exist. There will be multiple onix_YYYYMMDD tables.
Table Schema
The Thoth table uses the same schema as the ONIX Telescope's table. See the ONIX Telescope schema.
Last updated