Thoth
Last updated
Last updated
The Thoth Telescope downloads, transforms and loads publisher ONIX feeds from Thoth into BigQuery. ONIX is a standard format that book publishers use to share information about the books that they have published.
Thoth is a free, open metadata service that publishers can choose to utilise as a solution for metadata storage. Thoth can provide metadata upon request in a number of formats. The Thoth Telescope used the Thoth Export API to download metadata in an ONIX format. This API provides a snapshot of a specified publisher's metadata at the time of request.
The Thoth telescope downloads the ONIX metadata files and then transforms the data into a format suitable for loading into BigQuery with the ONIX parser Java command line tool. This is a near-identical process to how the ONIX telescope's data-transformation step is executed. The transformed data is loaded into BigQuery, where it can be picked up and used by the ONIX Workflow.
Fields passed as keyword arguments to the telescope upon instantiation.
This field holds the Thoth internal ID for the publisher. For example, Open Book Publisher's ID would be presented as follows:
Thoth can output the metadata feed in a number of different formats. Refer to the Thoth export API for more information. The format specification should be provided to the Telescope.
A boolean value ("True" | "False") that determines whether the transform step should elevate the feed's related products. See Related Products Manipulation for more information.
To get a publisher's internal identifier, navigate to the Thoth's GraphiQL page and supply the following query:
Which will output all available publishers with their name and internal ID. Inputs to the query are available to narrow down the search. For example, the 'filter' input can be used to filter by a string.
This will only output publishers with Press in their name.
The download step is simple, thanks to Thoth's export API. The telescope uses the API to gain a publisher's metadata record. All that is required is to query the API with the proper URI:
Where the format_specification
and publisher_id
are those supplied in the telescope kwargs.
The transform step consists of a few steps:
The downloaded XML file is parsed though the Java ONIX Parser, which results in a .jsonl file
The SubjectHeadingText field is collapsed into a single semicolon-separated string (for downstream)
The PersonName field is created (where possible) from the KeyNames and NamesBeforeKey fields.
The valid ONIX feed can now be loaded from the transform bucket into a BigQuery date-sharded table in the onix dataset, which will be created if it does not yet exist. There will be multiple onix_YYYYMMDD tables.
The Thoth table uses the same schema as the ONIX Telescope's table. See the ONIX Telescope schema.
Dataset Name
onix
Table Name
onix
Table Type
Sharded
Average Runtime
5 min
Average Download Size
1-200 MB
Harvest Type
API
Run Schedule
Weekly
Catch-up Missed Runs
Each Run Includes All Data