Thoth

The Thoth Telescope downloads, transforms and loads publisher ONIX feeds from Thoth into BigQuery. ONIX is a standard format that book publishers use to share information about the books that they have published.

Thoth is a free, open metadata service that publishers can choose to utilise as a solution for metadata storage. Thoth can provide metadata upon request in a number of formats. The Thoth Telescope used the Thoth Export API to download metadata in an ONIX format. This API provides a snapshot of a specified publisher's metadata at the time of request.

The Thoth telescope downloads the ONIX metadata files and then transforms the data into a format suitable for loading into BigQuery with the ONIX parser Java command line tool. This is a near-identical process to how the ONIX telescope's data-transformation step is executed. The transformed data is loaded into BigQuery, where it can be picked up and used by the ONIX Workflow.

Dataset Name

onix

Table Name

onix

Table Type

Sharded

Average Runtime

5 min

Average Download Size

1-200 MB

Harvest Type

API

Run Schedule

Weekly

Catch-up Missed Runs

Each Run Includes All Data

Telescope Configuration

Telescope kwargs

Fields passed as keyword arguments to the telescope upon instantiation.

Publisher ID (publisher_id)

This field holds the Thoth internal ID for the publisher. For example, Open Book Publisher's ID would be presented as follows:

kwargs: 
    publisher_id: "85fd969a-a16c-480b-b641-cb9adf979c3b"

Format Specification (format_specification)

Thoth can output the metadata feed in a number of different formats. Refer to the Thoth export API for more information. The format specification should be provided to the Telescope.

kwargs:
      format_specification: "onix_3.0::jstor"
kwargs:
      related_product_elevation: True

A boolean value ("True" | "False") that determines whether the transform step should elevate the feed's related products. See Related Products Manipulation for more information.

Retrieving the Publisher ID

To get a publisher's internal identifier, navigate to the Thoth's GraphiQL page and supply the following query:

{
  publishers{
    publisherName
    publisherId
  }
}

Which will output all available publishers with their name and internal ID. Inputs to the query are available to narrow down the search. For example, the 'filter' input can be used to filter by a string.

{
  publishers(filter: "Press"){
    publisherName
    publisherId
  }
}

This will only output publishers with Press in their name.

Telescope Tasks

Data Download

The download step is simple, thanks to Thoth's export API. The telescope uses the API to gain a publisher's metadata record. All that is required is to query the API with the proper URI:

https://export.thoth.pub/specifications/{format_specification}/publisher/{publisher_id}

Where the format_specification and publisher_id are those supplied in the telescope kwargs.

Data Transform

The transform step consists of a few steps:

  1. The downloaded XML file is parsed though the Java ONIX Parser, which results in a .jsonl file

  2. The SubjectHeadingText field is collapsed into a single semicolon-separated string (for downstream)

  3. The PersonName field is created (where possible) from the KeyNames and NamesBeforeKey fields.

BigQuery Load

The valid ONIX feed can now be loaded from the transform bucket into a BigQuery date-sharded table in the onix dataset, which will be created if it does not yet exist. There will be multiple onix_YYYYMMDD tables.

Table Schema

The Thoth table uses the same schema as the ONIX Telescope's table. See the ONIX Telescope schema.

Last updated