ONIX
Documentation for the ONIX telescope
The ONIX telescope downloads, transforms and loads publisher ONIX feeds into BigQuery. ONIX is a standard format that book publishers use to share information about the books that they have published.
Book publishers with ONIX feeds are given credentials and access to their own upload folder on the OAeBU SFTP server. They then configure ONIX Suite to upload their ONIX feeds to the SFTP server on a weekly basis. The ONIX feeds need to be full dumps every time, not incremental updates.
Telescope Configuration
Airflow connections
Telescope kwargs
Fields passed as keyword arguments to the telescope upon instantiation.
Date Regular Expression (date_regex)
This field is used to extract the date from the ONIX feed file name. For example, the regex \\d{8}
will extract the date from the file name 20220301_CURTINPRESS_ONIX.xml
.
Telescope Tasks
Data Download
Discovers all files in the partner's SFTP server folder that match the supplied date regex pattern. These files are downloaded the the local file system for transforming.
Data Transform
In order to convert from the .xml format into one suitable for loading into BigQuery, the ONIX telescope utilises the the Java ONIX parser. The parser is Java based in order to leverage the Jonix-onix3 library. The output of the parser is a .jsonl file, which makes for simple Pythonic interpretation.
An additional step in the transform task collapses the subjects (Subjects.SubjectHeadingText) into a semicolon-separated string.
BigQuery Load
The transformed data is loaded from the Google Cloud bucket into a date-sharded BigQuery table in the onix dataset, which will be created if it does not yet exist. There will be multiple onix_YYYYMMDD tables.
Table Schema
Last updated