ONIX Workflow
The ONIX Workflow filters and aggregates the data ingested by the telescopes.
Last updated
The ONIX Workflow filters and aggregates the data ingested by the telescopes.
Last updated
BAD has a single core workflow - the ONIX Workflow which is responsible for filtering and aggregating all data for a publisher. The workflow prepares the data so that it can be easily accessed/imported by a dashboarding tool. The ONIX Workflow can broadly broken into three parts:
Aggregating and Mapping book products into works and work families
Linking data from metric providers to book products
Creating export tables for visualisation in dashboards
A list of DAG IDs. Upon instantiation, the workflow will create a sensor task for each of the supplied DAGs. Each of the sensors will look back 7 days for a completed DAG. If there are no runs in the last 7 days, the sensor will be marked as a success. Otherwise, the sensor will wait for the DAG to complete before marking itself as a success.
A list of data partners that the workflow will use to aggregate and filter the data. The corresponding sensors for each data partner should be present. This should contain one onix type partner and at least one data type partner.
The ONIX workflow uses the ONIX table created by an ONIX telescope (ONIX Telescope, Thoth, OAPEN Metadata) to do the following:
Aggregate book product records into works records. Works are equivalence classes of products, where each product in the class is a manifestation of each other. For example, a PDF and a paperback of the same work.
Aggregate work records into work family records. A work family is an equivalence class of works where each work in the class is just a different edition.
Produce intermediate lookup tables mapping ISBN13 -> WorkID and ISBN13 -> WorkFamilyID.
Produce intermediate tables that append work_id and work_family_id columns to different data tables with ISBN keys.
The Work ID will be an arbitrary ISBN representative from a product in the equivalence class.
The Work Family ID will be an arbitrary Work ID (ISBN) representative from a work in the equivalence class.
Crossref Metadata is required to proceed. The ISBNs for each work is obtained from the publisher's Onix table. For each of these ISBNs, the Crossref Metadata table produced by the Academic Observatory workflows is queried. Refer to the Crossref Metadata task
Similarly to Crossref Metadata, Crossref Event Data is retrieved through Crossref's dedicated event REST API through the Crossref Event Data telescope. The API accepts queries based on DOI only, which we retrieve by matching the appropriate ISBN13 from the metadata. Refer to the Crossref Events task.
The book table (schema) is a collection of works and their relevant details for the relative publisher. The table accommodates a title's Crossref metadata, events and separate chapters.
For each data partner's tables containing ISBN, create new matched tables which extend the original data with new work_id and work_family_id columns.
The schemas for these tables are identical to the raw Telescope's schemas, with the addition of work_ids and work_family_ids.
The ONIX workflow takes the metrics fetched through various telescopes, then aggregates and joins them to the book records in the publisher's ONIX feed.
The output is the book product table (schema), containing one row per unique book, with a nested month field, which groups all the metrics relating to that book for each calendar month. This table is the main output of the workflow and contains all of the aggregated and filtered data from all of a a publisher's data partners/sources.
For each data source, including the intermediate tables, we perform basic quality assurance checks on the data, and output the results to tables that are easy to export for analysis by the publisher (e.g. to CSV). For example we verify if the provided ISBNs are valid, or if there are unmatched ISBNs indicating that there are missing ONIX product records.
Details ISBN13s in the ONIX feed that are not valid.
Details ISBN13s in the data source that are not valid. An example schema is below, as data platforms may use different name fields (e.g, 'ISBN', 'publication_id', 'Primary_ISBN').
Details ISBN\13s in the data source that were not matched to ISBN-13s in the ONIX feed.
Step three of the ONIX workflow is to export the book_product table to a sequence of flattened data export tables. The data in these tables is not materially different to the book product table, just organised in a way better suited for dashboards in Looker Studio.
Since these are date-sharded tables, their names will be updated each time the workflow is run. When using Google's Looker (previously Data Studio), it is preferable for us to use a static naming scheme. For this reason, after creating the (sharded) export and quality analysis tables, we also create/update a view for table. These views have a static name. By referencing the view, we can keep the Looker dashboards up-to-date without manual intervention.
*Table names prefixed with oaebu_{publisher}_book_product
This table is a list of each Book Product. It is primarily used for drop-down fields, or where a list of all the books independent of metrics is desired.
This table contains metrics, organised by month, that are linked to each book. The country, city, institution, events and referrals expand on this to provided further useful breakdowns of metrics.
This table contains metrics, organised by month and author, that are linked to each author.
This table contains metrics, organised by published year and month, that are linked to each book.
This table contains metrics, organised by month and crossref event type, that are linked to each book.
This table contains metrics, organised by month and city of measured usage, that are linked to each book.
This table contains metrics, organised by month and country of measured usage, that are linked to each book.
This table contains metrics, organised by month and institution for which there is measured activity linked to each book.
This index contains a summary of metrics, organised by month that are linked to each publisher.
This table contains metrics, organised by month and BIC subject type, that are linked to each book.
This table contains metrics, organised by month and BISAC subject type, that are linked to each book.
This table contains metrics, organised by month and THEMA subject type, that are linked to each book.
This table contains metrics, organised by published year and month and currently just the BIC subject type, that are linked to each book.
This table is a list of each unique Institution where metrics are linked too. It is primarily used for drop-down fields, or where a list of all the institutions independent of metrics is desired.
This dataset is helpful for understanding where metrics and books defined in the onix feed are not matched. Helping target data quality tasks upstream of this workflow.
Because the export tables are all sharded by date, once the workflow has run the export table names will be updated. This is an issue for Looker Studio, which looks for a specific table name to pull data from. For this reason, the final step of the workflow is to create/update a set of views for both the export tables and their QA counterparts. The first run will create the views, subsequent runs will update each view to point to the appropriate (latest) table.
*The table names are a copy of the tables created in the Create Export Tables (data_export_latest dataset) and the Create QA ISBN and Create QA Aggregate (oaebu_data_qa_latest dataset) tasks with the date shard removed.
Dataset Name
onix_workflow
Table Names
onix_workfamilyid_isbn
, onix_workid_isbn
, onix_workid_isbn_errors
Average Runtime
1-5 min
Average Download Size
0-1 MB
Table Type
Sharded
Dataset Name
oaebu
Table Name
book
Average Runtime
~1 min
Average Download Size
5-50 MB
Table Type
Sharded
Dataset Name
oaebu_intermediate
Table Names
{data_source}_matched
Average Runtime
1-5 min
Average Download Size
0-1 MB
Table Type
Sharded
Dataset Name
oaebu
Table Names
book_product
Average Runtime
1 min
Average Download Size
100-2000 MB
Table Type
Sharded
Dataset Name
oaebu_data_qa
Table Names
{data_source}_unmatched_{isbn}, {data_source}_invalid_{isbn}
Average Runtime
~1 min
Average Download Size
1 MB
Table Type
Sharded
Dataset Name
oaebu_data_qa
Table Names
onix_aggregate_metrics
Average Runtime
~1 min
Average Download Size
1 MB
Table Type
Sharded
Dataset Name
data_export
Table Names*
author_metrics
, list
, metrics
, metrics_city
, metrics_country
, metrics_events
, metrics_institution
, metrics_referrer
, publisher_metrics
, subject_bic_metrics
, subject_bisac_metrics
, subject_thema_metrics, subject_year_metrics
, year_metrics
Table Names
institution_list
, unmatched_book_metrics
Average Runtime
~1 min
Average Download Size
1 MB
Table Type
Sharded
Dataset Name
data_export_latest
, oaebu_data_qa_latest
Table Names*
Average Runtime
~1 min
Average Download Size
0 MB
Table Type
View
Average Runtime
10-20 min
Run Schedule
Weekly
Catch-up Missed Runs
Each Run Includes All Data