ONIX Workflow

The ONIX Workflow filters and aggregates the data ingested by the telescopes.

BAD has a single core workflow - the ONIX Workflow which is responsible for filtering and aggregating all data for a publisher. The workflow prepares the data so that it can be easily accessed/imported by a dashboarding tool. The ONIX Workflow can broadly broken into three parts:

  1. Aggregating and Mapping book products into works and work families

  2. Linking data from metric providers to book products

  3. Creating export tables for visualisation in dashboards

Average Runtime

10-20 min

Run Schedule

Weekly

Catch-up Missed Runs

Each Run Includes All Data

Workflow kwargs

Sensor DAG IDs (sensor_dag_ids)

A list of DAG IDs. Upon instantiation, the workflow will create a sensor task for each of the supplied DAGs. Each of the sensors will look back 7 days for a completed DAG. If there are no runs in the last 7 days, the sensor will be marked as a success. Otherwise, the sensor will wait for the DAG to complete before marking itself as a success.

Data Partners (data_partners)

A list of data partners that the workflow will use to aggregate and filter the data. The corresponding sensors for each data partner should be present. This should contain one onix type partner and at least one data type partner.

Workflow Tasks

Aggregate Works

Dataset Name

onix_workflow

Table Names

onix_workfamilyid_isbn, onix_workid_isbn, onix_workid_isbn_errors

Average Runtime

1-5 min

Average Download Size

0-1 MB

Table Type

Sharded

The ONIX workflow uses the ONIX table created by an ONIX telescope (ONIX Telescope, Thoth, OAPEN Metadata) to do the following:

  1. Aggregate book product records into works records. Works are equivalence classes of products, where each product in the class is a manifestation of each other. For example, a PDF and a paperback of the same work.

  2. Aggregate work records into work family records. A work family is an equivalence class of works where each work in the class is just a different edition.

  3. Produce intermediate lookup tables mapping ISBN13 -> WorkID and ISBN13 -> WorkFamilyID.

  4. Produce intermediate tables that append work_id and work_family_id columns to different data tables with ISBN keys.

Definitions - Product, Work and Work Families
  • Product: A product is a manifestation of a work, and will have its own ISBN. There may be several DOIs linked to a single product though (or sometimes none at all).

  • Work: Can be a collection of products, which are each different manifestation of the same work. Some datasets have unique IDs assigned to the concept of a work, but these are not as clear as the usage of ISBN for a product.

  • Edition: Is a new Work, but is derived as a revision from an existing work as opposed to being entirely new.

  • Work Family is a collection of works which are different editions of each other.

The Work ID will be an arbitrary ISBN representative from a product in the equivalence class.

The Work Family ID will be an arbitrary Work ID (ISBN) representative from a work in the equivalence class.

Create Crossref Metadata table

Crossref Metadata is required to proceed. The ISBNs for each work is obtained from the publisher's Onix table. For each of these ISBNs, the Crossref Metadata table produced by the Academic Observatory workflows is queried. Refer to the Crossref Metadata task

Create Crossref Events table

Similarly to Crossref Metadata, Crossref Event Data is retrieved through Crossref's dedicated event REST API through the Crossref Event Data telescope. The API accepts queries based on DOI only, which we retrieve by matching the appropriate ISBN13 from the metadata. Refer to the Crossref Events task.

Create Book table

The book table (schema) is a collection of works and their relevant details for the relative publisher. The table accommodates a title's Crossref metadata, events and separate chapters.

Dataset Name

oaebu

Table Name

book

Average Runtime

~1 min

Average Download Size

5-50 MB

Table Type

Sharded

Create intermediate tables

For each data partner's tables containing ISBN, create new matched tables which extend the original data with new work_id and work_family_id columns.

The schemas for these tables are identical to the raw Telescope's schemas, with the addition of work_ids and work_family_ids.

Dataset Name

oaebu_intermediate

Table Names

{data_source}_matched

Average Runtime

1-5 min

Average Download Size

0-1 MB

Table Type

Sharded

Create Book Product table

The ONIX workflow takes the metrics fetched through various telescopes, then aggregates and joins them to the book records in the publisher's ONIX feed.

The output is the book product table (schema), containing one row per unique book, with a nested month field, which groups all the metrics relating to that book for each calendar month. This table is the main output of the workflow and contains all of the aggregated and filtered data from all of a a publisher's data partners/sources.

Dataset Name

oaebu

Table Names

book_product

Average Runtime

1 min

Average Download Size

100-2000 MB

Table Type

Sharded

Create QA ISBN tables

For each data source, including the intermediate tables, we perform basic quality assurance checks on the data, and output the results to tables that are easy to export for analysis by the publisher (e.g. to CSV). For example we verify if the provided ISBNs are valid, or if there are unmatched ISBNs indicating that there are missing ONIX product records.

Dataset Name

oaebu_data_qa

Table Names

{data_source}_unmatched_{isbn}, {data_source}_invalid_{isbn}

Average Runtime

~1 min

Average Download Size

1 MB

Table Type

Sharded

Details ISBN13s in the ONIX feed that are not valid.

Details ISBN13s in the data source that are not valid. An example schema is below, as data platforms may use different name fields (e.g, 'ISBN', 'publication_id', 'Primary_ISBN').

Details ISBN\13s in the data source that were not matched to ISBN-13s in the ONIX feed.

Create QA Aggregate tables

Dataset Name

oaebu_data_qa

Table Names

onix_aggregate_metrics

Average Runtime

~1 min

Average Download Size

1 MB

Table Type

Sharded

Create Export tables

Step three of the ONIX workflow is to export the book_product table to a sequence of flattened data export tables. The data in these tables is not materially different to the book product table, just organised in a way better suited for dashboards in Looker Studio.

Since these are date-sharded tables, their names will be updated each time the workflow is run. When using Google's Looker (previously Data Studio), it is preferable for us to use a static naming scheme. For this reason, after creating the (sharded) export and quality analysis tables, we also create/update a view for table. These views have a static name. By referencing the view, we can keep the Looker dashboards up-to-date without manual intervention.

Dataset Name

data_export

Table Names*

author_metrics, list, metrics, metrics_city, metrics_country, metrics_events, metrics_institution, metrics_referrer, publisher_metrics, subject_bic_metrics, subject_bisac_metrics, subject_thema_metrics, subject_year_metrics, year_metrics

Table Names

institution_list, unmatched_book_metrics

Average Runtime

~1 min

Average Download Size

1 MB

Table Type

Sharded

*Table names prefixed with oaebu_{publisher}_book_product

This table is a list of each Book Product. It is primarily used for drop-down fields, or where a list of all the books independent of metrics is desired.

This table contains metrics, organised by month, that are linked to each book. The country, city, institution, events and referrals expand on this to provided further useful breakdowns of metrics.

This table contains metrics, organised by month and author, that are linked to each author.

This table contains metrics, organised by published year and month, that are linked to each book.

This table contains metrics, organised by month and crossref event type, that are linked to each book.

This table contains metrics, organised by month and city of measured usage, that are linked to each book.

This table contains metrics, organised by month and country of measured usage, that are linked to each book.

This table contains metrics, organised by month and institution for which there is measured activity linked to each book.

This index contains a summary of metrics, organised by month that are linked to each publisher.

This table contains metrics, organised by month and BIC subject type, that are linked to each book.

This table contains metrics, organised by month and BISAC subject type, that are linked to each book.

This table contains metrics, organised by month and THEMA subject type, that are linked to each book.

This table contains metrics, organised by published year and month and currently just the BIC subject type, that are linked to each book.

This table is a list of each unique Institution where metrics are linked too. It is primarily used for drop-down fields, or where a list of all the institutions independent of metrics is desired.

This dataset is helpful for understanding where metrics and books defined in the onix feed are not matched. Helping target data quality tasks upstream of this workflow.

Create Latest Views

Because the export tables are all sharded by date, once the workflow has run the export table names will be updated. This is an issue for Looker Studio, which looks for a specific table name to pull data from. For this reason, the final step of the workflow is to create/update a set of views for both the export tables and their QA counterparts. The first run will create the views, subsequent runs will update each view to point to the appropriate (latest) table.

Dataset Name

data_export_latest, oaebu_data_qa_latest

Table Names*

Average Runtime

~1 min

Average Download Size

0 MB

Table Type

View

*The table names are a copy of the tables created in the Create Export Tables (data_export_latest dataset) and the Create QA ISBN and Create QA Aggregate (oaebu_data_qa_latest dataset) tasks with the date shard removed.

Last updated