ONIX Workflow
The ONIX Workflow filters and aggregates the data ingested by the telescopes.
Last updated
The ONIX Workflow filters and aggregates the data ingested by the telescopes.
Last updated
BAD has a single core workflow - the ONIX Workflow which is responsible for filtering and aggregating all data for a publisher. The workflow prepares the data so that it can be easily accessed/imported by a dashboarding tool. The ONIX Workflow can broadly broken into three parts:
Aggregating and Mapping book products into works and work families
Linking data from metric providers to book products
Creating export tables for visualisation in dashboards
Average Runtime
10-20 min
Run Schedule
Weekly and on the 5th of every month
Catch-up Missed Runs
Each Run Includes All Data
A list of DAG IDs. Upon instantiation, the workflow will create a sensor task for each of the supplied DAGs. Each of the sensors will look back 7 days for a completed DAG. If there are no runs in the last 7 days, the sensor will be marked as a success. Otherwise, the sensor will wait for the DAG to complete before marking itself as a success.
A list of data partners that the workflow will use to aggregate and filter the data. The corresponding sensors for each data partner should be present. This should contain one onix type partner and at least one data type partner.
Dataset Name
onix_workflow
Table Names
onix_workfamilyid_isbn
, onix_workid_isbn
, onix_workid_isbn_errors
Average Runtime
1-5 min
Average Download Size
0-1 MB
Table Type
Sharded
The ONIX workflow uses the ONIX table created by an ONIX telescope (ONIX Telescope, Thoth, OAPEN Metadata) to do the following:
Aggregate book product records into works records. Works are equivalence classes of products, where each product in the class is a manifestation of each other. For example, a PDF and a paperback of the same work.
Aggregate work records into work family records. A work family is an equivalence class of works where each work in the class is just a different edition.
Produce intermediate lookup tables mapping ISBN13 -> WorkID and ISBN13 -> WorkFamilyID.
Produce intermediate tables that append work_id and work_family_id columns to different data tables with ISBN keys.
The Work ID will be an arbitrary ISBN representative from a product in the equivalence class.
The Work Family ID will be an arbitrary Work ID (ISBN) representative from a work in the equivalence class.
Crossref Metadata is required to proceed. The ISBNs for each work is obtained from the publisher's Onix table. For each of these ISBNs, the Crossref Metadata table produced by the Academic Observatory workflows is queried. Refer to the Crossref Metadata task
The book table (schema) is a collection of works and their relevant details for the relative publisher. The table accommodates a title's Crossref metadata and separate chapters.
Dataset Name
oaebu
Table Name
book
Average Runtime
~1 min
Average Download Size
5-50 MB
Table Type
Sharded
For each data partner's tables containing ISBN, create new matched tables which extend the original data with new work_id and work_family_id columns.
The schemas for these tables are identical to the raw Telescope's schemas, with the addition of work_ids and work_family_ids.
Dataset Name
oaebu_intermediate
Table Names
{data_source}_matched
Average Runtime
1-5 min
Average Download Size
0-1 MB
Table Type
Sharded
The ONIX workflow takes the metrics fetched through various telescopes, then aggregates and joins them to the book records in the publisher's ONIX feed.
The output is the book product table (schema), containing one row per unique book, with a nested month field, which groups all the metrics relating to that book for each calendar month. This table is the main output of the workflow and contains all of the aggregated and filtered data from all of a a publisher's data partners/sources.
Dataset Name
oaebu
Table Names
book_product
Average Runtime
1 min
Average Download Size
100-2000 MB
Table Type
Sharded
Step three of the ONIX workflow is to export the book_product table to a sequence of flattened data export tables. The data in these tables is not materially different to the book product table, just organised in a way better suited for dashboards in Looker Studio.
Since these are date-sharded tables, their names will be updated each time the workflow is run. When using Google's Looker (previously Data Studio), it is preferable for us to use a static naming scheme. For this reason, after creating the (sharded) export and quality analysis tables, we also create/update a view for table. These views have a static name. By referencing the view, we can keep the Looker dashboards up-to-date without manual intervention.
Dataset Name
data_export
Table Names*
author_metrics
, list
, metrics
, metrics_city
, metrics_country
, metrics_institution
, metrics_referrer
, publisher_metrics
, subject_bic_metrics
, subject_bisac_metrics
, subject_thema_metrics, subject_year_metrics
, year_metrics
Table Names
institution_list
, unmatched_book_metrics
Average Runtime
~1 min
Average Download Size
1 MB
Table Type
Sharded
*Table names prefixed with oaebu_{publisher}
This table contains metrics, organised by month, that are linked to each book. The country, city, institution and referrals expand on this to provided further useful breakdowns of metrics.
This table contains metrics, organised by month and author, that are linked to each author.
This table contains metrics, organised by month and city of measured usage, that are linked to each book.
This table contains metrics, organised by month and country of measured usage, that are linked to each book.
This table contains metrics, organised by month and institution for which there is measured activity linked to each book.
This table contains metrics, organised by month and BIC subject type, that are linked to each book.
This table contains metrics, organised by month and BISAC subject type, that are linked to each book.
This table contains metrics, organised by month and THEMA subject type, that are linked to each book.
This table is a list of each unique Institution where metrics are linked too. It is primarily used for drop-down fields, or where a list of all the institutions independent of metrics is desired.
Because the export tables are all sharded by date, once the workflow has run the export table names will be updated. This is an issue for Looker Studio, which looks for a specific table name to pull data from. For this reason, the final step of the workflow is to create/update a set of views for both the export tables and their QA counterparts. The first run will create the views, subsequent runs will update each view to point to the appropriate (latest) table.
Dataset Name
data_export_latest
, oaebu_data_qa_latest
Table Names*
Average Runtime
~1 min
Average Download Size
0 MB
Table Type
View
*The table names are a copy of the tables created in the Create Export Tables (data_export_latest dataset) and the Create QA ISBN and Create QA Aggregate (oaebu_data_qa_latest dataset) tasks with the date shard removed.