UCL Discovery
UCL Discovery is UCL's open access repository, showcasing and providing access to the full texts of UCL research publications.
Dataset Name | ucl |
Table Name | ucl_discovery |
Table Type | Partitioned |
Average Runtime | 5-10 min |
Average Download Size | ~1 MB |
Harvest Type | API |
Run Schedule | Monthly on the 4th |
Catch-up Missed Runs | |
Each Run Includes All Data |
The Google Sheet
UCL's titles are referenced via their identifier - the eprint ID. Their metadata maps the eprint ID to an ISBN13, but not consistently. For this reason, we forgo the use of their metadata and instead employ a semi-manual process to reliably map the two identifiers. The telescope references a Google sheet that contains all of the titles available in the UCL Discovery repository under the following headings:
ISBN13 | The title's ISBN13 |
date | The date of publication |
title_list_title | The title of the publication |
discovery_eprint_id | The eprint ID of the publication |
Some notes:
These headings are hardcoded into the telescope. Any change in the sheet will break the telescope without prior intervention.
Entries without a publication date or with a publication date in the future (where the current time is determined by the airflow scheduler) will be ignored.
Entries missing either an ISBN13 or eprint ID will be ignored.
For the aforementioned reasons, it is important that the google sheet remains up to date. Otherwise, the usage for a title may be missed and require a rerun.
Access
Access to the sheet can be granted using the sheet UI (Share at the top right of the page). The telescope will access the sheet via a service account, which will need to be given read access (Viewer) by supplying the account's email address.
Telescope kwargs
Sheet ID (sheet_id)
The ID of the google sheet. The ID can be found in its URI, which will have the form of https://docs.google.com/spreadsheets/d/[SHEET_ID]
.
UCL Discovery Usage API
UCL Discovery provides free and open access to their usage REST API. Unfortunately, I can't find any documentation on its use and design. We utilise two endpoints:
Countries URI =
https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=countries&top=countries&view=Table&limit=all&export=JSON
Totals URI =
https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=downloads&graph_type=column&view=Google%3A%3AGraph&date_resolution=month&title=Download+activity+-+last+12+months&export=JSON
Where from, to and set_value are appropriately set. The countries URI returns statistics pertaining to the number of downloads of the provided eprint ID broken down by country. The totals URI returns statistics pertaining to the number of downloads of the provided eprint ID aggregated over all regions. It should be noted that the totals data is not necessarily a simply aggregation of the countries data. This is because country data is omitted for downloads that are not attributed to a region. It is therefore not uncommon to have a total download count (derived from the totals URI) that is greater than the sum of all downloads from all listed countries (from the countries URI).
Telescope Tasks
Data Download
Acquires the eprint IDs and publication dates from the Google Sheet. For each ID that has a publication date that is before the current scheduled run date, download the country and totals data using the API. Then upload to the GCS download bucket.
Data Transform
Acquires the eprint IDs, ISBN13s and titles from the Google Sheet. For each ID, load the downloaded data (both countries and totals) into a single data structure and include the title (whether it is empty or not does not matter - the title exists for completeness only). Add an additional field to each row - the release_date which is determined by the scheduled runtime. Upload this transformed structure to GCS transform bucket.
BigQuery Load
The transformed data is loaded from the Google Cloud bucket into a partitioned BigQuery table. The table is in the ucl dataset (which will be created should it not exist yet). Since the data is partitioned on the release month, there will only be a single table named ucl_discovery.
Table Schema
name | type | mode | description |
---|---|---|---|
ISBN | STRING | REQUIRED | ISBN13 of the book. |
eprint_id | STRING | REQUIRED | eprint ID of the book. |
title | STRING | NULLABLE | Title of the book. |
timescale | RECORD | NULLABLE | Timescale of the statistics as reported by the origin. |
timescale.format | STRING | NULLABLE | Format of the 'to' and 'from' fields |
timescale.from | STRING | NULLABLE | Beginning of date range for the statistics |
timescale.to | STRING | NULLABLE | End of date range for the statistics |
origin | RECORD | NULLABLE | Origin of the statistics |
origin.url | STRING | NULLABLE | The URL of the origin |
origin.name | STRING | NULLABLE | The name of the origin |
total_downloads | INTEGER | NULLABLE | The aggregated statistics for the reported period |
country | RECORD | REPEATED | The aggregated statistics for each reported country |
country.value | STRING | NULLABLE | The two letter country code. |
country.count | INTEGER | NULLABLE | The total number of item downloads for the reported period from this country. |
release_date | DATE | REQUIRED | Last day of the release month. Table is partitioned on this column. |
Last updated