UCL Discovery
Last updated
Last updated
UCL Discovery is UCL's open access repository, showcasing and providing access to the full texts of UCL research publications.
UCL's titles are referenced via their identifier - the eprint ID. Their metadata maps the eprint ID to an ISBN13, but not consistently. For this reason, we forgo the use of their metadata and instead employ a semi-manual process to reliably map the two identifiers. The telescope references a Google sheet that contains all of the titles available in the UCL Discovery repository under the following headings:
Some notes:
These headings are hardcoded into the telescope. Any change in the sheet will break the telescope without prior intervention.
Entries without a publication date or with a publication date in the future (where the current time is determined by the airflow scheduler) will be ignored.
Entries missing either an ISBN13 or eprint ID will be ignored.
For the aforementioned reasons, it is important that the google sheet remains up to date. Otherwise, the usage for a title may be missed and require a rerun.
Access to the sheet can be granted using the sheet UI (Share at the top right of the page). The telescope will access the sheet via a service account, which will need to be given read access (Viewer) by supplying the account's email address.
The ID of the google sheet. The ID can be found in its URI, which will have the form of https://docs.google.com/spreadsheets/d/[SHEET_ID]
.
UCL Discovery provides free and open access to their usage REST API. Unfortunately, I can't find any documentation on its use and design. We utilise two endpoints:
Countries URI = https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=countries&top=countries&view=Table&limit=all&export=JSON
Totals URI = https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=downloads&graph_type=column&view=Google%3A%3AGraph&date_resolution=month&title=Download+activity+-+last+12+months&export=JSON
Where from, to and set_value are appropriately set. The countries URI returns statistics pertaining to the number of downloads of the provided eprint ID broken down by country. The totals URI returns statistics pertaining to the number of downloads of the provided eprint ID aggregated over all regions. It should be noted that the totals data is not necessarily a simply aggregation of the countries data. This is because country data is omitted for downloads that are not attributed to a region. It is therefore not uncommon to have a total download count (derived from the totals URI) that is greater than the sum of all downloads from all listed countries (from the countries URI).
Acquires the eprint IDs and publication dates from the Google Sheet. For each ID that has a publication date that is before the current scheduled run date, download the country and totals data using the API. Then upload to the GCS download bucket.
Acquires the eprint IDs, ISBN13s and titles from the Google Sheet. For each ID, load the downloaded data (both countries and totals) into a single data structure and include the title (whether it is empty or not does not matter - the title exists for completeness only). Add an additional field to each row - the release_date which is determined by the scheduled runtime. Upload this transformed structure to GCS transform bucket.
The transformed data is loaded from the Google Cloud bucket into a partitioned BigQuery table. The table is in the ucl dataset (which will be created should it not exist yet). Since the data is partitioned on the release month, there will only be a single table named ucl_discovery.
name | type | mode | description |
---|---|---|---|
ISBN
STRING
REQUIRED
ISBN13 of the book.
eprint_id
STRING
REQUIRED
eprint ID of the book.
title
STRING
NULLABLE
Title of the book.
timescale
RECORD
NULLABLE
Timescale of the statistics as reported by the origin.
timescale.format
STRING
NULLABLE
Format of the 'to' and 'from' fields
timescale.from
STRING
NULLABLE
Beginning of date range for the statistics
timescale.to
STRING
NULLABLE
End of date range for the statistics
origin
RECORD
NULLABLE
Origin of the statistics
origin.url
STRING
NULLABLE
The URL of the origin
origin.name
STRING
NULLABLE
The name of the origin
total_downloads
INTEGER
NULLABLE
The aggregated statistics for the reported period
country
RECORD
REPEATED
The aggregated statistics for each reported country
country.value
STRING
NULLABLE
The two letter country code.
country.count
INTEGER
NULLABLE
The total number of item downloads for the reported period from this country.
release_date
DATE
REQUIRED
Last day of the release month. Table is partitioned on this column.
Dataset Name
ucl
Table Name
ucl_discovery
Table Type
Partitioned
Average Runtime
5-10 min
Average Download Size
~1 MB
Harvest Type
API
Run Schedule
Monthly on the 4th
Catch-up Missed Runs
Each Run Includes All Data
ISBN13
The title's ISBN13
date
The date of publication
title_list_title
The title of the publication
discovery_eprint_id
The eprint ID of the publication