UCL Discovery

UCL Discovery is UCL's open access repository, showcasing and providing access to the full texts of UCL research publications.

Dataset Name

ucl

Table Name

ucl_discovery

Table Type

Partitioned

Average Runtime

5-10 min

Average Download Size

~1 MB

Harvest Type

API

Run Schedule

Monthly on the 4th

Catch-up Missed Runs

Each Run Includes All Data

The Google Sheet

UCL's titles are referenced via their identifier - the eprint ID. Their metadata maps the eprint ID to an ISBN13, but not consistently. For this reason, we forgo the use of their metadata and instead employ a semi-manual process to reliably map the two identifiers. The telescope references a Google sheet that contains all of the titles available in the UCL Discovery repository under the following headings:

ISBN13

The title's ISBN13

date

The date of publication

title_list_title

The title of the publication

discovery_eprint_id

The eprint ID of the publication

Some notes:

  • These headings are hardcoded into the telescope. Any change in the sheet will break the telescope without prior intervention.

  • Entries without a publication date or with a publication date in the future (where the current time is determined by the airflow scheduler) will be ignored.

  • Entries missing either an ISBN13 or eprint ID will be ignored.

For the aforementioned reasons, it is important that the google sheet remains up to date. Otherwise, the usage for a title may be missed and require a rerun.

Access

Access to the sheet can be granted using the sheet UI (Share at the top right of the page). The telescope will access the sheet via a service account, which will need to be given read access (Viewer) by supplying the account's email address.

Telescope kwargs

Sheet ID (sheet_id)

The ID of the google sheet. The ID can be found in its URI, which will have the form of https://docs.google.com/spreadsheets/d/[SHEET_ID].

UCL Discovery Usage API

UCL Discovery provides free and open access to their usage REST API. Unfortunately, I can't find any documentation on its use and design. We utilise two endpoints:

  • Countries URI = https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=countries&top=countries&view=Table&limit=all&export=JSON

  • Totals URI = https://discovery.ucl.ac.uk/cgi/stats/get?from=[YYYYMMDD]&to=[YYYYMMDD]&irs2report=eprint&set_name=eprint&set_value=[EPRINT_ID]&datatype=downloads&graph_type=column&view=Google%3A%3AGraph&date_resolution=month&title=Download+activity+-+last+12+months&export=JSON

Where from, to and set_value are appropriately set. The countries URI returns statistics pertaining to the number of downloads of the provided eprint ID broken down by country. The totals URI returns statistics pertaining to the number of downloads of the provided eprint ID aggregated over all regions. It should be noted that the totals data is not necessarily a simply aggregation of the countries data. This is because country data is omitted for downloads that are not attributed to a region. It is therefore not uncommon to have a total download count (derived from the totals URI) that is greater than the sum of all downloads from all listed countries (from the countries URI).

Telescope Tasks

Data Download

Acquires the eprint IDs and publication dates from the Google Sheet. For each ID that has a publication date that is before the current scheduled run date, download the country and totals data using the API. Then upload to the GCS download bucket.

Data Transform

Acquires the eprint IDs, ISBN13s and titles from the Google Sheet. For each ID, load the downloaded data (both countries and totals) into a single data structure and include the title (whether it is empty or not does not matter - the title exists for completeness only). Add an additional field to each row - the release_date which is determined by the scheduled runtime. Upload this transformed structure to GCS transform bucket.

BigQuery Load

The transformed data is loaded from the Google Cloud bucket into a partitioned BigQuery table. The table is in the ucl dataset (which will be created should it not exist yet). Since the data is partitioned on the release month, there will only be a single table named ucl_discovery.

Table Schema

nametypemodedescription

ISBN

STRING

REQUIRED

ISBN13 of the book.

eprint_id

STRING

REQUIRED

eprint ID of the book.

title

STRING

NULLABLE

Title of the book.

timescale

RECORD

NULLABLE

Timescale of the statistics as reported by the origin.

timescale.format

STRING

NULLABLE

Format of the 'to' and 'from' fields

timescale.from

STRING

NULLABLE

Beginning of date range for the statistics

timescale.to

STRING

NULLABLE

End of date range for the statistics

origin

RECORD

NULLABLE

Origin of the statistics

origin.url

STRING

NULLABLE

The URL of the origin

origin.name

STRING

NULLABLE

The name of the origin

total_downloads

INTEGER

NULLABLE

The aggregated statistics for the reported period

country

RECORD

REPEATED

The aggregated statistics for each reported country

country.value

STRING

NULLABLE

The two letter country code.

country.count

INTEGER

NULLABLE

The total number of item downloads for the reported period from this country.

release_date

DATE

REQUIRED

Last day of the release month. Table is partitioned on this column.

Last updated