IRUS OAPEN
Documentation for the IRUS OAPEN telescope
Last updated
Documentation for the IRUS OAPEN telescope
Last updated
IRUS provides OAPEN COUNTER standard access reports. Almost all books on OAPEN are provided as a whole book PDF file. The reports show access figures for each month as well as the location of the access.
Since the location info includes an IP-address, the original data is handled only from within the OAPEN Google Cloud project.
Using a Cloud Function, the original data is downloaded and IP-addresses are replaced with geographical information, such as city and country. After this transformation, the data without IP-addresses is uploaded to a Google Cloud Storage Bucket.
This is all done from within the OAPEN Google Cloud project. The Cloud Function is created and called from the telescope, when the Cloud Function has finished the data is copied from the Storage Bucket inside the OAPEN project, to a Bucket inside the main airflow project.
Name | Description |
---|---|
Fields passed as keyword arguments to the telescope upon instantiation.
The publisher_name_v4 can be found by going to the OAPEN page to manually create reports. On this page there is a drop down list with publisher names, to get the publisher name simply url encode the publisher name from this list.
Note that occasionally there are multiple publisher names for one publisher.
For example to get all data from Edinburgh University Press, you need data from both publishers Edinburgh University Press
and Edinburgh University Press,
. Multiple publisher names can be passed on by delimiting them with a '|' character.
The publisher_uuid_v5 can be found by querying the OAPEN API and creating a list of unique Publisher names and UUIDs.
This API request will return all items including their Publisher name and UUID: https://irus.jisc.ac.uk/api/oapen/reports/oapen_ir/?platform=215&requestor_id=<requestor_id>&api_key=<api_key>&granularity=totals&begin_date=2020-04&end_date=2021-11
To get a file with mappings between Publisher Name and UUID, use the following Python snippet:
From this file look up the publisher UUIDs of interest. Similar to the publisher names described above, multiple publisher UUIDs can be passed on by delimiting them with a '|' character.
The IRUS OAPEN telescope makes use of a Google Cloud Function that resides in the OAPEN Google Cloud Platform project.
There is a specific airflow task that will create the Cloud Function if it does not exist yet, or update it if the source code has changed.
The source code for the Cloud Function can be found inside a separate repository that is part of the same organization (https://github.com/The-Academic-Observatory/oapen-irus-uk-cloud-function).
The Cloud Function downloads IRUS OAPEN access stats data for 1 month and for a single publisher. Usage data after April 2020 is hosted on a new platform.
The newer data is obtained by using their API, this requires a requestor_id
and an api_key
.
Data before April 2020 is obtained from an URL, this requires an email
and a password
.
The required values for either the newer or older way of downloading data are passed on as a username
and password
to the Cloud Function. The username
and password
are obtained from an airflow connection, which should be set in the config file (see below).
Once the data is downloaded, the IP addresses are replaced with geographical information (corresponding city and country).
This is done using the GeoIp database, which is downloaded from inside the Cloud Function. The license key for this database is passed on as a parameter as well, geoip_license_key
.
The geoip_license_key
is also obtained from an airflow connection, which should be set in the config file (see below).
Next, the data without the IP addresses is upload to a bucket inside the OAPEN project. All files in this bucket are deleted after 1 day. In the next airflow task, the data can then be copied from this bucket to the appropriate bucket in the project where airflow is hosted.
The transformed data is loaded from the Google Cloud bucket into a partitioned BigQuery table in the irus dataset, which will be created if it does not exist. Since the data is partitioned on the release month, there will only be a single table named irus_oapen.
To make use of the Cloud Function described above it is required to enable two APIs and set up permissions for the Google service account that airflow is using.
See the Google support answer for info on how to enable an API. The API's that need to be enabled are:
Cloud Functions API
Cloud build API
Cloud Run Admin API
Artifact Registry API
Inside the OAPEN Google project, add the airflow Google service account (<airflow_project_id>@<airflow_project_id>.iam.gserviceaccount.com, where airflow_project_id is the project where airflow is hosted). This can be done from the 'IAM & Admin' menu and 'IAM' tab. Then, assign the following permissions to this account:
Cloud Functions Developer (to create or update the Cloud Function)
Cloud Functions Invoker (to call/invoke the Cloud Function)
Storage Admin (to create a bucket)
Storage Object Admin (to list and get a blob from the storage bucket)
Additionally, it is required to assign the role of service account user to the service account of the Cloud Function, with the airflow service account as a member. The Cloud SDK command for this is:
gcloud iam service-accounts add-iam-policy-binding <OAPEN_project_id>-compute@developer.gserviceaccount.com --member=<airflow_project_id@airflow_project_id.iam.gserviceaccount.com> --role=roles/iam.serviceAccountUser
Alternatively, it can be done with the Google Cloud console, from the 'IAM & Admin' menu and 'Service Accounts' tab.
Click on the service account of the Cloud Function: <OAPEN_project_id>-compute@developer.gserviceaccount.com
.
In the 'permissions' tab, click 'Grant Access', add the airflow service account as a member <airflow_project_id@airflow_project_id.iam.gserviceaccount.com>
and assign the role 'Service Account User'.
To get the userid/license_key, first sign up for geolite2. From your account, in the 'Services' section, click on 'Manage License Keys'. The user_id is displayed on this page. Then, click on 'Generate new license key', this can be used for the 'license_key'. Answer _No for the question: "Old versions of our GeoIP Update program use a different license key format. Will this key be used for GeoIP Update?"
name | type | mode | description |
---|---|---|---|
proprietary_id
STRING
NULLABLE
Proprietary identifier of the book.
URI
STRING
NULLABLE
URI of the book. Only available for data since 2020-04-01.
DOI
STRING
NULLABLE
DOI of the book.
ISBN
STRING
NULLABLE
ISBN of the book.
book_title
STRING
NULLABLE
Title of the book
grant
STRING
NULLABLE
Grant. Only available for data before 2020-04-01.
grant_number
STRING
NULLABLE
Grant number. Only available for data before 2020-04-01.
publisher
STRING
NULLABLE
The publisher
begin_date
DATE
NULLABLE
The begin date of the investigated period.
end_date
DATE
NULLABLE
The end date of the investigated period.
title_requests
INTEGER
NULLABLE
The total number of title requests. Only available for data before 2020-04-01.
total_item_investigations
INTEGER
NULLABLE
The total number of item investigations. Only available for data since 2020-04-01.
total_item_requests
INTEGER
NULLABLE
The total number of item requests. Only available for data since 2020-04-01.
unique_item_investigations
INTEGER
NULLABLE
The number of unique item investigations. Only available for data since 2020-04-01.
unique_item_requests
INTEGER
NULLABLE
The number of unique item requests. Only available for data since 2020-04-01.
country
RECORD
REPEATED
Record to store statistics on the country level.
country.name
STRING
NULLABLE
The country name of the client registered by oapen irus uk.
country.code
STRING
NULLABLE
The country code of the client registered by oapen irus uk.
country.title_requests
INTEGER
NULLABLE
The total number of title requests. Only available for data before 2020-04-01.
country.total_item_investigations
INTEGER
NULLABLE
The total number of item investigations. Only available for data since 2020-04-01.
country.total_item_requests
INTEGER
NULLABLE
The total number of item requests. Only available for data since 2020-04-01.
country.unique_item_investigations
INTEGER
NULLABLE
The number of unique item investigations. Only available for data since 2020-04-01.
country.unique_item_requests
INTEGER
NULLABLE
The number of unique item requests. Only available for data since 2020-04-01.
locations
RECORD
REPEATED
Record to store statistics on the location level.
locations.latitude
FLOAT
NULLABLE
The latitude geolocated from the client's ip address.
locations.longitude
FLOAT
NULLABLE
The longitude geolocated from the client's ip address.
locations.city
STRING
NULLABLE
The city geolocated from the client's ip address.
locations.country_name
STRING
NULLABLE
The country name geolocated from the client's ip address.
locations.country_code
STRING
NULLABLE
The country code geolocated from the client's ip address.
locations.title_requests
INTEGER
NULLABLE
The total number of title requests. Only available for data before 2020-04-01.
locations.total_item_investigations
INTEGER
NULLABLE
The total number of item investigations. Only available for data since 2020-04-01.
locations.total_item_requests
INTEGER
NULLABLE
The total number of item requests. Only available for data since 2020-04-01.
locations.unique_item_investigations
INTEGER
NULLABLE
The number of unique item investigations. Only available for data since 2020-04-01.
locations.unique_item_requests
INTEGER
NULLABLE
The number of unique item requests. Only available for data since 2020-04-01.
version
STRING
REQUIRED
Version of the OAPEN IRUS UK API, corresponds to the COUNTER report version.
release_date
DATE
REQUIRED
Last day of the release month. Table is partitioned on this column.
Dataset Name
irus
Table Name
oapen_irus
Table Type
Partitioned
Average Runtime
10 min
Average Download Size
1-10 MB
Harvest Type
API
Run Schedule
Monthly on the 4th
Catch-up Missed Runs
Each Run Includes All Data
irus_login
Login credentials for legacy C4 data
irus_api
The IRUS requestor_id/api_key - required to access the IRUS platform
geoip_license_key
Key for GeoIP services