JSTOR

Documentation for the JSTOR telescope

JSTOR provides publisher usage reports, the reports offer details about the use of journal or book content by title, institution, and country. Journal reports also include usage by issue and article. Usage is aligned with the COUNTER 5 standard of Item Requests (views + downloads). Reports can be run or scheduled weekly, monthly, or quarterly with custom date ranges.

To directly get access to the analytics data a publisher needs to grant access to e.g. a Gmail account. This account can then be used to login to the JSTOR portal and set-up the scheduled reports (see below) that are mailed to a G-suite account. Alternatively, the publisher can set-up a schedule to create reports that are sent to the G-suite account. In the telescope the Gmail of the G-suite account is parsed for messages with a download link to the JSTOR report.

The server running this telescope needs to be white listed by JSTOR to avoid bot detection.

Dataset Name

jstor

Table Name

jstor_country, jstor_institution

Table Type

Partitioned

Average Runtime

10 min

Average Download Size

5-10 MB

Harvest Type

Gmail

Run Schedule

Monthly on the 4th

Catch-up Missed Runs

Each Run Includes All Data

Airflow connections

The following airflow connections are required

NameDescription

gmail_api

The api credentials for access to the delegated gmail account.

Telescope kwargs

This telescope is created using the Observatory API. There is one 'extra' field that is required for the corresponding Telescope object, namely the 'publisher_id'.

Publisher ID (publisher_id)

A mapping is required between the JSTOR publisher ID and the organisation name obtained from the observatory API. The JSTOR publisher_id can be found in the original filename of a JSTOR report, for example: PUB_<publisher_id>_PUBBIU_20210501.tsv

It is possible to get the original filename by directly downloading a (previous) report from the JSTOR portal.

Entity Type (entity_type)

This is a special input that must be set to either publisher or collection. By default, this is set to publisher. This option determines what the telescope looks for in Gmail as well as how it transforms the data. For traditional publishers, this should remain untouched (or explicitly set to publisher). For a collection of titles, this should be set to collection. If collection is set, the country_partner and institution_partner kwargs should be changed as described below. This is because the information supplied by collections reports differ to traditional publisher reports and therefore require a slightly different schema.

Country Partner (country_partner)

The data_partner to use for the country report. Should be set to jstor_country when running with publisher entity, or jstor_country_collection if using the collection entity.

Institution Partner (institution_partner)

The data_partner to use for the institution report. Should be set to jstor_institution when running with publisher entity, or jstor_institution_collection if using the collection entity.

Retrieving Reports

JSTOR has no automated method for directly accessing reports from their website or any API. Reports must be sent via email in either TSV or CSV format. This can be done as either a one-off or as part of a schedule. In any case, log in to the JSTOR website and set up a report schedule at their portal to get started.

Scheduling Reports

It will be easiest to set the report frequency the same as the schedule interval of the telescope (monthly). For this telescope only the 'Book Usage by Country' (PUB_BCU) and 'Book Usage by Institution' (PUB_BIU) are used.

The format needs to be set to 'TSV' and the recipient to the Gmail account that will be used with the Gmail API. The title of the report is not used in the telescope, so set this to anything you'd like (it does not show up in the email).

Downloading previous reports

Above is described how to set up a report schedule. Unfortunately this schedule can only be set up starting from the current date. To get previous reports (from before the start date of the schedule) it is possible to create a 'one-time' report and mail this to the relevant gmail account. It will then still be processed by this Telescope. The settings are the same as for the scheduled report.

Using the Gmail API

See the google support answer for info on how to enable an API. Search for the Gmail API and enable this.

Creating the Gmail API connection and credentials

Currently, the telescope works only with a Gmail account that is an internal user (a G-suite account). It is possible to create credentials for an external user with a project status of 'Testing' in the OAuth screen, however refresh tokens created in such a project expire after 7 days and the telescope does not handle expired refresh tokens. See the documentation for more info on OAuth refresh token expiration.

Create OAuth credentials

  • In the IAM section add the G-suite account you would like to use as a user.

  • From the 'APIs & Services' section, click the 'Credentials' menu item.

  • Click 'Create Credentials' and choose OAuth client ID.

  • In the form, enter the following information:

    • Application type: Web application

    • Name: Can be anything, e.g. 'Gmail API'

    • Authorized redirect URIs: add the URI: http://localhost:8080/

    • Click 'Create'

  • Download the client secrets file for the newly created OAuth 2.0 Client ID, by clicking the download icon for the client ID that you created. The file will be named something like client_secret_token.apps.googleusercontent.com.json

  • Get the credentials info using the JSON file with client secret info by executing the following python code.

Note that there is currently a limit of 50 refresh tokens per client ID. If the limit is reached, creating a new refresh token automatically invalidates the oldest refresh token without warning. Additionally, tokens are invalidated whenever an account's password is reset.

import urllib.parse
from google_auth_oauthlib.flow import InstalledAppFlow

# When modifying these scopes, recreate the file token.json
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly', 'https://www.googleapis.com/auth/gmail.modify']
flow = InstalledAppFlow.from_client_secrets_file('/path/to/client_secret_token.apps.googleusercontent.com.json', SCOPES)

# This will open a pop-up, authorize the Gmail account you want to use
creds = flow.run_local_server(access_type='offline', approval_prompt='force', port=8080)

# Get the necessary credentials info
token = urllib.parse.quote(creds.token, safe='')
refresh_token = urllib.parse.quote(creds.refresh_token, safe='')
client_id = urllib.parse.quote(creds.client_id, safe='')
client_secret = urllib.parse.quote(creds.client_secret, safe='')

# This connection can be used in the config file
gmail_api_conn = f'google-cloud-platform://?token={token}&refresh_token={refresh_token}&client_id={client_id}&client_secret={client_secret}'

Telescope Tasks

The telescope tasks revolve around the retrieval and processing of the correct JSTOR report for each telescope run. It should be noted that the JSTOR report format has been slightly altered on more than one occasion. So depending on the date of the telescope run (and the corresponding date of the report), there may be a variation in the report structure. The telescope is smart enough to account for the different structures.

Data Download

All reports sent by JSTOR over email appear identical until they are opened. This presents an unfortunate issue, as it is therefore necessary to download and open all of the reports to see which of them pertain to the relevant partner and have the expected date intervals. To avoid an increasing amount of unnecessary downloads, the telescope will filter the emails (from which it extracts the report) based on their label. Any email that contains the processed report label will be ignored. This label is added to the emails processed at the final (cleanup) step of the telescope.

Once all of the possible reports have been examined, the matching reports (there should be only two for each telescope run - country and institution) are moved to a permanent location in the local file system ready for transforming.

Data Transform

The transformation of the report data is quite simple. The data is read from its .tsv format into a python dictionary, keys are converted into BigQuery-compatible column names and the result is saved to a gzipped .jsonl.

BigQuery Load

The transformed data is loaded from the Google Cloud bucket. There are two resulting datasets from each telescope run, both of which will be loaded into their own partitioned BigQuery table under the jstor dataset (which will be created should it not exist yet). Then, the jstor_country and jstor_institution table partitions are loaded. Since the data is partitioned on the release month, there will only be a single table for each of these report types.

Table Schemas

Publisher

Country

nametypemodedescription

Country_Name

STRING

NULLABLE

Country Name.

Book_Title

STRING

REQUIRED

Title of the book.

Book_ID

STRING

NULLABLE

DOI of the book on JSTOR.

Authors

STRING

NULLABLE

Author of the book.

ISBN

STRING

NULLABLE

ISBN of the book (13 digits).

eISBN

STRING

NULLABLE

ISBN of the digital version of the book (13 digits).

Copyright_Year

INTEGER

NULLABLE

Publication year.

Disciplines

STRING

REQUIRED

Subject category of the book.

Usage_Type

STRING

NULLABLE

For our case it is Open Access.

Usage_Month

STRING

REQUIRED

Date (as month and year) of the request.

Total_Item_Requests

INTEGER

NULLABLE

Total number of request made from that specific country.

release_date

DATE

REQUIRED

Last day of the release month. Table is partitioned on this column.

Institution

nametypemodedescription

Institution

STRING

NULLABLE

Institution name.

Book_Title

STRING

REQUIRED

Title of the book.

Book_ID

STRING

NULLABLE

DOI of the book on JSTOR.

Authors

STRING

NULLABLE

Author of the book.

ISBN

STRING

NULLABLE

ISBN of the book (13 digits).

eISBN

STRING

NULLABLE

ISBN of the digital version of the book (13 digits).

Copyright_Year

INTEGER

NULLABLE

Publication year.

Disciplines

STRING

REQUIRED

Subject category of the book.

Usage_Type

STRING

NULLABLE

For our case it is Open Access.

Usage_Month

STRING

REQUIRED

Date (as month and year) of the request.

Total_Item_Requests

INTEGER

NULLABLE

Total number of request made from that specific country.

release_date

DATE

REQUIRED

Last day of the release month. Table is partitioned on this column.

Collection

Country

nametypemodedescription

Country_Name

STRING

NULLABLE

Country Name.

Book_Title

STRING

REQUIRED

Title of the book.

Book_ID

STRING

NULLABLE

DOI of the book on JSTOR.

Publisher

STRING

NULLABLE

The publisher of the book.

Authors

STRING

NULLABLE

Author of the book.

ISBN

STRING

NULLABLE

ISBN of the book (13 digits).

eISBN

STRING

NULLABLE

ISBN of the digital version of the book (13 digits).

Copyright_Year

INTEGER

NULLABLE

Publication year.

Disciplines

STRING

NULLABLE

Subject category of the book.

Usage_Type

STRING

NULLABLE

For our case it is Open Access.

Usage_Month

STRING

NULLABLE

Date (as month and year) of the request.

Total_Item_Requests

INTEGER

NULLABLE

Total number of request made from that specific country.

release_date

DATE

REQUIRED

Last day of the release month. Table is partitioned on this column.

Insitution

nametypemodedescription

Institution

STRING

NULLABLE

Institution name.

Book_Title

STRING

REQUIRED

Title of the book.

Book_ID

STRING

NULLABLE

DOI of the book on JSTOR.

Publisher

STRING

NULLABLE

The publisher of the book.

Authors

STRING

NULLABLE

Author of the book.

ISBN

STRING

NULLABLE

ISBN of the book (13 digits).

eISBN

STRING

NULLABLE

ISBN of the digital version of the book (13 digits).

Copyright_Year

INTEGER

NULLABLE

Publication year.

Disciplines

STRING

NULLABLE

Subject category of the book.

Usage_Type

STRING

NULLABLE

For our case it is Open Access.

Usage_Month

STRING

NULLABLE

Date (as month and year) of the request.

Total_Item_Requests

INTEGER

NULLABLE

Total number of request made from that specific country.

release_date

DATE

REQUIRED

Last day of the release month. Table is partitioned on this column.

Last updated