Google Books

Documentation for the Google Books telescope

The Google Books Partner program enables selling books through the Google Play store and offering a preview on Google books. The program makes books discoverable to Google users around the world on Google books. When readers find a book on Google Books, they can preview a limited number of pages to decide if they're interested in it. Readers can also follow links to buy the book or borrow or download it when applicable.

As a publisher Google Books download reports are available at https://play.google.com/books/publish/

Currently there are 3 report types available:

  • Google Play sales summary report

  • Google Play sales transaction report

  • Google Books Traffic Report

The telescope collects data from the last 2 reports.

Airflow connections

Authentication

The reports are downloaded from https://play.google.com/books/publish/. To get access to the reports the publisher needs to give access to a google service account. This service account can then be used to login on this web page and download each report manually.

Downloading Reports Manually

There is no API available to download the Google Books report and it is quite challenging to automate the Google login process through tools such as Selenium, because of Google's bot detection triggering a reCAPTCHA. Until this step can be automated, the reports need to be downloaded manually each month. For each publisher and for both the sales transaction report and the traffic report:

  • A report should be created for exactly 1 month (e.g. starting 2021-01-01 and ending 2021-01-31).

  • All titles should be selected.

  • All countries should be selected.

  • The traffic report is organised by 'Book'.

  • It is important to save the file with the right name, this should be in the following format (<file_suffix> is optional):

    • GoogleSalesTransactionReport_<file_suffix>YYYY_MM.csv or

    • GoogleBooksTrafficReport_<file_suffix>YYYY_MM.csv

  • Upload each report to the SFTP server.

    • Add it to the folder /google_books_<publisher>/upload

    • Files are automatically moved between folders; do not move files between folders manually

Telescope Tasks

Data Download & Transform

The download step connects to the SFTP server. The telescope looks in the relevant publisher's upload folder for the file format specified above. Any telescope DAG run will harvest all instances of the matching files (regardless of the date associated). Before downloading, the files on the SFTP server are moved to the in_progress folder.

Once downloaded, each report is transformed. The transform process re-formats headings and dates such that they are consistent. It also performs an integrity check on the reported dates. None of the raw data is modified in any way. The partition date (the report's associated month) is appended to each row at the end of the transform step.

Big Query Load

The transformed data is loaded from the Google Cloud bucket. There are two resulting datasets from each telescope run, both of which will be loaded into their own partitioned BigQuery table under the google dataset (which will be created should it not exist yet). Then, the google_books_sales and google_books_traffic table partitions are loaded. Since the data is partitioned on the release month, there will only be a single table for each of these report types.

Table Schema - Google Books Sales

Table Schema - Google Books Traffic

Last updated