With the closing of Temple University’s physical libraries due to covid-19, and thus the loss of access to our physical collections, library staff have been scrambling to get as much additional content online as possible. One avenue, among many, we have gone down is taking advantage of the HathiTrust Emergency Temporary Access Service (ETAS). HathiTrust offers “a collection of millions of titles digitized from libraries around the world.” The copyright restricted content, which is usually unavailable, has been made temporarily available for partner institutions via a digital controlled lending program. We sent Hathi a listing of our physical holdings, which they processed to calculate an overlap report used to determine which items in their collection our users should have access to, and how many digital “copies” of each item would available. They also sent the report back to us to use for our own purposes.
The rest of this blog post will detail the process we went through to understand that data and massage it into shape so we could include it in our ETL process that populates our library catalog, with the end goal of linking from catalog records to the HathiTrust record page where our users can access the digitized copy.
This blog post is also about the amazing set of unix-like tools provided by csvkit, because I used them almost exclusively to do this work. It was my first time using these tools, and they were perfect for the job and well documented, so they deserve some praise, which they’ll get below.
This post meanders a bit, as it mostly documents my organic process of understanding the data and the tools. If you’re impatient like I often am, you can skip to the summary that is much more concise, just laying out the steps you need to transform an overlap report into a usable lookup table.
Also, huge shout out to Frances Webb’s blog post on how Cornell worked with the Hathi overlap report to get links into their catalog, as it helped me skip a few steps in understanding the shape of data, and some pitfalls to avoid.
Understanding the data
While I was certainly aware of HathiTrust and their digitized collections before this, I had never dug into the details of how they model their data, what an overlap report would provide, etc. Their data documentation was helpful, but I needed a bit more hands on digging in our data to really understand it. Since our ultimate goal was links from our catalog, I started with the overlap report which references items in our collections.
The overlap report
The overlap file provided by Hathi is a five column TSV file, with columns:
- oclc : the OCLC number
- local_id : the item ID of the physical item held by Temple
- item_type : One of three values - mono(graph), multi(part?), or serial
- access: allow, deny or blank
- rights : many different values, none of which are important for our use case.
The file is about 100Mb, so not enormous, but big enough that some process were a bit slow. There wasn’t any documentation of the overlap file that I know of, but I was able to infer some stuff from browsing the data a bit, and from looking at Hathi’s data documentation.
During the ETAS, we will have access to all of the items with either
deny in the access column. If the access column is empty, it means Hathi does not have a copy of it, so there is nothing to link to, so we can effectively ignore those.
Since I need to filter out rows by value, I decided to use
csvgrep, which is a lot like unix grep, except it allows you to focus on the data in particular columns of a line. Without it I’d probably be piping
grep outputs to Cthulhu-level-terrible
awk commands that I’d almost certainly be ashamed to share with the world.
I also decided to to split the data by
item_type, as I wondered if monographs and multi-part or serials might require a different approaches when mapping from local identifiers to those held by Hathi. Also, having some smaller files to work with would almost certainly mean that the processing, and hence feedback on my assumptions, would be quicker.
My command looked something like
csvgrep -t -c 3 -m mono overlap_20200316_temple.tsv | csvgrep -c -4 -r ".+" > ht_overlap_mono.csv
Breaking that down:
csvgrep -t -c 3 -m mono
-t: indicates it is tab delimited instead of comma delimited data
-c 3: indicates which column we should search in (item_type)
-m mono: indicates the search term to use in the selected columns
| csvgrep -c -4 -r ".+": we pipe it to another csvgrep process for more refinement
-c 4: column 4 (access) this time
-r “.+”: regex search that requires at least one character. Since a lack of value indicates that HT does not have the item, we want to filter out empty columns.
This outputs a csv file that only includes the monographs we are going to have access to through the ETAS program, in a ~30MB file with 770,077 lines.
Then I did the same for multipart and serials:
cat overlap_20200316_temple.tsv | csvgrep -t -c 3 -m multi | csvgrep -c -4 -r ".+"
cat overlap_20200316_temple.tsv | csvgrep -t -c 3 -m mono | csvgrep -c -4 -r ".+"
Now, with some more consistent data to work with, I needed to understand a bit more about what was in the data. Poking through the files with my text editor gave me a bit of a sense, but I wanted a higher level view. I decided to try
csvstat on each output file to ensure, to see if it would help.
csvstat ht_overlap_mono.csv 1. "oclc" Type of data: Number Contains null values: False Unique values: 637894 Smallest value: 17 Largest value: 1,019,844,429 Sum: 20,814,034,834,803 Mean: 27,028,546.319 Median: 5,750,865.5 StDev: 77,658,357.791 Most common values: 1,175,417 (24x) 1,199,723 (20x) 250,896 (20x) 260,205 (18x) 259,219 (16x) 2. "local_id" Type of data: Number Contains null values: False Unique values: 764783 Smallest value: 23,237,130,010,003,811 Largest value: 23,427,799,980,003,811 Sum: 17,932,810,712,974,114,759,636 Mean: 23,287,066,098,637,166.669 Median: 23,286,117,055,003,811 StDev: 29,532,593,582,807.78 Most common values: 23,318,950,640,003,811 (3x) 23,308,001,580,003,811 (3x) 23,316,100,060,003,811 (3x) 23,323,166,490,003,811 (3x) 23,308,952,980,003,811 (3x) 3. "item_type" Type of data: Text Contains null values: False Unique values: 1 Longest value: 4 characters Most common values: mono (770076x) 4. "access" Type of data: Text Contains null values: False Unique values: 2 Longest value: 5 characters Most common values: deny (688662x) allow (81414x) 5. "rights" Type of data: Text Contains null values: False Unique values: 21 Longest value: 15 characters Most common values: ic (660454x) pd (61592x) und (27309x) pdus (18164x) ic-world (854x)
csvstat outputs a lot more analysis than I expected (and was a bit noisy with our numeric identifiers), but there are some particularly useful items in there, specifically the:
- reports for item_type, access, and rights report that they have no null values, indicating we have removed the rows for which Hathi does not have a copy.
- unique values for access, show us that the only two values present are allow and deny. The ETAS means we get access to deny as well as access, so that’s exactly what we were looking for.
Two pieces of data that I initially missed, but would have been useful to consider at the time were the “number of unique values” and the “most common values” for
local_id. In my initial mental model, the
oclc numbers and our
local_ids were unique per row, but the report values
- number of unique values: 637,894
- most common values: 1175417 (24)
- number of unique values: 764,783
- most common values: 23318950640003811 (3x)
show a different story. Much less than the 770,076 rows of data. Knowing that would have helped to correct my mental model, but I initially just glazed over it. So,
csvstat provided lots of assistance, even if I wasn’t yet ready to receive it.
Now that we have these three files, representing the physical items in our collection that Hathi knows we have and the OCLC number for each item, we can move on to figuring how to map those item records to Hathi URLs that will allow our users to access the digitized items.
Connecting to HathiTrust Records
So, while we could use the OCLC number we already have with the Hathi bibliographic api to dynamically look up the Hathi Record URL when our catalog display an item from the processed overlap report, that would mean we don’t know which items were available until page load, meaning it would not reflect in our existing affordances for “Online Resources” like facets. To integrate with existing affordances, we need to get the data into the Solr collection that powers our catalog.
Based on the Hathi docs on linking, we want to link to Record pages, as they provide the most context for our users. It seemed to me the best way to identify the needed record numbers was to use “Hathifiles”, which are “tab-delimited text files that describe every item in the HathiTrust collection.” These files include the OCLC number and record id for each item, which seemed to provide a path to link our overlap report to the records.
Specifically, I downloaded hathi_full_20200201.txt.gz and hathi_field_list.txt. I want the final file to have headers, so the first thing I did was save
hathi_field_list.txt to a new file called
hathi_with_headers and then append the full record listing into that. To avoid unzipping the full record file first, I used gzcat
gzcat hathi_full_20200201.txt.gz >> hathi_with_headers
This file is about 4.3GB, so totally unwieldy to work with, so first thing I’m going to want to do is shrink it using some filtering. As this is another tsv file, so csvkit still gonna be the right tool for the job.
Since I previously split up our overlap files by type (mono, serial, and multi), I did the same for the full record list, hoping to get it down to a set of manageable chunks to work with. What’s odd though, is that Hathi’s full record file uses a different set of terms of those types. The values are instead BK, SE, MP, which I interpret as Book, Serial, and MultiPart.
So, to get just the books from the Hathi full record list I run
cat hathi_with_header | csvgrep -t -c 20 -m BK -z=1310720 > hathi_with_header_books.csv
Breaking that down:
-t: tab delimited
-c 20: search in column 20, which in this case is bib_fmt
-m BK: use BK as the search term in the bib_fmt column
-z=1310720: csvkit has a default field size limit of 131072 characters, which apparently is too small for some Hathi records rows (I’m guessing the ones with unicode titles?). I multiplied the default by 10, just to be safe.
So, now filtered down to just the monographs, the file has shrunk to…3.4GB. :( NEEDS MORE FILTERING.
Since we’ve already filtered rows we don’t need, the next logical step is to filter columns we don’t need. I only really cared about a few fields from the Hathi Records.
- htid (column 1)
- bib_key_id (column 4)
- oclc_num (column 8)
This time I reached for csvcut, which:
filters and truncates CSV files. Like the Unix “cut” command, but for tabular data.
csvcut -z 1310720 -c 1,4,8 hathi_with_header_books.csv, and we have selected just those columns, which bring us down to a respectable 400M. A 85% reduction in size is pretty good for a one liner.
So…what to do now. We have:
- a list of all of our overlap items with:
- A few more fields that no longer seem as relevant
- A list of all the books from Hathi with:
- htid: the Hathi Trust ID
- bib_key_id: another identifier used by HT items
What we really want to get to is a list of all of the books from our overlap report with the Hathi Trust identifiers so we can link directly to them from our library catalog. The piece of data connecting these two lists of data is the oclc_num. If this were a SQL database we would join these two tables on oclc_num.
Luckily, once again, csvkit has your back.
csvjoin was built for exactly this kind of problem.
So I ran:
csvjoin -y 0 -c 1,3 ht_overlap_mono.csv hathi_with_header_books_minimal_columns.csv \ > joined_overlap_hathi.csv
Breaking that down:
-y 0: “Limit CSV dialect sniffing to the specified number of bytes.” I think this helped with the bigger file? I could be wrong.
-c 1,3: in csvjoin, the -c flag indicates the columns to join on respective to the files passed as the positional arguments. So in this case, use column 1 from
ht_overlap_mono.csvand column 3 from
Now, I’ll be honest,
csvjoin with big old files like this is not performant. It took probably 30 minutes for this to run on my five year old mac book pro. Definitely could have been faster, but the results were exactly what I thought I wanted; a new file with the headers
and a limited set of Hathi's IDs related to items in our overlap report. ## Mistaken Mental Model So, just to do some basic smoke testing, I wanted to make sure the number of rows in `ht_overlap_mono.csv` was equal to the number of rows in the new `joined_overlap_hathi.csv`. ```bash wc -l ht_overlap_mono.csv 77077 wc -l 11575650 wc -l joined_overlap_hathi.csv 1045708
How could a join on unique ids hand have more rows than the number of ids in the smallest dataset? Running csvstat on the join file produced some interesting results that made it quickly obvious:
## I removed some fields to reduce noise. 1. "oclc" Type of data: Number Unique values: 614893 Most common values: 1,175,417 (144x) <------- my emphasis 265,046 (90x) 711,022 (90x) 217,758 (84x) 259,219 (80x) 2. "local_id" Type of data: Number Unique values: 738527 Most common values: 23,304,874,110,003,811 (26x) <------- my emphasis 23,251,931,190,003,811 (26x) 23,322,074,210,003,811 (22x) 23,280,098,830,003,811 (22x) 23,293,578,800,003,811 (20x) 3. "htid" Type of data: Text Contains null values: False Unique values: 843988 Longest value: 25 characters Most common values: mdp.39015072135042 (24x) <------- my emphasis mdp.39015000687783 (24x) mdp.39015005304780 (24x) mdp.39015066017032 (24x) mdp.39015064422515 (24x) 4. "ht_bib_key" Type of data: Number Unique values: 621654 Most common values: 1,452,934 (144x) <------- my emphasis 3,838 (90x) 427,209 (84x) 1,109,812 (80x) 1,274,588 (77x)
Lots of identifiers were reused many times. Looking back at the orignal data files, I realized that the Hathi metadata file has many records with the same oclc_num, representing either difeerent digitized copies of the same item, or multiple volumes of the same item. Additionally, the overlap report contained many entries with duplicate OCLC numbers, because the report is at the physical item level, letting Hathi know how many copies of something we have so they know how many users to allow access to it.
Luckily, this is a problem that
uniq was built to solve.
Additionally, in reading a bit more about Hathi’s data and the links we’d want to be constructing, I realized that the best ID to use for linking is not the
htid, which links to an individual instance of an record, often without much context, but the
ht_bib_key which links to a landing page with all available copies of an item, plus useful metadata.
So, going back to
hathi_with_headers, to get just the OCLC number and ht_bib_key for all rows I ran:
csvcut -t -c 4,8 -z 1310720 \ hathi_with_header > hathi_with_header_all_types_oclc_and_bibkeyid.csv
and then, to remove lines without either an OCLC number or ht_bib_key:
csvgrep -c 1,2 -r ".+" hathi_with_header_all_types_oclc_and_bibkeyid.csv > hathi_with_header_all_types_oclc_and_bibkeyid_no_nulls.cs
and then finally, used the standard *nix
uniq command to remove duplicate lines.(Luckily, the Hathi file is already sorted, with rows with the same OCLC num being next to each other, which
cat hathi_with_header_all_types_oclc_and_bibkeyid_no_nulls.csv | uniq > hathi_with_header_all_types_oclc_and_bibkeyid_unique.csv
So now we have a file with all of the Hathi items with just an OCLC number and a ht_bib_key. Now what we need is an list of OCLC numbers to filter it by.
So, going back to our original overlap report, to get a list of items of all types but limit it to just the OCLC number we can run:
csvgrep -t -c -4 -r ".+" overlap_20200316_temple.tsv | csvcut -c 1 > overlap_all.csv
But that list contains some repeated OCLC numbers that we want to eliminate, so we use
csvsort to sort the row (not all OCLC numbers were in adjacent rows), while retaining the first row as the header.
cat overlap_all.csv | csvsort | uniq > overlap_all_unique.csv
Now we have the second file we need, which is just a big list of OCLC numbers. But how to filter the full list by the second?
My first thought was to use
csvjoin again, but it returned an empty csv file. And reflecting on it, there was nothing to actually “join” in the SQL sense, since one of the csv files was a one column set of id’s to be used for the join.
What I really wanted to do was filter the Hathi records csv by the overlap csv. And filtering is exactly what
csvgrep does. And even more impressively, reading the docs, it turns out csvgrep implements a
-f flag which uses the lines of a file as the filter criteria.
So…after all that…🥁🥁🥁
csvgrep -c 2 -f overlap_all_unique.csv hathi_with_header_all_types_oclc_and_bibkeyid_unique.csv > hathi_filtered_by_overlap.csv
got me the exact set of data I was looking for:
ht_bib_key,oclc_num 000000008,241 000000009,252 ...680K more lines 102691351,1474971 102691365,10498545
Essentially, we built a simple lookup table with OCLC numbers and the corresponding ht_bib_keys that Hathi knows that Temple holds. And it comes in at just 12MB. Small enough to quickly grep for OCLC numbers while we are indexing records to see what Hathi bib key we want to add. How we used that lookup table will have to wait for the next post.
Download the latest HathiTrust monthly file (i.e. hathi_full_20200401.txt.gz ) from the HathiFiles page.
Pare the large file down to just the needed data (OCLC number and Hathi Trust bib key) with:
- csvcut to limit to just the needed columns
- csvgrep to eliminate rows without required fields
- sort and uniq to eliminate duplicates
gunzip - c hathi_full_20200401.txt.gz| \ csvcut -t -c 4,8 -z 1310720 | \ csvgrep -c 1,2 -r ".+" | \ sort | uniq > hathi_full_dedupe.csv
Take the overlap report HathiTrust provides and extract the unique set of OCLC numbers:
csvgrep -t -c -4 -r ".+" overlap_20200316_temple.tsv | \ csvcut -c 1 > | csvsort | uniq \ > overlap_all_unique.csv
Then filter the pared down Hathi data using the overlap OCLC numbers as the filter input:
csvgrep -c 2 -f overlap_all_unique.csv \ hathi_full_dedupe.csv > hathi_filtered_by_overlap.csv
hathi_filtered_by_overlap.csv is a two column csv of related OCLC numbers and Hathi Bib Keys that Hathi knows are in your library, which you can use to construct links to Hathi Trust items based on OCLC numbers in catalog records.