accessing historical data · why gather digitized historical data en masse? • it can let you grab...

Accessing Historical Data

en masse

Ian Milligan Assistant Professor

Hello!• Who am I?

• Ian Milligan (Assistant Professor, University of Waterloo)

• Canadian, digital, youth, and web archives.

• [email protected]

• @ianmilligan1

• Slides will be all available at http://ianmilligan.ca/getting-data/, as well as links to tutorials and data

mailto:[email protected]

http://ianmilligan.ca/getting-data/

Why gather digitized historical data en masse?• It can let you grab data from across the globe

for minimal extra effort;

• When digitized, it can save time + effort (no more right clicking);

• Can let you explore extremely large datasets to find patterns, inferences, etc. in bodies that you couldn’t otherwise read!

Pitfalls?

• Digitization has proceeded unevenly: requires institutional money and support, so replicates holdings of elite + western institutions;

• We may not know how it works - Optical Character Recognition (OCR) for plain text, collection biases, etc.

Pitfalls?

0"

1"

2"

3"

4"

5"

6"

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Ave

rage

num

ber

of a

ppea

ranc

es, d

ivid

ed b

y ye

ar

Years

Globe

Star

Telegram

Gazette

Citizen

Gap between appearance and usage in ProQuest dissertations

Impact of Pages of the Past and Canada's Heritage Online

Pre–Pages of the Past and Canada's Heritage Online

Handle with Care

But still, these sources present considerable power when used by the right historians

(you!)

Different Methods• The Dream Case

• Application Programming Interfaces (APIs)

• Scraping Data yourself (Outwit Hub)

• Computational Methods (Python, Bash, Programming Historian)

• HistoryCrawler Virtual Machine

The Dream Case

• A Dream:

• For you to find on your own websites;

• And for you to create for others if you make databases…

The Dream Case

• Examples

• http://edh-www.adw.uni-heidelberg.de/home

• http://www.cwgc.org/find-war-dead.aspx

• Lexis|Nexis

• Sometimes limited (i.e. CWGC to 50,000 records, Lexis|Nexis to a few hundred) which requires multiple searches

http://edh-www.adw.uni-heidelberg.de/home

http://www.cwgc.org/find-war-dead.aspx

http://adamcrymble.blogspot.ca/2014/01/does-your-online-collection-need-api.html

Or maybe you just want a few

documents?

Worth bookmarking• Google Books Advanced Search: http://

books.google.com/advanced_book_search

• Internet Archive Advanced Search: http://archive.org/advancedsearch.php

• Hathi Trust Advanced Search: http://babel.hathitrust.org/cgi/ls?a=page;page=advanced

• (Let’s Visit Each)

http://books.google.com/advanced_book_search

http://archive.org/advancedsearch.php

http://babel.hathitrust.org/cgi/ls?a=page;page=advanced

Google Books

• In ‘advanced search,’ select ‘Full view only’

• Do a search, pre-1923 content will be most fruitful

Internet Archive

Hathi Trust• The world’s backup drive for libraries - 4.5+ billion pages!

Also..

• Sometimes a colleague might have compiled this data for you…

• Shawn Graham (Carleton) has compiled a great list: https://github.com/hist3907b-winter2015/module2-findingdata

https://github.com/hist3907b-winter2015/module2-findingdata

But if the dream case doesn’t work out, it’s

OK.

(this one is a bit difficult, but it helps us get some foundational concepts)

Application Programming Interfaces• API - programs talking to

each other

• In our context, it’s a way to send an HTTP request and get some responses

• (this is relatively complex, but will make more sense as we proceed through workshop)

APIs

• JSON format (instead of human-readable format like HTML, machine-readable).

• So if I own 3 iPhones and an iPad (I don’t), I’d structure it like this

{ "iphones" : "3", "ipads" : "1" }

APIs (added &fmt=json)

Good intro - start studying URLs

http://search.canadiana.ca/search?df=1800&dt=1900&q=psycholog*&fmt=json

http://search.canadiana.ca/support/api [instructions]

http://search.canadiana.ca/search?df=1800&dt=1900&q=psycholog*&fmt=json

http://search.canadiana.ca/support/api

URLs

• 939 pages of results (!)

• Each document in this case has a unique record key

• But we do figure out the URL formula

• http://search.canadiana.ca/search/X?df=1800&dt-‐1900&q=psycholog*&fmt=json

• And solve for X, where X is a value between 1 and 939

URLs

• http://search.canadiana.ca/search/1?df=1800&dt-‐1900&q=psycholog*&fmt=json




URLs

"contributor" : [

"oocihm",

3837,

EACH item on these pages has a unique number that it, and only it, has. If we can get a list of those oocihm

numbers, we could get EVERY full text item in a database.

URLs• How do we get those key values? (stay tuned)

• But once we have them, we’d see that we have a list of files like:

• http://eco.canadiana.ca/view/X/?r=0&s=1&fmt=json&api=text=1; (where X is the oocihm information)

• So a code like http://eco.canadiana.ca/view/oocihm.16278/?r=0&s=1&fmt=json&api_text=1 would get the full text of an item.

• You’d have to automate this to get all full text sources having to do with psychology. But how?

http://eco.canadiana.ca/view/X/?r=0&s=1&fmt=json&api=text=1

http://eco.canadiana.ca/view/oocihm.16278/?r=0&s=1&fmt=json&api_text=1

Downloading all the files

• We can turn to some other resources - which are a useful demonstration of how DH involves code sharing

• http://ianmilligan.ca/2014/01/07/historians-love-json-or-one-quick-example-of-why-it-rocks/

• https://canzac.wordpress.com/2014/09/02/canadiana-in-context/

• I’ll explain code and share

http://ianmilligan.ca/2014/01/07/historians-love-json-or-one-quick-example-of-why-it-rocks/

https://canzac.wordpress.com/2014/09/02/canadiana-in-context/

Outwit Hub

• A free software suite that finds ‘structure’ in web pages and grabs the information that you’re looking for.

• Free in limited version.

• https://www.outwit.com/products/hub/

https://www.outwit.com/products/hub/

Outwit Hub

• Starting database to try this out on - Suda On Line - a 10th century Byzantine Greek historical encyclopedia

• http://www.stoa.org/sol/

http://www.stoa.org/sol/

Outwit Hub

Outwit Hub

Adler NumberBegins after: Adler number: </strong> Ends before: <br/>TranslationBegins after: <div class=“translation”> Ends before: </div>

Outwit Hub

• Step One: install Outwit Hub

• Step Two: paste URL into the bar at the top of the page

• Step Three: click ‘scrapers,’ then ‘new,’ give it a name.

• Step Four: Say no thanks to buying it (at least now).

Outwit Hub

Adler NumberBegins after: Adler number: </strong> Ends before: <br/>TranslationBegins after: <div class=“translation”> Ends before: </div>

Outwit Hub

Outwit Hub• Press ‘Catch’ if you want to

keep going with other websites

• CATCH moves it into your memory

• Or you can press ‘Export’ when you’re done to generate a spreadsheet

• (Do a second search for ‘rome’ and see it auto catch)

It’s a good introduction, but sometimes you need better tools…

The Dreaded Command Line

• Most of these programs are based in a UNIX environment

• Ian Milligan and James Baker (British Library), “Introduction to the Bash Command Line.” http://programminghistorian.org/lessons/intro-to-bash

http://programminghistorian.org/lessons/intro-to-bash

The Dreaded Command Line

• Not so bad once you get into it!

• Allows you to run some pretty fine-tuned commands, and begin to rapidly move around your computer.

• Does have a learning curve, but it is worth it.

Basic Programming

• ProgrammingHistorian.org

• Basic programming techniques with an applied perspective

• Not general examples, but specific ones.

http://ProgrammingHistorian.org

Basic Programming

Wget• A powerful tool for

retrieving online material

• Command line only (!)

• Easy way to install on OS X:

• Install homebrew (one line to install at brew.sh)

• and then ‘brew install wget’

To install on all platforms: http://programminghistorian.org/lessons/automated-downloading-

with-wget

http://programminghistorian.org/lessons/automated-downloading-with-wget

The Internet Archive

• 15 PB of awesome historical, cultural sources

• But occasionally cumbersome to access en masse

Wget and the Internet Archive

• http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/

• Let’s grab all the files relating to a given collection

http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/

Finding a Collection

• The Boston Public Library Anti-Slavery Collection

• https://archive.org/details/bplscas

• (but there are many other ones)

https://archive.org/details/bplscas


• Everything in the Internet Archive has a unique URL, like this: http://archive.org/details/[IDENTIFIER]

• So an item might be: http://archive.org/details/lettertowilliaml00doug

• And the collection is: http://archive.org/details/bplscas/

http://archive.org/details/bplscas/


• Create a directory to store all our files

• Visit the advanced search page (http://archive.org/advancedsearch.php)

• Click on ‘collection’ - big list loads. Click on ‘bplscas’ and then search

http://archive.org/advancedsearch.php


• 8,265 results. That’d be a lot of ‘right clicking’ to download.

• We confirm that this is indeed what we want.

• So we go back.

Finding a Collection• So we do this

• Scroll down and do a search for “collection: bplscas”, we can sort by “date asc” - ascending dates, and we select CSV FORMAT.

• Number of results: 7971

• Click ‘search’ and download the file


• Looks like this. It’s one line per file.

• The first one is dialoguscreatura00nico

• Put that into the search bar, press enter.. and voila..


• We can now download every single entry in that list - in this case, everything within the Boston Public Library Slavery collection.

• We can decide if we want every single format (probably not), or perhaps just the TXT files, or the PDFs, etc.


• Step One: Open the CSV file and delete the first line that reads ‘identifier’

• Step Two: Save it as a text file - itemlist.txt

• Step Three: use WGET. Copy commands from the Internet Archive.. :)

Example Commands

• All files:

• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B ‘http://archive.org/download/'

• Certain file formats

• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐A .pdf,.epub -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B 'http://archive.org/download/'

http://archive.org/download/'


Our command

• We just want the TXT files

• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐A .txt -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B 'http://archive.org/download/'


Exploring

• Now we have LOTS of text files. Or PDFs. Or EPUBs. Or whatever we want for whatever purposes.

Programmatically Interacting

• Caleb McDaniel’s “Data Mining the Internet Archive Collection” at http://programminghistorian.org/lessons/data-mining-the-internet-archive

• Uses the Python programming language to download metadata (information about information)

http://programminghistorian.org/lessons/data-mining-the-internet-archive


• It goes through and grabs the MARC data (library records) for everything in the Anti-Slavery Collection

• It is decently documented and we don’t have time today. However, we can steal his code.

Stealing Code#!/usr/bin/python

import internetarchive

import time

error_log = open('bpl-‐marcs-‐errors.log', 'a')

search = internetarchive.search_items('collection:bplscas')

for result in search:

itemid = result['identifier']

item = internetarchive.get_item(itemid)

marc = item.get_file(itemid + '_marc.xml')

try:


• We save this file into a new directory (slavery-marc) and then run it.

• BORROWING CODE IS OK.

• On command line we could type:

• python ia-‐download.py


• The results!

• Using his pymarc script to generate location data.

Programatically Interacting

• Other tools

• Adam Crymble, “Downloading Multiple Records Using Query Strings.” [http://programminghistorian.org/lessons/downloading-multiple-records-using-query-strings]

http://programminghistorian.org/lessons/downloading-multiple-records-using-query-strings

Two Main Programs• obo.py

• (which contains definitions for several functions that you call)

• download-searches.py

• Where you can swap out your query and get files, download them, all without visiting the site

HistoryCrawler

• Download link: http://ianmilligan.ca/historycrawler [link to repository at York University, Toronto]

• Instructions: http://williamjturkel.net/2014/09/09/creating-the-historycrawler-virtual-machine/

• Solving problems of dependencies, reproducibility, working on a virtual environment for scholars

http://ianmilligan.ca/historycrawler

http://williamjturkel.net/2014/09/09/creating-the-historycrawler-virtual-machine/

HistoryCrawler• Step One: Download

HistoryCrawler201407-32b.ova from previous links

• Step Two: Install Oracle VM Virtual Box (https://www.virtualbox.org/)

• Step Three: File —> Import Appliance —> Select the ova file to generate your machine

• Step Four: Press ‘start.’ You may have to wait ~ 1-2 minutes.

• Step Five: password is ‘go’

https://www.virtualbox.org/

HistoryCrawler

Tutorials

• Mary Beth Start (PhD Candidate, Western University, Ontario): http://marybethstart.wordpress.com/2014/09/09/getting-started-virtualbox-and-historycrawler/

• William Turkel (Associate Professor, Western University, Ontario): http://williamjturkel.net/how-to/#virtualmachine

http://marybethstart.wordpress.com/2014/09/09/getting-started-virtualbox-and-historycrawler/

http://williamjturkel.net/how-to/#virtualmachine

HistoryCrawler: A platform for teaching?• Does require a decent computer to run it on

• But

• eliminates problems of dependencies;

• installation issues;

• gets everybody on same platform;

• allows for sharing and reproducibility of research inputs/outputs;

• Still in progress - would love any feedback.

Conclusions, Questions & Your

Own Data?

Ian Milligan Assistant Professor

accessing historical data · why gather digitized historical data en masse? • it can let you grab...

Documents