accessing historical data · why gather digitized historical data en masse? • it can let you grab...
TRANSCRIPT
![Page 1: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/1.jpg)
Accessing Historical Data
en masse
Ian Milligan Assistant Professor
![Page 2: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/2.jpg)
Hello!• Who am I?
• Ian Milligan (Assistant Professor, University of Waterloo)
• Canadian, digital, youth, and web archives.
• @ianmilligan1
• Slides will be all available at http://ianmilligan.ca/getting-data/, as well as links to tutorials and data
![Page 3: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/3.jpg)
Why gather digitized historical data en masse?• It can let you grab data from across the globe
for minimal extra effort;
• When digitized, it can save time + effort (no more right clicking);
• Can let you explore extremely large datasets to find patterns, inferences, etc. in bodies that you couldn’t otherwise read!
![Page 4: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/4.jpg)
Pitfalls?
• Digitization has proceeded unevenly: requires institutional money and support, so replicates holdings of elite + western institutions;
• We may not know how it works - Optical Character Recognition (OCR) for plain text, collection biases, etc.
![Page 5: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/5.jpg)
Pitfalls?
0"
1"
2"
3"
4"
5"
6"
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Ave
rage
num
ber
of a
ppea
ranc
es, d
ivid
ed b
y ye
ar
Years
Globe
Star
Telegram
Gazette
Citizen
Gap between appearance and usage in ProQuest dissertations
Impact of Pages of the Past and Canada's Heritage Online
Pre–Pages of the Past and Canada's Heritage Online
![Page 6: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/6.jpg)
Handle with Care
![Page 7: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/7.jpg)
But still, these sources present considerable power when used by the right historians
(you!)
![Page 8: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/8.jpg)
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
![Page 9: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/9.jpg)
The Dream Case
• A Dream:
• For you to find on your own websites;
• And for you to create for others if you make databases…
![Page 10: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/10.jpg)
The Dream Case
• Examples
• http://edh-www.adw.uni-heidelberg.de/home
• http://www.cwgc.org/find-war-dead.aspx
• Lexis|Nexis
• Sometimes limited (i.e. CWGC to 50,000 records, Lexis|Nexis to a few hundred) which requires multiple searches
![Page 11: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/11.jpg)
http://adamcrymble.blogspot.ca/2014/01/does-your-online-collection-need-api.html
![Page 12: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/12.jpg)
Or maybe you just want a few
documents?
![Page 13: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/13.jpg)
Worth bookmarking• Google Books Advanced Search: http://
books.google.com/advanced_book_search
• Internet Archive Advanced Search: http://archive.org/advancedsearch.php
• Hathi Trust Advanced Search: http://babel.hathitrust.org/cgi/ls?a=page;page=advanced
• (Let’s Visit Each)
![Page 14: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/14.jpg)
Google Books
• In ‘advanced search,’ select ‘Full view only’
• Do a search, pre-1923 content will be most fruitful
![Page 15: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/15.jpg)
Internet Archive
![Page 16: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/16.jpg)
Hathi Trust• The world’s backup drive for libraries - 4.5+ billion pages!
![Page 17: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/17.jpg)
Also..
• Sometimes a colleague might have compiled this data for you…
• Shawn Graham (Carleton) has compiled a great list: https://github.com/hist3907b-winter2015/module2-findingdata
![Page 18: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/18.jpg)
But if the dream case doesn’t work out, it’s
OK.
![Page 19: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/19.jpg)
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
![Page 20: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/20.jpg)
(this one is a bit difficult, but it helps us get some foundational concepts)
![Page 21: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/21.jpg)
Application Programming Interfaces• API - programs talking to
each other
• In our context, it’s a way to send an HTTP request and get some responses
• (this is relatively complex, but will make more sense as we proceed through workshop)
![Page 22: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/22.jpg)
APIs
• JSON format (instead of human-readable format like HTML, machine-readable).
• So if I own 3 iPhones and an iPad (I don’t), I’d structure it like this
{ "iphones" : "3", "ipads" : "1" }
![Page 23: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/23.jpg)
APIs
![Page 24: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/24.jpg)
APIs (added &fmt=json)
![Page 25: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/25.jpg)
Good intro - start studying URLs
http://search.canadiana.ca/search?df=1800&dt=1900&q=psycholog*&fmt=json
http://search.canadiana.ca/support/api [instructions]
![Page 26: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/26.jpg)
URLs
• 939 pages of results (!)
• Each document in this case has a unique record key
• But we do figure out the URL formula
• http://search.canadiana.ca/search/X?df=1800&dt-‐1900&q=psycholog*&fmt=json
• And solve for X, where X is a value between 1 and 939
![Page 27: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/27.jpg)
URLs
• http://search.canadiana.ca/search/1?df=1800&dt-‐1900&q=psycholog*&fmt=json
• http://search.canadiana.ca/search/2?df=1800&dt-‐1900&q=psycholog*&fmt=json
• http://search.canadiana.ca/search/3?df=1800&dt-‐1900&q=psycholog*&fmt=json
• http://search.canadiana.ca/search/4?df=1800&dt-‐1900&q=psycholog*&fmt=json
![Page 28: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/28.jpg)
URLs
"contributor" : [
"oocihm",
3837,
EACH item on these pages has a unique number that it, and only it, has. If we can get a list of those oocihm
numbers, we could get EVERY full text item in a database.
![Page 29: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/29.jpg)
URLs• How do we get those key values? (stay tuned)
• But once we have them, we’d see that we have a list of files like:
• http://eco.canadiana.ca/view/X/?r=0&s=1&fmt=json&api=text=1; (where X is the oocihm information)
• So a code like http://eco.canadiana.ca/view/oocihm.16278/?r=0&s=1&fmt=json&api_text=1 would get the full text of an item.
• You’d have to automate this to get all full text sources having to do with psychology. But how?
![Page 30: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/30.jpg)
Downloading all the files
• We can turn to some other resources - which are a useful demonstration of how DH involves code sharing
• http://ianmilligan.ca/2014/01/07/historians-love-json-or-one-quick-example-of-why-it-rocks/
• https://canzac.wordpress.com/2014/09/02/canadiana-in-context/
• I’ll explain code and share
![Page 31: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/31.jpg)
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
![Page 32: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/32.jpg)
Outwit Hub
• A free software suite that finds ‘structure’ in web pages and grabs the information that you’re looking for.
• Free in limited version.
• https://www.outwit.com/products/hub/
![Page 33: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/33.jpg)
Outwit Hub
• Starting database to try this out on - Suda On Line - a 10th century Byzantine Greek historical encyclopedia
• http://www.stoa.org/sol/
![Page 34: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/34.jpg)
Outwit Hub
![Page 35: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/35.jpg)
Outwit Hub
![Page 36: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/36.jpg)
Outwit Hub
Adler NumberBegins after: Adler number: </strong> Ends before: <br/>TranslationBegins after: <div class=“translation”> Ends before: </div>
![Page 37: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/37.jpg)
Outwit Hub
• Step One: install Outwit Hub
• Step Two: paste URL into the bar at the top of the page
• Step Three: click ‘scrapers,’ then ‘new,’ give it a name.
• Step Four: Say no thanks to buying it (at least now).
![Page 38: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/38.jpg)
![Page 39: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/39.jpg)
Outwit Hub
Adler NumberBegins after: Adler number: </strong> Ends before: <br/>TranslationBegins after: <div class=“translation”> Ends before: </div>
![Page 40: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/40.jpg)
Outwit Hub
![Page 41: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/41.jpg)
Outwit Hub
![Page 42: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/42.jpg)
Outwit Hub• Press ‘Catch’ if you want to
keep going with other websites
• CATCH moves it into your memory
• Or you can press ‘Export’ when you’re done to generate a spreadsheet
• (Do a second search for ‘rome’ and see it auto catch)
![Page 43: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/43.jpg)
It’s a good introduction, but sometimes you need better tools…
![Page 44: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/44.jpg)
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
![Page 45: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/45.jpg)
The Dreaded Command Line
• Most of these programs are based in a UNIX environment
• Ian Milligan and James Baker (British Library), “Introduction to the Bash Command Line.” http://programminghistorian.org/lessons/intro-to-bash
![Page 46: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/46.jpg)
The Dreaded Command Line
• Not so bad once you get into it!
• Allows you to run some pretty fine-tuned commands, and begin to rapidly move around your computer.
• Does have a learning curve, but it is worth it.
![Page 47: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/47.jpg)
Basic Programming
• ProgrammingHistorian.org
• Basic programming techniques with an applied perspective
• Not general examples, but specific ones.
![Page 48: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/48.jpg)
Basic Programming
![Page 49: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/49.jpg)
Wget• A powerful tool for
retrieving online material
• Command line only (!)
• Easy way to install on OS X:
• Install homebrew (one line to install at brew.sh)
• and then ‘brew install wget’
![Page 50: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/50.jpg)
To install on all platforms: http://programminghistorian.org/lessons/automated-downloading-
with-wget
![Page 51: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/51.jpg)
The Internet Archive
• 15 PB of awesome historical, cultural sources
• But occasionally cumbersome to access en masse
![Page 52: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/52.jpg)
Wget and the Internet Archive
• http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
• Let’s grab all the files relating to a given collection
![Page 53: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/53.jpg)
Finding a Collection
• The Boston Public Library Anti-Slavery Collection
• https://archive.org/details/bplscas
• (but there are many other ones)
![Page 54: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/54.jpg)
Finding a Collection
• Everything in the Internet Archive has a unique URL, like this: http://archive.org/details/[IDENTIFIER]
• So an item might be: http://archive.org/details/lettertowilliaml00doug
• And the collection is: http://archive.org/details/bplscas/
![Page 55: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/55.jpg)
Finding a Collection
• Create a directory to store all our files
• Visit the advanced search page (http://archive.org/advancedsearch.php)
• Click on ‘collection’ - big list loads. Click on ‘bplscas’ and then search
![Page 56: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/56.jpg)
Finding a Collection
• 8,265 results. That’d be a lot of ‘right clicking’ to download.
• We confirm that this is indeed what we want.
• So we go back.
![Page 57: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/57.jpg)
Finding a Collection• So we do this
• Scroll down and do a search for “collection: bplscas”, we can sort by “date asc” - ascending dates, and we select CSV FORMAT.
• Number of results: 7971
• Click ‘search’ and download the file
![Page 58: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/58.jpg)
Finding a Collection
• Looks like this. It’s one line per file.
• The first one is dialoguscreatura00nico
• Put that into the search bar, press enter.. and voila..
![Page 59: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/59.jpg)
Finding a Collection
![Page 60: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/60.jpg)
Finding a Collection
• We can now download every single entry in that list - in this case, everything within the Boston Public Library Slavery collection.
• We can decide if we want every single format (probably not), or perhaps just the TXT files, or the PDFs, etc.
![Page 61: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/61.jpg)
Finding a Collection
• Step One: Open the CSV file and delete the first line that reads ‘identifier’
• Step Two: Save it as a text file - itemlist.txt
• Step Three: use WGET. Copy commands from the Internet Archive.. :)
![Page 62: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/62.jpg)
Example Commands
• All files:
• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B ‘http://archive.org/download/'
• Certain file formats
• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐A .pdf,.epub -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B 'http://archive.org/download/'
![Page 63: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/63.jpg)
Our command
• We just want the TXT files
• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐A .txt -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B 'http://archive.org/download/'
![Page 64: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/64.jpg)
Exploring
• Now we have LOTS of text files. Or PDFs. Or EPUBs. Or whatever we want for whatever purposes.
![Page 65: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/65.jpg)
Programmatically Interacting
• Caleb McDaniel’s “Data Mining the Internet Archive Collection” at http://programminghistorian.org/lessons/data-mining-the-internet-archive
• Uses the Python programming language to download metadata (information about information)
![Page 66: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/66.jpg)
Programmatically Interacting
• It goes through and grabs the MARC data (library records) for everything in the Anti-Slavery Collection
• It is decently documented and we don’t have time today. However, we can steal his code.
![Page 67: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/67.jpg)
Stealing Code#!/usr/bin/python
import internetarchive
import time
error_log = open('bpl-‐marcs-‐errors.log', 'a')
search = internetarchive.search_items('collection:bplscas')
for result in search:
itemid = result['identifier']
item = internetarchive.get_item(itemid)
marc = item.get_file(itemid + '_marc.xml')
try:
![Page 68: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/68.jpg)
Programmatically Interacting
• We save this file into a new directory (slavery-marc) and then run it.
• BORROWING CODE IS OK.
• On command line we could type:
• python ia-‐download.py
![Page 69: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/69.jpg)
Programmatically Interacting
• The results!
• Using his pymarc script to generate location data.
![Page 70: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/70.jpg)
Programatically Interacting
• Other tools
• Adam Crymble, “Downloading Multiple Records Using Query Strings.” [http://programminghistorian.org/lessons/downloading-multiple-records-using-query-strings]
![Page 71: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/71.jpg)
![Page 72: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/72.jpg)
Two Main Programs• obo.py
• (which contains definitions for several functions that you call)
• download-searches.py
• Where you can swap out your query and get files, download them, all without visiting the site
![Page 73: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/73.jpg)
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
![Page 74: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/74.jpg)
HistoryCrawler
• Download link: http://ianmilligan.ca/historycrawler [link to repository at York University, Toronto]
• Instructions: http://williamjturkel.net/2014/09/09/creating-the-historycrawler-virtual-machine/
• Solving problems of dependencies, reproducibility, working on a virtual environment for scholars
![Page 75: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/75.jpg)
HistoryCrawler• Step One: Download
HistoryCrawler201407-32b.ova from previous links
• Step Two: Install Oracle VM Virtual Box (https://www.virtualbox.org/)
• Step Three: File —> Import Appliance —> Select the ova file to generate your machine
• Step Four: Press ‘start.’ You may have to wait ~ 1-2 minutes.
• Step Five: password is ‘go’
![Page 76: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/76.jpg)
HistoryCrawler
![Page 77: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/77.jpg)
Tutorials
• Mary Beth Start (PhD Candidate, Western University, Ontario): http://marybethstart.wordpress.com/2014/09/09/getting-started-virtualbox-and-historycrawler/
• William Turkel (Associate Professor, Western University, Ontario): http://williamjturkel.net/how-to/#virtualmachine
![Page 78: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/78.jpg)
HistoryCrawler: A platform for teaching?• Does require a decent computer to run it on
• But
• eliminates problems of dependencies;
• installation issues;
• gets everybody on same platform;
• allows for sharing and reproducibility of research inputs/outputs;
• Still in progress - would love any feedback.
![Page 79: Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab data from across the globe for minimal extra effort; • When digitized, it can](https://reader035.vdocuments.us/reader035/viewer/2022070904/5f7239e54bcd455fb0554129/html5/thumbnails/79.jpg)
Conclusions, Questions & Your
Own Data?
Ian Milligan Assistant Professor