extracting data from historical documents: crowdsourcing annotations on wikisource
DESCRIPTION
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource by Gaurav Vaidya, based on a paper by Andrea Thomer, Gaurav Vaidya*, Robert Guralnick, David Bloom and Laura Russell. Presented November 8, 2012 (http://www.mcn.edu/2012/extracting-data-historical-documents-crowdsourcing-annotations-wikisource) Find out more at http://bit.ly/jhfnblogTRANSCRIPT
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
Andrea Thomer, Gaurav Vaidya*, RobertGuralnick, David Bloom, Laura Russell
GBIF (389 million records!)
http://data.gbif.org/
Where species are, where species aren’t
http://www.mappinglife.org/Sayornis_saya
The big picture (AKN)
Chronhorogram (Ariño & Otegui, 2010), extracted using BIDDSAT (Otegui & Ariño, 2012)http://www.unav.es/unzyec/mzna/biddsat/recsperyear.php?prov=10&dataset=all&db=GBIF_201202
An expedition into the Rockies, 1904
http://commons.wikimedia.org/wiki/File:Tent_in_montane_field_site.tif
The Great Outdoors
http://commons.wikimedia.org/wiki/File:Step_Valley_Lake_near_Arapahoe_Glacier.tif
http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif
Exploration time
University of Colorado Museum of Natural History (CUMNH) -- founded 1909
http://pinterest.com/cumnh/http://media-cache-ec3.pinterest.com/avatars/ucmnh-1346976471_600.jpg
Junius HendersonCUMNH Curator, 1902-1933
http://commons.wikimedia.org/wiki/File:Junius_Henderson.jpg
http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif
Exploration time
Henderson’s notebooks
“This entire project was only possible because people had been making small steps towards digitization over the last 10
years” -- Andrea Thomer
Wikisource: a transcription platform
Step 1: Scanning (1996)
The Process1. Images on the Wikimedia Commons.
2. Images + text on Wikisource.
3. Images + text + annotations on Wikisource.
4. Data using the MediaWiki APIs.
• Full details: http://dx.doi.org/10.3897/zookeys.209.3247
• Short URL: http://bit.ly/henderson-paper
#1. The Wikimedia Commons
Copyright?
http://commons.wikimedia.org/wiki/File:Licensing_tutorial_en.svg
Copyright!
http://commons.wikimedia.org/wiki/Template:PD-scan
http://commons.wikimedia.org/wiki/Template:PD-US-unpublished
Result #1: Images
http://commons.wikimedia.org/wiki/File:Field_Notes_of_Junius_Henderson,_Notebook_1.pdf
#2. Images + text
http://en.wikisource.org/wiki/Index:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu
Just like Wikipedia
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
Dr. Peter Robinson
http://cumuseum.colorado.edu/about/newsdetail.php?newsID=3
CUMNH Director, 1971-1982Transcribed Henderson’s notebooks, 2000-02
Step 2: Transcription
Result #2: Images + text
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
Result #2: Images + text
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
Combining multiple pages
#3. Images + text + annotations
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
Wikipedia templates
Wikipedia templates are everywhere
The “Neutrality” template
The “Neutrality” template
Examples of templates
Examples of templates
Examples of templates
An template of our own
{{element|formal name of this element|element as written by Henderson}}Examples:
{{taxon|Sayornis saya|Say Phoebe}}{{taxon|Carduelis pinus|siskins}}{{taxon|Siskin|siskins}}
An template of our own
{{element|formal name of this element|element as written by Henderson}}Examples:
{{dated|1905-07-28|July 28, 1905}}{{place|Boulder, Colorado|Boulder, Colo}}
#3. Annotations
#3. Annotations
#3. Annotations
Calling all volunteers!
Calling all volunteers!
Result #3. Image + text + annotations!
Volunteers arrive
Volunteers arrive
#4. Data
http://www.mappinglife.org/Sayornis_saya
Simple algorithm
Simple algorithm
Simple algorithm
Simple algorithm
Simple algorithm
Complicated script
Complicated, open source script
Result #4. (Text + Images + Annotation) = Data!
Where do we go from here?
http://commons.wikimedia.org/wiki/File:Bighorn_sheep_skull_at_Arapaho_glacier,_1904.tif
More books to upload
More books to transcribe
More books to transcribe
http://www.biodiversitylibrary.org/
A better Wikisource
https://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf?page=19
“This entire project was only possible because people had been making small steps towards digitization over the last 10
years” -- Andrea Thomer
The following slides were not used in my presentation
Museum collections
Museum records
240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson
Museum records
240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson
Problem: context
240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson