extracting data from historical documents: crowdsourcing annotations on wikisource

Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Andrea Thomer, Gaurav Vaidya*, RobertGuralnick, David Bloom, Laura Russell

GBIF (389 million records!)

http://data.gbif.org/

http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif


Where species are, where species aren’t

http://www.mappinglife.org/Sayornis_saya



The big picture (AKN)

Chronhorogram (Ariño & Otegui, 2010), extracted using BIDDSAT (Otegui & Ariño, 2012)http://www.unav.es/unzyec/mzna/biddsat/recsperyear.php?prov=10&dataset=all&db=GBIF_201202



An expedition into the Rockies, 1904

http://commons.wikimedia.org/wiki/File:Tent_in_montane_field_site.tif



The Great Outdoors

http://commons.wikimedia.org/wiki/File:Step_Valley_Lake_near_Arapahoe_Glacier.tif




Exploration time



University of Colorado Museum of Natural History (CUMNH) -- founded 1909

http://pinterest.com/cumnh/http://media-cache-ec3.pinterest.com/avatars/ucmnh-1346976471_600.jpg





Junius HendersonCUMNH Curator, 1902-1933

http://commons.wikimedia.org/wiki/File:Junius_Henderson.jpg




Exploration time



Henderson’s notebooks

“This entire project was only possible because people had been making small steps towards digitization over the last 10

years” -- Andrea Thomer

Wikisource: a transcription platform

Step 1: Scanning (1996)

The Process1. Images on the Wikimedia Commons.

2. Images + text on Wikisource.

3. Images + text + annotations on Wikisource.

4. Data using the MediaWiki APIs.

• Full details: http://dx.doi.org/10.3897/zookeys.209.3247

• Short URL: http://bit.ly/henderson-paper

http://dx.doi.org/10.3897/zookeys.209.3247

http://dx.doi.org/10.3897/zookeys.209.3247

http://bit.ly/henderson-paper

http://bit.ly/henderson-paper

#1. The Wikimedia Commons

Copyright?

http://commons.wikimedia.org/wiki/File:Licensing_tutorial_en.svg



Copyright!

http://commons.wikimedia.org/wiki/Template:PD-scan

http://commons.wikimedia.org/wiki/Template:PD-US-unpublished





Result #1: Images

http://commons.wikimedia.org/wiki/File:Field_Notes_of_Junius_Henderson,_Notebook_1.pdf



#2. Images + text

http://en.wikisource.org/wiki/Index:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu



Just like Wikipedia

http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371





Dr. Peter Robinson

http://cumuseum.colorado.edu/about/newsdetail.php?newsID=3

CUMNH Director, 1971-1982Transcribed Henderson’s notebooks, 2000-02



Step 2: Transcription

Result #2: Images + text






Combining multiple pages

#3. Images + text + annotations






Wikipedia templates

Wikipedia templates are everywhere

The “Neutrality” template

Examples of templates

An template of our own

{{element|formal name of this element|element as written by Henderson}}Examples:

{{taxon|Sayornis saya|Say Phoebe}}{{taxon|Carduelis pinus|siskins}}{{taxon|Siskin|siskins}}

An template of our own

{{element|formal name of this element|element as written by Henderson}}Examples:

{{dated|1905-07-28|July 28, 1905}}{{place|Boulder, Colorado|Boulder, Colo}}

#3. Annotations

Calling all volunteers!

Result #3. Image + text + annotations!

Volunteers arrive

#4. Data

http://www.mappinglife.org/Sayornis_saya



Simple algorithm

Complicated script

Complicated, open source script

Result #4. (Text + Images + Annotation) = Data!

Where do we go from here?

http://commons.wikimedia.org/wiki/File:Bighorn_sheep_skull_at_Arapaho_glacier,_1904.tif



More books to upload

More books to transcribe

More books to transcribe

http://www.biodiversitylibrary.org/



A better Wikisource

https://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf?page=19



“This entire project was only possible because people had been making small steps towards digitization over the last 10

years” -- Andrea Thomer

Thanks!Find out more at http://bit.ly/jhfnblog

http://bit.ly/jhfnblog

http://bit.ly/jhfnblog

The following slides were not used in my presentation

Museum collections

Museum records

240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson

Problem: context

240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson

extracting data from historical documents: crowdsourcing annotations on wikisource

Technology