extracting data from historical documents: crowdsourcing annotations on wikisource

69
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource Andrea Thomer, Gaurav Vaidya*, Robert Guralnick, David Bloom, Laura Russell

Upload: gaurav-vaidya

Post on 01-Jul-2015

2.609 views

Category:

Technology


1 download

DESCRIPTION

Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource by Gaurav Vaidya, based on a paper by Andrea Thomer, Gaurav Vaidya*, Robert Guralnick, David Bloom and Laura Russell. Presented November 8, 2012 (http://www.mcn.edu/2012/extracting-data-historical-documents-crowdsourcing-annotations-wikisource) Find out more at http://bit.ly/jhfnblog

TRANSCRIPT

Page 1: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Andrea Thomer, Gaurav Vaidya*, RobertGuralnick, David Bloom, Laura Russell

Page 4: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

The big picture (AKN)

Chronhorogram (Ariño & Otegui, 2010), extracted using BIDDSAT (Otegui & Ariño, 2012)http://www.unav.es/unzyec/mzna/biddsat/recsperyear.php?prov=10&dataset=all&db=GBIF_201202

Page 5: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

An expedition into the Rockies, 1904

http://commons.wikimedia.org/wiki/File:Tent_in_montane_field_site.tif

Page 6: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

The Great Outdoors

http://commons.wikimedia.org/wiki/File:Step_Valley_Lake_near_Arapahoe_Glacier.tif

Page 7: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif

Exploration time

Page 9: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Junius HendersonCUMNH Curator, 1902-1933

http://commons.wikimedia.org/wiki/File:Junius_Henderson.jpg

Page 10: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif

Exploration time

Page 11: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Henderson’s notebooks

Page 12: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
Page 13: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
Page 14: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

“This entire project was only possible because people had been making small steps towards digitization over the last 10

years” -- Andrea Thomer

Page 15: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Wikisource: a transcription platform

Page 16: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Step 1: Scanning (1996)

Page 17: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

The Process1. Images on the Wikimedia Commons.

2. Images + text on Wikisource.

3. Images + text + annotations on Wikisource.

4. Data using the MediaWiki APIs.

• Full details: http://dx.doi.org/10.3897/zookeys.209.3247

• Short URL: http://bit.ly/henderson-paper

Page 18: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

#1. The Wikimedia Commons

Page 21: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Result #1: Images

http://commons.wikimedia.org/wiki/File:Field_Notes_of_Junius_Henderson,_Notebook_1.pdf

Page 22: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

#2. Images + text

http://en.wikisource.org/wiki/Index:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu

Page 24: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Dr. Peter Robinson

http://cumuseum.colorado.edu/about/newsdetail.php?newsID=3

CUMNH Director, 1971-1982Transcribed Henderson’s notebooks, 2000-02

Page 25: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Step 2: Transcription

Page 28: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Combining multiple pages

Page 30: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Wikipedia templates

Page 31: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Wikipedia templates are everywhere

Page 32: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

The “Neutrality” template

Page 33: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

The “Neutrality” template

Page 34: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Examples of templates

Page 35: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Examples of templates

Page 36: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Examples of templates

Page 37: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

An template of our own

{{element|formal name of this element|element as written by Henderson}}Examples:

{{taxon|Sayornis saya|Say Phoebe}}{{taxon|Carduelis pinus|siskins}}{{taxon|Siskin|siskins}}

Page 38: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

An template of our own

{{element|formal name of this element|element as written by Henderson}}Examples:

{{dated|1905-07-28|July 28, 1905}}{{place|Boulder, Colorado|Boulder, Colo}}

Page 39: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

#3. Annotations

Page 40: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

#3. Annotations

Page 41: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

#3. Annotations

Page 42: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Calling all volunteers!

Page 43: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Calling all volunteers!

Page 44: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Result #3. Image + text + annotations!

Page 45: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Volunteers arrive

Page 46: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Volunteers arrive

Page 48: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Simple algorithm

Page 49: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Simple algorithm

Page 50: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Simple algorithm

Page 51: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Simple algorithm

Page 52: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Simple algorithm

Page 53: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Complicated script

Page 54: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Complicated, open source script

Page 55: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Result #4. (Text + Images + Annotation) = Data!

Page 56: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Where do we go from here?

http://commons.wikimedia.org/wiki/File:Bighorn_sheep_skull_at_Arapaho_glacier,_1904.tif

Page 57: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

More books to upload

Page 58: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

More books to transcribe

Page 60: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

A better Wikisource

https://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf?page=19

Page 61: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

“This entire project was only possible because people had been making small steps towards digitization over the last 10

years” -- Andrea Thomer

Page 62: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Thanks!Find out more at http://bit.ly/jhfnblog

Page 63: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
Page 64: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

The following slides were not used in my presentation

Page 65: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Museum collections

Page 66: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Museum records

240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson

Page 67: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Museum records

240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson

Page 68: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource

Problem: context

240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson

Page 69: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource