crowdsourcing the transcription of archival data · — automate piece-wise crowdsourced...

80
Kimberly A. Jameson 1 , Sean Tauber 1 , Prutha S. Deshpande 2 , Stephanie M. Chang 3 , and Sergio Gago 3 1 Institute for Mathematical Behavioral Sciences, 2 Cognitive Sciences, and 3 Calit2 University of California, Irvine INSTITUTE FOR MATHEMATICAL BEHAVIORAL SCIENCES UC IRVINE Crowdsourcing the transcription of archival data

Upload: others

Post on 10-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Kimberly A. Jameson1, Sean Tauber1, Prutha S. Deshpande2, Stephanie M. Chang3, and Sergio Gago3

1Institute for Mathematical Behavioral Sciences, 2Cognitive Sciences, and 3Calit2

University of California, Irvine

INSTITUTE FORMATHEMATICAL BEHAVIORAL SCIENCESUC IRVINE

Crowdsourcing the transcription of archival data

Page 2: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

UCI ColCat Project Collaborators:

Funding and Support for the archive project: Calit2 at UCI. University of California PacificRim Research Program, 2010-2015 (K.A. Jameson, PI). National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). UCI’s UROP Program Awards. IRB ApprovalsHS#2013-9921 and 2015-9047.

Prutha DeshpandeSean Tauber

Stephanie ChangSergio Gago

Nathan BenjaminYang Jiao

Brian HuynhHan Ke

Ram BhaktaZhimin Xiang

Ian Harris

Page 3: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Prutha S. Deshpande CogSci

Sean TauberIMBS

Sergio GagoCalit2

Stephanie M. ChangCalit2

UCI ColCat Project Collaborators:

Page 4: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

• Background on an important problem in Cognitive Science.

• The domain under consideration: Color categorization.

• Creating a new database using internet-based procedures.

• Features of the internet-based research problem and solution approaches that may generalize elsewhere.

• Modeling the problem and developing appropriate analyses.

• Preliminary results from empirical tests.

• Summary.

Talk Overview

Page 5: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Research on how concepts are represented across linguistic

groups

✶ Individual concept formation and the sharing and transmission of concepts within and across groups.

E.g., Kinship terminology …

Page 6: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Concept formation across language groups

E.g., Kinship terminology:

https://en.wikipedia.org/wiki/Kinship

Page 7: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Concept formation across language groups

https://en.wikipedia.org/wiki/Kinship

E.g., Kinship terminology:

Page 8: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

In what ways are representations of concepts similar across individuals and language

groups?

and

What are the various ways concepts vary across individuals and language groups?

Page 9: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

How do the world’s languages map the color appearances we all see in our environments?

Page 10: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Basic Color Terms (1969)

Paul KayBrent Berlin

Basic Color Terms being described as “the smallest set of simple words with which the speaker can name any color.”

Page 11: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Courtesy of Lindsey & Brown (2006). PNAS, 102.

Page 12: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Image Credit: Lindsey & Brown (2006). PNAS, 102.

Page 13: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Basic Color Terms (1969)

(2) Provided a sequence by which languages adopted subsets of the 11 basic color categories.

(1) Found all languages tested had systems including 11 or fewer basic color words (e.g., English): red, yellow, green, blue, orange, purple, pink, brown, grey, black and

white.

(Terms such as crimson, blonde and royal blue are not considered to be basic.)

Page 14: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

IMBS workshop UC Irvine 12/04/2015

Color concept universals like this were made popular by Berlin & Kay, and by

several other investigators,

still, there are instances where different societies have evolved different conventions for color naming ...

Page 15: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Image Credit: Lindsey & Brown (2006). PNAS, 102.

Page 16: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Courtesy of Lindsey & Brown (2006). PNAS, 102.

Berinmo (5 words)

Image Credit: Kay & Regier (2007). Cognition, 102.

Page 17: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Different numbers of Color Terms:n=3

T. Regier et al, PNAS 104, 2007

Page 18: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

n=4

n=3

T. Regier et al, PNAS 104, 2007

Different numbers of Color Terms:

Page 19: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

n=4

n=3

n=5

T. Regier et al, PNAS 104, 2007

Different numbers of Color Terms:

Page 20: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

n=4

n=3

n=5

n=6

T. Regier et al, PNAS 104, 2007

Different numbers of Color Terms:

Page 21: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

The World Color Survey

✶ 110 languages; 25 speakers.

✶ Data collection ended in 1980.

✶ Digitalizing hand coded data took more than 23 years.

✶ A very valuable site of unembellished ascii data files:http://www.icsi.berkeley.edu/wcs/data.html

Page 22: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

World Color Survey Data — Uses a Generic Format

Page 23: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

The existing World Color Survey (WCS) database

✶ Beginning ~2003 the WCS database was made publicly available.

✶ Has been very widely cited in the last few years.

http://www.icsi.berkeley.edu/wcs/data.html

(2009)

Page 24: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

E.g., Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.

Page 25: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

(WCS datafiles do not include headers)

Language Number

Speaker Number

Focus Number

Term Abbrv.

Coordinates of focus selection

Datafile Example “foci.txt”:Color chip selected as category best-exemplar

Page 26: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Focus selections in two languages:

Deshpande, P.S. (under review). Investigating Color Categorization Behaviors in Korean-English Bilinguals. UCI Undergraduate Research Journal (submitted June, 2015).

English

Korean

Page 27: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

The WCS data is awesome, but …

a platform with a GUI for empirically investigating and analyzing such data would be even better,

and a site with rigorous on-board research tools would

also be a big plus.

We were given a chance to do this…

Jameson, K. A., Benjamin, N. A., Chang, S.M., Deshpande, P. S., Gago, S., Harris, I. G., Jiao, Y., and Tauber, S. (2015). Mesoamerican Color Survey Digital Archive. In Encyclopedia of Color Science and Technology, (Ronnier Luo, Ed.). Springer: Berlin / Heidelberg. ISBN: 978-3-642-27851-8 (Online). DOI 10.1007/978-3-642-27851-8.

Nathan

See Poster: An Affordance Based Approach to Large Data-Set Navigation.

Page 28: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

The Robert E. MacLaury Archive

✶ ~23,000 pages of raw color categorization data that includes:

✶ 116 dialects from indigenous Mesoamerican societies (261 surveys), and

✶ ~130 additional surveys from a variety of languages (across Africa, Asia, the Americas and Europe).

Page 29: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

R. E. MacLaury’s Dissertation:Color in MesoAmerica, Vol. I: A Theory of Composite Categorization. (1986)

Book:Color and Cognition in Mesoamerica: Constructing Categories as Vantages.(1997)

Page 30: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

The mesoamerican portion of the REM archive:

37 within Oaxaca

30 within Guatemala

33 within Mexico City

Jameson et al. (2015). ECST.

Page 31: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Chinantec language diversity in the MCS

http://dc433.4shared.com/img/5va8nYgU/s7/0.6475236094740385/aaa.jpg?async

Page 32: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Chinantec language diversity in the MCS

Developing

Vigorous

Endangered

Jameson et al. (2015). ECST.

Page 33: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Features of our transcription problem that may be general:

✶ The data has a constrained structure and format.(unlike typical historical records transcription tasks)

✶ It’s a perceptual identification/reproduction problem:e.g., identify handwritten characters/symbols in a

standardized template or form and reproduce them via keyboard input.

✶ transcription of large blocks of data can be brokeninto small tasks and transcribed by OCR or

crowdsourcing methods.

YangSee Poster: Optical Character Recognition of Handwritten Tabular Data.

Page 34: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.

Page 35: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.

Problem: Convert THIS into a data addressable file

Page 36: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

American English Data

Problem: Convert THIS into a data addressable file

Page 37: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

DATA

... continues up to 330 ...

Page 38: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Challenges of our transcription job: • Concepts. How they apply everywhere

• There’s a classic example — color.

• There’s an existing database.

• There’s a chance to do better.

• Crowdsourcing can help greatly

• Why OCR doesn't work. Handwriting that is not prose.

• The reason is its a perceptual problem.

• Crowdsourcing lets us break the problem into pieces and solve it piecewise.

Page 39: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Page 40: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Page 41: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Page 42: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Features of our problem and approach that may apply elsewhere:

✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.

✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.

✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.

✶ While crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like “CCT”) may permit trading off “smarter” data for “bigger” data, giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.

National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).

Page 43: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Jameson & Romney (1990). Consensus on Semiotic Models of Alphabetic Systems.J. of Quant. Anthro.

Batchelder and Romney (1988)Test theory without an answer-

key. Psychometrika.

Cultural consensus analyses of a

cognitive-perceptual task

✶ For tasks evaluating new characters designed to extend the 26 letters of the English alphabet, consensus analyses objectively identified expert typeface designers with higher “competence” compared to college undergraduates.

Page 44: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Automating archive transcription: Task and Judgments

Stephanie

Design 1: OCR verification (pattern recognition) - 2-AFC yes/noDesign 2: OCR verification (training data) - free responseDesign 3: Crowdsource verification - 2-AFC “match/no-match” Design 4: Naming ranges 1 - free response + confidenceDesign 5: Naming ranges 2 - N-AFC + confidenceDesign 6: Focus transcription 1 - free response + confidenceDesign 7: Focus transcription 2 - free response

“free response” = a “reCAPTCHA” task.

Poster title: Designing Crowdsourcing Methods for the Transcription of Handwritten Documents.

*

Page 45: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

E.g., internet-based transcription task:

http://colcat.calit2.uci.edu

Page 46: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Cultural Consensus Theory (CCT)to aggregate the data

Deshpande, Tauber., Chang, Gago & Jameson. (in preparation). Digitizing a large corpus of handwritten documents using crowdsourcing and cultural consensus theory.

Prutha

— Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription.

— Enrich the model underlying Dichtomous Bayesian form of CCT (Oravecz, et al. 2014) to handle N-alternative forced-choice data formats.

— As a result, employ smarter analyses of smaller samples, using CCT’s formal process model, that produce solutions as robust as those from large amounts of “averaged” data.

See Poster: A Cultural Consensus Theory Analysis of Crowdsourced Transcription Data.

Page 47: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Results:

Page 48: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Results: Task 4n=30

Page 49: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Results: Task 4

hi, hl

n=30

Page 50: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Results: Task 4n=30

Page 51: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Inferring the true transcription

• Mode?

• (Bayesian) Cultural Consensus Theory (CCT)(Oravecz, Vandekerckhove & Batchelder, 2014)(Batchelder & Romney, 1988)

Page 52: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Cultural Consensus Theory (CCT)

• “Test theory without an answer key” (Batchelder & Romney, 1988)

• Allows us to infer:

• shared latent cultural knowledge (true transcription)

• individual ability

• item difficulty

• response bias

Page 53: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Cultural Consensus Theory (CCT)

• Usually applied to dichotomous (true/false) data.

• Other formats have been explored with Bayesian framework but not multiple choice / free response (to our knowledge).

• Not typically applied to perceptual identification(although, see Jameson 1990)

Page 54: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCT

Page 55: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCTObserved Data

Page 56: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCTObserved Data

Page 57: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Page 58: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Page 59: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Page 60: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCTObserved Data

Latent Parameters

Page 61: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Dichotomous CCT Multiple Choice CCT

(subject-wise bias)

Observed Data

Latent Parameters

Page 62: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Examples of perceptually confusable stimuli

Page 63: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Response bias:Individuals or items?

subject-wise bias item-wise bias

Page 64: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Response bias:Individuals or items?

subject-wise bias item-wise bias

Page 65: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

CCT Answer Key: Task 4

Page 66: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

CCT Answer Key: Task 4

Page 67: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

CCT Answer Key: Task 4

Page 68: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

CCT Answer Key: Task 4

Page 69: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

CCT Answer Key: Task 4

Page 70: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

subject-wise posteriorsAnswer 4 (Z4) Answer 16 (Z16)

Answer 125 (Z125) Subject 0 bias (g0)

Page 71: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

subject-wise posteriorsAnswer 4 (Z4) Answer 16 (Z16)

Answer 125 (Z125) Subject 0 bias (g0)

Page 72: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

item-wise posteriorsAnswer 4 (Z4)

Item 4 bias (g4)Answer 16 (Z16)

Item 16 bias (g16)

Answer 125 (Z125)Item 125 bias (g125)

Page 73: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

subject-wise model predictionstask 4

Page 74: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

subject-wise model predictions

item-wise model predictions

task 4

Page 75: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

subject-wise model predictionstask 7

Page 76: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

subject-wise model predictions

item-wise model predictions

task 7

Page 77: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

✶ CCT was designed to work on small (6-10) sized subject samples typical of anthropological studies.

Would the patterns of results reported for Task 4 be possible with a sample smaller than 30 participants?

Method Answer Key Estimate %-correct

Mean Competence

Mean Item Difficulty

Trial 1 - 8 participants 100% 0.929 0.466

Trial 2 - 8 participants 100% 0.937 0.460

Trial 3 - 8 participants 100% 0.914 0.459

Trial 4 - 8 participants 100% 0.942 0.464

Trial 5 - 8 participants 100% 0.935 0.464

30 Participants 100% 0.917 0.366

Can we use fewer informants ?…

Preliminary trends suggests 8 participants may be as informative as 30.

Page 78: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Discussion points• Two (or more) response-strategy “subcultures”?

• Confidence data can help CCT results

• Quantitative model evaluation

• Item + individual bias component?

• Automation and integration with other server-side processes (Python module vs. R, Matlab)

Page 79: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Results Summary: ✶ These preliminary results suggest two novel approaches, piece-wise crowdsourcing and CCT data handling, can be used to accurately transcribe a large corpus of ethnographic data.

✶ By using internet-based methods, it appears we can a avoid 20+ year manual transcription job and derive an accurate and unbiased database of great value to investigations of concept formation across language groups.

✶ The economical way in which we modeled this perceptually-based transcription problem seems likely to generalize to other internet-based tasks that require extraction and evaluation of targets embedded in distracting information, and our novel use of CCT analyses seem promising for intelligently aggregating smaller subsets of crowdsourced responses to address large data handling problems.

Page 80: Crowdsourcing the transcription of archival data · — Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription. — Enrich

Thanks for Listening!!

Funding and Support for the archive project: Calit2 at UCI. University of California PacificRim Research Program, 2010-2015 (K.A. Jameson, PI). National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). UCI’s UROP Program Awards. IRB ApprovalsHS#2013-9921 and 2015-9047.