the landscape of crowdsourcing and transcription (delivered at duke university libraries,...

Download The Landscape of Crowdsourcing and Transcription (delivered at Duke University Libraries, 2013-11-20)

If you can't read please download the document

Upload: benwbrum

Post on 02-Jun-2015

5.516 views

Category:

Technology


0 download

DESCRIPTION

One of the most popular applications of crowdsourcing to cultural heritage is transcription. Since OCR software doesn't recognize handwriting, human volunteers are converting letters, diaries, and log books into formats that can be read, mined, searched, and used to improve collection metadata. But cultural heritage institutions aren't the only organizations working with handwritten material, and many innovations are happening within investigative journalism, citizen science, and genealogy. This talk will present an overview of the landscape of crowdsourced transcription: where it came from, who's doing it, and the kinds of contributions their volunteers make, followed by a discussion of motivation, participation, recruitment, and quality controls. (Video available at http://www.youtube.com/watch?v=jNrTC4Y0_dk )

TRANSCRIPT

  • 1. The Landscape of Crowdsourcing and Transcription Ben Brumfield Duke University November 20, 2013

2. Methodological Origins What is transcription? 3. Indexing Structured Data Extracting from Text Databases for Search and Analysis Granular Quality Control Gamification 4. Editing Books, Diaries, Letters, Articles Representing Text Traditional Editorial Workflow Digital or Print Editions 5. Community Origins Libraries and Archives Documentary & Scholarly Editing Genealogy Bioinformatics & Astronomy Investigative Journalism Free Culture 6. Libraries and ArchivesMaterial: Hand-written letters OCRed newspaper articlesGoal: Findability Format: Plaintext transcripts Destination: Search engines, finding aids 7. Documentary & Scholarly EditingMaterial: Literary drafts Historic correspondenceGoal: High-quality editions Format: TEI or other XML Destination: Human-readable print or digital editions 8. Genealogy Material: Handwritten records Goal: Findability Format: Structured data SpreadsheetsProprietary databases Destination: Searchable databases 9. Bioinformatics Material: Specimens Goal: Analysis Format: Custom Databases Destination: Analytic DatabasesScientific JournalsMuseum Collection Databases 10. Investigative Journalism Material: Receipts, FOIA Responses Goal: Findability Format: Custom Databases Destination: News Articles 11. Free Culture Material: OCR and e-Texts Goal: Readability Format: Plaintext, wiki mark-up Destination: Digital editions 12. How it works Who are the volunteers? 13. How it works Who are the volunteers? Why do they volunteer? 14. How it works Who are the volunteers? Why do they volunteer? What about accuracy? 15. OldWeather Participation More than 1.6 million weather observations. 16,000 volunteers. 1 million log pages transcribed. Mean contribution of 100 transcriptions per user. 16. OldWeather Participation More than 1.6 million weather observations. 16,000 volunteers. 1 million log pages transcribed. Mean contribution of 100 transcriptions per user but this statistic is worthless! 17. Power-law Distribution Most contributions are made by a core of wellinformed enthusiasts.True regardless of project size.What are the implications? 18. Objection! Can we really believe there is a crowd out there that is capable of producing publishable translations? I suspect that for most medieval topics, the vulgus is just too indoctum to make the effort worthwhile. I cannot see where there is the expertise out there to make this work. 19. That's when it dawned on me: what you need for mass volunteer projects isn't actually crowd-sourcing, but nerd-sourcing. You need to find, among the vast number of vaguely interested, not very analytical people who look at web sites, the small number of tidy-minded obsessives who care deeply about the ethnic origins of Freddie Mercury or want to analyse statistical data for fun and no profit. And then you need to persuade these people to do as much work for you as you can. The success of mass volunteering, therefore is going to depend heavily on the number of well-informed enthusiasts 'out there'. Rachel Stone 20. One Well-Informed Enthusiast In 14 days, Entire diary transcribed 250 revisions to 43 pages Two dozen footnotes 21. Quality Control 22. Quality Control 23. OldWeather Accuracy Individual transcriptions are about 97% accurate Of 1000 transcribed logbook entries: 3 will be lost because of transcription errors 10 will be illegible At least 3 will be errors in the logs 24. Costs and Results Harry Ransom Center Manuscript FragmentsStadarchief Leuven Itinera Nova 25. HRC Manuscript Fragments $0 capital budget Images captured with a camera phone. Crowdsourcing platform was Flickr.Minimal staffing 100 unpaid hours (July-October 2012)10-20 paid hours/week (March-August 2013) 26. HRC Manuscript Fragments 6050403020100 StaffVolunteersOther ScholarsUnidentifiedUnidentifiable 27. HRC Manuscript Fragments What is 30,000+ views, 284 twitter followers, and 147 facebook followers worth? [...] I know for sure that it has suddenly put the HRC on the map for a lot of medievalists who assumed this institution was not all that interested in that area. Micah Erwin 28. HRC Manuscript Fragments The biggest lesson for me is that you've got to engage your contributors more actively than I did. Micah Erwin 29. Free as in puppy!http://www.flickr.com/photos/magnusbrath/7614518858/ 30. Itinera Nova 31. Itinera Nova 765 registers conserved by 2 volunteers486 registers photographed301,201 images processed (25% by volunteers)12,033 acts transcribed35-40 volunteers, 1 full-time staff member 32. Itinera Nova Budget for first three years 4 person-years staff salariesProfessional-quality book scannerSoftware developmentTutorial developmentExhibitionsConferences 33. More Resources People Mia RidgeChris Lintott (Zooniverse)Micah Erwin (HRC Manuscript Fragments)Melissa Terras (Transcribe Bentham)Dominic McDevitt-Parks (Wikisource)Paul Flemons (Biodiversity Volunteer Portal) 34. Questions? Ben Brumfield [email protected] http://fromthepage.com/ http://tinyurl.com/TranscriptionToolGDoc http://manuscripttranscription.blogspot.com