digitizing serialized fiction kirk hess dh 2013 – july 17, 2013 [email protected]

16
Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 [email protected]

Upload: chester-fitzgerald

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Digitizing Serialized FictionKirk HessDH 2013 – July 17, [email protected]

Page 2: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Serialized Fictionin Farm Newspapers• Libguide

for Serialized Fiction in the Farm Field and Fireside collection • “Many of the newspapers in Farm, Field and Fireside published

serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “

Page 3: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Serialized Fiction in the Farmer’s Wife• Farmer’s Wife was published from 1897-1939; April 1906-April

1939 digitized in FFF• “Many of the stories could be characterized as romance fiction

designed to appeal to farm wives”• Previously indexed in practicum project; stored in spreadsheet

(link). Intended as a database with a way to link to existing articles.

Page 4: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Newspaper Digitization• Select Newspaper• Create page images

• Microfilmed?• If not, film• If film bad, fix film

• Scan film• Tiff image, cropped, deskewed

• Article Segmentation• Process TIFF to Olive specs• OCR text, Article/Ad/Image segmentation

• Load to access system (Olive Active Paper/Veridian)

Page 5: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Finding Serialized Fiction

Software doesn’t make this easy to findNo metadataOCR problems with newsprintArticles span multiple issues, no links between them

On the other hand…The text is thereThe images are thereThe articles are segmented

Page 6: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

OCR issues• Only adminstrators• A lot of errors, not a lot of people• Manual process, not easily automatable• Full text not visible• Users expect correct text

• Demo’d many solutions, coalesced around Omekahttp://omeka.org

• Moving to Veridian Fall 2014

Page 7: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Prototype Omeka/Scripto• http://uller.grainger.illinois.edu/omeka/• Workflow http

://hpnl.pbworks.com/w/page/53056034/Omeka%20instructions

• PM/Technical Lead (Kirk), 4 part time editors (Olivia, Matt, Shoshana, Carl)

• Completed project in ~ 4 months, 736 serials

Page 8: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Completed story• THE MYSTERIOUS MCCORKLES by F. Roney Weir• http://uller.grainger.uiuc.edu/omeka/items/show/20

Page 9: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

TEI?• Requires training, manual process for full annotations, lite TEI

can be automatically generated from corrected text• Has some advantages for scholars over plain text• XTF Example• http://uller.grainger.uiuc.edu:8080/xtf/search

• More McCorkles• http://uller.grainger.uiuc.edu:8080/xtf/view?docId=tei/TSF00013/TSF00013.

xml&chunk.id=AR00300&toc.id=&brand=default

Page 10: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Beyond the Berry Farm• How can we prioritize work so important text is corrected

first?Example:• http://uller.grainger.uiuc.edu/omeka/items/show/6• Words: 2876, spelling errors 55, 98% accuracy• Predictive solutions

• How can we identify serialized fiction without having to find it manually and put it in a spreadsheet?

Page 11: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Identifying Serialized Fiction• Building a Feature set• Common N-Grams

• Chapter (number/roman numeral)• To Be Continued• The End

• Topic/Genre/Theme (Romance, children stories, holidays, etc.)• Named entity extraction• Predictive solutions (Google API)

Page 12: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Topics• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al.• A document contains a finite amount of topics, and each word

can be assigned to a topic• Used Mallet (http://mallet.cs.umass.edu/)• Example output:

Topic 10Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread

Page 13: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Network Analysis • Topics and Documents are nodes, docs in topics are edges.• By generating a network graph (Gephi) we can see

connections• By using clustering algorithms, we can see clusters of

documents around a topic• Train data mining algorithm?

Page 14: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Named Entity Extraction• Proper names interfere with LSA

• Manually generate stop word list• Lots of names to find!

• Programmatically find names• Stanford NLP Named Entity Recognizer

Page 15: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

NLTK• Similar to Movie Review sample using a small subset of articles, Naïve Bayes

Classifier using NTLK, top 2000 words• >>> classifier.show_most_informative_features(5)• contains(having) = True fictio : nonfic = 1.9 : 1.0• contains(plan) = True fictio : nonfic = 1.9 : 1.0• contains(growing) = True fictio : nonfic = 1.9 : 1.0• contains(entertaining) = True fictio : nonfic = 1.9 : 1.0• contains(home) = True fictio : nonfic = 1.9 : 1.0

• High accuracy (> .95) but weak ratios

Page 16: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu

Next Steps• Implement Veridian• Crowdsource OCR correction• Direct access to index (Solr)

• Continue NLP research using NLTK Toolkit