digitizing serialized fiction kirk hess dh 2013 – july 17, 2013 [email protected]
TRANSCRIPT
![Page 2: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/2.jpg)
Serialized Fictionin Farm Newspapers• Libguide
for Serialized Fiction in the Farm Field and Fireside collection • “Many of the newspapers in Farm, Field and Fireside published
serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “
![Page 3: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/3.jpg)
Serialized Fiction in the Farmer’s Wife• Farmer’s Wife was published from 1897-1939; April 1906-April
1939 digitized in FFF• “Many of the stories could be characterized as romance fiction
designed to appeal to farm wives”• Previously indexed in practicum project; stored in spreadsheet
(link). Intended as a database with a way to link to existing articles.
![Page 4: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/4.jpg)
Newspaper Digitization• Select Newspaper• Create page images
• Microfilmed?• If not, film• If film bad, fix film
• Scan film• Tiff image, cropped, deskewed
• Article Segmentation• Process TIFF to Olive specs• OCR text, Article/Ad/Image segmentation
• Load to access system (Olive Active Paper/Veridian)
![Page 5: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/5.jpg)
Finding Serialized Fiction
Software doesn’t make this easy to findNo metadataOCR problems with newsprintArticles span multiple issues, no links between them
On the other hand…The text is thereThe images are thereThe articles are segmented
![Page 6: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/6.jpg)
OCR issues• Only adminstrators• A lot of errors, not a lot of people• Manual process, not easily automatable• Full text not visible• Users expect correct text
• Demo’d many solutions, coalesced around Omekahttp://omeka.org
• Moving to Veridian Fall 2014
![Page 7: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/7.jpg)
Prototype Omeka/Scripto• http://uller.grainger.illinois.edu/omeka/• Workflow http
://hpnl.pbworks.com/w/page/53056034/Omeka%20instructions
• PM/Technical Lead (Kirk), 4 part time editors (Olivia, Matt, Shoshana, Carl)
• Completed project in ~ 4 months, 736 serials
![Page 8: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/8.jpg)
Completed story• THE MYSTERIOUS MCCORKLES by F. Roney Weir• http://uller.grainger.uiuc.edu/omeka/items/show/20
![Page 9: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/9.jpg)
TEI?• Requires training, manual process for full annotations, lite TEI
can be automatically generated from corrected text• Has some advantages for scholars over plain text• XTF Example• http://uller.grainger.uiuc.edu:8080/xtf/search
• More McCorkles• http://uller.grainger.uiuc.edu:8080/xtf/view?docId=tei/TSF00013/TSF00013.
xml&chunk.id=AR00300&toc.id=&brand=default
![Page 10: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/10.jpg)
Beyond the Berry Farm• How can we prioritize work so important text is corrected
first?Example:• http://uller.grainger.uiuc.edu/omeka/items/show/6• Words: 2876, spelling errors 55, 98% accuracy• Predictive solutions
• How can we identify serialized fiction without having to find it manually and put it in a spreadsheet?
![Page 11: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/11.jpg)
Identifying Serialized Fiction• Building a Feature set• Common N-Grams
• Chapter (number/roman numeral)• To Be Continued• The End
• Topic/Genre/Theme (Romance, children stories, holidays, etc.)• Named entity extraction• Predictive solutions (Google API)
![Page 12: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/12.jpg)
Topics• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al.• A document contains a finite amount of topics, and each word
can be assigned to a topic• Used Mallet (http://mallet.cs.umass.edu/)• Example output:
Topic 10Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread
![Page 13: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/13.jpg)
Network Analysis • Topics and Documents are nodes, docs in topics are edges.• By generating a network graph (Gephi) we can see
connections• By using clustering algorithms, we can see clusters of
documents around a topic• Train data mining algorithm?
![Page 14: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/14.jpg)
Named Entity Extraction• Proper names interfere with LSA
• Manually generate stop word list• Lots of names to find!
• Programmatically find names• Stanford NLP Named Entity Recognizer
![Page 15: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/15.jpg)
NLTK• Similar to Movie Review sample using a small subset of articles, Naïve Bayes
Classifier using NTLK, top 2000 words• >>> classifier.show_most_informative_features(5)• contains(having) = True fictio : nonfic = 1.9 : 1.0• contains(plan) = True fictio : nonfic = 1.9 : 1.0• contains(growing) = True fictio : nonfic = 1.9 : 1.0• contains(entertaining) = True fictio : nonfic = 1.9 : 1.0• contains(home) = True fictio : nonfic = 1.9 : 1.0
• High accuracy (> .95) but weak ratios
![Page 16: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e885503460f94b8cf09/html5/thumbnails/16.jpg)
Next Steps• Implement Veridian• Crowdsource OCR correction• Direct access to index (Solr)
• Continue NLP research using NLTK Toolkit