cultural text mining workshop
TRANSCRIPT
![Page 1: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/1.jpg)
Cultural text mining
Pim HuijnenUtrecht University
DH Autumn School @ Uni Trier, October 1, 2015
![Page 2: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/2.jpg)
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignments: looking for the right words
![Page 3: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/3.jpg)
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
![Page 4: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/4.jpg)
www.translantis.nl
4
![Page 5: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/5.jpg)
National Library The Hague
~9.000.000 digitized newspaper pages 1618 - 1995
5
![Page 6: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/6.jpg)
Digital Humanities Approaches to Reference Cultures; The Emergence of the United States in Public Discourse in the Netherlands, 1890-1990
“…uses digital technologies to analyze the role of reference cultures in debates about social issues and collective identities, looking specifically at the emergence of the United States in public discourse in the Netherlands from the end of the nineteenth century to the end of the Cold War.
6
![Page 7: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/7.jpg)
The United States as a reference culture
BusinessSocietyConsumptionMediaCrimeHealth
![Page 8: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/8.jpg)
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
![Page 9: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/9.jpg)
Critical Digital Humanities
Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distant
![Page 10: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/10.jpg)
Queries / title selection
Visibility of underlying
data
Variable time frame
Export function(csv)
Linguistic and statistical settings
histogram
Word cloud
![Page 11: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/11.jpg)
BILAND
Query: ‘Heredity’ (1876) (22/1465 hits)
![Page 12: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/12.jpg)
BILAND
Query: ‘Heredity’ (1935) (1465 hits)
![Page 13: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/13.jpg)
Critical Digital Humanities
Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidence
![Page 14: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/14.jpg)
Eploratory text mining
[R]igorous mathematics is not necessarilyessential for using data efficiently andeffectively. In particular, working with data can be playful and exploratory anddeliberately without the mathematical rigor that social scientists must use to support theirepistemological claims.
Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan
Press, 2013).
“
![Page 15: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/15.jpg)
Exploratory text mining
In other words, data does not always have tobe used as evidence, but can be simply fordiscovering and framing research questions. […] [P]laying with data – in all its formats andforms – is more important than ever.
Frederick W. Gibbs and Trevor J. Owens, ‘The Hermeneutics of Data and Historical Writing’, in: Kristen Nawrotzki and Jack Dougherty (eds.), Writing History in the Digital Age (Ann Arbor, MI: University of Michigan
Press, 2013).
“
![Page 16: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/16.jpg)
Critical Digital Humanities
Transparency: user has to understand toolsFlexibility: constant to-and-fro between close and distantGreatest benefit digital tools is in exploring data, not in providing evidenceNo one size fits all solutions
![Page 17: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/17.jpg)
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
![Page 18: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/18.jpg)
Digital workflows (1)
Export of subset as a csv file
Stripping file of redundant metadata
Upload files into Voyant for further analysis
![Page 19: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/19.jpg)
![Page 20: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/20.jpg)
Digital workflows (2)
Export of subset as a csv file
Stripping file of redundant metadata
Splitting csv into csv-per-year
Upload files into Voyant for further analysis
![Page 21: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/21.jpg)
Digital workflows (2)
Subset “scientific management” (ca. 1200 articles 1919-1939): waning identification with Taylor(ism)
![Page 22: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/22.jpg)
Digital workflows (3)
Export of subset as csv file
Stripping file of redundant metadata
Saving concordance file and upload it into Voyantfor further analysis
Getting concordances of a word in TextSTAT
![Page 23: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/23.jpg)
Digital workflows (3)
Word cloud of concordances of “Amerika” in “scientific management” subset (1918-1939): words referring to other countries, to Taylor(ism), to work. Also note: “voorbeeld” (“model”), “oorlog” (“war”)
![Page 24: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/24.jpg)
Constants computationalmethods in historical research
Time factor
Comparative perspective
Focus on language
![Page 25: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/25.jpg)
Time factor
![Page 26: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/26.jpg)
Comparative perspective
![Page 27: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/27.jpg)
Focus on language
Capitalism ≠ “Capitalism”
![Page 28: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/28.jpg)
Focus on language
Historical spelling variations
Ambiguity between word and concept
Ambiguity in word meaning
![Page 29: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/29.jpg)
Overcoming language restrictions: dictionaries
![Page 30: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/30.jpg)
Searching with large queries
Productie OR wetenschappelijk OR loon OR arbeid OR leiding OR systeem OR wetenschapOR taak OR methode OR stelsel OR studie OR kennis OR geschikt OR winst OR resultatenOR snelheid OR vermeerdering OR geld
![Page 31: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/31.jpg)
But how to find the right words?
![Page 32: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/32.jpg)
Translantis: goals & methods
Critical Digital Humanities
Cultural text mining: Workflows
Assignment: looking for the right words
http://bit.do/CTM_Trier
![Page 33: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/33.jpg)
Using topic models to findcontext-specific words
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
![Page 34: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/34.jpg)
‘Moderne productie-beginselen’, Nieuwe Tilburgsche Courant, 20-11-1924
![Page 35: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/35.jpg)
Gevonden voor “product* machine* verspilling bedrijfgoedkoop kwaliteit”\1900-1909
Het nieuws van den dag voorNederlandsch-Indië, 16-05-1903 Algemeen Handelsblad, 19-05-1906
![Page 36: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/36.jpg)
Sub-corpus topic modeling
![Page 37: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/37.jpg)
Topic modeling
Representing topics in collection of documents
Use statistics to find topics represented by groups of wordsDocument is a mix of topicsTopic is a mix of words
Documents and words can be directly observed, topics are latent
![Page 38: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/38.jpg)
Topic modeling
Given a collection of documents, the modeling process does two things:
create word probability distribution for topicscreate topic probability distribution for documents
Both are purely based on frequency and co-occurrence of words
![Page 39: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/39.jpg)
Mallet LDA
Mallet uses Latent Dirichlet AllocationReducing high-dimensional term vector space to low-dimensional 'latent' topic space Iterative sampling to establish topics, word-topic distribution and topic-document distribution
After so many iterations, distributions are stable
![Page 40: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/40.jpg)
1) Gathering a corpus
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
![Page 41: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/41.jpg)
Mallet LDA
Project Gutenberg
https://www.gutenberg.org
Hathi Trust Digital Library
https://www.hathitrust.org
![Page 42: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/42.jpg)
2) Topic modeling
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
![Page 43: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/43.jpg)
Mallet LDA
Mallet command line
http://mallet.cs.umass.edu
Mallet GUI
https://code.google.com/p/topic-modeling-tool/
![Page 44: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/44.jpg)
Mallet GUI settings
Stopword list
No. of iterations
No. of topics
No. of words per topic
(Splitting of corpus)
![Page 45: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/45.jpg)
Mallet GUI output
![Page 46: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/46.jpg)
3) Further exploration
Topic modeling (sets of) texts
Use (interesting) output topic modeling as (combinations of) keywords for further exploration
![Page 47: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/47.jpg)
Use interesting words to build a dictionary
![Page 48: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/48.jpg)
Word use in specific corpora
Chronicling Americahttp://chroniclingamerica.loc.gov
Europeana Newspaper Projecthttp://www.theeuropeanlibrary.org/tel4/access
Dutch historical newspapershttp://www.delpher.nl
http://corpus.byu.edu
![Page 49: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/49.jpg)
Word frequencies, collocations, etc.
AntConchttp://www.laurenceanthony.net/software.html
Taporwarehttp://taporware.ualberta.ca
TextSTAThttp://neon.niederlandistik.fu-berlin.de/en/textstat/
![Page 50: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/50.jpg)
Word use over time
Google Books ngram viewerhttps://books.google.com/ngrams
NY Times ngram viewerhttp://chronicle.nytlabs.com
Chronicling America ngram viewerhttp://bookworm.culturomics.org/ChronAm/
![Page 51: Cultural text mining workshop](https://reader031.vdocuments.us/reader031/viewer/2022021922/58ed88b51a28ab723c8b465b/html5/thumbnails/51.jpg)
Word use over time