zoss high-level text analysis and techniques

35
HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library [email protected] e University Libraries, Digital Scholarship t > Data, October 25

Upload: dukedigitalscholarship

Post on 27-Jun-2015

304 views

Category:

Education


1 download

DESCRIPTION

2012 Oct 25 presentation by Angela Zoss (Duke University) for Duke University Libraries' Text > Data digital scholarship series

TRANSCRIPT

Page 1: Zoss High-Level Text Analysis and Techniques

HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUESAngela ZossData Visualization Coordinator226 Perkins [email protected]

Duke University Libraries, Digital ScholarshipText > Data, October 25

Page 2: Zoss High-Level Text Analysis and Techniques

DOCUMENTS AS CONTEXT

Page 3: Zoss High-Level Text Analysis and Techniques

ANGELA AS CONTEXTBut first,

Page 4: Zoss High-Level Text Analysis and Techniques

How I learned to love the document.B.A. courses: Linguistics, Communication

M.S. courses: Communication, Human-Computer Interaction

Employment: arXiv.org Administrator

Ph.D. courses: •Bibliometrics/Scientometrics•Computer Mediated Discourse Analysis•Latent Structure Analysis•Natural Language Processing

Page 5: Zoss High-Level Text Analysis and Techniques

DOCUMENTS AS CONTEXTNow,

Page 6: Zoss High-Level Text Analysis and Techniques

Text analysis from…

• documents down to words (“low-level”)

• words up to documents (“high-level”)

Page 7: Zoss High-Level Text Analysis and Techniques

Using documents to learn about language (or other social phenomena)

Analyzing documents as records/proxies of language, social structures, events, etc.

Linguistic studies: morphology, word counts, syntax, etc. …

over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches)

Underwood, T. (2012). Where to start with text mining.

Page 8: Zoss High-Level Text Analysis and Techniques

Using documents to learn about language

Historical culturomics of pronoun frequencies

Page 9: Zoss High-Level Text Analysis and Techniques

Using documents to learn about language

Universal properties of mythological networks

Page 10: Zoss High-Level Text Analysis and Techniques

Using language to learn about documents

Analyzing documents as artifacts themselves, with their own properties and dynamics

Literary, documentary studies:Structural/rhetorical/stylistic analysisDocument categorization, classificationDetecting clusters of document features (topic modeling)

Underwood, T. (2012). Where to start with text mining.

Page 13: Zoss High-Level Text Analysis and Techniques

What are documents?

For this discussion, digital versions of works of spoken or written language

Examples: books, articles, transcripts, emails,

tweets…

Page 14: Zoss High-Level Text Analysis and Techniques

Documents as context

Documents have:• form(at)• style• provenance• entities• intentions

Page 15: Zoss High-Level Text Analysis and Techniques

STUDIES OF DOCUMENTS

Page 16: Zoss High-Level Text Analysis and Techniques

Why study documents?

• Describe a corpus• Compare/organize documents• Locate relevant information/filter out

irrelevant information

Page 17: Zoss High-Level Text Analysis and Techniques

Describing a corpus

• Finding regularities/differences across groups of documents

• Developing theories of structure, style, etc. that can then be tested or applied

• May be manual (content analysis) or computer-assisted (statistical)

Page 18: Zoss High-Level Text Analysis and Techniques

Example: Storylines

http://xkcd.com/657/

Page 19: Zoss High-Level Text Analysis and Techniques

Differences of format, genre, participants…

• Articles may have sections, but these will vary by discipline and type of article

• Books may be fiction or non-fiction (or both)

• Transcripts may refer to multiple speakers, non-text content

• …ad infinitum

Page 20: Zoss High-Level Text Analysis and Techniques

Example: Literature Fingerprinting

Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004

Page 21: Zoss High-Level Text Analysis and Techniques

Organizing documents

Detect similarity between documents and a known category (or simply among themselves)

Supports browsing, sentiment analysis, authorship detection

Page 22: Zoss High-Level Text Analysis and Techniques

Example: Bohemian Bookshelf

Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.

Page 23: Zoss High-Level Text Analysis and Techniques

Similarity based on…

• common document attributesauthorship, genre

• common language patternstopics, phrases

• common entity referencescharacters, citations

Page 24: Zoss High-Level Text Analysis and Techniques

Example: Quantitative Formalism

Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).

Page 25: Zoss High-Level Text Analysis and Techniques

Example: Clinton’s DNC Speech

http://b.globe.com/TogUqq

Page 27: Zoss High-Level Text Analysis and Techniques

Classification

• assigning an object to a single class• often supervised, using an existing

classification scheme and a tagged corpus

Page 28: Zoss High-Level Text Analysis and Techniques

Example: Relative signatures

Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).

Page 29: Zoss High-Level Text Analysis and Techniques

Categorization

• assigning documents to one or more categories

• suggestive of unsupervised clustering techniques

• design choices made to fit particular tasks or goals

Page 30: Zoss High-Level Text Analysis and Techniques

Example: UCSD Map of Science

Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science . PLoS ONE, 7(7), e39464.

Page 32: Zoss High-Level Text Analysis and Techniques

Reference systems, infrastructureWhat do we gain by adding structure?

What do we lose?

Page 33: Zoss High-Level Text Analysis and Techniques

SUMMARIZING DOCUMENTS

Page 34: Zoss High-Level Text Analysis and Techniques

Text is only one component of a document.

Research questions often push us to be creative with how we operationalize constructs.

The richness of language and documents is best preserved by using multiple, complementary approaches.