topicexplorer - qualitative text mining for humanities and ... · introduction 3/ 24 introduction i...

1/ 24

TopicExplorer - Qualitative Text Mining forHumanities and Social Sciences

Alexander Hinneburg1 Christian Oberlander2 ChristianPapilloud3 Anne Purschwitz4

1Computer Science, Martin-Luther-University Halle-Wittenberg

2Japanese Studies, Martin-Luther-University Halle-Wittenberg

3Social Science, Martin-Luther-University Halle-Wittenberg

4History, IZEA, Martin-Luther-University Halle-Wittenberg

May 2017https://blogs.urz.uni-halle.de/topicexplorer/

https://blogs.urz.uni-halle.de/topicexplorer/

Introduction 2/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Introduction 3/ 24

Introduction

I Qualitative text analysis is a methodological foundation formany disciplines in the humanities and social sciences

I Main challenges for software supportI corroboration of resultsI semi-automatic, content-based preparation, indexing and

summarization of large text collections

I Goal of the DFG grant applicationI build an infrastructure for the TopicExplorer system to offer

qualitative text mining as a self serviceI Use case: media impact analysisI Use case: deep interview analysis & history of sociologyI Use case: Halle journals of the 18th century

includes application of OCR software for historic printings

I TopicExplorerI Visualization techniques and user interfaceI Topics and meta data for context information

TopicExplorer 4/ 24

Outline

Introduction



Conclusion

TopicExplorer 5/ 24

Short Overview of Topic Models[BNJ03, Ble12]

I Latent Dirichlet Allocation (LDA) assumes that a document isgenerated as follows

I draw topic mixture proportions θ of document from Dirichletdistribution

I Generate words of the document as followsI draw topic z of the word to be generated from multinomial

with parameters θI draw word w from topic specific multinomial over vocabulary

that is indicated by z and parameterized by ϕ

I LDA-Model

I Inference estimates distributions over hidden variables z , θ, ϕ⇒ each word is assigned to a topic

TopicExplorer Visualization techniques and user interface 6/ 24

Outline

Introduction



Conclusion


Overview of TopicExplorer User Interface[HPS12, HRPO14]




Similarity-based Layout of Topics

I AlgorithmI Compute all pairwise topic similaritiesI Compute hierarchical clusteringI Layout dendrogram tree, each leaf is a topic

I Rainbow color mapping reflects computed topic orderI Color acts like a visual hash function: similar color indicates

similar topic with high likelihood.

I Color provides orientation for the user at three levelsI TopicsI Document browsing and rankingI Word assignments of single Documents


Document Browsing and Ranking

I Documents are ranked by decreasing topic affinityI Colored circles indicate the four most important topicsI Color acts as visual hash function


Document Inspection

I Document is represented with topic assignments of wordsI Color acts as visual hash function


Hierarchical Topics

I Initial situation

I Preview of topic merging according to topic hierarchy

I Merged topic, (allows interactive splitting to reverse merging)

TopicExplorer TopicFrames 12/ 24

Outline

Introduction



Conclusion


Motivation of TopicFrames [HRPO14]

I TopicExplorer

I ProblemI Topic models give no guarantees on interpretabilityI Not all topics are well interpretableI Topics represented by word lists may leave room for several

different interpretations


Towards interpretable Topic Representations

I Most probable words is not always the best option

I More context of top words is helpfulI Results on topic coherence

I Humans find a topic interpretable if pairs of top words oftenappear close together in documents [Lau et al. EACL 2014]

I OberservationI Top word lists of topics are often dominated by nouns, while

verbs are less prominent

I FramesI Basic communication units for semantic contentsI Noun-verb co-locations

I Topic FramesI Topic specific noun-verb co-location

I How to define and compute topic frames?


Topic Frames

I DefinitionI Topic frame occurs when a noun and a verb appear close

together in a document and the topic model assigned bothtokens to the same topic.

I Topics with top word list that are not clearly interpretable.

I Topics frames make topics more clearly interpretable.


Workflow and Implementation

I Linguistic data preparationI Tokenization, important for Japanese documentsI Part-of-speech taggingI Lemmatization

I Topic modelingI work with word lemmataI translate topic assignments back to original documents

I Topic frame detectionI find noun-verb co-locations with tokens assigned to same topicI filter frames occurrences with sentence delimitersI count different topic frames and number of frame occurrences

Topic Frame Occurrence Keywords with Topics

TopicExplorer Topics and Meta Data 17/ 24

Outline

Introduction



Conclusion


Topics and Meta Data: Time

I Document Collection consists of Blogs with Time Stamps

I Temporal Information is not used for Topic Modeling

I Instead: token assignments to topics are analyzed over time


Topics and Meta Data: General Case

I Documents come with all kinds of meta dataI Texts, topic model and meta data are stored in a relational

database [RH16]I Derived linguistic meta data: POS, NER, Sentiment, ...I Geographic entities mentioned in documentsI Images, Videos, Links, ...

I Token assignments to topics can be analyzed in context ofmeta-data enities using

I Aggregation and/or Intra-Document ProximityI Topic-Explorer uses automatically configured workflows to

allow future extension to all kind of meta data:

DocumentCreate

DocumentTopicCreate

DocumentTopicFill

DocumentFill TopicFill

DocumentTermTopicFill TopicCreate

TokenTopicAssociator

DocumentTermTopicCreate

Mallet

InFilePreparation

ColorTopic_TopicCreate

Text_TopicCreate

Text_TopicFill

TermTopicFillTermFill

TermTopicCreateTermCreate

HierarchicalTopic_TermTopicFill

HierarchicalTopic_TopicFill

TopicMetaData HierarchicalTopic_TopicCreate

ColorTopic_TopicFill

Text_DocumentFill

Text_DocumentCreate

PruneWordType_DocumentTermTopicCreate

HierarchicalTopic_DocumentTopicFill


Topics and Meta Data: General Case

I Documents come with all kinds of meta dataI Texts, topic model and meta data are stored in a relational

database [RH16]I Derived linguistic meta data: POS, NER, Sentiment, ...I Geographic entities mentioned in documentsI Images, Videos, Links, ...

I Token assignments to topics can be analyzed in context ofmeta-data enities using

I Aggregation and/or Intra-Document ProximityI Topic-Explorer uses automatically configured workflows to

allow future extension to all kind of meta data:

OCR and Topic Models 20/ 24

Outline

Introduction



Conclusion


OCR for historic printings

I Abbyy (commercial) and Tesseract (open source) needextensive training to handle historic printings

I New neural networks techniques for OCR are available:OCRopus

I Neural networks may require less and simpler training [Spr16]

I DFG funds OCR-D project1 for enhancing available OCRsoftware

1http://ocr-d.de

http://ocr-d.de


Combining OCR and Topic Models

I Evaluating models of latent document semantics in thepresence of OCR errors [WLR10]

Though model quality declines as errors increase,simple feature selection techniques enable thelearning of relatively high quality models even asword error rates approach 50%.

I Towards Noise-resilient Document Modeling [YL11]

I On handling textual errors in latent document modeling[YL13]

Using both real and synthetic data sets with varyingdegrees of errors, our TDE-LDA model outperforms:(1) the traditional LDA model by 16%-39% (real)and 20%-63% (synthetic); and (2) thestate-of-the-art N-Grams model by 11%-27% (real)and 16%-54% (synthetic).

Conclusion 23/ 24

Outline

Introduction



Conclusion

Conclusion 24/ 24

Conclusion

I Topic models are helpful for qualitative text analysis

I TopicExplorer User-Interface helps to go from statisticalmodels towards semantic interpretations

I Future WorkI evaluate new OCR methods for historic printingsI combine OCR and topic modeling

Conclusion 24/ 24

David M. Blei.

Probabilistic topic models.

Commun. ACM, 55(4):77–84, April 2012.

David M Blei, Andrew Y Ng, and Michael I Jordan.

Latent dirichlet allocation.

The Journal of Machine Learning Research, 3:993–1022, 2003.

Alexander Hinneburg, Rico Preiss, and Rene Schroder.

Topicexplorer: Exploring document collections with topic models.

In Machine Learning and Knowledge Discovery in Databases, pages838–841. Springer, 2012.

Alexander Hinneburg, Frank Rosner, Stefan Pessler, and ChristianOberlander.

Exploring document collections with topic frames.

In Proceedings of the 23rd ACM International Conference onConference on Information and Knowledge Management, CIKM ’14,pages 2084–2086, New York, NY, USA, 2014. ACM.

Conclusion 24/ 24

Frank Rosner and Alexander Hinneburg.

Translating bayesian networks into entity relationship models.

In Conceptual Modeling - 35th International Conference, ER 2016,Gifu, Japan, November 14-17, 2016, Proceedings, pages 65–72,2016.

Uwe Springmann.

OCR fur alte Drucke.

Informatik-Spektrum, 39(6):459–462, 2016.

Daniel D. Walker, William B. Lund, and Eric K. Ringger.

Evaluating models of latent document semantics in the presence ofocr errors.

In Proceedings of the 2010 Conference on Empirical Methods inNatural Language Processing, EMNLP ’10, pages 240–250,Stroudsburg, PA, USA, 2010. Association for ComputationalLinguistics.

Tao Yang and Dongwon Lee.

Towards noise-resilient document modeling.

Conclusion 24/ 24

In Proceedings of the 20th ACM International Conference onInformation and Knowledge Management, CIKM ’11, pages2345–2348, New York, NY, USA, 2011. ACM.

Tao Yang and Dongwon Lee.

On handling textual errors in latent document modeling.

In Proceedings of the 22nd ACM international conference onConference on information & knowledge management, CIKM’13, pages 2089–2098, New York, NY, USA, 2013. ACM.