topicexplorer - qualitative text mining for humanities and ... · introduction 3/ 24 introduction i...

28
1/ 24 TopicExplorer - Qualitative Text Mining for Humanities and Social Sciences Alexander Hinneburg 1 Christian Oberl¨ ander 2 Christian Papilloud 3 Anne Purschwitz 4 1 Computer Science, Martin-Luther-University Halle-Wittenberg 2 Japanese Studies, Martin-Luther-University Halle-Wittenberg 3 Social Science, Martin-Luther-University Halle-Wittenberg 4 History, IZEA, Martin-Luther-University Halle-Wittenberg May 2017 https://blogs.urz.uni-halle.de/topicexplorer/

Upload: others

Post on 31-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

1/ 24

TopicExplorer - Qualitative Text Mining forHumanities and Social Sciences

Alexander Hinneburg1 Christian Oberlander2 ChristianPapilloud3 Anne Purschwitz4

1Computer Science, Martin-Luther-University Halle-Wittenberg

2Japanese Studies, Martin-Luther-University Halle-Wittenberg

3Social Science, Martin-Luther-University Halle-Wittenberg

4History, IZEA, Martin-Luther-University Halle-Wittenberg

May 2017https://blogs.urz.uni-halle.de/topicexplorer/

Page 2: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Introduction 2/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 3: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Introduction 3/ 24

Introduction

I Qualitative text analysis is a methodological foundation formany disciplines in the humanities and social sciences

I Main challenges for software supportI corroboration of resultsI semi-automatic, content-based preparation, indexing and

summarization of large text collections

I Goal of the DFG grant applicationI build an infrastructure for the TopicExplorer system to offer

qualitative text mining as a self serviceI Use case: media impact analysisI Use case: deep interview analysis & history of sociologyI Use case: Halle journals of the 18th century

includes application of OCR software for historic printings

I TopicExplorerI Visualization techniques and user interfaceI Topics and meta data for context information

Page 4: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer 4/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 5: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer 5/ 24

Short Overview of Topic Models[BNJ03, Ble12]

I Latent Dirichlet Allocation (LDA) assumes that a document isgenerated as follows

I draw topic mixture proportions θ of document from Dirichletdistribution

I Generate words of the document as followsI draw topic z of the word to be generated from multinomial

with parameters θI draw word w from topic specific multinomial over vocabulary

that is indicated by z and parameterized by ϕ

I LDA-Model

I Inference estimates distributions over hidden variables z , θ, ϕ⇒ each word is assigned to a topic

Page 6: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Visualization techniques and user interface 6/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 7: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Visualization techniques and user interface 7/ 24

Overview of TopicExplorer User Interface[HPS12, HRPO14]

https://blogs.urz.uni-halle.de/topicexplorer/

Page 8: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Visualization techniques and user interface 8/ 24

Similarity-based Layout of Topics

I AlgorithmI Compute all pairwise topic similaritiesI Compute hierarchical clusteringI Layout dendrogram tree, each leaf is a topic

I Rainbow color mapping reflects computed topic orderI Color acts like a visual hash function: similar color indicates

similar topic with high likelihood.

I Color provides orientation for the user at three levelsI TopicsI Document browsing and rankingI Word assignments of single Documents

Page 9: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Visualization techniques and user interface 9/ 24

Document Browsing and Ranking

I Documents are ranked by decreasing topic affinityI Colored circles indicate the four most important topicsI Color acts as visual hash function

Page 10: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Visualization techniques and user interface 10/ 24

Document Inspection

I Document is represented with topic assignments of wordsI Color acts as visual hash function

Page 11: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Visualization techniques and user interface 11/ 24

Hierarchical Topics

I Initial situation

I Preview of topic merging according to topic hierarchy

I Merged topic, (allows interactive splitting to reverse merging)

Page 12: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer TopicFrames 12/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 13: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer TopicFrames 13/ 24

Motivation of TopicFrames [HRPO14]

I TopicExplorer

I ProblemI Topic models give no guarantees on interpretabilityI Not all topics are well interpretableI Topics represented by word lists may leave room for several

different interpretations

Page 14: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer TopicFrames 14/ 24

Towards interpretable Topic Representations

I Most probable words is not always the best option

I More context of top words is helpfulI Results on topic coherence

I Humans find a topic interpretable if pairs of top words oftenappear close together in documents [Lau et al. EACL 2014]

I OberservationI Top word lists of topics are often dominated by nouns, while

verbs are less prominent

I FramesI Basic communication units for semantic contentsI Noun-verb co-locations

I Topic FramesI Topic specific noun-verb co-location

I How to define and compute topic frames?

Page 15: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer TopicFrames 15/ 24

Topic Frames

I DefinitionI Topic frame occurs when a noun and a verb appear close

together in a document and the topic model assigned bothtokens to the same topic.

I Topics with top word list that are not clearly interpretable.

I Topics frames make topics more clearly interpretable.

Page 16: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer TopicFrames 16/ 24

Workflow and Implementation

I Linguistic data preparationI Tokenization, important for Japanese documentsI Part-of-speech taggingI Lemmatization

I Topic modelingI work with word lemmataI translate topic assignments back to original documents

I Topic frame detectionI find noun-verb co-locations with tokens assigned to same topicI filter frames occurrences with sentence delimitersI count different topic frames and number of frame occurrences

Topic Frame Occurrence Keywords with Topics

Page 17: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Topics and Meta Data 17/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 18: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Topics and Meta Data 18/ 24

Topics and Meta Data: Time

I Document Collection consists of Blogs with Time Stamps

I Temporal Information is not used for Topic Modeling

I Instead: token assignments to topics are analyzed over time

Page 19: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Topics and Meta Data 19/ 24

Topics and Meta Data: General Case

I Documents come with all kinds of meta dataI Texts, topic model and meta data are stored in a relational

database [RH16]I Derived linguistic meta data: POS, NER, Sentiment, ...I Geographic entities mentioned in documentsI Images, Videos, Links, ...

I Token assignments to topics can be analyzed in context ofmeta-data enities using

I Aggregation and/or Intra-Document ProximityI Topic-Explorer uses automatically configured workflows to

allow future extension to all kind of meta data:

DocumentCreate

DocumentTopicCreate

DocumentTopicFill

DocumentFill TopicFill

DocumentTermTopicFill TopicCreate

TokenTopicAssociator

DocumentTermTopicCreate

Mallet

InFilePreparation

ColorTopic_TopicCreate

Text_TopicCreate

Text_TopicFill

TermTopicFillTermFill

TermTopicCreateTermCreate

HierarchicalTopic_TermTopicFill

HierarchicalTopic_TopicFill

TopicMetaData HierarchicalTopic_TopicCreate

ColorTopic_TopicFill

Text_DocumentFill

Text_DocumentCreate

PruneWordType_DocumentTermTopicCreate

HierarchicalTopic_DocumentTopicFill

Page 20: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

TopicExplorer Topics and Meta Data 19/ 24

Topics and Meta Data: General Case

I Documents come with all kinds of meta dataI Texts, topic model and meta data are stored in a relational

database [RH16]I Derived linguistic meta data: POS, NER, Sentiment, ...I Geographic entities mentioned in documentsI Images, Videos, Links, ...

I Token assignments to topics can be analyzed in context ofmeta-data enities using

I Aggregation and/or Intra-Document ProximityI Topic-Explorer uses automatically configured workflows to

allow future extension to all kind of meta data:

Page 21: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

OCR and Topic Models 20/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 22: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

OCR and Topic Models 21/ 24

OCR for historic printings

I Abbyy (commercial) and Tesseract (open source) needextensive training to handle historic printings

I New neural networks techniques for OCR are available:OCRopus

I Neural networks may require less and simpler training [Spr16]

I DFG funds OCR-D project1 for enhancing available OCRsoftware

1http://ocr-d.de

Page 23: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

OCR and Topic Models 22/ 24

Combining OCR and Topic Models

I Evaluating models of latent document semantics in thepresence of OCR errors [WLR10]

Though model quality declines as errors increase,simple feature selection techniques enable thelearning of relatively high quality models even asword error rates approach 50%.

I Towards Noise-resilient Document Modeling [YL11]

I On handling textual errors in latent document modeling[YL13]

Using both real and synthetic data sets with varyingdegrees of errors, our TDE-LDA model outperforms:(1) the traditional LDA model by 16%-39% (real)and 20%-63% (synthetic); and (2) thestate-of-the-art N-Grams model by 11%-27% (real)and 16%-54% (synthetic).

Page 24: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Conclusion 23/ 24

Outline

Introduction

TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data

OCR and Topic Models

Conclusion

Page 25: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Conclusion 24/ 24

Conclusion

I Topic models are helpful for qualitative text analysis

I TopicExplorer User-Interface helps to go from statisticalmodels towards semantic interpretations

I Future WorkI evaluate new OCR methods for historic printingsI combine OCR and topic modeling

Page 26: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Conclusion 24/ 24

David M. Blei.

Probabilistic topic models.

Commun. ACM, 55(4):77–84, April 2012.

David M Blei, Andrew Y Ng, and Michael I Jordan.

Latent dirichlet allocation.

The Journal of Machine Learning Research, 3:993–1022, 2003.

Alexander Hinneburg, Rico Preiss, and Rene Schroder.

Topicexplorer: Exploring document collections with topic models.

In Machine Learning and Knowledge Discovery in Databases, pages838–841. Springer, 2012.

Alexander Hinneburg, Frank Rosner, Stefan Pessler, and ChristianOberlander.

Exploring document collections with topic frames.

In Proceedings of the 23rd ACM International Conference onConference on Information and Knowledge Management, CIKM ’14,pages 2084–2086, New York, NY, USA, 2014. ACM.

Page 27: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Conclusion 24/ 24

Frank Rosner and Alexander Hinneburg.

Translating bayesian networks into entity relationship models.

In Conceptual Modeling - 35th International Conference, ER 2016,Gifu, Japan, November 14-17, 2016, Proceedings, pages 65–72,2016.

Uwe Springmann.

OCR fur alte Drucke.

Informatik-Spektrum, 39(6):459–462, 2016.

Daniel D. Walker, William B. Lund, and Eric K. Ringger.

Evaluating models of latent document semantics in the presence ofocr errors.

In Proceedings of the 2010 Conference on Empirical Methods inNatural Language Processing, EMNLP ’10, pages 240–250,Stroudsburg, PA, USA, 2010. Association for ComputationalLinguistics.

Tao Yang and Dongwon Lee.

Towards noise-resilient document modeling.

Page 28: TopicExplorer - Qualitative Text Mining for Humanities and ... · Introduction 3/ 24 Introduction I Qualitative text analysis is a methodological foundation for many disciplines in

Conclusion 24/ 24

In Proceedings of the 20th ACM International Conference onInformation and Knowledge Management, CIKM ’11, pages2345–2348, New York, NY, USA, 2011. ACM.

Tao Yang and Dongwon Lee.

On handling textual errors in latent document modeling.

In Proceedings of the 22nd ACM international conference onConference on information & knowledge management, CIKM’13, pages 2089–2098, New York, NY, USA, 2013. ACM.