topicexplorer - qualitative text mining for humanities and ... · introduction 3/ 24 introduction i...
TRANSCRIPT
1/ 24
TopicExplorer - Qualitative Text Mining forHumanities and Social Sciences
Alexander Hinneburg1 Christian Oberlander2 ChristianPapilloud3 Anne Purschwitz4
1Computer Science, Martin-Luther-University Halle-Wittenberg
2Japanese Studies, Martin-Luther-University Halle-Wittenberg
3Social Science, Martin-Luther-University Halle-Wittenberg
4History, IZEA, Martin-Luther-University Halle-Wittenberg
May 2017https://blogs.urz.uni-halle.de/topicexplorer/
Introduction 2/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
Introduction 3/ 24
Introduction
I Qualitative text analysis is a methodological foundation formany disciplines in the humanities and social sciences
I Main challenges for software supportI corroboration of resultsI semi-automatic, content-based preparation, indexing and
summarization of large text collections
I Goal of the DFG grant applicationI build an infrastructure for the TopicExplorer system to offer
qualitative text mining as a self serviceI Use case: media impact analysisI Use case: deep interview analysis & history of sociologyI Use case: Halle journals of the 18th century
includes application of OCR software for historic printings
I TopicExplorerI Visualization techniques and user interfaceI Topics and meta data for context information
TopicExplorer 4/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
TopicExplorer 5/ 24
Short Overview of Topic Models[BNJ03, Ble12]
I Latent Dirichlet Allocation (LDA) assumes that a document isgenerated as follows
I draw topic mixture proportions θ of document from Dirichletdistribution
I Generate words of the document as followsI draw topic z of the word to be generated from multinomial
with parameters θI draw word w from topic specific multinomial over vocabulary
that is indicated by z and parameterized by ϕ
I LDA-Model
I Inference estimates distributions over hidden variables z , θ, ϕ⇒ each word is assigned to a topic
TopicExplorer Visualization techniques and user interface 6/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
TopicExplorer Visualization techniques and user interface 7/ 24
Overview of TopicExplorer User Interface[HPS12, HRPO14]
https://blogs.urz.uni-halle.de/topicexplorer/
TopicExplorer Visualization techniques and user interface 8/ 24
Similarity-based Layout of Topics
I AlgorithmI Compute all pairwise topic similaritiesI Compute hierarchical clusteringI Layout dendrogram tree, each leaf is a topic
I Rainbow color mapping reflects computed topic orderI Color acts like a visual hash function: similar color indicates
similar topic with high likelihood.
I Color provides orientation for the user at three levelsI TopicsI Document browsing and rankingI Word assignments of single Documents
TopicExplorer Visualization techniques and user interface 9/ 24
Document Browsing and Ranking
I Documents are ranked by decreasing topic affinityI Colored circles indicate the four most important topicsI Color acts as visual hash function
TopicExplorer Visualization techniques and user interface 10/ 24
Document Inspection
I Document is represented with topic assignments of wordsI Color acts as visual hash function
TopicExplorer Visualization techniques and user interface 11/ 24
Hierarchical Topics
I Initial situation
I Preview of topic merging according to topic hierarchy
I Merged topic, (allows interactive splitting to reverse merging)
TopicExplorer TopicFrames 12/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
TopicExplorer TopicFrames 13/ 24
Motivation of TopicFrames [HRPO14]
I TopicExplorer
I ProblemI Topic models give no guarantees on interpretabilityI Not all topics are well interpretableI Topics represented by word lists may leave room for several
different interpretations
TopicExplorer TopicFrames 14/ 24
Towards interpretable Topic Representations
I Most probable words is not always the best option
I More context of top words is helpfulI Results on topic coherence
I Humans find a topic interpretable if pairs of top words oftenappear close together in documents [Lau et al. EACL 2014]
I OberservationI Top word lists of topics are often dominated by nouns, while
verbs are less prominent
I FramesI Basic communication units for semantic contentsI Noun-verb co-locations
I Topic FramesI Topic specific noun-verb co-location
I How to define and compute topic frames?
TopicExplorer TopicFrames 15/ 24
Topic Frames
I DefinitionI Topic frame occurs when a noun and a verb appear close
together in a document and the topic model assigned bothtokens to the same topic.
I Topics with top word list that are not clearly interpretable.
I Topics frames make topics more clearly interpretable.
TopicExplorer TopicFrames 16/ 24
Workflow and Implementation
I Linguistic data preparationI Tokenization, important for Japanese documentsI Part-of-speech taggingI Lemmatization
I Topic modelingI work with word lemmataI translate topic assignments back to original documents
I Topic frame detectionI find noun-verb co-locations with tokens assigned to same topicI filter frames occurrences with sentence delimitersI count different topic frames and number of frame occurrences
Topic Frame Occurrence Keywords with Topics
TopicExplorer Topics and Meta Data 17/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
TopicExplorer Topics and Meta Data 18/ 24
Topics and Meta Data: Time
I Document Collection consists of Blogs with Time Stamps
I Temporal Information is not used for Topic Modeling
I Instead: token assignments to topics are analyzed over time
TopicExplorer Topics and Meta Data 19/ 24
Topics and Meta Data: General Case
I Documents come with all kinds of meta dataI Texts, topic model and meta data are stored in a relational
database [RH16]I Derived linguistic meta data: POS, NER, Sentiment, ...I Geographic entities mentioned in documentsI Images, Videos, Links, ...
I Token assignments to topics can be analyzed in context ofmeta-data enities using
I Aggregation and/or Intra-Document ProximityI Topic-Explorer uses automatically configured workflows to
allow future extension to all kind of meta data:
DocumentCreate
DocumentTopicCreate
DocumentTopicFill
DocumentFill TopicFill
DocumentTermTopicFill TopicCreate
TokenTopicAssociator
DocumentTermTopicCreate
Mallet
InFilePreparation
ColorTopic_TopicCreate
Text_TopicCreate
Text_TopicFill
TermTopicFillTermFill
TermTopicCreateTermCreate
HierarchicalTopic_TermTopicFill
HierarchicalTopic_TopicFill
TopicMetaData HierarchicalTopic_TopicCreate
ColorTopic_TopicFill
Text_DocumentFill
Text_DocumentCreate
PruneWordType_DocumentTermTopicCreate
HierarchicalTopic_DocumentTopicFill
TopicExplorer Topics and Meta Data 19/ 24
Topics and Meta Data: General Case
I Documents come with all kinds of meta dataI Texts, topic model and meta data are stored in a relational
database [RH16]I Derived linguistic meta data: POS, NER, Sentiment, ...I Geographic entities mentioned in documentsI Images, Videos, Links, ...
I Token assignments to topics can be analyzed in context ofmeta-data enities using
I Aggregation and/or Intra-Document ProximityI Topic-Explorer uses automatically configured workflows to
allow future extension to all kind of meta data:
OCR and Topic Models 20/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
OCR and Topic Models 21/ 24
OCR for historic printings
I Abbyy (commercial) and Tesseract (open source) needextensive training to handle historic printings
I New neural networks techniques for OCR are available:OCRopus
I Neural networks may require less and simpler training [Spr16]
I DFG funds OCR-D project1 for enhancing available OCRsoftware
1http://ocr-d.de
OCR and Topic Models 22/ 24
Combining OCR and Topic Models
I Evaluating models of latent document semantics in thepresence of OCR errors [WLR10]
Though model quality declines as errors increase,simple feature selection techniques enable thelearning of relatively high quality models even asword error rates approach 50%.
I Towards Noise-resilient Document Modeling [YL11]
I On handling textual errors in latent document modeling[YL13]
Using both real and synthetic data sets with varyingdegrees of errors, our TDE-LDA model outperforms:(1) the traditional LDA model by 16%-39% (real)and 20%-63% (synthetic); and (2) thestate-of-the-art N-Grams model by 11%-27% (real)and 16%-54% (synthetic).
Conclusion 23/ 24
Outline
Introduction
TopicExplorerVisualization techniques and user interfaceTopicFramesTopics and Meta Data
OCR and Topic Models
Conclusion
Conclusion 24/ 24
Conclusion
I Topic models are helpful for qualitative text analysis
I TopicExplorer User-Interface helps to go from statisticalmodels towards semantic interpretations
I Future WorkI evaluate new OCR methods for historic printingsI combine OCR and topic modeling
Conclusion 24/ 24
David M. Blei.
Probabilistic topic models.
Commun. ACM, 55(4):77–84, April 2012.
David M Blei, Andrew Y Ng, and Michael I Jordan.
Latent dirichlet allocation.
The Journal of Machine Learning Research, 3:993–1022, 2003.
Alexander Hinneburg, Rico Preiss, and Rene Schroder.
Topicexplorer: Exploring document collections with topic models.
In Machine Learning and Knowledge Discovery in Databases, pages838–841. Springer, 2012.
Alexander Hinneburg, Frank Rosner, Stefan Pessler, and ChristianOberlander.
Exploring document collections with topic frames.
In Proceedings of the 23rd ACM International Conference onConference on Information and Knowledge Management, CIKM ’14,pages 2084–2086, New York, NY, USA, 2014. ACM.
Conclusion 24/ 24
Frank Rosner and Alexander Hinneburg.
Translating bayesian networks into entity relationship models.
In Conceptual Modeling - 35th International Conference, ER 2016,Gifu, Japan, November 14-17, 2016, Proceedings, pages 65–72,2016.
Uwe Springmann.
OCR fur alte Drucke.
Informatik-Spektrum, 39(6):459–462, 2016.
Daniel D. Walker, William B. Lund, and Eric K. Ringger.
Evaluating models of latent document semantics in the presence ofocr errors.
In Proceedings of the 2010 Conference on Empirical Methods inNatural Language Processing, EMNLP ’10, pages 240–250,Stroudsburg, PA, USA, 2010. Association for ComputationalLinguistics.
Tao Yang and Dongwon Lee.
Towards noise-resilient document modeling.
Conclusion 24/ 24
In Proceedings of the 20th ACM International Conference onInformation and Knowledge Management, CIKM ’11, pages2345–2348, New York, NY, USA, 2011. ACM.
Tao Yang and Dongwon Lee.
On handling textual errors in latent document modeling.
In Proceedings of the 22nd ACM international conference onConference on information & knowledge management, CIKM’13, pages 2089–2098, New York, NY, USA, 2013. ACM.