introduction to text mining and visualization with interactive web application
TRANSCRIPT
![Page 1: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/1.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Interactive Visual Data AnalysisPart Two
Interactive Text Mining Suite
Olga Scrivner
Indiana University
Workshop in Methods
1 / 33
![Page 2: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/2.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Outline
1 Introduce a web application for text processing and mining
2 Learn about natural language processing techniques
3 Develop practical skills
2 / 33
![Page 3: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/3.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Data Mining
“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what
we are looking for.” (Blei 2012)
3 / 33
![Page 4: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/4.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
New Ways of Exploring Data Collections
Word clouds (Vuillemot et al., 2009)
4 / 33
![Page 5: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/5.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Visualization Methods
Social network graphs (Rydberg-Cox, 2011)
5 / 33
![Page 6: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/6.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Visualization Methods
Tracking emotion and sentiment in fairy tales(Mohammad, 2012)
6 / 33
![Page 7: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/7.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Modeling
Discovering underlying theme of collection from Science magazine1990-2000 (Blei 2012)
7 / 33
![Page 8: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/8.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Technological and Methodological Obstacles
Many tools require some programming skills (Mallet,Meta, R and Python libraries)
GUI tools are limited to certain formats and functions(Voyant, PaperMachine)
Lack of active control by users
8 / 33
![Page 9: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/9.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Interactive Text Mining Suite
A user-friendly tool for quantitative analysis andvisualization of unstructured data
Platform-independent
Interactive
9 / 33
![Page 10: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/10.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
ITMS Structure
1 File Uploads
Upload files (txt, pdf, rdf and Google books API)
2 Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3 Data Visualization
Word frequencies, Cluster analysis and topic modeling
10 / 33
![Page 11: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/11.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
ITMS Structure
1 File Uploads
Upload files (txt, pdf, rdf and Google books API)
2 Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3 Data Visualization
Word frequencies, Cluster analysis and topic modeling
10 / 33
![Page 12: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/12.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Workshop Files
Download 3 text files
http://ssrc.indiana.edu/seminars/wim.shtml
NY Times articles (3 documents in a plain text format)
ITMS Web site:
http://www.interactivetextminingsuite.com
11 / 33
![Page 13: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/13.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Upload File
12 / 33
![Page 14: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/14.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Upload File
12 / 33
![Page 15: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/15.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Upload File
12 / 33
![Page 16: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/16.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Preprocessing Data
Before performing data analysis we should preprocess data.
13 / 33
![Page 17: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/17.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Preprocessing Options
Select preprocessing options and click apply.
14 / 33
![Page 18: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/18.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Stopwords
Stopwords (e.g. the, and): select Default for English
15 / 33
![Page 19: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/19.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Manual Removal of Stopwords
Based on the need, remove any additional stopwords that youmay consider a noise, e,g, paper, shows etc
Select apply
16 / 33
![Page 20: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/20.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Stemming
To improve analytics, you can stem all your tokens, ex. insteadof worked, works, working, you will have only one relevantstem work
17 / 33
![Page 21: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/21.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Metadata Extraction
You can extract or upload metadata. You will need datestamp(year) information for chronological topic modeling.
18 / 33
![Page 22: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/22.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Visualization
19 / 33
![Page 23: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/23.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Word Cloud Representation
20 / 33
![Page 24: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/24.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Customization
21 / 33
![Page 25: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/25.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Cluster Analysis
You need to have at least three documents
Documents will be grouped based on their term similaritymeasures
22 / 33
![Page 26: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/26.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Cluster Analysis
23 / 33
![Page 27: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/27.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Modeling
LDA (Latent Dirichlet allocation)
STM (Structural Topic model)
Chronological topic visualization (lda): requires metadata
24 / 33
![Page 28: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/28.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Modeling Tuning
Selection of topics (how many different themes)
Selection of words per theme (how many words per topic)
Selection of iteration
25 / 33
![Page 29: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/29.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Topic Model Selection
26 / 33
![Page 30: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/30.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
LDA Topic Model
27 / 33
![Page 31: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/31.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
STM Topic Model
28 / 33
![Page 32: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/32.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Other Formats - Google Books
Before switching to other data formats, refresh your localbrowser.
Start with File Uploads and select Structured Data
29 / 33
![Page 33: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/33.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Other Formats - Google Books
Select your search terms and submit
Current limitation is 40 books
30 / 33
![Page 34: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/34.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Future Options
Shiny Web Application is highly customizable
1 Part-of-speech tagging (tm package)
2 Network analysis (igraph package)
3 Name Entity Recognition (NLP package)
4 Twitter Streaming (twitterR package) - will requires user’stwitter set-up for streaming but information will beprovided how to set it up
Open for other suggestions and collaboration - [email protected]
31 / 33
![Page 35: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/35.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
Acknowledgements
I would like to thank WIM for providing this opportunity.
Contributors: Jefferson Davis, Irina Trapido, Jay Lee
32 / 33
![Page 36: Introduction to Text Mining and Visualization with Interactive Web Application](https://reader031.vdocuments.us/reader031/viewer/2022022413/58ecbd821a28ab73408b458d/html5/thumbnails/36.jpg)
Introduction
ITMS
PreprocessingData
DataVisualization
ClusterAnalysis
TopicModeling
Google BookAPI
FutureDirections
References
References I
[1] Many open source R packages: tm, shiny, NLP, stringi, stringr, topicmodels, lda and many more
[2] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:Cambridge University Press
[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber RandiReppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: CambridgeUniversity Press
[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in theHumanities and Social Sciences. Springer International Publishing, Cham
[5] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso
[6] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods forliterature analysis. Proceedings of the 6th EACL Workshop on Language Technology for CulturalHeritage, Social 561Sciences, and Humanities, pages 35–44image credits: https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif
33 / 33