discovertext: tools for text
DESCRIPTION
A talk prepared for a presentation at the Digital Methods Initiative 2014 Winter School held at the University of Amsterdam.TRANSCRIPT
![Page 1: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/1.jpg)
Tools for TextDr. Stuart Shulman @stuartwshulman [email protected]
Prepared for the Digital Methods Initiative Winter School 2014
University of Amsterdam!1
![Page 2: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/2.jpg)
Acknowledgements
Richard RogersThe National Science Foundation
Mark J. Hoy
!2
![Page 3: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/3.jpg)
Plan of Attack
A few high level thoughts
Five pillars of text analytics
Getting started on DiscoverText
A small collaborative project
The twittersifter.com beta release!3
![Page 4: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/4.jpg)
“A funny thing happened…”
A brief history of DiscoverText
!
!4
![Page 5: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/5.jpg)
A Master Metaphor: Sifter
!5
![Page 6: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/6.jpg)
An Open Source Kernel
!6
![Page 7: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/7.jpg)
Three Primary Tasks in CAT
!7
![Page 8: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/8.jpg)
Classification of Text
A 2500 year-old problem
Plato argued it would be frustrating
It still is…
!8
![Page 9: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/9.jpg)
Grimmer & Stewart “Text as Data”Political Analysis (2013)Volume is a problem for scholars
Coders are expensive Groups struggle to accurately label text at scale
Validation of both humans and machines is “essential” Some models are easier to validate than others
All models are wrong Automated models enhance/amplify, but don’t replace humans
There is no one right way to do this “Validate, validate, validate”
“What should be avoided then, is the blind use of any method without a validation step.”
!9
![Page 10: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/10.jpg)
!10
(Patent Pending)
![Page 11: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/11.jpg)
Three Important Books
!11
![Page 12: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/12.jpg)
One Particularly Important Idea
!12
![Page 13: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/13.jpg)
Five Pillars of Text Analytics
SearchFilterCode
ClusterClassify
You can execute all five using DT!13
![Page 14: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/14.jpg)
Pillar #1: Search
!14
![Page 15: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/15.jpg)
Search for Negative Cases
!15
![Page 16: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/16.jpg)
Defined Search (Multi-term)
!16
![Page 17: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/17.jpg)
Pillar #2: Filters
!17
Remember this filter
![Page 18: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/18.jpg)
Another Common Filter
!18
![Page 19: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/19.jpg)
!19
![Page 20: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/20.jpg)
Pillar#3: Human Coding
!20
![Page 21: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/21.jpg)
Keystroke Coding is Fast
!21
![Page 22: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/22.jpg)
Coding Off a List is Faster
!22
![Page 23: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/23.jpg)
Data Cleaning is Fundamental
!23
![Page 24: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/24.jpg)
Pillar #4: Clustering
!24
![Page 25: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/25.jpg)
!25
![Page 26: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/26.jpg)
Latent Dirichlet Allocation (LDA) Topic Models
!26
![Page 27: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/27.jpg)
LDA on the Christie Data
!27
Data is still processing…
![Page 28: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/28.jpg)
Pillar#5: Machine-Learning
!28
![Page 29: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/29.jpg)
Getting Started on DiscoverText
!29
![Page 30: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/30.jpg)
Use the Key in Your Email
!30
![Page 31: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/31.jpg)
Note the Peer Visibility Setting
!31
![Page 32: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/32.jpg)
Peers Make Collaboration Possible
!32
![Page 33: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/33.jpg)
!33
![Page 34: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/34.jpg)
!34
![Page 35: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/35.jpg)
!35
![Page 36: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/36.jpg)
Perhaps a Trending Topic
!36
![Page 37: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/37.jpg)
!37
![Page 38: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/38.jpg)
The Basics
!38
Raw Data
Subsets of Data
Data Humans or Machines Classify
![Page 39: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/39.jpg)
!39
![Page 40: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/40.jpg)
Grab Some Twitter Data
!40
![Page 41: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/41.jpg)
Create an Empty Archive
!41
![Page 42: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/42.jpg)
Login to a Twitter Account
!42
![Page 43: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/43.jpg)
Enable via OAuth
!43
![Page 44: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/44.jpg)
Ready to Query Twitter
!44
![Page 45: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/45.jpg)
Use Operators to Refine Queries
!45
![Page 46: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/46.jpg)
Set the Frequency of Fetches
!46
![Page 47: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/47.jpg)
Data Will Start Flowing
!47
![Page 48: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/48.jpg)
Data List View
!48
![Page 49: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/49.jpg)
Best List Settings for Twitter Data
!49
![Page 50: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/50.jpg)
Use Buckets to Refine Lists
Search results go into buckets
“Defined search” is a multi-term filter
Meta data filters also useful for buckets
Buckets focus the text analytic process!50
![Page 51: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/51.jpg)
!51
![Page 52: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/52.jpg)
Create a Dataset to Code
Any archive or bucket
Use the random sampling tool
Standard: All coders get all items
Triage: Coders get next uncoded item!52
![Page 53: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/53.jpg)
!53
![Page 54: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/54.jpg)
Select from Three Coding Styles
Default: Mutually Exclusive Codes
Option 1: Non-Mutually Exclusive Codes
Option 2: User-Defined Codes (Grounded Theory)
!54
![Page 55: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/55.jpg)
!55
![Page 56: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/56.jpg)
Assign Peers to Code a Dataset
How many coders?
How many items need to be coded?
How many test or training sets?
There are no cookbook answers!56
![Page 57: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/57.jpg)
Look at Inter-Rater Reliability
Highly reliable coding (easy tasks)
Unreliable coding (interesting tasks)
If humans can’t, neither can machines
Some tasks better suited for machines!57
![Page 58: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/58.jpg)
Adjudication: The Secret Sauce
Expert review or consensus process
Invalidate false positives
Identify strong and weak coders
Exclude false positives from training sets!58
![Page 59: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/59.jpg)
!59
![Page 60: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/60.jpg)
!60
![Page 61: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/61.jpg)
Use Classification Scores as Filters
Iteration plays a critical role
Train, classify, filter
Repeat until the model is trusted
Each round weeds out false positives!61
![Page 62: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/62.jpg)
Classifier Histograms: More Filtering
!62
![Page 63: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/63.jpg)
Track Your Progress
!63
![Page 64: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/64.jpg)
!64
![Page 65: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/65.jpg)
![Page 66: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/66.jpg)
!66
![Page 67: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/67.jpg)
Running the Classifier
!67
![Page 68: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/68.jpg)
!68
![Page 69: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/69.jpg)
!69
Filter by Classification
![Page 70: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/70.jpg)
Filtered List >95% Not Chris Christie
!70
![Page 71: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/71.jpg)
http://beta.twittersifter.com
![Page 72: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/72.jpg)
![Page 73: DiscoverText: Tools for Text](https://reader033.vdocuments.us/reader033/viewer/2022052906/558dfecc1a28aba90d8b4705/html5/thumbnails/73.jpg)