small data classification for nlp
TRANSCRIPT
Small Data Classification for Natural Language ProcessingMichael ThorneHead of Data Science, CaliberMind
2 | ©2016 CaliberMind
Goals
• Intro
• What Makes NLP Different
• Solutions
• Questions
3 | ©2016 CaliberMind
Michael Thorne
Head of Data Science, Caliber Mind
MS Data Science Program, GalvanizeU
B.S. Physics, Fordham University
NSA Analytic Lead
US Navy Digital Network Intelligence Analyst / Cryptolinguist
Obligatory Speaker Bio
4 | ©2016 CaliberMind
CaliberMind
• B2B marketing SaaS
• Persona modeling and personality insights
• Content matching across buyer journey for high-value, complex purchase decisions
• Our core competency is natural language processing
5 | ©2016 CaliberMind
What’s So Special About NLP?
• Not random (Zipf’s Law)
• Huge feature space
• Subjective Criteria
6 | ©2016 CaliberMind
Small Data NLP
7 | ©2016 CaliberMind
Persona Status Quo
• Assumptive Personas
• Qualitative Criteria
• Subjective Labels
• Static Output
8 | ©2016 CaliberMind
Starting Point
Demographics
Psychographics
Firmographics
9 | ©2016 CaliberMind
Let’s Validate the Status Quo
10 | ©2016 CaliberMind
CaliberMind’s Data Challenge
• We match the right message, to the right person, at the right time
• We operate at the upper limits of human scale problems (100’s - 10,000’s of documents)
• We weren’t getting as accurate results as we expected
11 | ©2016 CaliberMind
Our Friend: The Central Limit Theorem
• This is the theorem that lets us assume our data is well behaved, assuming we have enough of it
• Let’s look at a classic example, coin tosses
12 | ©2016 CaliberMind
Coin Flip Distribution
13 | ©2016 CaliberMind
1 Trial
14 | ©2016 CaliberMind
100 Trials
15 | ©2016 CaliberMind
Example: K-Means
• K-means is a workhorse algorithm when doing unsupervised learning
• What are the assumptions we make when we use k-means?
Spherical data
Same variance
Same prior probability
Turns out NLP data is none of these things
16 | ©2016 CaliberMind
Happy K-Means
17 | ©2016 CaliberMind
NLP K-Means
18 | ©2016 CaliberMind
But Wait, It Gets Better
• Our documents tend to be of vastly different sizes within the same corpus
• Unbalanced Classes
• Qualitative Criteria
• Unlabeled data
• Human-labeling is time intensive
Our Solution
20 | ©2016 CaliberMind
Dimensionality Reduction
• Dimensionality was the first thing we tackled
• Manual dictionaries to collapse similar terms• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]
• LSA to remove low-information terms
• Automating the process using word2vec, dbpedia, and skip gram similarities
• As we aggregate more data, we’re able to do this process more effectively
21 | ©2016 CaliberMind
Spiky Data
22 | ©2016 CaliberMind
Metrics Over Raw Scores
• Especially important when comparing data of different sizes
• How many standard deviations off the mean works better than a simple similarity score
• Pick the best similarity score (with NLP, it’s not cosine)
23 | ©2016 CaliberMind
Pretend We Have Labeled Data
• Rules-based scoring algorithm for a first pass
• Take a small subset of high-scoring people as exemplars
• Use a latent semantic analysis of these exemplars to make a template
• Compare remaining data rows against each exemplar cluster
• Assign highest score to that exemplar cluster, broadening the definition
• Continue until all data rows are assigned
• Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas
24 | ©2016 CaliberMind
Round 1 (Rules)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder
Lucas M Growth Ninja
Bec G Tech Guru
Fiona F Sysops 1.0 Security
Claude S Growth Hacker
Art L Data Analyst
25 | ©2016 CaliberMind
Round 2 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.45
Lucas M Growth Ninja 0.11
Bec G Tech Guru 0.71 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.87 Value
Art L Data Analyst 0.41
26 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.68 Security
Lucas M Growth Ninja 0.18
Bec G Tech Guru 0.86 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.89 Value
Art L Data Analyst 0.72 Security
27 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.71 Security
Lucas M Growth Ninja 0.16 Unknown
Bec G Tech Guru 0.88 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.91 Value
Art L Data Analyst 0.78 Security
28 | ©2016 CaliberMind
Example
29 | ©2016 CaliberMind
Example
30 | ©2016 CaliberMind
Example
31 | ©2016 CaliberMind
Takeaways
• Human-generated data is never really random
• Small data models are hyper-sensitive
• Validate assumptions
Questions?
Michael [email protected]