small data classification for nlp

Small Data Classification for Natural Language ProcessingMichael ThorneHead of Data Science, CaliberMind

2 | ©2016 CaliberMind

Goals

• Intro

• What Makes NLP Different

• Solutions

• Questions


Michael Thorne

Head of Data Science, Caliber Mind

MS Data Science Program, GalvanizeU

B.S. Physics, Fordham University

NSA Analytic Lead

US Navy Digital Network Intelligence Analyst / Cryptolinguist

Obligatory Speaker Bio


CaliberMind

• B2B marketing SaaS

• Persona modeling and personality insights

• Content matching across buyer journey for high-value, complex purchase decisions

• Our core competency is natural language processing


What’s So Special About NLP?

• Not random (Zipf’s Law)

• Huge feature space

• Subjective Criteria


Small Data NLP


Persona Status Quo

• Assumptive Personas

• Qualitative Criteria

• Subjective Labels

• Static Output


Starting Point

Demographics

Psychographics

Firmographics


Let’s Validate the Status Quo


CaliberMind’s Data Challenge

• We match the right message, to the right person, at the right time

• We operate at the upper limits of human scale problems (100’s - 10,000’s of documents)

• We weren’t getting as accurate results as we expected


Our Friend: The Central Limit Theorem

• This is the theorem that lets us assume our data is well behaved, assuming we have enough of it

• Let’s look at a classic example, coin tosses


Coin Flip Distribution


1 Trial


100 Trials


Example: K-Means

• K-means is a workhorse algorithm when doing unsupervised learning

• What are the assumptions we make when we use k-means?

Spherical data

Same variance

Same prior probability

Turns out NLP data is none of these things


Happy K-Means


NLP K-Means


But Wait, It Gets Better

• Our documents tend to be of vastly different sizes within the same corpus

• Unbalanced Classes

• Qualitative Criteria

• Unlabeled data

• Human-labeling is time intensive

Our Solution


Dimensionality Reduction

• Dimensionality was the first thing we tackled

• Manual dictionaries to collapse similar terms• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]

• LSA to remove low-information terms

• Automating the process using word2vec, dbpedia, and skip gram similarities

• As we aggregate more data, we’re able to do this process more effectively


Spiky Data


Metrics Over Raw Scores

• Especially important when comparing data of different sizes

• How many standard deviations off the mean works better than a simple similarity score

• Pick the best similarity score (with NLP, it’s not cosine)


Pretend We Have Labeled Data

• Rules-based scoring algorithm for a first pass

• Take a small subset of high-scoring people as exemplars

• Use a latent semantic analysis of these exemplars to make a template

• Compare remaining data rows against each exemplar cluster

• Assign highest score to that exemplar cluster, broadening the definition

• Continue until all data rows are assigned

• Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas


Round 1 (Rules)

Name Title Similarity Score Persona

Luke J VP Marketing 1.0 Value

Randy P Founder

Lucas M Growth Ninja

Bec G Tech Guru

Fiona F Sysops 1.0 Security

Claude S Growth Hacker

Art L Data Analyst


Round 2 (LSA)



Randy P Founder 0.45

Lucas M Growth Ninja 0.11

Bec G Tech Guru 0.71 Security


Claude S Growth Hacker 0.87 Value

Art L Data Analyst 0.41


Round 3 (LSA)



Randy P Founder 0.68 Security

Lucas M Growth Ninja 0.18




Art L Data Analyst 0.72 Security


Round 3 (LSA)



Randy P Founder 0.71 Security

Lucas M Growth Ninja 0.16 Unknown




Art L Data Analyst 0.78 Security


Example


Takeaways

• Human-generated data is never really random

• Small data models are hyper-sensitive

• Validate assumptions

Questions?

Michael [email protected]

mailto:[email protected]?subject=

mailto:[email protected]?subject=