""into the wild" ... with natural language processing and text classification",...

24
…with Natural Language Processing and Text Classification Data Natives 2015 19.11.2015 - Peter Grosskopf

Upload: dataconomy-media

Post on 08-Jan-2017

157 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

…with Natural Language

Processing and Text Classification

Data Natives 2015 19.11.2015 - Peter Grosskopf

Page 2: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Hey, I’m Peter.

Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group

Department „Tech & Development“ (TechDev)

Page 3: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Company Builder with 500+ employees

in AdTech, FinTech and Big Data

Page 4: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Company Builder =

💡Ideas + 👥People

Page 5: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

How do we select the best people out of more than 1000 applications every month in a consistent way?

?

? ?

Machine Learning ?

Page 6: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Yeah!

I found a solution

Not really

💩

Page 7: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Our Goal

Add a sort-by-relevance to lower the screening costs

and invite people faster

Page 8: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Let’s Go!

Page 9: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Action Steps

1. Prepare the textual data 2. Build a model to classify the data

3. Run it! 4. Display and interpret

the results

Page 10: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

1. Prepare

Load data Kick out outlier

Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states

Link data

Page 11: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

2. Build a model

tf-idf / bag of words

!: term-frequency idf: inverse document frequency

Page 12: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Transform / Quantization

from a textual shape to a numerical vector-form

I am a nice little text

-> v(i, am, a, nice, little, text)

-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)

Page 13: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

term-frequency (tf)

Count occurrences in document

I am a nice little text

-> v(i, am, a, nice, little, text)

-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)

Page 14: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

inverse document frequency (idf)

Count how often a term occurs in the whole document set and invert

with the logarithm

d1(I play a fun game)-> v1(i, play, a, fun, game)

d2(I am a nice little text) -> v2(i, am, a, nice, little, text)

-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …)-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)

Page 15: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

bag of words

Simple approach to calculate the frequency of relevant terms

Ignores contextual information 😢

better: n-grams

Page 16: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

n-grams

Generate new tokens by concatenating neighboured tokens

example (1 and 2-grams): (nice, little, text)

-> (nice, nice_little, little, little_text, text)-> From three tokens we just generated 5 tokens.

example2 (1 and 2-grams): (new, york, is, a, nice, city)

-> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)

Page 17: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

vectorize the resumes

build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer

Page 18: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Define runtime

Train-test-split by date (80/20)

Approach: Pick randomly CVs out of the test

group Count how many CVs have to be screened to find all the good CVs

Page 19: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

3. run it!

After the resumes are transformed to vector form, the classification

gets done with a classical statistical machine learning model

(e.g. multinominal-naive-bayes,

stochastic-gradient-descent-classifier, logistic-regression and

random-forest)

Page 20: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

4. Results

Generated with a combination of stochastic-gradient-descent-

classifier and logistic-regression with the python machine-learning

library scikit-learn

AUC: 73.0615 %

Page 21: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Wrap Up

1. Prepare 2. Build Model 3. Run 4. Interpret

import datavectorize the

CVs with 1 to 4 n_grams

choose Machine Learning model

visualize results

clean datadefine train-test-

splitrun it!

Area under curve (AUC)

Page 22: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Conclusion

After trying many different approaches (doc2vec, Recurrent

Neuronal Networks, Feature Hashing)- bag of words still the

best

Explana<on: CV documents do not contain too many semantics

Page 23: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Outlook

Build a better database Experiment with new approaches

and tune models Build a continuous learning model

Page 24: ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

Happy End. Thanks :-)