""into the wild" ... with natural language processing and text classification",...
TRANSCRIPT
…with Natural Language
Processing and Text Classification
Data Natives 2015 19.11.2015 - Peter Grosskopf
Hey, I’m Peter.
Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group
Department „Tech & Development“ (TechDev)
Company Builder with 500+ employees
in AdTech, FinTech and Big Data
Company Builder =
💡Ideas + 👥People
How do we select the best people out of more than 1000 applications every month in a consistent way?
?
? ?
Machine Learning ?
Yeah!
I found a solution
Not really
💩
Our Goal
Add a sort-by-relevance to lower the screening costs
and invite people faster
Let’s Go!
Action Steps
1. Prepare the textual data 2. Build a model to classify the data
3. Run it! 4. Display and interpret
the results
1. Prepare
Load data Kick out outlier
Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states
Link data
2. Build a model
tf-idf / bag of words
!: term-frequency idf: inverse document frequency
Transform / Quantization
from a textual shape to a numerical vector-form
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
inverse document frequency (idf)
Count how often a term occurs in the whole document set and invert
with the logarithm
d1(I play a fun game)-> v1(i, play, a, fun, game)
d2(I am a nice little text) -> v2(i, am, a, nice, little, text)
-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …)-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
bag of words
Simple approach to calculate the frequency of relevant terms
Ignores contextual information 😢
better: n-grams
n-grams
Generate new tokens by concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, nice_little, little, little_text, text)-> From three tokens we just generated 5 tokens.
example2 (1 and 2-grams): (new, york, is, a, nice, city)
-> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)
vectorize the resumes
build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer
Define runtime
Train-test-split by date (80/20)
Approach: Pick randomly CVs out of the test
group Count how many CVs have to be screened to find all the good CVs
3. run it!
After the resumes are transformed to vector form, the classification
gets done with a classical statistical machine learning model
(e.g. multinominal-naive-bayes,
stochastic-gradient-descent-classifier, logistic-regression and
random-forest)
4. Results
Generated with a combination of stochastic-gradient-descent-
classifier and logistic-regression with the python machine-learning
library scikit-learn
AUC: 73.0615 %
Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import datavectorize the
CVs with 1 to 4 n_grams
choose Machine Learning model
visualize results
clean datadefine train-test-
splitrun it!
Area under curve (AUC)
Conclusion
After trying many different approaches (doc2vec, Recurrent
Neuronal Networks, Feature Hashing)- bag of words still the
best
Explana<on: CV documents do not contain too many semantics
Outlook
Build a better database Experiment with new approaches
and tune models Build a continuous learning model
Happy End. Thanks :-)