""into the wild" ... with natural language processing and text classification",...

…with Natural Language

Processing and Text Classification

Data Natives 2015 19.11.2015 - Peter Grosskopf

Hey, I’m Peter.

Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group

Department „Tech & Development“ (TechDev)

Company Builder with 500+ employees

in AdTech, FinTech and Big Data

Company Builder =

💡Ideas + 👥People

How do we select the best people out of more than 1000 applications every month in a consistent way?

?

? ?

Machine Learning ?

Yeah!

I found a solution

Not really

💩

Our Goal

Add a sort-by-relevance to lower the screening costs

and invite people faster

Let’s Go!

Action Steps

1. Prepare the textual data 2. Build a model to classify the data

3. Run it! 4. Display and interpret

the results

1. Prepare

Load data Kick out outlier

Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states

Link data

2. Build a model

tf-idf / bag of words

!: term-frequency idf: inverse document frequency

Transform / Quantization

from a textual shape to a numerical vector-form

I am a nice little text

-> v(i, am, a, nice, little, text)

-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)

term-frequency (tf)

Count occurrences in document

I am a nice little text

-> v(i, am, a, nice, little, text)

-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)

inverse document frequency (idf)

Count how often a term occurs in the whole document set and invert

with the logarithm

d1(I play a fun game)-> v1(i, play, a, fun, game)

d2(I am a nice little text) -> v2(i, am, a, nice, little, text)

-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …)-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)

bag of words

Simple approach to calculate the frequency of relevant terms

Ignores contextual information 😢

better: n-grams

n-grams

Generate new tokens by concatenating neighboured tokens

example (1 and 2-grams): (nice, little, text)

-> (nice, nice_little, little, little_text, text)-> From three tokens we just generated 5 tokens.

example2 (1 and 2-grams): (new, york, is, a, nice, city)

-> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)

vectorize the resumes

build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer

Define runtime

Train-test-split by date (80/20)

Approach: Pick randomly CVs out of the test

group Count how many CVs have to be screened to find all the good CVs

3. run it!

After the resumes are transformed to vector form, the classification

gets done with a classical statistical machine learning model

(e.g. multinominal-naive-bayes,

stochastic-gradient-descent-classifier, logistic-regression and

random-forest)

4. Results

Generated with a combination of stochastic-gradient-descent-

classifier and logistic-regression with the python machine-learning

library scikit-learn

AUC: 73.0615 %

Wrap Up

1. Prepare 2. Build Model 3. Run 4. Interpret

import datavectorize the

CVs with 1 to 4 n_grams

choose Machine Learning model

visualize results

clean datadefine train-test-

splitrun it!

Area under curve (AUC)

Conclusion

After trying many different approaches (doc2vec, Recurrent

Neuronal Networks, Feature Hashing)- bag of words still the

best

Explana<on: CV documents do not contain too many semantics

Outlook

Build a better database Experiment with new approaches

and tune models Build a continuous learning model

Happy End. Thanks :-)

""into the wild" ... with natural language processing and text classification",...

Data & Analytics