extracting keywords from text - mit csail · extracting keywords from text lavanya sharan insight...
TRANSCRIPT
Extracting keywords from text
Lavanya Sharan Insight Data Science Fellow
with
matching ads to content
keywords
keywords
matching ad
keywords
matching ad
keywords
matching ad
?
keywords
matching ad
?
keywords
matching ad
?
keywords
matching ad
?
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
A machine learning model for keyword extraction
Content Candidatekeywords
Feature extraction
Keyword Classifier Keywords
Brooklyn
A machine learning model for keyword extraction
Content Candidate keywords
Featureextraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
KeywordClassifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
Crowd500500 news articles
Human-annotated keywords 9:1 training-test split
Content Candidate keywords
Feature extraction
KeywordClassifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
Content Candidate keywords
Feature extraction
KeywordClassifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
P(keyword) = 0.81
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
P(keyword) = 0.81
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
A machine learning model for keyword extraction
P(keyword) = 0.81
nltk
Content Candidate keywords
Feature extraction
Keyword Classifier Keywords
Term length Wikipedia freq TF-‐IDF score ...
Brooklyn Logistic regression
State-of-the-art performance on Crowd500
About me
glass
Keywords: Lavanya, image recognition, foodie
Keyword classifier
“Brooklyn”
Term frequency TF-IDF score Wikipedia frequency
Term length Capitalized?
Position in page Spread in page
Named entity? Noun phrase? Ngram?
Logistic regression model
In-sample: 65%, out-of-sample: 65%, chance: 50%
Simple heuristics beat complex features
All stages of model required for performance
Beats AlchemyAPI for a range of parameters
Beats AlchemyAPI for a range of parameters