[smartnews] globally scalable web document classification using word2vec
TRANSCRIPT
Globally Scalable Web Document Classification Using Word2Vec
Kohei Nakaji (SmartNews)
keyword: machine learning for discovery
SmartNews Demo
About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers
2013 App of The Year
USLaunched Oct 2014
1M+ Monthly Active Users Same engagement
80+ Publishers Top News Category App
International
Launched Feb 2015
10M Downloads WW Same engagement
English beta Featured App
Funding: $50M
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
Diversification
Signals on the Internet
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
Diversification
Signals on the Internet
Web Document Classification ⊂
Web Document Classification
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task definition:When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set.
WORLD
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Main Content Extraction
Two approaches:
html
html
easier, but takes time
difficult, but fast
・Extract after rendering whole page
・Extract from HTML
Main Content Extraction
・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difficult, but fast
Two approaches:
Our Approach
Main Content Extraction from HTML
<html> <body><div>click <a>here</a> for </div> <div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.</p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html>
Example:
main content
not main content
Main Content Extraction from HTML
Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content
Rule-based extraction algorithm is possible.
English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content
RuleN:
…
Main Content Extraction from HTML
Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content
Rule-based extraction algorithm is possible.
English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content
RuleN:
…
But not scalable.
Japanese:…… …
…
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Feature Extraction from HTML
<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>
Separate HTML into ‘text block’s
Step1:
Feature Extraction from HTML
<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Feature Extraction from HTML
<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Step3: Define feature of each text block as combination of local features
word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1
ex:
…
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach:See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Making Main Content Using Decision Tree
(features)block1:not main
(features)block2:not main
(features)block3:main
(features)block5:main
(features)block4:not main
Main Content Extraction From HTML
② live data
(features)block1:block2:block3:
(features)(features)
…
① training
(features, main)(features, not main)(features, main)
block1:block2:block3:
…
decision tree
block separation & feature extraction
We are using machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)(features, sports)(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics) …
sports
training algorithm
classifier
feature extraction
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)(features, sports)(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics) …
sports
training algorithm
classifier
feature extraction
Feature Extraction in Text Classification
Will LeBron James deliver an NBA championship to Cleveland?
‘Bag-of-words’ is commonly used as a feature vector.
Willdeliver
an NBAchampionship
to
Cleveland
JamesLeBron
Feature Extraction in Text Classification
Will LeBron James deliver an NBA championship to Cleveland?
‘Bag-of-words’ is commonly used as a feature vector
Willdeliver
an NBAchampionship
to
Cleveland
JamesLeBron
stop wordssports players dictionary
with some feature engineering.
NBA_PLAYER
tf-idf
Feature Extraction in Text Classification
Similarly used in Japanese.
私は中路です。 よろしくお願いします。
stop wordsperson dictionary
私は中路
よろしくお願い
し ます
です
PERSON
tf-idf
Another Option: Paragraph Vector
Example:
私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]
Will LeBron James deliver an NBA championship to Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vector
(dimension ~ several 100)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
(https://code.google.com/p/word2vec/)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(https://code.google.com/p/word2vec/)
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Word Vector in word2vec Model
Every word is mapped to unique word vector with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0.1, ……-0.1] =
[0.3, 0.4, ……0]=
[0.3, 0.3, ……0.3] =
Germany Berlin
Paris France
…
“Germany - Berlin = France - Paris”
vFrance
vParis
vGermany
vBerlin
Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on the street.
…
I love cat very much.
w220
w221He comes from Japan.
…
…
TX
t=1
logP (wt|wt�c, · · ·wt+c)
P (wt|wt�c, · · ·wt+c) =exp(uwt · v)PW exp(uW · v)
v =X
t0 6=t,�ct0c
vw0t
for and uw vw
vw is word vector for w.
Word vectors are trained so that it becomes a good feature for predicting surrounding words.
Objective Function (cbow-case)
Model (sum-case)
=
Procedure① Maximize
②
L
L
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Example:
私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]
Will LeBron James deliver an NBA championship to Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vectors
(dimension ~ 100s)
Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.w220
He comes from Japan.
…
w221
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
doc_1
TX
t=1
logP (wt|wt�c, · · ·wt+c,doc i)
P (wt|wt�c, · · ·wt+c,doc i) =exp(uwt · v)PW exp(uW · v)
v =X
t0 6=t,�ct0c
vw0t+ di
, and di
wt is included
vw② Preserve uw , as uw , vw
document where
Add a vector to the model for each document.Objective Function (dbow-case)
=
Model (sum-case)
Procedure① Maximize
L
L
Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as uw , vw
After training, we can get a good paragraph vector as a feature for a new document.Objective Function (dbow-case)
Model (sum-case)
Procedure① Maximize
TX
t=1
logP (wt|wt�c, · · ·wt+c,doc)
P (wt|wt�c, · · ·wt+c,doc) =exp(uwt · v)PW exp(uW · v)
v =X
t0 6=t,�ct0c
vwt0 + d
We love SmartNews.
…
doc :
I love SmartNews very much.
d
Ldoc
=
③ Maximize for
L
Ldoc
d
④ Use as a paragraph vectord
training
live data
Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]d
uw vw
Paragraph Vector :
Lmaximize
Ldoc
maximize
Text Classification
Ordinary text classification architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], entertainment)([0.2, -0.3, …], sports)([0.1, 0.1, …], entertainment)
features
? ?
…
entertainment
sports
([0.1, -0.2, …], politics) …
sports
training algorithm
classifier
feature extraction
Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text ClassificationSeveral percent better than using Bag-of-Words with feature engineering in our Japanese/English data set.
We don’t need to work hard for feature engineering in each language.
Bad
・Difficulty in analyzing error
It is hard to understand the meaning of each component of paragraph vector.
labeled: ~several 10000 unlabeled: ~100000
Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a different nature than Bag-of-Words
Reason: We can get a better classifier by combining two different types of classifiers.
Our Use Case
Validation
Use one to validate the other.
CombinationUse the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
In multilingual localizationUse only Paragraph Vector-based classifier without any feature engineering.
Our Use Case (future)
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
The Challenge
The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty seeking discovery
What Big Data Firms typically do:
preference estimation and risk quantification
What if parents don't feed vegetables to children who only like meat?What if you keep hearing only opinions that match yours?
The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews.
・topic extraction
We are developing:
・image extraction
・multi-arm bandit based scoring model
① For better Feature Vector of users and articles
② For Human-Acceptable Explorationuser
interests
①
②
…
feature vector for 10 million users
real-time feature vector for articlesx
We are building our engineering team in SF - please join us!
採用してます・ML/NLP Engineer
・Data Science Engineer
…
References
Main Content Extraction
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Text Classification
Boilerplate Detection using Shallow Text Features
・BoilerPipe (GoogleCode)
・Quoc V. Le, Tomas MikolovDistributed Representations of Sentences and Documents
・Word2Vec (GoogleCode)
References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S.
・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S.
・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M
・About our Company SmartNews
Articles about SmartNews