[smartnews] globally scalable web document classification using word2vec

54
Globally Scalable Web Document Classification Using Word2Vec Kohei Nakaji (SmartNews)

Upload: kohei-nakaji

Post on 15-Jul-2015

7.523 views

Category:

Software


0 download

TRANSCRIPT

Page 1: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Globally Scalable Web Document Classification Using Word2Vec

Kohei Nakaji (SmartNews)

Page 2: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec
Page 3: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

keyword: machine learning for discovery

Page 4: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

SmartNews Demo

Page 5: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

About SmartNews

Japan

Launched 2013

4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers

2013 App of The Year

USLaunched Oct 2014

1M+ Monthly Active Users Same engagement

80+ Publishers Top News Category App

International

Launched Feb 2015

10M Downloads WW Same engagement

English beta Featured App

Funding: $50M

Page 6: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Outline of our algorithm

Structure Analysis

Semantics Analysis

URLs Found

Importance Estimation

10 million/day

1000+/day

Diversification

Signals on the Internet

Page 7: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Outline of our algorithm

Structure Analysis

Semantics Analysis

URLs Found

Importance Estimation

10 million/day

1000+ /day

Diversification

Signals on the Internet

Web Document Classification ⊂

Page 8: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Web Document Classification

ENTERTAINMENT

SPORTS

TECHNOLOGY

LIFESTYLE

SCIENCE

Task definition:When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set.

WORLD

Page 9: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

There are roughly two steps:

Page 10: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

There are roughly two steps:

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

Page 11: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction

Two approaches:

html

html

easier, but takes time

difficult, but fast

・Extract after rendering whole page

・Extract from HTML

Page 12: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction

・Extract after rendering whole page

・Extract from HTML

html

html

easier, but takes time

difficult, but fast

Two approaches:

Our Approach

Page 13: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

<html> <body><div>click <a>here</a> for </div> <div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.</p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html>

Example:

main content

not main content

Page 14: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content

Rule-based extraction algorithm is possible.

English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content

RuleN:

Page 15: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content

Rule-based extraction algorithm is possible.

English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content

RuleN:

But not scalable.

Japanese:…… …

Page 16: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Page 17: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Page 18: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Separate HTML into ‘text block’s

Step1:

Page 19: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Step1:

Separate HTML into ‘text block’s

Step2:

Extract local features for every text block

ex: word count = 36, num of <a> = 0

Page 20: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Step1:

Separate HTML into ‘text block’s

Step2:

Extract local features for every text block

ex: word count = 36, num of <a> = 0

Step3: Define feature of each text block as combination of local features

word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1

ex:

Page 21: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach:See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Page 22: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Page 23: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Making Main Content Using Decision Tree

(features)block1:not main

(features)block2:not main

(features)block3:main

(features)block5:main

(features)block4:not main

Page 24: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Main Content Extraction From HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Page 25: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

There are roughly two steps:

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

Page 26: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Text Classification

Ordinary text classification architecture:

② live data

(features)

① training

(features, entertainment)(features, sports)(features, entertainment)

features

? ?

entertainment

sports

(features, politics) …

sports

training algorithm

classifier

feature extraction

Page 27: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Text Classification

Ordinary text classification architecture:

② live data

(features)

① training

(features, entertainment)(features, sports)(features, entertainment)

features

? ?

entertainment

sports

(features, politics) …

sports

training algorithm

classifier

feature extraction

Page 28: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Feature Extraction in Text Classification

Will LeBron James deliver an NBA championship to Cleveland?

‘Bag-of-words’ is commonly used as a feature vector.

Willdeliver

an NBAchampionship

to

Cleveland

JamesLeBron

Page 29: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Feature Extraction in Text Classification

Will LeBron James deliver an NBA championship to Cleveland?

‘Bag-of-words’ is commonly used as a feature vector

Willdeliver

an NBAchampionship

to

Cleveland

JamesLeBron

stop wordssports players dictionary

with some feature engineering.

NBA_PLAYER

tf-idf

Page 30: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Feature Extraction in Text Classification

Similarly used in Japanese.

私は中路です。 よろしくお願いします。

stop wordsperson dictionary

私は中路

よろしくお願い

し ます

です

PERSON

tf-idf

Page 31: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Another Option: Paragraph Vector

Page 32: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Example:

私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]

Will LeBron James deliver an NBA championship to Cleveland?

[0.1, 0.4, ……0.1]

Paragraph Vector

(dimension ~ several 100)

Page 33: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

(https://code.google.com/p/word2vec/)

Page 34: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(https://code.google.com/p/word2vec/)

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

Page 35: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Word Vector in word2vec Model

Every word is mapped to unique word vector with good properties.

[0.1, 0.2, ……0.2]=

[0.1, 0.1, ……-0.1] =

[0.3, 0.4, ……0]=

[0.3, 0.3, ……0.3] =

Germany Berlin

Paris France

“Germany - Berlin = France - Paris”

vFrance

vParis

vGermany

vBerlin

Page 36: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Procedure to Create Word Vectors

Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)

cat

sat

the

street

on

A cat sat on the street.

I love cat very much.

w220

w221He comes from Japan.

TX

t=1

logP (wt|wt�c, · · ·wt+c)

P (wt|wt�c, · · ·wt+c) =exp(uwt · v)PW exp(uW · v)

v =X

t0 6=t,�ct0c

vw0t

for and uw vw

vw is word vector for w.

Word vectors are trained so that it becomes a good feature for predicting surrounding words.

Objective Function (cbow-case)

Model (sum-case)

=

Procedure① Maximize

L

L

Page 37: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

Page 38: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Example:

私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]

Will LeBron James deliver an NBA championship to Cleveland?

[0.1, 0.4, ……0.1]

Paragraph Vectors

(dimension ~ 100s)

Page 39: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Procedure to Create Paragraph Vectors

for uw vw

A cat sat on the street.

doc_1 : doc_2 :

I love cat very much.w220

He comes from Japan.

w221

Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)

cat

sat

the

street

on

doc_1

TX

t=1

logP (wt|wt�c, · · ·wt+c,doc i)

P (wt|wt�c, · · ·wt+c,doc i) =exp(uwt · v)PW exp(uW · v)

v =X

t0 6=t,�ct0c

vw0t+ di

, and di

wt is included

vw② Preserve uw , as uw , vw

document where

Add a vector to the model for each document.Objective Function (dbow-case)

=

Model (sum-case)

Procedure① Maximize

L

L

Page 40: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Procedure to Create Paragraph Vector

for uw vw, and di

vw② Preserve uw , as uw , vw

After training, we can get a good paragraph vector as a feature for a new document.Objective Function (dbow-case)

Model (sum-case)

Procedure① Maximize

TX

t=1

logP (wt|wt�c, · · ·wt+c,doc)

P (wt|wt�c, · · ·wt+c,doc) =exp(uwt · v)PW exp(uW · v)

v =X

t0 6=t,�ct0c

vwt0 + d

We love SmartNews.

doc :

I love SmartNews very much.

d

Ldoc

=

③ Maximize for

L

Ldoc

d

④ Use as a paragraph vectord

training

live data

Page 41: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Procedure to Create Paragraph Vector

Feature Extractor

[0.2, 0.3, ……0.2]d

uw vw

Paragraph Vector :

Lmaximize

Ldoc

maximize

Page 42: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Text Classification

Ordinary text classification architecture:

② live data

([0.1, -0.1, …])

① training

([0.1, 0.3, …], entertainment)([0.2, -0.3, …], sports)([0.1, 0.1, …], entertainment)

features

? ?

entertainment

sports

([0.1, -0.2, …], politics) …

sports

training algorithm

classifier

feature extraction

Page 43: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Good

Benefits of Using Paragraph Vector

・High Scalability

・High Precision in Text ClassificationSeveral percent better than using Bag-of-Words with feature engineering in our Japanese/English data set.

We don’t need to work hard for feature engineering in each language.

Bad

・Difficulty in analyzing error

It is hard to understand the meaning of each component of paragraph vector.

labeled: ~several 10000 unlabeled: ~100000

Page 44: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Benefits of Using Paragraph Vector

It is important that Paragraph Vector has a different nature than Bag-of-Words

Reason: We can get a better classifier by combining two different types of classifiers.

Page 45: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Our Use Case

Validation

Use one to validate the other.

CombinationUse the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier

Page 46: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

In multilingual localizationUse only Paragraph Vector-based classifier without any feature engineering.

Our Use Case (future)

Page 47: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

There are roughly two steps:

Page 48: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

The Challenge

Page 49: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

The Challenge

News is uncertainty seeking for long-term values.

Exploitation Exploration

What SmartNews does:

uncertainty seeking discovery

What Big Data Firms typically do:

preference estimation and risk quantification

What if parents don't feed vegetables to children who only like meat?What if you keep hearing only opinions that match yours?

Page 50: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

The Challenge

Searching not optimal, but acceptable form of exploration.

Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews.

・topic extraction

We are developing:

・image extraction

・multi-arm bandit based scoring model

① For better Feature Vector of users and articles

② For Human-Acceptable Explorationuser

interests

feature vector for 10 million users

real-time feature vector for articlesx

Page 51: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

We are building our engineering team in SF - please join us!

採用してます・ML/NLP Engineer

・Data Science Engineer

Page 53: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

References

Main Content Extraction

・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl

Text Classification

Boilerplate Detection using Shallow Text Features

・BoilerPipe (GoogleCode)

・Quoc V. Le, Tomas MikolovDistributed Representations of Sentences and Documents

・Word2Vec (GoogleCode)

Page 54: [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

References

About SmartNews

・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S.

・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S.

・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M

・About our Company SmartNews

Articles about SmartNews