[smartnews] globally scalable web document classification using word2vec

Post on 15-Jul-2015

7.523 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Globally Scalable Web Document Classification Using Word2Vec

Kohei Nakaji (SmartNews)

keyword: machine learning for discovery

SmartNews Demo

About SmartNews

Japan

Launched 2013

4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers

2013 App of The Year

USLaunched Oct 2014

1M+ Monthly Active Users Same engagement

80+ Publishers Top News Category App

International

Launched Feb 2015

10M Downloads WW Same engagement

English beta Featured App

Funding: $50M

Outline of our algorithm

Structure Analysis

Semantics Analysis

URLs Found

Importance Estimation

10 million/day

1000+/day

Diversification

Signals on the Internet

Outline of our algorithm

Structure Analysis

Semantics Analysis

URLs Found

Importance Estimation

10 million/day

1000+ /day

Diversification

Signals on the Internet

Web Document Classification ⊂

Web Document Classification

ENTERTAINMENT

SPORTS

TECHNOLOGY

LIFESTYLE

SCIENCE

Task definition:When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set.

WORLD

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

There are roughly two steps:

There are roughly two steps:

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

Main Content Extraction

Two approaches:

html

html

easier, but takes time

difficult, but fast

・Extract after rendering whole page

・Extract from HTML

Main Content Extraction

・Extract after rendering whole page

・Extract from HTML

html

html

easier, but takes time

difficult, but fast

Two approaches:

Our Approach

Main Content Extraction from HTML

<html> <body><div>click <a>here</a> for </div> <div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.</p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html>

Example:

main content

not main content

Main Content Extraction from HTML

Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content

Rule-based extraction algorithm is possible.

English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content

RuleN:

Main Content Extraction from HTML

Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content

Rule-based extraction algorithm is possible.

English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content

RuleN:

But not scalable.

Japanese:…… …

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Separate HTML into ‘text block’s

Step1:

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Step1:

Separate HTML into ‘text block’s

Step2:

Extract local features for every text block

ex: word count = 36, num of <a> = 0

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Step1:

Separate HTML into ‘text block’s

Step2:

Extract local features for every text block

ex: word count = 36, num of <a> = 0

Step3: Define feature of each text block as combination of local features

word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1

ex:

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach:See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Main Content Extraction from HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

Making Main Content Using Decision Tree

(features)block1:not main

(features)block2:not main

(features)block3:main

(features)block5:main

(features)block4:not main

Main Content Extraction From HTML

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

There are roughly two steps:

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

Text Classification

Ordinary text classification architecture:

② live data

(features)

① training

(features, entertainment)(features, sports)(features, entertainment)

features

? ?

entertainment

sports

(features, politics) …

sports

training algorithm

classifier

feature extraction

Text Classification

Ordinary text classification architecture:

② live data

(features)

① training

(features, entertainment)(features, sports)(features, entertainment)

features

? ?

entertainment

sports

(features, politics) …

sports

training algorithm

classifier

feature extraction

Feature Extraction in Text Classification

Will LeBron James deliver an NBA championship to Cleveland?

‘Bag-of-words’ is commonly used as a feature vector.

Willdeliver

an NBAchampionship

to

Cleveland

JamesLeBron

Feature Extraction in Text Classification

Will LeBron James deliver an NBA championship to Cleveland?

‘Bag-of-words’ is commonly used as a feature vector

Willdeliver

an NBAchampionship

to

Cleveland

JamesLeBron

stop wordssports players dictionary

with some feature engineering.

NBA_PLAYER

tf-idf

Feature Extraction in Text Classification

Similarly used in Japanese.

私は中路です。 よろしくお願いします。

stop wordsperson dictionary

私は中路

よろしくお願い

し ます

です

PERSON

tf-idf

Another Option: Paragraph Vector

Example:

私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]

Will LeBron James deliver an NBA championship to Cleveland?

[0.1, 0.4, ……0.1]

Paragraph Vector

(dimension ~ several 100)

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

(https://code.google.com/p/word2vec/)

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(https://code.google.com/p/word2vec/)

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

Word Vector in word2vec Model

Every word is mapped to unique word vector with good properties.

[0.1, 0.2, ……0.2]=

[0.1, 0.1, ……-0.1] =

[0.3, 0.4, ……0]=

[0.3, 0.3, ……0.3] =

Germany Berlin

Paris France

“Germany - Berlin = France - Paris”

vFrance

vParis

vGermany

vBerlin

Procedure to Create Word Vectors

Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)

cat

sat

the

street

on

A cat sat on the street.

I love cat very much.

w220

w221He comes from Japan.

TX

t=1

logP (wt|wt�c, · · ·wt+c)

P (wt|wt�c, · · ·wt+c) =exp(uwt · v)PW exp(uW · v)

v =X

t0 6=t,�ct0c

vw0t

for and uw vw

vw is word vector for w.

Word vectors are trained so that it becomes a good feature for predicting surrounding words.

Objective Function (cbow-case)

Model (sum-case)

=

Procedure① Maximize

L

L

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

Example:

私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2]

Will LeBron James deliver an NBA championship to Cleveland?

[0.1, 0.4, ……0.1]

Paragraph Vectors

(dimension ~ 100s)

Procedure to Create Paragraph Vectors

for uw vw

A cat sat on the street.

doc_1 : doc_2 :

I love cat very much.w220

He comes from Japan.

w221

Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)

cat

sat

the

street

on

doc_1

TX

t=1

logP (wt|wt�c, · · ·wt+c,doc i)

P (wt|wt�c, · · ·wt+c,doc i) =exp(uwt · v)PW exp(uW · v)

v =X

t0 6=t,�ct0c

vw0t+ di

, and di

wt is included

vw② Preserve uw , as uw , vw

document where

Add a vector to the model for each document.Objective Function (dbow-case)

=

Model (sum-case)

Procedure① Maximize

L

L

Procedure to Create Paragraph Vector

for uw vw, and di

vw② Preserve uw , as uw , vw

After training, we can get a good paragraph vector as a feature for a new document.Objective Function (dbow-case)

Model (sum-case)

Procedure① Maximize

TX

t=1

logP (wt|wt�c, · · ·wt+c,doc)

P (wt|wt�c, · · ·wt+c,doc) =exp(uwt · v)PW exp(uW · v)

v =X

t0 6=t,�ct0c

vwt0 + d

We love SmartNews.

doc :

I love SmartNews very much.

d

Ldoc

=

③ Maximize for

L

Ldoc

d

④ Use as a paragraph vectord

training

live data

Procedure to Create Paragraph Vector

Feature Extractor

[0.2, 0.3, ……0.2]d

uw vw

Paragraph Vector :

Lmaximize

Ldoc

maximize

Text Classification

Ordinary text classification architecture:

② live data

([0.1, -0.1, …])

① training

([0.1, 0.3, …], entertainment)([0.2, -0.3, …], sports)([0.1, 0.1, …], entertainment)

features

? ?

entertainment

sports

([0.1, -0.2, …], politics) …

sports

training algorithm

classifier

feature extraction

Good

Benefits of Using Paragraph Vector

・High Scalability

・High Precision in Text ClassificationSeveral percent better than using Bag-of-Words with feature engineering in our Japanese/English data set.

We don’t need to work hard for feature engineering in each language.

Bad

・Difficulty in analyzing error

It is hard to understand the meaning of each component of paragraph vector.

labeled: ~several 10000 unlabeled: ~100000

Benefits of Using Paragraph Vector

It is important that Paragraph Vector has a different nature than Bag-of-Words

Reason: We can get a better classifier by combining two different types of classifiers.

Our Use Case

Validation

Use one to validate the other.

CombinationUse the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier

In multilingual localizationUse only Paragraph Vector-based classifier without any feature engineering.

Our Use Case (future)

Web Document Classification

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

There are roughly two steps:

The Challenge

The Challenge

News is uncertainty seeking for long-term values.

Exploitation Exploration

What SmartNews does:

uncertainty seeking discovery

What Big Data Firms typically do:

preference estimation and risk quantification

What if parents don't feed vegetables to children who only like meat?What if you keep hearing only opinions that match yours?

The Challenge

Searching not optimal, but acceptable form of exploration.

Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews.

・topic extraction

We are developing:

・image extraction

・multi-arm bandit based scoring model

① For better Feature Vector of users and articles

② For Human-Acceptable Explorationuser

interests

feature vector for 10 million users

real-time feature vector for articlesx

We are building our engineering team in SF - please join us!

採用してます・ML/NLP Engineer

・Data Science Engineer

kohei.nakaji@smartnews.com

References

Main Content Extraction

・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl

Text Classification

Boilerplate Detection using Shallow Text Features

・BoilerPipe (GoogleCode)

・Quoc V. Le, Tomas MikolovDistributed Representations of Sentences and Documents

・Word2Vec (GoogleCode)

References

About SmartNews

・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S.

・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S.

・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M

・About our Company SmartNews

Articles about SmartNews

top related