[smartnews] globally scalable web document classification using word2vec

Globally Scalable Web Document Classification Using Word2Vec

Kohei Nakaji (SmartNews)

keyword: machine learning for discovery

SmartNews Demo

About SmartNews

Launched 2013

4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers

2013 App of The Year

USLaunched Oct 2014

1M+ Monthly Active Users Same engagement

80+ Publishers Top News Category App

International

Launched Feb 2015

10M Downloads WW Same engagement

English beta Featured App

Funding: $50M

Outline of our algorithm

Structure Analysis

Semantics Analysis

URLs Found

Importance Estimation

10 million/day

1000+/day

Diversification

Signals on the Internet

Outline of our algorithm

Structure Analysis

Semantics Analysis

URLs Found

Importance Estimation

10 million/day

1000+ /day

Diversification

Signals on the Internet

Web Document Classification ⊂

Web Document Classification

ENTERTAINMENT

SPORTS

TECHNOLOGY

LIFESTYLE

SCIENCE

Task definition:When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set.

ENTERTAINMENT

① Main Content Extraction

② Text Classification

① ②

There are roughly two steps:

ENTERTAINMENT

① ②

Main Content Extraction

Two approaches:

easier, but takes time

difficult, but fast

・Extract after rendering whole page

・Extract from HTML

・Extract after rendering whole page

・Extract from HTML

easier, but takes time

difficult, but fast

Two approaches:

Our Approach

Main Content Extraction from HTML

<html> <body><div>click <a>here</a> for </div> <div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.</p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html>

Example:

main content

not main content

Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content

Rule-based extraction algorithm is possible.

English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content

RuleN:

Rule1: div which hastext length > 200 num of ‘a’ tag < 3 is Main Content

Rule-based extraction algorithm is possible.

English:Rule2: div which hastext length < 100 num of ‘p’ tag > 4 is Main Content

RuleN:

But not scalable.

Japanese:…… …

② live data

(features)block1:block2:block3:

(features)(features)

① training

(features, main)(features, not main)(features, main)

block1:block2:block3:

decision tree

block separation & feature extraction

We are using a machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

② live data

① training

decision tree

Feature Extraction from HTML

<html> <body><div>click <a>here</a> for </div><div><a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office. </p><a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html>

Separate HTML into ‘text block’s

Step1:

Step2:

Extract local features for every text block

ex: word count = 36, num of <a> = 0

Step1:

Step2:

Extract local features for every text block

ex: word count = 36, num of <a> = 0

Step3: Define feature of each text block as combination of local features

word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1

② live data

① training

decision tree

We are using a machine learning approach:See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

② live data

① training

decision tree

Making Main Content Using Decision Tree

(features)block1:not main

(features)block3:main

(features)block5:main

Main Content Extraction From HTML

② live data

① training

decision tree

We are using machine learning approach;See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)

ENTERTAINMENT

① ②

Text Classification

Ordinary text classification architecture:

② live data

(features)

① training

(features, entertainment)(features, sports)(features, entertainment)

features

entertainment

sports

(features, politics) …

sports

training algorithm

classifier

feature extraction

Text Classification

② live data

(features)

① training

(features, entertainment)(features, sports)(features, entertainment)

features

entertainment

sports

(features, politics) …

sports

training algorithm

classifier

feature extraction

Feature Extraction in Text Classification

Will LeBron James deliver an NBA championship to Cleveland?

‘Bag-of-words’ is commonly used as a feature vector.

Willdeliver

an NBAchampionship

Cleveland

JamesLeBron

‘Bag-of-words’ is commonly used as a feature vector

Willdeliver

an NBAchampionship

Cleveland

JamesLeBron

stop wordssports players dictionary

with some feature engineering.

NBA_PLAYER

tf-idf

Similarly used in Japanese.

私は中路です。よろしくお願いします。

stop wordsperson dictionary

私は中路

よろしくお願い

します

です

PERSON

tf-idf

Another Option: Paragraph Vector

Example:

私は中路です。よろしくお願いします。 [0.2, 0.3, ……0.2]

[0.1, 0.4, ……0.1]

Paragraph Vector

(dimension ～ several 100)

Outline of Distributed Representation

・word2vec

・paragraph vector

every word is mapped to unique word vector.

every document is mapped to unique vector.

(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)

(https://code.google.com/p/word2vec/)

・word2vec

・paragraph vector

(https://code.google.com/p/word2vec/)

Word Vector in word2vec Model

Every word is mapped to unique word vector with good properties.

[0.1, 0.2, ……0.2]=

[0.1, 0.1, ……-0.1] =

[0.3, 0.4, ……0]=

[0.3, 0.3, ……0.3] =

Germany Berlin

Paris France

“Germany - Berlin = France - Paris”

vFrance

vParis

vGermany

vBerlin

Procedure to Create Word Vectors

Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)

street

A cat sat on the street.

I love cat very much.

w221He comes from Japan.

logP (wt|wt�c, · · ·wt+c)

P (wt|wt�c, · · ·wt+c) =exp(uwt · v)PW exp(uW · v)

t0 6=t,�ct0c

for and uw vw

vw is word vector for w.

Word vectors are trained so that it becomes a good feature for predicting surrounding words.

Objective Function (cbow-case)

Model (sum-case)

Procedure① Maximize

・word2vec

・paragraph vector

Example:

私は中路です。よろしくお願いします。 [0.2, 0.3, ……0.2]

[0.1, 0.4, ……0.1]

Paragraph Vectors

(dimension ～ 100s)

Procedure to Create Paragraph Vectors

for uw vw

A cat sat on the street.

doc_1 : doc_2 :

I love cat very much.w220

He comes from Japan.

Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)

street

logP (wt|wt�c, · · ·wt+c,doc i)

P (wt|wt�c, · · ·wt+c,doc i) =exp(uwt · v)PW exp(uW · v)

t0 6=t,�ct0c

vw0t+ di

, and di

wt is included

vw② Preserve uw , as uw , vw

document where

Add a vector to the model for each document.Objective Function (dbow-case)

Model (sum-case)

Procedure to Create Paragraph Vector

for uw vw, and di

vw② Preserve uw , as uw , vw

After training, we can get a good paragraph vector as a feature for a new document.Objective Function (dbow-case)

Model (sum-case)

logP (wt|wt�c, · · ·wt+c,doc)

P (wt|wt�c, · · ·wt+c,doc) =exp(uwt · v)PW exp(uW · v)

t0 6=t,�ct0c

vwt0 + d

We love SmartNews.

I love SmartNews very much.

③ Maximize for

④ Use as a paragraph vectord

training

live data

Procedure to Create Paragraph Vector

Feature Extractor

[0.2, 0.3, ……0.2]d

Paragraph Vector :

Lmaximize

maximize

Text Classification

② live data

([0.1, -0.1, …])

① training

([0.1, 0.3, …], entertainment)([0.2, -0.3, …], sports)([0.1, 0.1, …], entertainment)

features

entertainment

sports

([0.1, -0.2, …], politics) …

sports

training algorithm

classifier

feature extraction

Benefits of Using Paragraph Vector

・High Scalability

・High Precision in Text ClassificationSeveral percent better than using Bag-of-Words with feature engineering in our Japanese/English data set.

We don’t need to work hard for feature engineering in each language.

・Difficulty in analyzing error

It is hard to understand the meaning of each component of paragraph vector.

labeled: ～several 10000 unlabeled: ～100000

Benefits of Using Paragraph Vector

It is important that Paragraph Vector has a different nature than Bag-of-Words

Reason: We can get a better classifier by combining two different types of classifiers.

Our Use Case

Validation

Use one to validate the other.

CombinationUse the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier

In multilingual localizationUse only Paragraph Vector-based classifier without any feature engineering.

Our Use Case (future)

ENTERTAINMENT

① ②

The Challenge

News is uncertainty seeking for long-term values.

Exploitation Exploration

What SmartNews does:

uncertainty seeking discovery

What Big Data Firms typically do:

preference estimation and risk quantification

What if parents don't feed vegetables to children who only like meat?What if you keep hearing only opinions that match yours?

The Challenge

Searching not optimal, but acceptable form of exploration.

Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews.

・topic extraction

We are developing:

・image extraction

・multi-arm bandit based scoring model

① For better Feature Vector of users and articles

② For Human-Acceptable Explorationuser

interests

feature vector for 10 million users

real-time feature vector for articlesx

We are building our engineering team in SF - please join us!

採用してます・ML/NLP Engineer

・Data Science Engineer

kohei.nakaji@smartnews.com

References

・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl

Text Classification

Boilerplate Detection using Shallow Text Features

・BoilerPipe (GoogleCode)

・Quoc V. Le, Tomas MikolovDistributed Representations of Sentences and Documents

・Word2Vec (GoogleCode)

References

About SmartNews

・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S.

・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S.

・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M

・About our Company SmartNews

Articles about SmartNews

[smartnews] globally scalable web document classification using word2vec

main contentrulen

hastext length

arbitrary web document

num of p tag

html live datafeaturesblock1

volunteer deputy whod

tweetshare robert bates

monthly active users

Software

word2vec - jlu.myweb.cs.uwindsor.ca

word2vec hendrik heuer from theory to...

word2vec embeddings: cbow and skipgram · word2vec...

loguad: log unsupervised anomaly detection based on word2vec

multilingual word embedding for zero-shot text...

comparative study of lsa vs word2vec embeddings … ·...

introducing in-house paas in smartnews

word2vec in theory practice with tensorflow

word2vec -...

word2vec slide(lab seminar)

word-embeddings and word2vec and the computation...

introduction to word2vec and its application to find...

word2vec: from intuition to practice using gensim

word...

droidvecdeep: android malware detection based on word2vec

stream processing in smartnews #jawsdays

word2vec - from theory to practice

deep learning framework based on word2vec and cnn for...

word2vec parameter learning explained

ekstraksi fitur menggunakan model word2vec …