“scalable distributed online machine learning framework ... · •we need machine learning to...

7
© 2012 NTT Software Innovation Center “Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data” XLDB2012 Sep. 12, 2012 Hiroyuki MAKINO, NTT Software Innovation Center

Upload: others

Post on 03-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

“Scalable Distributed Online Machine Learning Framework for

Realtime Analysis of Big Data”

XLDB2012 Sep. 12, 2012

Hiroyuki MAKINO, NTT Software Innovation Center

Page 2: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

• To satisfy “Scalable”, “Realtime”, and “Profound

analysis” for Big Data.

Objective of Jubatus

SVMlight

Profound

Analysis

Scalability

For velocity

For volume

For variety

Page 3: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

• Social stream analysis for marketing research • We want to know reputation or mood

from their voice.

• Too hard to go over each tweet • We need machine learning to

classify tweets automatically in realtime.

Use Case: Social stream analysis

Twitter

Realtime twitter

analysis with

Jubatus

Analysis

result

More than

8000 tweets/sec

Demonstration See you in the poster session

Classifies into

1600 companies

Page 4: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

Performance The number of servers and throughput

Throughput and linear scalability

Classification Tests: Classifying tweets into 1600

companies automatically

- Throughput: Classifies all the Japanese

tweets (6700 tweets/sec) with 1 server

Accuracy and learning time

The number of

servers

Accuracy

- Achieving the accuracy as good as batch

processing based learning.

- 90% accuracy in 1 sec. with 16 servers

Accuracy of batch learning

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0 2 4 6 8 10 12 14 16 18

The number of servers

[million]

Accura

cy[%

]

90% accuracy

in 1 sec.

*SPAM classification task (Dataset: LIBSVM webspam)

Th

roug

hput[

featu

res/s

ec.]

6700 tweets/sec.

with 1 server

100 thousand

tweets/sec.

with 16 servers

* 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec.

Page 5: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

• User process • Implemented using Jubatus client API (C++/Java/Python/PHP/Ruby clients)

• Acquires input data and sends requests to servers via proxies

• Proxy process (JubaKeeper) • Relays clients’ requests to servers

• Server process (JubaServer) • Performs the training and prediction processing, and learning model synchronization

• Increases performance linearly with the number of servers

• ZooKeeper process • Manages distributed coordination, such as process alive check and leader selection

KEY POINT 1: Jubatus Architecture

User processes Proxy processes Server processes

ZooKeeper processes

API

API

API

MIX

(learning model

synchronization)

Big Data

RPC RPC

Page 6: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

• Deterioration in the consistency of data synchronization is considered to be a loss of

data to be trained, and it results in a decrease in accuracy.

• MIX technique can increase robustness by exchanging the intermediate results

among servers loosely.

KEY POINT 2: MIX technique

Data

Analysis Analysis Analysis Analysis

Intermediate

Results r0

Intermediate

Results r1

Intermediate

Results r2

Intermediate

Results rn

Aggregated

Result R

Aggregated

Result R

Aggregated

Result R

1. Leader election

Server failures:

Stop sharing

aggregated result

Server addition:

Start sharing

aggregated result

Aggregated

Result R R = f (r0, r1, r2, …, rn)

MIX

calculation 2. Aggregation

4. Sharing the aggregated result

Membership management

MIX technique

MIX calculator [Calculation for aggregated

result based on intermediate

results]

MIX Protocol

controller [Regulating aggregated

result]

Membership

manager [Leader election, members

addition or deletion]

3. Calculating

aggregated

result from

intermediate

results

addition

Page 7: “Scalable Distributed Online Machine Learning Framework ... · •We need machine learning to classify tweets automatically in realtime. Use Case: Social stream analysis Twitter

© 2012 NTT Software Innovation Center

• Jubatus OSS website

• http://jubat.us

• Github https://github.com/jubatus/jubatus

Open Source Software

Supported machine learning

engines Description Algorithms Usecases

Linear classification Classify input data into

given categories

Perceptron, Passive

Aggressive (PA), Confidence

Weighted Learning (CW),

AROW,

Normal HERD (NHERD)

Spam mail

Twitter classification

Recommendation Recommend similar data

as input data Inverted Index, LSH, MinHash

Item recommendation

Advertisement

Regression Estimate output value for

input data SVR using PA

Power consumption

estimation

Stock price estimation

Statistics

Calculate statistics data,

such as sum, max, min,

average, standard

deviation, etc

Sensor monitoring

Data anomaly detection

Graph mining Extract a centrality and

shortest path of the given

graph structure

Centrality

computation(PageRank),

Shortest path search

Social community

analysis

Network traffic analysis