“scalable distributed online machine learning framework ... · •we need machine learning to...

“Scalable Distributed Online Machine Learning Framework for

Realtime Analysis of Big Data”

XLDB2012 Sep. 12, 2012

Hiroyuki MAKINO, NTT Software Innovation Center

• To satisfy “Scalable”, “Realtime”, and “Profound

analysis” for Big Data.

Objective of Jubatus

SVMlight

Profound

Analysis

Scalability

For velocity

For volume

For variety

• Social stream analysis for marketing research • We want to know reputation or mood

from their voice.

• Too hard to go over each tweet • We need machine learning to

classify tweets automatically in realtime.

Use Case: Social stream analysis

Twitter

Realtime twitter

analysis with

Jubatus

Analysis

result

More than

8000 tweets/sec

Demonstration See you in the poster session

Classifies into

1600 companies

Performance The number of servers and throughput

Throughput and linear scalability

Classification Tests: Classifying tweets into 1600

companies automatically

- Throughput： Classifies all the Japanese

tweets (6700 tweets/sec) with 1 server

Accuracy and learning time

The number of

servers

Accuracy

- Achieving the accuracy as good as batch

processing based learning.

- 90% accuracy in 1 sec. with 16 servers

Accuracy of batch learning

0 2 4 6 8 10 12 14 16 18

The number of servers

[million]

Accura

90% accuracy

in 1 sec.

*SPAM classification task (Dataset: LIBSVM webspam)

6700 tweets/sec.

with 1 server

100 thousand

tweets/sec.

with 16 servers

* 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec.

• User process • Implemented using Jubatus client API (C++/Java/Python/PHP/Ruby clients)

• Acquires input data and sends requests to servers via proxies

• Proxy process (JubaKeeper) • Relays clients’ requests to servers

• Server process (JubaServer) • Performs the training and prediction processing, and learning model synchronization

• Increases performance linearly with the number of servers

• ZooKeeper process • Manages distributed coordination, such as process alive check and leader selection

KEY POINT 1: Jubatus Architecture

User processes Proxy processes Server processes

ZooKeeper processes

(learning model

synchronization)

Big Data

RPC RPC

• Deterioration in the consistency of data synchronization is considered to be a loss of

data to be trained, and it results in a decrease in accuracy.

• MIX technique can increase robustness by exchanging the intermediate results

among servers loosely.

KEY POINT 2: MIX technique

Analysis Analysis Analysis Analysis

Intermediate

Results r0

Intermediate

Results r1

Intermediate

Results r2

Intermediate

Results rn

Aggregated

Result R

Aggregated

Result R

Aggregated

Result R

1. Leader election

Server failures:

Stop sharing

aggregated result

Server addition:

Start sharing

aggregated result

Aggregated

Result R R = f (r0, r1, r2, …, rn)

calculation 2. Aggregation

4. Sharing the aggregated result

Membership management

MIX technique

MIX calculator [Calculation for aggregated

result based on intermediate

results]

MIX Protocol

controller [Regulating aggregated

result]

Membership

manager [Leader election, members

addition or deletion]

3. Calculating

aggregated

result from

intermediate

results

addition

• Jubatus OSS website

• http://jubat.us

• Github https://github.com/jubatus/jubatus

Open Source Software

Supported machine learning

engines Description Algorithms Usecases

Linear classification Classify input data into

given categories

Perceptron, Passive

Aggressive (PA), Confidence

Weighted Learning (CW),

Normal HERD (NHERD)

Spam mail

Twitter classification

Recommendation Recommend similar data

as input data Inverted Index, LSH, MinHash

Item recommendation

Regression Estimate output value for

input data SVR using PA

Power consumption

estimation

Stock price estimation

Statistics

Calculate statistics data,

such as sum, max, min,

average, standard

deviation, etc

Sensor monitoring

Data anomaly detection

Graph mining Extract a centrality and

shortest path of the given

graph structure

Centrality

computation(PageRank),

Shortest path search

Social community

analysis

Network traffic analysis

“scalable distributed online machine learning framework ... · •we need machine learning to...

Documents

machine learning with matlab - … · 2 agenda machine...

learning from tweets: opportunities and challenges to...

machine learning on spark - uc berkeley amp...

learning with large datasets machine learning large scale...

machine learning chapter 11. 2 machine learning what is...

machine learning algorithms for spam detection in social...

fast track machine learning part 1 (machine learning...

dat340/dit866 applied machine learning: machine learning...

arxiv:1709.09927v3 [cs.cy] 24 dec 2017arxiv:1709.09927v3...

machine learning: machine learning: introduction...

an introduction to machine learning - github pages ·...

predicting housing prices with azure machine learning ·...

tweets, posts, messaging this generation of learning vicki...

11 machine learning important issues in machine learning

tweets tweets & replies 5,732 - internet archive

analyzing the power of tweets in predicting commodity...

¿qué es machine learning? usos de machine learning

machine learning using mapreduce - boise state...

machine learning in logistics: machine learning...

deep learning - machine learning...