“scalable distributed online machine learning framework ... · •we need machine learning to...
Post on 03-Jul-2020
2 Views
Preview:
TRANSCRIPT
© 2012 NTT Software Innovation Center
“Scalable Distributed Online Machine Learning Framework for
Realtime Analysis of Big Data”
XLDB2012 Sep. 12, 2012
Hiroyuki MAKINO, NTT Software Innovation Center
© 2012 NTT Software Innovation Center
• To satisfy “Scalable”, “Realtime”, and “Profound
analysis” for Big Data.
Objective of Jubatus
SVMlight
Profound
Analysis
Scalability
For velocity
For volume
For variety
© 2012 NTT Software Innovation Center
• Social stream analysis for marketing research • We want to know reputation or mood
from their voice.
• Too hard to go over each tweet • We need machine learning to
classify tweets automatically in realtime.
Use Case: Social stream analysis
Realtime twitter
analysis with
Jubatus
Analysis
result
More than
8000 tweets/sec
Demonstration See you in the poster session
Classifies into
1600 companies
© 2012 NTT Software Innovation Center
Performance The number of servers and throughput
Throughput and linear scalability
Classification Tests: Classifying tweets into 1600
companies automatically
- Throughput: Classifies all the Japanese
tweets (6700 tweets/sec) with 1 server
Accuracy and learning time
The number of
servers
Accuracy
- Achieving the accuracy as good as batch
processing based learning.
- 90% accuracy in 1 sec. with 16 servers
Accuracy of batch learning
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0 2 4 6 8 10 12 14 16 18
The number of servers
[million]
Accura
cy[%
]
90% accuracy
in 1 sec.
*SPAM classification task (Dataset: LIBSVM webspam)
Th
roug
hput[
featu
res/s
ec.]
6700 tweets/sec.
with 1 server
100 thousand
tweets/sec.
with 16 servers
* 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec.
© 2012 NTT Software Innovation Center
• User process • Implemented using Jubatus client API (C++/Java/Python/PHP/Ruby clients)
• Acquires input data and sends requests to servers via proxies
• Proxy process (JubaKeeper) • Relays clients’ requests to servers
• Server process (JubaServer) • Performs the training and prediction processing, and learning model synchronization
• Increases performance linearly with the number of servers
• ZooKeeper process • Manages distributed coordination, such as process alive check and leader selection
KEY POINT 1: Jubatus Architecture
User processes Proxy processes Server processes
ZooKeeper processes
API
API
API
MIX
(learning model
synchronization)
Big Data
RPC RPC
© 2012 NTT Software Innovation Center
• Deterioration in the consistency of data synchronization is considered to be a loss of
data to be trained, and it results in a decrease in accuracy.
• MIX technique can increase robustness by exchanging the intermediate results
among servers loosely.
KEY POINT 2: MIX technique
Data
Analysis Analysis Analysis Analysis
Intermediate
Results r0
Intermediate
Results r1
Intermediate
Results r2
Intermediate
Results rn
Aggregated
Result R
Aggregated
Result R
Aggregated
Result R
1. Leader election
Server failures:
Stop sharing
aggregated result
Server addition:
Start sharing
aggregated result
Aggregated
Result R R = f (r0, r1, r2, …, rn)
MIX
calculation 2. Aggregation
4. Sharing the aggregated result
Membership management
MIX technique
MIX calculator [Calculation for aggregated
result based on intermediate
results]
MIX Protocol
controller [Regulating aggregated
result]
Membership
manager [Leader election, members
addition or deletion]
3. Calculating
aggregated
result from
intermediate
results
addition
© 2012 NTT Software Innovation Center
• Jubatus OSS website
• http://jubat.us
• Github https://github.com/jubatus/jubatus
Open Source Software
Supported machine learning
engines Description Algorithms Usecases
Linear classification Classify input data into
given categories
Perceptron, Passive
Aggressive (PA), Confidence
Weighted Learning (CW),
AROW,
Normal HERD (NHERD)
Spam mail
Twitter classification
Recommendation Recommend similar data
as input data Inverted Index, LSH, MinHash
Item recommendation
Advertisement
Regression Estimate output value for
input data SVR using PA
Power consumption
estimation
Stock price estimation
Statistics
Calculate statistics data,
such as sum, max, min,
average, standard
deviation, etc
Sensor monitoring
Data anomaly detection
Graph mining Extract a centrality and
shortest path of the given
graph structure
Centrality
computation(PageRank),
Shortest path search
Social community
analysis
Network traffic analysis
top related