samoa: a platform for mining big data streams (apache bigdata north america 2016)
TRANSCRIPT
![Page 1: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/1.jpg)
1
SAMOA: A Platform for Mining Big Data Streams
Nicolas KourtellisAssociate Researcher
Telefonica I+D, Barcelona@kourtellis
@ApacheSAMOA
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 2: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/2.jpg)
2
What is Big Data?Search queriesFacebook postsEmailsTweetsPhoto sharesClicks on ads…
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 3: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/3.jpg)
3
How BIG is your data?Volume (+ Variety)
Too large for RAM of single commodity serverVelocity
Too fast for CPU of single commodity server
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 4: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/4.jpg)
4
What is the Streaming Paradigm?High amount of data, high speed of arrivalUpdated models at “real” timePotentially infinite sequence of dataChange over time (concept drift)
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 5: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/5.jpg)
5
Mining Big Data StreamsApproximation algorithms:
Single pass, one data item at a timeSub-linear space and time per data itemSmall error with high probability
A platform solution:Support different algorithms & processing enginesDistributedScalable
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 6: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/6.jpg)
6
What is SAMOA?Scalable Advanced Massive Online AnalysisA platform for mining big data streams
Framework for developing new distributed stream mining algorithms
Framework for deploying algorithms on new distributed stream processing engines
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 7: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/7.jpg)
7
Taxonomy
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 8: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/8.jpg)
8
SAMOA ArchitectureMachine LearningAlgorithms
Distributed StreamProcessing Engines Flink
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 9: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/9.jpg)
9
Why is SAMOA important?Program once, run everywhere
Reuse existing infrastructureAvoid deploy cycles
No system downtimeNo complex backup/update processNo need to select update frequency
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 10: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/10.jpg)
10
ML Developer API
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 11: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/11.jpg)
11
ML Developer API
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 12: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/12.jpg)
12
Deployment
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 13: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/13.jpg)
13
Easy to get!
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 14: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/14.jpg)
14
Easy to get!
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 15: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/15.jpg)
15
Easy to get!
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 16: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/16.jpg)
16
Easy to test!bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar"PrequentialEvaluation-d /tmp/dump.csv-i 1000000 -f 100000-l (classifiers.trees.VerticalHoeffdingTree -p 4 -k)-s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 17: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/17.jpg)
17
Case study: Decision TreesVHT: Vertical Hoeffding Tree*
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
Task parallelism
*VHT: Vertical Hoeffding Tree. N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
![Page 18: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/18.jpg)
18
Case study: VHT
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
Horizontal Parallelism
![Page 19: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/19.jpg)
19
Case study: VHT
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
Vertical Parallelism
![Page 20: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/20.jpg)
20
Benefits of Vertical ParallelismHigh number of attributes:
high level parallelism (e.g., documents)vs. task parallelism:
obvious parallelism observedvs. horizontal parallelism:
reduced memory usage (no model replication)parallelized split computation
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 21: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/21.jpg)
21
Vertical Hoeffding Tree
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 22: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/22.jpg)
22
Preliminary results: Dense instancesRandom decision treeMixed categorical and numerical attributes
10-10, 100-100, 1k-1k, 10k-10kInstances: 1,000,0002 balanced classes10 different seeded runsTest every 100k instancesMOA HT vs. Local VHT vs. Storm cluster VHT
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 23: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/23.jpg)
23
Results: Accuracy
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 24: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/24.jpg)
24
Results: Accuracy
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 25: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/25.jpg)
25
Results: Accuracy Evolution
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 26: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/26.jpg)
26
Results: Speedup
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 27: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/27.jpg)
27
Results: Speedup
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 28: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/28.jpg)
28
Preliminary results: Artificial TweetsZipf skew: 1.5Bag of words: 100, 1000, 10000 (attributes)Size of tweet: ~15 words Instances: 1,000,000Class: positive or negative
Gaussian random variable10 different seeded runsTest every 100k instancesMOA HT vs. Local VHT vs. Storm cluster VHT
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 29: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/29.jpg)
29
Results: Accuracy
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 30: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/30.jpg)
30
Results: Accuracy
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 31: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/31.jpg)
31
Results: Accuracy Evolution
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 32: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/32.jpg)
32
Results: Speedup
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 33: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/33.jpg)
33
Results: Speedup
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 34: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/34.jpg)
34
Is SAMOA for you?Are you dealing with:
Big fast data?Possibly endless streams of data?Evolving data?
Do you need updated models at real time?Do you want to test an algorithm on different DSPEs?
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 35: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/35.jpg)
35
SAMOA Team
Albert Bifet
GianmarcoDe Francisci Morales
Nicolas Kourtellis
Matthieu Morel
Arinto Murdopo
Olivier Van Laere
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 36: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/36.jpg)
36
Status Apache Incubator
Released version 0.3.0 in July Execution Engines
Input: Local FS HDFS Avro Kafka [pending]
Heron?
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
Apache Beam?
![Page 37: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/37.jpg)
37
Algorithms in SAMOAExisting:
Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression)
Pending: Distributed Naïve Bayes Stochastic Gradient Descent Adaptive + Boosting VHT Parallelized Gradient Boosted Decision Tree PARMA (frequent pattern mining) …
Check Samoa Roadmap for more
Looking for contributors!
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016
![Page 38: SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)](https://reader036.vdocuments.us/reader036/viewer/2022081605/586e8c851a28aba0038b84db/html5/thumbnails/38.jpg)
38
SAMOA: A Platform for Mining Big Data Streams
@ApacheSAMOAhttp://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis@kourtellis
SAMOA: Scalable Advanced Machine Online Analysis, ApacheCon BigData, NA, 2016