big data beer apr2014
DESCRIPTION
Talk held at Big Data Beers at TU Berlin, April 15, 2014.Machine Learning on Streams: examples and challenges.Map Reduce and Apache Hadoop for batch processing.Apache Storm and Spark for stream processing.Approximate instead of scale with stream mining.Streamdrill.TRANSCRIPT
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Machine Learning on Streams
Dr. Mikio Braun@mikiobraun
TU Berlin & streamdrill
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Where does the data come from?
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
FinanceFinance GamingGaming MonitoringMonitoring
Social MediaSocial MediaSensor NetworksSensor NetworksAdvertismentAdvertisment
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Static Data vs. Streaming
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Mostly it's not even ML Turn event stream into some statistics
(events/second, trends, counting) Churn prediction (but really just training
on data) Clustering of news stories (ok) Outlier detection for monitoring (ok,
yeah)
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Still it's hard! Many Events
100 events / second
360k per hour 8.6M per day 260M per month 3.2B per year
Many Objects
http://www.flickr.com/photos/arenamontanus/269158554/
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Time for response Weekly reports: a day Recommendations: a few hours Web Analytics: seconds to a few minutes Ad Optimization: milliseconds
It's only real-time if you can react in real-time.
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Approaches to stream mining
streaming feature extraction
learning: update iteratively local ops only nothing beyond O(n^2)
query is often ok (cache)
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Isn't Machine Learning easily Online?
Stochastic gradient descent
converges, e.g. if
http://leon.bottou.org/research/stochastic
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Time horizons vs. Learning rate
You can't just do online learning on event data!
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Batch: Store and Crunch Just like you would
with static data, in a BIG FOR LOOP
data management? huge latency!
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Map Reduce Process
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Map Reduce and ML
Covers Locally Weighted Linear Regression, Naive Bayes, Gaussian Discriminative Analysis, k-Means, Logistic Regression, Neural Networks, Principal Component Analysis, Independent Component Analysis, Expectation Maximization, Support Vector Machines
Neural InformationProcessing SystemsConference, 2006
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Apache Hadoop Industry Standard for Map Reduce Originally developed at Yahoo A faithful embodiment of the Java
Enterprise Mindset, bindings & tools for everything
Core components: HDFS + MapReduce
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Hadoop Facts You shouldn't be afraid of Java HDFS like a Poor Mans FS (read/write, no seek) to
distribute data Map splits essentially on file level (some file formats
are understood) Job startup takes quite some time (minutes) Map/Reduce jobs can be scripts => interface to
every language in principle Very extensible (in Java): User Defined Functions, etc.
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Parallelize: Stream Processing
Split work into small pieces of code handling a single event
Parallelize Good latency No Query Layer!
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Actor based concurrency Classical multi-threaded programming
considered harmful (locking, concurrent access)
Interacting actors (each running single-threaded)
Emphasis on functional programming & immutable data structures
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Actor based concurrency
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Apache Storm Basically scale the actor based model
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Parallelize: Micro-Batches Event-oriented
processing can be tedious
Apache Spark: Micro-Batches
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Apache Spark Stream processing not just map/reduce but more complete functional collection style API, also for
streams in memory hyped Hadoop competitor developed with support by UC Berkeley
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Approximate: Stream Mining
Scaling is nice, but WE ALL KNOW: Data is noisy Not every data point is important Methods are noisy, too Absolute numbers are often not important,
too
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Probabilistic Data Structures
Probabilistic Algorithm: With n space, perform task with error e = f(n) with e 0 as n
Mother of all algorithms: Bloomfilter
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions,
even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005
132142432553712023
15128532
Fixed tables of counts
Case 1: element already in data base
Case 2: new element
142 142 12 13
713 023 2
713 3
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Count-Min Sketches Summarize histograms over large feature sets Like bloom filters, but better
Query: Take minimum over all hash functions
0 0 3 01 1 0 2
0 2 0 00 3 5 2
0 5 3 22 4 5 0
1 3 7 30 2 0 8
m bins
n differenthash functions
Updates for new entryQuery result: 1
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Clustering with count-min Sketches Online clustering
For each data point: Map to closest centroid ( compute distances) Update centroid
count-min sketches to represent sum over all vectors in a class
Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009
0 0 3 01 1 0 2
0 2 0 00 3 5 2
0 5 3 22 4 5 0
1 3 7 30 2 0 8
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Wait a minute? Only Counting?
Well, getting the top most active items is already useful. Web analytics, Users, Trending Topics
Counting is statistics!
-
Big Data Beers, April 15, 2014, Berlin
2014 by Mikio L. Braun
Counting is Statistics Empirical mean:
Correlations:
Principal Component Analysis:
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
More: Maximum-Likelihood Estimate probabilistic models
based on
which is slightly biased, but simpler
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Outlier detection Once you have a model, you can compute
p-values (based on recent time frames!)
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Online TF-IDF
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Classification with Nave Bayes
Naive Bayes is also just counting, right?
class priors
Multinomnial nave Bayes
Priors
Total number of words in class
Number of times word appears in classfrequency of word in document
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Classification with Naive Bayes
ICML 2003
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Classification with Naive Bayes
7 Steps to improve NB: transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
So much more to do with trends
Least Recently Used Caches Sparse Vectors Sparse Matrices Conditional Probabilities (Histograms) Accumulators ...
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Streamdrill Real-Time Analysis Solutions Core Engine:
Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows In-Memory
Modules Profiling and Trending Recommendations Count Distinct
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Architecture Overview
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Example: Twitter Stock Analysis
http://play.streamdrill.com/vis/
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Example: Twitter Stock Analysis
Trends: symbol:combinations $AAPL:$GOOG symbol:hashtag $AAPL:#trading symbol:keywords $GOOG:disruption symbol:mentions $GOOG:WallStreetCom symbol trend $AAPL symbol:url $FB:http://on.wsj.com/15fHaZW
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Example: Twitter Stock Analysis
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Example: Twitter Stock Analysis
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
Example: Twitter Stock Analysis
Twitter
streamdrill
JavaScriptvia REST
Tweet Analyzer
tweets
updates
-
Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun
ML on Streams Constantly deal with new data It's often not that complex, really High data rate & event range Batch: High latency Hadoop Stream: Low latency Storm / Spark Approximate: Good enough Streamdrill
Slide 1Slide 2Slide 4Slide 5Slide 6Slide 8Slide 9Slide 10Slide 11Slide 13Slide 14Slide 15Slide 16Slide 17Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 27Slide 28Slide 29Slide 30Slide 31Slide 34Slide 35Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51