big data beer apr2014

Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

Machine Learning on Streams

Dr. Mikio Braun@mikiobraun

TU Berlin & streamdrill


Where does the data come from?

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

FinanceFinance GamingGaming MonitoringMonitoring

Social MediaSocial MediaSensor NetworksSensor NetworksAdvertismentAdvertisment


Static Data vs. Streaming


Mostly it's not even ML Turn event stream into some statistics

(events/second, trends, counting) Churn prediction (but really just training

on data) Clustering of news stories (ok) Outlier detection for monitoring (ok,

yeah)


Still it's hard! Many Events

100 events / second

360k per hour 8.6M per day 260M per month 3.2B per year

Many Objects

http://www.flickr.com/photos/arenamontanus/269158554/


Time for response Weekly reports: a day Recommendations: a few hours Web Analytics: seconds to a few minutes Ad Optimization: milliseconds

It's only real-time if you can react in real-time.


Approaches to stream mining

streaming feature extraction

learning: update iteratively local ops only nothing beyond O(n^2)

query is often ok (cache)


Isn't Machine Learning easily Online?

Stochastic gradient descent

converges, e.g. if

http://leon.bottou.org/research/stochastic


Time horizons vs. Learning rate

You can't just do online learning on event data!


Batch: Store and Crunch Just like you would

with static data, in a BIG FOR LOOP

data management? huge latency!

Big Data Beers, April 15, 2014, Berlin

2014 by Mikio L. Braun

Map Reduce Process



Map Reduce and ML

Covers Locally Weighted Linear Regression, Naive Bayes, Gaussian Discriminative Analysis, k-Means, Logistic Regression, Neural Networks, Principal Component Analysis, Independent Component Analysis, Expectation Maximization, Support Vector Machines

Neural InformationProcessing SystemsConference, 2006



Apache Hadoop Industry Standard for Map Reduce Originally developed at Yahoo A faithful embodiment of the Java

Enterprise Mindset, bindings & tools for everything

Core components: HDFS + MapReduce



Hadoop Facts You shouldn't be afraid of Java HDFS like a Poor Mans FS (read/write, no seek) to

distribute data Map splits essentially on file level (some file formats

are understood) Job startup takes quite some time (minutes) Map/Reduce jobs can be scripts => interface to

every language in principle Very extensible (in Java): User Defined Functions, etc.


Parallelize: Stream Processing

Split work into small pieces of code handling a single event

Parallelize Good latency No Query Layer!



Actor based concurrency Classical multi-threaded programming

considered harmful (locking, concurrent access)

Interacting actors (each running single-threaded)

Emphasis on functional programming & immutable data structures



Actor based concurrency



Apache Storm Basically scale the actor based model


Parallelize: Micro-Batches Event-oriented

processing can be tedious

Apache Spark: Micro-Batches


Apache Spark Stream processing not just map/reduce but more complete functional collection style API, also for

streams in memory hyped Hadoop competitor developed with support by UC Berkeley


Approximate: Stream Mining

Scaling is nice, but WE ALL KNOW: Data is noisy Not every data point is important Methods are noisy, too Absolute numbers are often not important,

too


Probabilistic Data Structures

Probabilistic Algorithm: With n space, perform task with error e = f(n) with e 0 as n

Mother of all algorithms: Bloomfilter


Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions,

even more, e.g. IP addresses, Twitter users) Interested in most active elements only.

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

132142432553712023

15128532

Fixed tables of counts

Case 1: element already in data base

Case 2: new element

142 142 12 13

713 023 2

713 3


Count-Min Sketches Summarize histograms over large feature sets Like bloom filters, but better

Query: Take minimum over all hash functions

0 0 3 01 1 0 2

0 2 0 00 3 5 2

0 5 3 22 4 5 0

1 3 7 30 2 0 8

m bins

n differenthash functions

Updates for new entryQuery result: 1

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .


Clustering with count-min Sketches Online clustering

For each data point: Map to closest centroid ( compute distances) Update centroid

count-min sketches to represent sum over all vectors in a class

Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009

0 0 3 01 1 0 2

0 2 0 00 3 5 2

0 5 3 22 4 5 0

1 3 7 30 2 0 8



Wait a minute? Only Counting?

Well, getting the top most active items is already useful. Web analytics, Users, Trending Topics

Counting is statistics!



Counting is Statistics Empirical mean:

Correlations:

Principal Component Analysis:


More: Maximum-Likelihood Estimate probabilistic models

based on

which is slightly biased, but simpler


Outlier detection Once you have a model, you can compute

p-values (based on recent time frames!)


Online TF-IDF


Classification with Nave Bayes

Naive Bayes is also just counting, right?

class priors

Multinomnial nave Bayes

Priors

Total number of words in class

Number of times word appears in classfrequency of word in document


Classification with Naive Bayes

ICML 2003


Classification with Naive Bayes

7 Steps to improve NB: transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights


So much more to do with trends

Least Recently Used Caches Sparse Vectors Sparse Matrices Conditional Probabilities (Histograms) Accumulators ...


Streamdrill Real-Time Analysis Solutions Core Engine:

Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows In-Memory

Modules Profiling and Trending Recommendations Count Distinct


Architecture Overview


Example: Twitter Stock Analysis

http://play.streamdrill.com/vis/



Trends: symbol:combinations $AAPL:$GOOG symbol:hashtag $AAPL:#trading symbol:keywords $GOOG:disruption symbol:mentions $GOOG:WallStreetCom symbol trend $AAPL symbol:url $FB:http://on.wsj.com/15fHaZW



Twitter

streamdrill

JavaScriptvia REST

Tweet Analyzer

tweets

updates


ML on Streams Constantly deal with new data It's often not that complex, really High data rate & event range Batch: High latency Hadoop Stream: Low latency Storm / Spark Approximate: Good enough Streamdrill

Slide 1Slide 2Slide 4Slide 5Slide 6Slide 8Slide 9Slide 10Slide 11Slide 13Slide 14Slide 15Slide 16Slide 17Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 27Slide 28Slide 29Slide 30Slide 31Slide 34Slide 35Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51

big data beer apr2014

Documents

yeahbig data beers

streamingbig data beers

event data

process big data beers

braunstatic data

ok cachebig data beers

mikio braun

data map splits