big data beer apr2014

42
Big Data Beers, April 15, 2014, Berlin © 2014 by Mikio L. Braun Machine Learning on Streams Dr. Mikio Braun @mikiobraun TU Berlin & streamdrill

Upload: mikio-braun

Post on 25-Nov-2015

3.603 views

Category:

Documents


0 download

DESCRIPTION

Talk held at Big Data Beers at TU Berlin, April 15, 2014.Machine Learning on Streams: examples and challenges.Map Reduce and Apache Hadoop for batch processing.Apache Storm and Spark for stream processing.Approximate instead of scale with stream mining.Streamdrill.

TRANSCRIPT

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Machine Learning on Streams

    Dr. Mikio Braun@mikiobraun

    TU Berlin & streamdrill

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Where does the data come from?

    Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

    FinanceFinance GamingGaming MonitoringMonitoring

    Social MediaSocial MediaSensor NetworksSensor NetworksAdvertismentAdvertisment

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Static Data vs. Streaming

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Mostly it's not even ML Turn event stream into some statistics

    (events/second, trends, counting) Churn prediction (but really just training

    on data) Clustering of news stories (ok) Outlier detection for monitoring (ok,

    yeah)

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Still it's hard! Many Events

    100 events / second

    360k per hour 8.6M per day 260M per month 3.2B per year

    Many Objects

    http://www.flickr.com/photos/arenamontanus/269158554/

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Time for response Weekly reports: a day Recommendations: a few hours Web Analytics: seconds to a few minutes Ad Optimization: milliseconds

    It's only real-time if you can react in real-time.

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Approaches to stream mining

    streaming feature extraction

    learning: update iteratively local ops only nothing beyond O(n^2)

    query is often ok (cache)

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Isn't Machine Learning easily Online?

    Stochastic gradient descent

    converges, e.g. if

    http://leon.bottou.org/research/stochastic

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Time horizons vs. Learning rate

    You can't just do online learning on event data!

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Batch: Store and Crunch Just like you would

    with static data, in a BIG FOR LOOP

    data management? huge latency!

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Map Reduce Process

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Map Reduce and ML

    Covers Locally Weighted Linear Regression, Naive Bayes, Gaussian Discriminative Analysis, k-Means, Logistic Regression, Neural Networks, Principal Component Analysis, Independent Component Analysis, Expectation Maximization, Support Vector Machines

    Neural InformationProcessing SystemsConference, 2006

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Apache Hadoop Industry Standard for Map Reduce Originally developed at Yahoo A faithful embodiment of the Java

    Enterprise Mindset, bindings & tools for everything

    Core components: HDFS + MapReduce

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Hadoop Facts You shouldn't be afraid of Java HDFS like a Poor Mans FS (read/write, no seek) to

    distribute data Map splits essentially on file level (some file formats

    are understood) Job startup takes quite some time (minutes) Map/Reduce jobs can be scripts => interface to

    every language in principle Very extensible (in Java): User Defined Functions, etc.

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Parallelize: Stream Processing

    Split work into small pieces of code handling a single event

    Parallelize Good latency No Query Layer!

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Actor based concurrency Classical multi-threaded programming

    considered harmful (locking, concurrent access)

    Interacting actors (each running single-threaded)

    Emphasis on functional programming & immutable data structures

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Actor based concurrency

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Apache Storm Basically scale the actor based model

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Parallelize: Micro-Batches Event-oriented

    processing can be tedious

    Apache Spark: Micro-Batches

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Apache Spark Stream processing not just map/reduce but more complete functional collection style API, also for

    streams in memory hyped Hadoop competitor developed with support by UC Berkeley

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Approximate: Stream Mining

    Scaling is nice, but WE ALL KNOW: Data is noisy Not every data point is important Methods are noisy, too Absolute numbers are often not important,

    too

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Probabilistic Data Structures

    Probabilistic Algorithm: With n space, perform task with error e = f(n) with e 0 as n

    Mother of all algorithms: Bloomfilter

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions,

    even more, e.g. IP addresses, Twitter users) Interested in most active elements only.

    Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

    132142432553712023

    15128532

    Fixed tables of counts

    Case 1: element already in data base

    Case 2: new element

    142 142 12 13

    713 023 2

    713 3

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Count-Min Sketches Summarize histograms over large feature sets Like bloom filters, but better

    Query: Take minimum over all hash functions

    0 0 3 01 1 0 2

    0 2 0 00 3 5 2

    0 5 3 22 4 5 0

    1 3 7 30 2 0 8

    m bins

    n differenthash functions

    Updates for new entryQuery result: 1

    G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Clustering with count-min Sketches Online clustering

    For each data point: Map to closest centroid ( compute distances) Update centroid

    count-min sketches to represent sum over all vectors in a class

    Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009

    0 0 3 01 1 0 2

    0 2 0 00 3 5 2

    0 5 3 22 4 5 0

    1 3 7 30 2 0 8

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Wait a minute? Only Counting?

    Well, getting the top most active items is already useful. Web analytics, Users, Trending Topics

    Counting is statistics!

  • Big Data Beers, April 15, 2014, Berlin

    2014 by Mikio L. Braun

    Counting is Statistics Empirical mean:

    Correlations:

    Principal Component Analysis:

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    More: Maximum-Likelihood Estimate probabilistic models

    based on

    which is slightly biased, but simpler

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Outlier detection Once you have a model, you can compute

    p-values (based on recent time frames!)

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Online TF-IDF

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Classification with Nave Bayes

    Naive Bayes is also just counting, right?

    class priors

    Multinomnial nave Bayes

    Priors

    Total number of words in class

    Number of times word appears in classfrequency of word in document

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Classification with Naive Bayes

    ICML 2003

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Classification with Naive Bayes

    7 Steps to improve NB: transform TF to log( . + 1) IDF-style normalization square length normalization use complement probability another log normalize those weights again Predict linearly using those weights

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    So much more to do with trends

    Least Recently Used Caches Sparse Vectors Sparse Matrices Conditional Probabilities (Histograms) Accumulators ...

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Streamdrill Real-Time Analysis Solutions Core Engine:

    Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows In-Memory

    Modules Profiling and Trending Recommendations Count Distinct

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Architecture Overview

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Example: Twitter Stock Analysis

    http://play.streamdrill.com/vis/

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Example: Twitter Stock Analysis

    Trends: symbol:combinations $AAPL:$GOOG symbol:hashtag $AAPL:#trading symbol:keywords $GOOG:disruption symbol:mentions $GOOG:WallStreetCom symbol trend $AAPL symbol:url $FB:http://on.wsj.com/15fHaZW

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Example: Twitter Stock Analysis

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Example: Twitter Stock Analysis

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    Example: Twitter Stock Analysis

    Twitter

    streamdrill

    JavaScriptvia REST

    Tweet Analyzer

    tweets

    updates

  • Big Data Beers, April 15, 2014, Berlin 2014 by Mikio L. Braun

    ML on Streams Constantly deal with new data It's often not that complex, really High data rate & event range Batch: High latency Hadoop Stream: Low latency Storm / Spark Approximate: Good enough Streamdrill

    Slide 1Slide 2Slide 4Slide 5Slide 6Slide 8Slide 9Slide 10Slide 11Slide 13Slide 14Slide 15Slide 16Slide 17Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 27Slide 28Slide 29Slide 30Slide 31Slide 34Slide 35Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51