mining big data in real time
DESCRIPTION
Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.TRANSCRIPT
Mining Big Data in Real Time
Albert Bifet
Turing/SLAIS 2012 Conference
BIG Data
BIG DATAMeasure and React
Motivation
Source: IDC’s Digital Universe Study (EMC), June 2011
Data is growing
Motivation
Memory unit Size Binary sizekilobyte (kB/KB) 103 210
megabyte (MB) 106 220
gigabyte (GB) 109 230
terabyte (TB) 1012 240
petabyte (PB) 1015 250
exabyte (EB) 1018 260
zettabyte (ZB) 1021 270
yottabyte (YB) 1024 280
Data is growing
Motivation
Source: IDC’s Digital Universe Study (EMC), June 2011
Data is growing
Motivation
Source: IDC’s Digital Universe Study (EMC), June 2011
Data is growing
Motivation
Source: IDC’s Digital Universe Study (EMC), June 2011
Data is growing
Streaming Data
Big Data & Real Time
Big Data
McKinsey Global Institute (MGI) Report on Big Data, 2011.
Big data refers to datasets whose size is beyondthe ability of typical database software tools to
capture, store, manage, and analyze.
Big Data
McKinsey Global Institute (MGI) Report on Big Data, 2011.
Big data refers to datasets whose size is beyondthe ability of typical database software tools to
capture, store, manage, and analyze.
BIG Data
I VolumeI VarietyI Velocity
3 Vs
Methodology
Sampling and distributed systems
Methodology
Paolo BoldiFacebook Four degrees of separation
Big Data does not need big machines,it needs big intelligence
Real time analytics
We want to analyze what is happening now.
Real time analytics
We want to analyze what is happening now.
Time and Memory
Number 8 Wire Mentality
Time and memory are the resource dimensions ofthe process.
Time and Memory
Time and memory are the resource dimensions ofthe process.
Algorithms
Classification, Regression, Clustering, FrequentPattern Mining.
Applications
I sensor data: industry, citiesI telecomm dataI social networks: twitter, facebook, yahooI marketing: sales business
Data may come from: humans, sensors, ormachines.
New applications: social networks
Twitter: A Massive Data Stream
I Micro-blogging serviceI Built to discover what is happening at any moment in time,
anywhere in the world.I 3 billion requests a day via its API.
MOA-TweetReader: a real-time system to
I read tweets in real timeI detect changesI find the terms whose frequency changed
Sentiment Analysis on TwitterSentiment analysisClassifying messages into two categories depending onwhether they convey positive or negative feelings
Emoticons are visual cues associated with emotional states,which can be used to define class labels for sentimentclassification
Positive Emoticons Negative Emoticons:) :(:-) :-(: ) : (:D=)
Table : List of positive and negative emoticons.
New problem: structured classification
New methods for structured classification
D
B
C
A
C
D
B
C C
B→
I sequences, trees, graphs
I frequent pattern mining techniquesI multi-label data mining
I Example: Lord of the Rings → Action, Adventure, Fantasy
New problem: structured classification
New methods for structured classification
D
B
C C
D
B
C
A
D
B
C
B
D
B
C
, → ,
I sequences, trees, graphsI frequent pattern mining techniques
I multi-label data miningI Example: Lord of the Rings → Action, Adventure, Fantasy
New problem: structured classification
New methods for structured classification
D
B
C C
D
B
C
A
D
B
C
B
D
B
C
, → ,
a,b → class1, class2
I sequences, trees, graphsI frequent pattern mining techniquesI multi-label data mining
I Example: Lord of the Rings → Action, Adventure, Fantasy
New Techniques: Distributed Systems
Hadoop, S4 and Storm
Hadoop
Hadoop
Hadoop
Hadoop architecture
Apache Mahout
Mahout: open source framework
Pig
Pig: Similar to SQL
Pig
I A = LOAD ’data’ USING PigStorage() AS(f1:int, f2:int, f3:int);
I B = GROUP A BY f1;
I C = FOREACH B GENERATE COUNT ($0);
I DUMP C;
Pig: Similar to SQL
Apache S4
Apache S4
Apache S4
Storm
Storm from Twitter
Storm
Stream, Spout, Bolt, Topology
Storm
Tools
Alldata
Precomputed batch view
Query
Precomputed realtime view
New data stream
Hadoop
Storm
“Lambda Architecture”
Storm
ElephantDB, Voldemort
Cassandra, Riak, HBaseKafka
Runaway complexity in Big DataNathan Marz, 2012
Data Streams
Big Data & Real Time
Data Streams
Thanks!