mining big data in real time

37
Mining Big Data in Real Time Albert Bifet Turing/SLAIS 2012 Conference

Upload: albert-bifet

Post on 08-May-2015

3.086 views

Category:

Documents


4 download

DESCRIPTION

Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Evolving data streams are contributing to the growth of data created over the last few years. We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. Evolving data streams methods are becoming a low-cost, green methodology for real time online prediction and analysis. We discuss the current and future trends of mining evolving data streams, and the challenges that the field will have to overcome during the next years.

TRANSCRIPT

Page 1: Mining Big Data in Real Time

Mining Big Data in Real Time

Albert Bifet

Turing/SLAIS 2012 Conference

Page 2: Mining Big Data in Real Time

BIG Data

BIG DATAMeasure and React

Page 3: Mining Big Data in Real Time

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Page 4: Mining Big Data in Real Time

Motivation

Memory unit Size Binary sizekilobyte (kB/KB) 103 210

megabyte (MB) 106 220

gigabyte (GB) 109 230

terabyte (TB) 1012 240

petabyte (PB) 1015 250

exabyte (EB) 1018 260

zettabyte (ZB) 1021 270

yottabyte (YB) 1024 280

Data is growing

Page 5: Mining Big Data in Real Time

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Page 6: Mining Big Data in Real Time

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Page 7: Mining Big Data in Real Time

Motivation

Source: IDC’s Digital Universe Study (EMC), June 2011

Data is growing

Page 8: Mining Big Data in Real Time

Streaming Data

Big Data & Real Time

Page 9: Mining Big Data in Real Time

Big Data

McKinsey Global Institute (MGI) Report on Big Data, 2011.

Big data refers to datasets whose size is beyondthe ability of typical database software tools to

capture, store, manage, and analyze.

Page 10: Mining Big Data in Real Time

Big Data

McKinsey Global Institute (MGI) Report on Big Data, 2011.

Big data refers to datasets whose size is beyondthe ability of typical database software tools to

capture, store, manage, and analyze.

Page 11: Mining Big Data in Real Time

BIG Data

I VolumeI VarietyI Velocity

3 Vs

Page 12: Mining Big Data in Real Time

Methodology

Sampling and distributed systems

Page 13: Mining Big Data in Real Time

Methodology

Paolo BoldiFacebook Four degrees of separation

Big Data does not need big machines,it needs big intelligence

Page 14: Mining Big Data in Real Time

Real time analytics

We want to analyze what is happening now.

Page 15: Mining Big Data in Real Time

Real time analytics

We want to analyze what is happening now.

Page 16: Mining Big Data in Real Time

Time and Memory

Number 8 Wire Mentality

Time and memory are the resource dimensions ofthe process.

Page 17: Mining Big Data in Real Time

Time and Memory

Time and memory are the resource dimensions ofthe process.

Page 18: Mining Big Data in Real Time

Algorithms

Classification, Regression, Clustering, FrequentPattern Mining.

Page 19: Mining Big Data in Real Time

Applications

I sensor data: industry, citiesI telecomm dataI social networks: twitter, facebook, yahooI marketing: sales business

Data may come from: humans, sensors, ormachines.

Page 20: Mining Big Data in Real Time

New applications: social networks

Twitter: A Massive Data Stream

I Micro-blogging serviceI Built to discover what is happening at any moment in time,

anywhere in the world.I 3 billion requests a day via its API.

MOA-TweetReader: a real-time system to

I read tweets in real timeI detect changesI find the terms whose frequency changed

Page 21: Mining Big Data in Real Time

Sentiment Analysis on TwitterSentiment analysisClassifying messages into two categories depending onwhether they convey positive or negative feelings

Emoticons are visual cues associated with emotional states,which can be used to define class labels for sentimentclassification

Positive Emoticons Negative Emoticons:) :(:-) :-(: ) : (:D=)

Table : List of positive and negative emoticons.

Page 22: Mining Big Data in Real Time

New problem: structured classification

New methods for structured classification

D

B

C

A

C

D

B

C C

B→

I sequences, trees, graphs

I frequent pattern mining techniquesI multi-label data mining

I Example: Lord of the Rings → Action, Adventure, Fantasy

Page 23: Mining Big Data in Real Time

New problem: structured classification

New methods for structured classification

D

B

C C

D

B

C

A

D

B

C

B

D

B

C

, → ,

I sequences, trees, graphsI frequent pattern mining techniques

I multi-label data miningI Example: Lord of the Rings → Action, Adventure, Fantasy

Page 24: Mining Big Data in Real Time

New problem: structured classification

New methods for structured classification

D

B

C C

D

B

C

A

D

B

C

B

D

B

C

, → ,

a,b → class1, class2

I sequences, trees, graphsI frequent pattern mining techniquesI multi-label data mining

I Example: Lord of the Rings → Action, Adventure, Fantasy

Page 25: Mining Big Data in Real Time

New Techniques: Distributed Systems

Hadoop, S4 and Storm

Page 26: Mining Big Data in Real Time

Hadoop

Hadoop

Page 27: Mining Big Data in Real Time

Hadoop

Hadoop architecture

Page 28: Mining Big Data in Real Time

Apache Mahout

Mahout: open source framework

Page 29: Mining Big Data in Real Time

Pig

Pig: Similar to SQL

Page 30: Mining Big Data in Real Time

Pig

I A = LOAD ’data’ USING PigStorage() AS(f1:int, f2:int, f3:int);

I B = GROUP A BY f1;

I C = FOREACH B GENERATE COUNT ($0);

I DUMP C;

Pig: Similar to SQL

Page 31: Mining Big Data in Real Time

Apache S4

Apache S4

Page 32: Mining Big Data in Real Time

Apache S4

Page 33: Mining Big Data in Real Time

Storm

Storm from Twitter

Page 34: Mining Big Data in Real Time

Storm

Stream, Spout, Bolt, Topology

Page 35: Mining Big Data in Real Time

Storm

Tools

Alldata

Precomputed batch view

Query

Precomputed realtime view

New data stream

Hadoop

Storm

“Lambda Architecture”

Storm

ElephantDB, Voldemort

Cassandra, Riak, HBaseKafka

Runaway complexity in Big DataNathan Marz, 2012

Page 36: Mining Big Data in Real Time

Data Streams

Big Data & Real Time

Page 37: Mining Big Data in Real Time

Data Streams

Thanks!