if everything seems under control, you're not going fast enough

Post on 15-Jan-2015

3.910 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

this presentation is a quick story about how we used beancounter.io to perform a realtime analysis of #debate hashtag during the 3rd Obama-Romney debate.

TRANSCRIPT

“If everything seems under control, you're

not going fast enough.”

realtime analysis of #debate hashtag

Davide Palmisano @dpalmisano

when size matters: the

4Vs of big data

Volume, Velocity, Variety, and Veracity

let’s focus on Velocity

the night of the first

#debate, 2615 tweets per second have

been recorded

http://www.nbcnews.com/technology/technolog/presidential-debate-sets-twitter-record-6281796

*

*

What have been the most

influential URLs ?

What have been the implicit concepts underlying the

conversation?

How these concepts

evolved during the discussion?

every single tweet potentially contains some

hidden information

extract such information,

making it explicit,analysing it

and doing it at a rate of ~2000 tweets/sec?

real-time analytics

Storm, a free and open source distributed realtime computation system. Storm makes it easy to

reliably process unbounded streams of data, doing for realtime

processing what Hadoop did for batch processing.

batch analyses

The Apache Hadoop software library is a framework that allows for the distributed

processing of large data sets across clusters of computers using simple

programming models.

+ hdfs, a distributed FS

crunching the Social Web, in real-time.

Beancounterformerly known as

data gathering from the Social Web

beancounter.io is a SaaS platform to profile your

users from their activities on the Social Web

now powering part of the Italian public

broadcaster #socialtv

environment

http://www.amazon.com/Strategic-Thinking-New-Science-Complexity/dp/0684842688

**

(a quick parenthesis)

or ...

“how a butterfly flapping its wings in Asia might

cause a hurricane in the Atlantic”

beancounter.io uses Twitter OAuth authorisation to

perform TV Social events check-ins

at 13.32 UTC-8 Twitter had

an outage

while beancounter.io was handling more than ~100

check-ins per minute

**https://status.io.watchmouse.com/7617/125017//statuses/home_timeline-(OAuth-1.0a)

0

50

100

150

200

2012-11-06T20:45:01.690984

2012-11-06T21:40:03.615521

2012-11-06T22:35:04.645506

2012-11-06T23:30:05.627388

Facebook and Twitter check-ins rate

Nov 6, 2012 13:32 UTC-8 twitter service disruptionNov 6, 2012 13:32 UTC-8 twitter service disruption

0

375

750

1125

1500

2012-11-06T20:45:01.6909842012-11-06T21:30:02.861083

2012-11-06T22:15:04.455317

2012-11-06T23:00:05.432714

Facebook and Twitter overall comments

Facebook Twitter

Nov 6, 2012 13:32 UTC-8 twitter service disruption

lesson learnt: the real-time Web is an hyper-connected graph of a myriad of di!erent

live systems

always mind the butterflies, even if you can’t see them

back to #debate

<timestamp, <c0...cn>>

concepts are extracted using NLP technologies for each tweet

we’ve tied together beancounter.io, Storm and Hadoop

batch analytics

real-time analytics

hdfs, distributed FS

please note, this was only the 10% of the firehose

Storm

more than ~ 500k tweets processed in 2h for an average

rate of ~70 t/sec

each tweet produced a snapshot (~10k each) for an overall size of 4.6GB of data

highest peak: 253 tweets/sec

5 amazon EC2 x-large instance + 2 mid-sized for HDFS

more than ~18k di!erent URLs shared

recurring concepts

0

17500

35000

52500

70000

Osama Bin LadenIran

IsraelMiddle East

PakistanIraq

AfghanistanRussia

Iran - Israel 35.356 %Russia - Middle East 24.7 %

...

...Wikileaks - Richard Nixon 93.5%

most co-occurrent concepts

17284

5321

6960

data viz is a completely another job

facts

mining data requires science skills, it’s not just about technology: it’s about math

forget to control everything when data flows at that speed: make reasoned

approximations

?

top related