“If everything seems under control, you're
not going fast enough.”
realtime analysis of #debate hashtag
Davide Palmisano @dpalmisano
when size matters: the
4Vs of big data
Volume, Velocity, Variety, and Veracity
let’s focus on Velocity
during peak time ~35 persons/second top up their Oyster card
http://www.tfl.gov.uk/corporate/modesoftransport/londonunderground/1608.aspx
*
*
every second ~58 new pictures are uploaded on
http://www.digitalbuzzblog.com/infographic-instagram-stats/
*
*
the night of the first
#debate, 2615 tweets per second have
been recorded
http://www.nbcnews.com/technology/technolog/presidential-debate-sets-twitter-record-6281796
*
*
What have been the most
influential URLs ?
What have been the implicit concepts underlying the
conversation?
How these concepts
evolved during the discussion?
every single tweet potentially contains some
hidden information
extract such information,
making it explicit,analysing it
and doing it at a rate of ~2000 tweets/sec?
real-time analytics
Storm, a free and open source distributed realtime computation system. Storm makes it easy to
reliably process unbounded streams of data, doing for realtime
processing what Hadoop did for batch processing.
batch analyses
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
programming models.
+ hdfs, a distributed FS
crunching the Social Web, in real-time.
Beancounterformerly known as
data gathering from the Social Web
beancounter.io is a SaaS platform to profile your
users from their activities on the Social Web
now powering part of the Italian public
broadcaster #socialtv
environment
http://www.amazon.com/Strategic-Thinking-New-Science-Complexity/dp/0684842688
**
(a quick parenthesis)
or ...
“how a butterfly flapping its wings in Asia might
cause a hurricane in the Atlantic”
beancounter.io uses Twitter OAuth authorisation to
perform TV Social events check-ins
at 13.32 UTC-8 Twitter had
an outage
while beancounter.io was handling more than ~100
check-ins per minute
**https://status.io.watchmouse.com/7617/125017//statuses/home_timeline-(OAuth-1.0a)
0
50
100
150
200
2012-11-06T20:45:01.690984
2012-11-06T21:40:03.615521
2012-11-06T22:35:04.645506
2012-11-06T23:30:05.627388
Facebook and Twitter check-ins rate
Nov 6, 2012 13:32 UTC-8 twitter service disruptionNov 6, 2012 13:32 UTC-8 twitter service disruption
0
375
750
1125
1500
2012-11-06T20:45:01.6909842012-11-06T21:30:02.861083
2012-11-06T22:15:04.455317
2012-11-06T23:00:05.432714
Facebook and Twitter overall comments
Facebook Twitter
Nov 6, 2012 13:32 UTC-8 twitter service disruption
lesson learnt: the real-time Web is an hyper-connected graph of a myriad of di!erent
live systems
always mind the butterflies, even if you can’t see them
back to #debate
<timestamp, <c0...cn>>
concepts are extracted using NLP technologies for each tweet
we’ve tied together beancounter.io, Storm and Hadoop
batch analytics
real-time analytics
hdfs, distributed FS
please note, this was only the 10% of the firehose
Storm
more than ~ 500k tweets processed in 2h for an average
rate of ~70 t/sec
each tweet produced a snapshot (~10k each) for an overall size of 4.6GB of data
highest peak: 253 tweets/sec
5 amazon EC2 x-large instance + 2 mid-sized for HDFS
more than ~18k di!erent URLs shared
recurring concepts
0
17500
35000
52500
70000
Osama Bin LadenIran
IsraelMiddle East
PakistanIraq
AfghanistanRussia
Iran - Israel 35.356 %Russia - Middle East 24.7 %
...
...Wikileaks - Richard Nixon 93.5%
most co-occurrent concepts
17284
5321
6960
data viz is a completely another job
facts
mining data requires science skills, it’s not just about technology: it’s about math
forget to control everything when data flows at that speed: make reasoned
approximations
?
@dpalmisano
Davide Palmisano
http://davidepalmisano.com