twitter big data
DESCRIPTION
How to access & analyse Twitter big data. Full working example using Storm and RedStorm in Ruby & JRuby. Code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/TRANSCRIPT
![Page 1: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/1.jpg)
Twitter Big Data
Colin Surprenant@colinsurprenantLead ninja warning: this presentation contains a few rage faces
![Page 2: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/2.jpg)
Twitter - Spring 2012
● 350 million tweets/day● 140 million active users● >1 million applications using API
![Page 3: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/3.jpg)
Daily Twitter @ Needium
50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
![Page 4: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/4.jpg)
Needium Geo
![Page 5: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/5.jpg)
Anatomy of a tweet
What's in a tweet? Anything else?
avatar
usertimestamp
message
![Page 6: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/6.jpg)
![Page 7: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/7.jpg)
How to get the tweets?
● Streaming API
Subscribe to realtime feeds moving forward ● REST Search API
Search request on past data (1 week)
![Page 8: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/8.jpg)
Streaming APIpublic statuses from all users
● status/filtertrack/location/follow○ 5000 follow user ids
○ 400 track keywords
○ 25 location boxes
○ rate limited
● status/sample○ 1% of all public statuses (message id mod 100)
○ two status/sample streams will result in same data
![Page 9: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/9.jpg)
Streaming APIper user streams
● User Streams
○ all data required to update a user's display○ requires user's OAuth token○ statuses from followings, direct messages, mentions○ cannot open large number of user streams from same host
● Site Streams○ multiplexing of multiple User Streams
![Page 10: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/10.jpg)
Streaming APIFirehose
need more/full data? only through partners ● gnip.com● datasift.com
● filtering/tracking
● partial to full Firehose
What's the catch?
![Page 11: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/11.jpg)
Streaming APIFirehose
Base Twitter data license
$0.10 per 1000 tweets ~$1 million/monthapprox for full Firehose
![Page 12: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/12.jpg)
Streaming APIFirehose
startup?
![Page 13: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/13.jpg)
Search API
● REST API (http request/response)
○ search query○ geocode (lat, long, radius)○ result type (mixed/recent/popular)○ since id
● max 100 rpp and 1500 results● rate limited (~1 request/sec)
![Page 14: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/14.jpg)
Twitter Geo
NO simple way to grabALL tweets for a given region
![Page 15: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/15.jpg)
Twitter GeoStreaming API
● status/filter + location (bounding box)○ only tweets with explicit coordinates○ < 10% of all tweets
![Page 16: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/16.jpg)
Twitter GeoStreaming API
● Firehose○ < 10% of all tweets contains explicit coordinates○ must do reverse geocoding on user profile location○ user profile location is free form
![Page 17: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/17.jpg)
Twitter GeoSearch API
● geocode (lat, long, radius)● tweets with explicit coordinates● tweets reverse geocoded from user profile location
● location field: free form text (Montreal / Montreal,Qc / Mtl / Mourial)
● false positives
● REST API: not for frequent polling● rate limited (1 req/sec/ip)
![Page 18: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/18.jpg)
Twitter Geo
Solutions? That's your job! But seriously?
![Page 19: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/19.jpg)
Twitter Geo
● search API intelligent polling farm○ adjust polling interval to minimize polling in relation to traffic
● streaming API status/filter/follow reader farm?○ find N relevant users from city, # stream readers = N / 5000
○ must do reverse geocoding
○ user list dynamic update
● TOS gray zone
![Page 20: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/20.jpg)
StormDistributed and fault-tolerant realtime computation
https://github.com/nathanmarz/storm
![Page 21: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/21.jpg)
Storm
The promise ● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work
![Page 22: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/22.jpg)
RedStorm
JRuby integration & DSL for Storm
Simplicity of Ruby + power of Storm
https://github.com/colinsurprenant/redstorm
+
![Page 23: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/23.jpg)
StormTypical use cases
Stream processing Continuous computation
![Page 24: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/24.jpg)
StormConcepts
Streams
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple
![Page 25: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/25.jpg)
StormConcepts
Spouts
Source of streams
TupleTuple
TupleTuple
Tuple
TupleTuple
TupleTuple
Tuple
![Page 26: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/26.jpg)
StormConcepts
Bolts
Processes input streams and produce new streams
Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple
![Page 27: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/27.jpg)
StormConcepts
Topology
Network of spouts and bolts
![Page 28: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/28.jpg)
Storm
What Storm does ● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream
![Page 29: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/29.jpg)
TweitgeistLive top 10 trending hashtags on Twitter
DEMO
https://github.com/colinsurprenant/tweitgeistLive demo: http://tweitgeist.needium.com
![Page 30: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/30.jpg)
Tweitgeist
TwitterStreamSpout
ExtractMessageBolt
ExtractHashtagBolt
RollingCountBolt
shuf
fle
shuf
fle
field
hash
tag
RankBolt
field
hash
tag
MergeBolt
glob
al
Shuffle grouping: Tuples are randomly distributed across the bolt's tasksFields grouping: The stream is partitioned by the fields specified in the groupingGlobal grouping: The entire stream goes to a single one of the bolt's tasks
Redisqueue
Redisqueue
Twitter U
I
streamreader
messageextract
hashtagextract
rollingcounter
ranking merging
![Page 31: Twitter Big Data](https://reader034.vdocuments.us/reader034/viewer/2022042714/54b795274a795953368b4790/html5/thumbnails/31.jpg)
TweitgeistTopology definition