possconpresentationjehiahv3
DESCRIPTION
http://posscon.org/assets/Uploads/possconpresentationjehiahv3.pdfTRANSCRIPT
Data processing @ bit.ly
Jehiah Czebotar [email protected] @jehiah
http://www.jsps.go.jp/english/e-jafos/2010_01.html!
http://bit.ly/gFNuXa!
DATA
Big Dataset, No Updates
Big Dataset, Lots of Updates
Small Dataset, Lots of Updates
The Big Data Problem
sortdb
http://bit.ly/simplehttp
$sort data.csv > sorted_data.csv$sortdb -F ',' -f sorted_data.csv -p 8080$curl http://127.0.0.1:8080/get?key=...
simplequeue
• $curl -f “data=...” http://simplequeue/put• $data=`curl http://simplequeue/get`
http://bit.ly/simplehttp
passing database changes through a queue allows you to decouple the
performance of receiving data from a client, and adding it to a database
http://bit.ly/simplehttp
simplequeue
simplequeue
http://bit.ly/simplehttp
• class BackoffTimer(object):• def __init__(self):• self.interval = 0• • def failure(self):• self.interval = min(self.interval * 2, 1)
• def success(self):• self.interval = max(self.interval * .25, 1) - 1
allows processing to gracefully slow down when remote systems become unavailable
or start returning errors (ie: sleep 1s, 2s, 4s, 8s, 16s, 32s, ...)
• class QueueReader(object):• def __init__(self):• self.backoff_timer = BackoffTimer()• • def run(self):• while True:• try:• data = queue.get()• if not data:• time.sleep(.5)• continue• self.handle(data)• self.backoff_timer.success()• except:• self.backoff_timer.failure()• • if self.backoff_timer.interval:• time.sleep(self.backoff_timer.interval)
pubsub
• long lived persistent HTTP connections that streams back JSON messages
• a way to separate the core data collection (or production) from data consumers
http://bit.ly/simplehttp
• $curl --silent http://pubsubserver/sub
• { "a": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3 like Mac OS X; ja-jp) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8F190", "c": "JP", "nk": 1, "tz": "Asia/Tokyo", "gr": "40", "g": "hkIdmh", "h": "g0ABCf", "k": "4d8547e6-0022b-04438-d8ac8fa8", "l": "portalexcite", "al": "ja-jp", "hh": "bit.ly", "r": "direct", "u": "http://paltyyuria.exblog.jp/15685838/", "t": 1300587345, "hc": 1300587133, "cy": "Tokyo", "ll": [ 35.685001, 139.751404 ], "i": "613d872b3663f1f0cd54b48653ec788" }
• { "a": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; OfficeLiveConnector.1.5; OfficeLivePatch.1.3; .NET4.0C; .NET CLR 3.0.30729)", "c": "US", "nk": 0, "tz": "America/Chicago", "gr": "TX", "g": "fTPW1w", "h": "fK3C3I", "k": "4d856351-001b9-07054-c6ac8fa8", "l": "espn", "al": "en-us", "hh": "es.pn", "r": "http://espn.go.com/mlb/", "u": "http://espn.go.com/blog/dallas/texas-rangers/post/_/id/4861596/surprise-six-saturday-camp-recap-4", "t": 1300587345, "hc": 1300584373, "cy": "Dallas", "ll": [ 32.809799, -96.799301 ], "i": "6686654b9493543ff18d36120d5caa9" }
tools we like
• memcached • tokyo tyrant / tokyo cabinet • simplehttp (simplequeue, sortdb, pubsub) • tornado (fast async python framework) • json files • mysql (battle tested; reliable replication) • mongod
github.com/bitly/simplehttp
@jehiah
Thank you! Questions?