possconpresentationjehiahv3

18
Data processing @ bit.ly Jehiah Czebotar [email protected] @jehiah

Upload: matt-hudson

Post on 13-Mar-2016

218 views

Category:

Documents


1 download

DESCRIPTION

http://posscon.org/assets/Uploads/possconpresentationjehiahv3.pdf

TRANSCRIPT

Page 1: possconpresentationjehiahv3

Data processing @ bit.ly

Jehiah Czebotar [email protected] @jehiah

Page 2: possconpresentationjehiahv3

http://www.jsps.go.jp/english/e-jafos/2010_01.html!

http://bit.ly/gFNuXa!

Page 3: possconpresentationjehiahv3
Page 4: possconpresentationjehiahv3
Page 5: possconpresentationjehiahv3

DATA

Page 6: possconpresentationjehiahv3
Page 7: possconpresentationjehiahv3

Big Dataset, No Updates

Big Dataset, Lots of Updates

Small Dataset, Lots of Updates

The Big Data Problem

Page 8: possconpresentationjehiahv3

sortdb

http://bit.ly/simplehttp

$sort data.csv > sorted_data.csv$sortdb -F ',' -f sorted_data.csv -p 8080$curl http://127.0.0.1:8080/get?key=...

Page 9: possconpresentationjehiahv3

simplequeue

• $curl -f “data=...” http://simplequeue/put• $data=`curl http://simplequeue/get`

http://bit.ly/simplehttp

Page 10: possconpresentationjehiahv3

passing database changes through a queue allows you to decouple the

performance of receiving data from a client, and adding it to a database

http://bit.ly/simplehttp

simplequeue

Page 11: possconpresentationjehiahv3

simplequeue

http://bit.ly/simplehttp

Page 12: possconpresentationjehiahv3

•  class BackoffTimer(object):•  def __init__(self):•  self.interval = 0•  •  def failure(self):•  self.interval = min(self.interval * 2, 1)

•  def success(self):•  self.interval = max(self.interval * .25, 1) - 1

allows processing to gracefully slow down when remote systems become unavailable

or start returning errors (ie: sleep 1s, 2s, 4s, 8s, 16s, 32s, ...)

Page 13: possconpresentationjehiahv3

•  class QueueReader(object):•  def __init__(self):•  self.backoff_timer = BackoffTimer()•  •  def run(self):•  while True:•  try:•  data = queue.get()•  if not data:•  time.sleep(.5)•  continue•  self.handle(data)•  self.backoff_timer.success()•  except:•  self.backoff_timer.failure()•  •  if self.backoff_timer.interval:•  time.sleep(self.backoff_timer.interval)

Page 14: possconpresentationjehiahv3

pubsub

• long lived persistent HTTP connections that streams back JSON messages

• a way to separate the core data collection (or production) from data consumers

http://bit.ly/simplehttp

Page 15: possconpresentationjehiahv3

•  $curl --silent http://pubsubserver/sub

•  { "a": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3 like Mac OS X; ja-jp) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8F190", "c": "JP", "nk": 1, "tz": "Asia/Tokyo", "gr": "40", "g": "hkIdmh", "h": "g0ABCf", "k": "4d8547e6-0022b-04438-d8ac8fa8", "l": "portalexcite", "al": "ja-jp", "hh": "bit.ly", "r": "direct", "u": "http://paltyyuria.exblog.jp/15685838/", "t": 1300587345, "hc": 1300587133, "cy": "Tokyo", "ll": [ 35.685001, 139.751404 ], "i": "613d872b3663f1f0cd54b48653ec788" }

•  { "a": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; OfficeLiveConnector.1.5; OfficeLivePatch.1.3; .NET4.0C; .NET CLR 3.0.30729)", "c": "US", "nk": 0, "tz": "America/Chicago", "gr": "TX", "g": "fTPW1w", "h": "fK3C3I", "k": "4d856351-001b9-07054-c6ac8fa8", "l": "espn", "al": "en-us", "hh": "es.pn", "r": "http://espn.go.com/mlb/", "u": "http://espn.go.com/blog/dallas/texas-rangers/post/_/id/4861596/surprise-six-saturday-camp-recap-4", "t": 1300587345, "hc": 1300584373, "cy": "Dallas", "ll": [ 32.809799, -96.799301 ], "i": "6686654b9493543ff18d36120d5caa9" }

Page 16: possconpresentationjehiahv3

tools we like

• memcached • tokyo tyrant / tokyo cabinet • simplehttp (simplequeue, sortdb, pubsub) • tornado (fast async python framework) • json files • mysql (battle tested; reliable replication) • mongod

Page 17: possconpresentationjehiahv3

github.com/bitly/simplehttp

Page 18: possconpresentationjehiahv3

@jehiah

Thank you! Questions?