hadoop at datasift

HADOOP AT

DATASIFT

ABOUT MEJairam ChandarBig Data Engineer

Datasift@jairamc

http://about.me/jairamhttp://jairam.me

And I’m a Formula 1 Fan!

http://about.me/jairam

http://about.me/jairam

http://jairam.me

http://jairam.me

OUTLINE

•What is Datasift ?

•Where do we use Hadoop ?

• The Numbers

• The Use-cases

• The Lessons

!! SALES PITCH ALERT !!

WHAT IS DATASIFT?

THE NUMBERS

•Machines

• HBase

• 60 Machines as RegionServers

• 1 HMaster

• 3 Zookeeper nodes

THE NUMBERS•Machines

• Hadoop

• 135 Machines divided into 2 clusters

•Datanodes/Tasktrakers

•Namenodes with High-Availability Failover

• 1 Jobtracker each

THE NUMBERS• Machines

• DL380 Gen8

• 2 * Intel Xeon E5646 @ 2.40GHz (24 core total)

• 48GB RAM

• 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage)

• 1 Gigabit network links

THE NUMBERS• Data

• Average load of 7500 interactions per second

• Peak loads of 15000 interactions per second sustained over a min

• Peak of 21000 interactions per second during superbowl

• Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB

• Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3)

• And that’s not it!

THE USE CASES• HBase

• Recordings

• Archive

• Map/Reduce

• Exports

• Historics

• Migration

THE USE CASES• Recordings

• User defined streams

• Stored in HBase for later retrieval

• Export to multiple output formats and stores

• <recording-id><interaction-uuid>

• Recording-id is a SHA-1 hash

• Allows recordings to be distributed by their key without generating hot-spots.

THE RECORDER

THE USE CASES• Exporter

• Export data from HBase for customer

• Export files ~ 5 – 10 GB or ~ 3-6 million records

•MR over HBase using TableInputFormat

• But the data needs to be sorted

• TotalOrderPartioner

EXPORTER

HISTORICS

THE USE CASES

• Twitter Import

• 2 years of Tweets

• About 95,000,000,000 tweets

•Over 300 TB with added augmentation

• Import was not as simple as you would imagine

THE USE CASES• Archive

• Not just the Firehose but the Ultrahose

• Stored in HBase as well

• HBase architecture (BigTable) creates Hotspots with Time Series data

• Leading randomizing bit (see HBaseWD)

• Pre-split regions

• Concurrent writes

THE USE CASES• Historics

• Export archive data

• Slightly different from Exporter

• Much larger time lines (1 – 3 months)

• Controlled access to Hadoop cluster with efficient job scheduling

• Unfiltered Input Data

• Therefore longer processing time

• Hence more optimizations required

HISTORICS

THE LESSONS• Tune Tune Tune (Default == BAD)

• Based on use case tune -

• Heap

• Block Size

• Memstore size

• Keep number of column families low

• Be aware of hot-spotting issue when writing time-series data

THE LESSONS

• Use compression (eg. Snappy)

•Ops need intimate understanding of system

•Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)

•Don't be afraid to fiddle with HBase code

• Using a distribution is advisable

QUESTIONS?

We are hiringhttp://datasift.com/about-us/careers

http://datasift.com/about-us/careers

http://datasift.com/about-us/careers

hadoop at datasift

Technology

use cases archive

numbers machines hbase

use cases exporter exportdata

hbase code

time series data

timeseries data

numbers machines hadoop

data needs