hadoop at datasift
DESCRIPTION
Presentation given at Edinburgh University Student Tech-Meetup on 6th Feb, 2013.TRANSCRIPT
![Page 1: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/1.jpg)
HADOOP AT
DATASIFT
![Page 2: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/2.jpg)
ABOUT MEJairam ChandarBig Data Engineer
Datasift@jairamc
http://about.me/jairamhttp://jairam.me
And I’m a Formula 1 Fan!
![Page 3: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/3.jpg)
OUTLINE
•What is Datasift ?
•Where do we use Hadoop ?
• The Numbers
• The Use-cases
• The Lessons
![Page 4: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/4.jpg)
!! SALES PITCH ALERT !!
![Page 5: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/5.jpg)
WHAT IS DATASIFT?
![Page 6: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/6.jpg)
WHAT IS DATASIFT?
![Page 7: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/7.jpg)
WHAT IS DATASIFT?
![Page 8: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/8.jpg)
WHAT IS DATASIFT?
![Page 9: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/9.jpg)
WHAT IS DATASIFT?
![Page 10: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/10.jpg)
WHAT IS DATASIFT?
![Page 11: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/11.jpg)
WHAT IS DATASIFT?
![Page 12: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/12.jpg)
WHAT IS DATASIFT?
![Page 13: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/13.jpg)
WHAT IS DATASIFT?
![Page 14: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/14.jpg)
THE NUMBERS
•Machines
• HBase
• 60 Machines as RegionServers
• 1 HMaster
• 3 Zookeeper nodes
![Page 15: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/15.jpg)
THE NUMBERS•Machines
• Hadoop
• 135 Machines divided into 2 clusters
•Datanodes/Tasktrakers
•Namenodes with High-Availability Failover
• 1 Jobtracker each
![Page 16: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/16.jpg)
THE NUMBERS• Machines
• DL380 Gen8
• 2 * Intel Xeon E5646 @ 2.40GHz (24 core total)
• 48GB RAM
• 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage)
• 1 Gigabit network links
![Page 17: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/17.jpg)
THE NUMBERS• Data
• Average load of 7500 interactions per second
• Peak loads of 15000 interactions per second sustained over a min
• Peak of 21000 interactions per second during superbowl
• Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB
• Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3)
• And that’s not it!
![Page 18: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/18.jpg)
THE USE CASES• HBase
• Recordings
• Archive
• Map/Reduce
• Exports
• Historics
• Migration
![Page 19: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/19.jpg)
THE USE CASES• Recordings
• User defined streams
• Stored in HBase for later retrieval
• Export to multiple output formats and stores
• <recording-id><interaction-uuid>
• Recording-id is a SHA-1 hash
• Allows recordings to be distributed by their key without generating hot-spots.
![Page 20: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/20.jpg)
THE RECORDER
![Page 21: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/21.jpg)
THE USE CASES• Exporter
• Export data from HBase for customer
• Export files ~ 5 – 10 GB or ~ 3-6 million records
•MR over HBase using TableInputFormat
• But the data needs to be sorted
• TotalOrderPartioner
![Page 22: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/22.jpg)
EXPORTER
![Page 23: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/23.jpg)
HISTORICS
![Page 24: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/24.jpg)
THE USE CASES
• Twitter Import
• 2 years of Tweets
• About 95,000,000,000 tweets
•Over 300 TB with added augmentation
• Import was not as simple as you would imagine
![Page 25: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/25.jpg)
THE USE CASES• Archive
• Not just the Firehose but the Ultrahose
• Stored in HBase as well
• HBase architecture (BigTable) creates Hotspots with Time Series data
• Leading randomizing bit (see HBaseWD)
• Pre-split regions
• Concurrent writes
![Page 26: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/26.jpg)
THE USE CASES• Historics
• Export archive data
• Slightly different from Exporter
• Much larger time lines (1 – 3 months)
• Controlled access to Hadoop cluster with efficient job scheduling
• Unfiltered Input Data
• Therefore longer processing time
• Hence more optimizations required
![Page 27: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/27.jpg)
HISTORICS
![Page 28: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/28.jpg)
THE LESSONS• Tune Tune Tune (Default == BAD)
• Based on use case tune -
• Heap
• Block Size
• Memstore size
• Keep number of column families low
• Be aware of hot-spotting issue when writing time-series data
![Page 29: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/29.jpg)
THE LESSONS
• Use compression (eg. Snappy)
•Ops need intimate understanding of system
•Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)
•Don't be afraid to fiddle with HBase code
• Using a distribution is advisable
![Page 30: Hadoop at datasift](https://reader033.vdocuments.us/reader033/viewer/2022052619/5552812ab4c905b4598b4ef8/html5/thumbnails/30.jpg)
QUESTIONS?
We are hiringhttp://datasift.com/about-us/careers