2011 06-30-hadoop-summit v5
DESCRIPTION
Slides from presentation at Hadoop Summit 2011 on Facebook's Data Freeway systemTRANSCRIPT
Data Freeway : Scaling Out to Realtime
Eric Hwang, Sam Rash{ehwang,rash}@fb.com
Agenda
» Data at Facebook» Data Freeway System Overview» Realtime Requirements» Realtime Components
› Calligraphus/Scribe› HDFS use case and modifications› Calligraphus: a Zookeeper use case › ptail› Puma
» Future Work
Big Data, Big Applications / Data at Facebook
» Lots of data› more than 500 million active users › 50 million users update their statuses at least once each day› More than 1 billion photos uploaded each month › More than 1 billion pieces of content (web links, news stories,
blog posts, notes, photos, etc.) shared each week› Data rate: over 7 GB / second
» Numerous products can leverage the data› Revenue related: Ads Targeting› Product/User Growth related: AYML, PYMK, etc› Engineering/Operation related: Automatic Debugging› Puma: streaming queries
Data Freeway System Diagram
Realtime Requirements
› Scalability: 10-15 GBytes/second› Reliability: No single point of failure› Data loss SLA: 0.01% • loss due to hardware: means at most 1 out of 10,000 machines
can lose data› Delay of less than 10 sec for 99% of data• Typically we see 2s
› Easy to use: as simple as ‘tail –f /var/log/my-log-file’
Scribe
• Scalable distributed logging framework• Very easy to use:
• scribe_log(string category, string message)
• Mechanics:• Runs on every machine at Facebook• Built on top of Thrift• Collect the log data into a bunch of destinations• Buffer data on local disk if network is down
• History:• 2007: Started at Facebook• 2008 Oct: Open-sourced
Calligraphus
» What › Scribe-compatible server written in Java› emphasis on modular, testable code-base, and
performance» Why?
› extract simpler design from existing Scribe architecture
› cleaner integration with Hadoop ecosystem• HDFS, Zookeeper, HBase, Hive
» History› In production since November 2010› Zookeeper integration since March 2011
HDFS : a different use case
» message hub› add concurrent reader support and sync› writers + concurrent readers a form of pub/sub model
HDFS : add Sync
» Sync› implement in 0.20 (HDFS-200)• partial chunks are flushed• blocks are persisted
› provides durability› lowers write-to-read latency
HDFS : Concurrent Reads Overview
» Without changes, stock Hadoop 0.20 does not allow access to the block being written
» Need to read the block being written for realtime apps in order to achieve < 10s latency
HDFS : Concurrent Reads Implementation
1.DFSClient asks Namenode for blocks and locations
2.DFSClient asks Datanode for length of block being written
3.opens last block
HDFS : Checksum Problem
» Issue: data and checksum updates are not atomic for last chunk
» 0.20-append fix: › detect when data is out of sync with checksum using a visible
length› recompute checksum on the fly
» 0.22 fix› last chunk data and checksum kept in memory for reads
Calligraphus: Log Writer
Calligraphus Servers
HDFSScribe categories
ServerServer
ServerServer
ServerServer
Category 1Category 1
Category 2Category 2
Category 3Category 3
How to persist to HDFS?
Calligraphus (Simple)
Calligraphus Servers
HDFSScribe categories
Number of categories
Number of servers
Total number of directories
x =
ServerServer
ServerServer
ServerServer
Category 1Category 1
Category 2Category 2
Category 3Category 3
Calligraphus Servers
HDFSScribe categories
Number of categories
Total number of directories
=
Category 1Category 1
Category 2Category 2
Category 3Category 3
RouterRouter
RouterRouter
RouterRouter
WriterWriter
WriterWriter
WriterWriter
Calligraphus (Stream Consolidation)
ZooKeeperZooKeeper
ZooKeeper: Distributed Map
» Design› ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>)› Cannonical ZooKeeper leader elections under each bucket for
bucket ownership› Independent load management – leaders can release tasks› Reader-side caches› Frequent sync with policy db
AA
11 5522 33 44
BB
11 5522 33 44
CC
11 5522 33 44
DD
11 5522 33 44
RootRoot
ZooKeeper: Distributed Map
» Real-time Properties› Highly available› No centralized control› Fast mapping lookups› Quick failover for writer failures› Adapts to new categories and changing throughput
Distributed Map: Performance Summary
» Bootstrap (~3000 categories)› Full election participation in 30 seconds› Identify all election winners in 5-10 seconds› Stable mapping converges in about three minutes
» Election or failure response usually <1 second› Worst case bounded in tens of seconds
Canonical Realtime Application
» Examples› Realtime search indexing› Site integrity: spam detection› Streaming metrics
Parallel Tailer
• Why?• Access data in 10 seconds or less• Data stream interface
• Command-line tool to tail the log• Easy to use: ptail -f cat1• Support checkpoint: ptail -cp XXX cat1
Canonical Realtime ptail Application
Puma Overview
» realtime analytics platform» metrics
› count, sum, unique count, average, percentile
» uses ptail checkpointing for accurate calculations in the case of failure
» Puma nodes are sharded by keys in the input stream» HBase for persistence
Puma Write Path
Puma Read Path
Summary - Data Freeway
» Highlights:› Scalable: 4G-5G Bytes/Second› Reliable: No single-point of failure; < 0.01% data loss with
hardware failures› Realtime: delay < 10 sec (typically 2s)
» Open-Source› Scribe, HDFS› Calligraphus/Continuous Copier/Loader/ptail (pending)
» Applications› Realtime Analytics› Search/Feed› Spam Detection/Ads Click Prediction (in the future)
Future Work
» Puma› Enhance functionality: add application-level transactions on Hbase› Streaming SQL interface
» Seekable Compression format› for large categories, the files are 400-500 MB› need an efficient way to get to the end of the stream› Simple Seekable Format• container with compressed/uncompressed stream offsets• contains data segments which are independent virtual files
Fin
» Questions?