data freeway : scaling out to realtime eric hwang, sam rash {ehwang,rash}@fb.com
TRANSCRIPT
![Page 1: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/1.jpg)
Data Freeway : Scaling Out to Realtime
Eric Hwang, Sam Rash{ehwang,rash}@fb.com
![Page 2: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/2.jpg)
Agenda
» Data at Facebook» Data Freeway System Overview» Realtime Requirements» Realtime Components
› Calligraphus/Scribe› HDFS use case and modifications› Calligraphus: a Zookeeper use case › ptail› Puma
» Future Work
![Page 3: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/3.jpg)
Big Data, Big Applications / Data at Facebook
» Lots of data› more than 500 million active users › 50 million users update their statuses at least once each day› More than 1 billion photos uploaded each month › More than 1 billion pieces of content (web links, news stories,
blog posts, notes, photos, etc.) shared each week› Data rate: over 7 GB / second
» Numerous products can leverage the data› Revenue related: Ads Targeting› Product/User Growth related: AYML, PYMK, etc› Engineering/Operation related: Automatic Debugging› Puma: streaming queries
![Page 4: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/4.jpg)
Data Freeway System Diagram
![Page 5: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/5.jpg)
Realtime Requirements
› Scalability: 10-15 GBytes/second› Reliability: No single point of failure› Data loss SLA: 0.01% • loss due to hardware: means at most 1 out of 10,000 machines
can lose data› Delay of less than 10 sec for 99% of data• Typically we see 2s
› Easy to use: as simple as ‘tail –f /var/log/my-log-file’
![Page 6: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/6.jpg)
Scribe
• Scalable distributed logging framework• Very easy to use:
• scribe_log(string category, string message)
• Mechanics:• Runs on every machine at Facebook• Built on top of Thrift• Collect the log data into a bunch of destinations• Buffer data on local disk if network is down
• History:• 2007: Started at Facebook• 2008 Oct: Open-sourced
![Page 7: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/7.jpg)
Calligraphus
» What › Scribe-compatible server written in Java› emphasis on modular, testable code-base, and
performance» Why?
› extract simpler design from existing Scribe architecture
› cleaner integration with Hadoop ecosystem• HDFS, Zookeeper, HBase, Hive
» History› In production since November 2010› Zookeeper integration since March 2011
![Page 8: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/8.jpg)
HDFS : a different use case
» message hub› add concurrent reader support and sync› writers + concurrent readers a form of pub/sub model
![Page 9: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/9.jpg)
HDFS : add Sync
» Sync› implement in 0.20 (HDFS-200)• partial chunks are flushed• blocks are persisted
› provides durability› lowers write-to-read latency
![Page 10: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/10.jpg)
HDFS : Concurrent Reads Overview
» Without changes, stock Hadoop 0.20 does not allow access to the block being written
» Need to read the block being written for realtime apps in order to achieve < 10s latency
![Page 11: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/11.jpg)
HDFS : Concurrent Reads Implementation
1.DFSClient asks Namenode for blocks and locations
2.DFSClient asks Datanode for length of block being written
3.opens last block
![Page 12: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/12.jpg)
HDFS : Checksum Problem
» Issue: data and checksum updates are not atomic for last chunk
» 0.20-append fix: › detect when data is out of sync with checksum using a visible
length› recompute checksum on the fly
» 0.22 fix› last chunk data and checksum kept in memory for reads
![Page 13: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/13.jpg)
Calligraphus: Log Writer
Calligraphus Servers
HDFSScribe categories
ServerServer
ServerServer
ServerServer
Category 1Category 1
Category 2Category 2
Category 3Category 3
How to persist to HDFS?
![Page 14: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/14.jpg)
Calligraphus (Simple)
Calligraphus Servers
HDFSScribe categories
Number of categories
Number of servers
Total number of directories
x =
ServerServer
ServerServer
ServerServer
Category 1Category 1
Category 2Category 2
Category 3Category 3
![Page 15: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/15.jpg)
Calligraphus Servers
HDFSScribe categories
Number of categories
Total number of directories
=
Category 1Category 1
Category 2Category 2
Category 3Category 3
RouterRouter
RouterRouter
RouterRouter
WriterWriter
WriterWriter
WriterWriter
Calligraphus (Stream Consolidation)
ZooKeeperZooKeeper
![Page 16: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/16.jpg)
ZooKeeper: Distributed Map
» Design› ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>)› Cannonical ZooKeeper leader elections under each bucket for
bucket ownership› Independent load management – leaders can release tasks› Reader-side caches› Frequent sync with policy db
AA
11 5522 33 44
BB
11 5522 33 44
CC
11 5522 33 44
DD
11 5522 33 44
RootRoot
![Page 17: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/17.jpg)
ZooKeeper: Distributed Map
» Real-time Properties› Highly available› No centralized control› Fast mapping lookups› Quick failover for writer failures› Adapts to new categories and changing throughput
![Page 18: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/18.jpg)
Distributed Map: Performance Summary
» Bootstrap (~3000 categories)› Full election participation in 30 seconds› Identify all election winners in 5-10 seconds› Stable mapping converges in about three minutes
» Election or failure response usually <1 second› Worst case bounded in tens of seconds
![Page 19: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/19.jpg)
Canonical Realtime Application
» Examples› Realtime search indexing› Site integrity: spam detection› Streaming metrics
![Page 20: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/20.jpg)
Parallel Tailer
• Why?• Access data in 10 seconds or less• Data stream interface
• Command-line tool to tail the log• Easy to use: ptail -f cat1• Support checkpoint: ptail -cp XXX cat1
![Page 21: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/21.jpg)
Canonical Realtime ptail Application
![Page 22: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/22.jpg)
Puma Overview
» realtime analytics platform» metrics
› count, sum, unique count, average, percentile
» uses ptail checkpointing for accurate calculations in the case of failure
» Puma nodes are sharded by keys in the input stream» HBase for persistence
![Page 23: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/23.jpg)
Puma Write Path
![Page 24: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/24.jpg)
Puma Read Path
![Page 25: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/25.jpg)
Summary - Data Freeway
» Highlights:› Scalable: 4G-5G Bytes/Second› Reliable: No single-point of failure; < 0.01% data loss with
hardware failures› Realtime: delay < 10 sec (typically 2s)
» Open-Source› Scribe, HDFS› Calligraphus/Continuous Copier/Loader/ptail (pending)
» Applications› Realtime Analytics› Search/Feed› Spam Detection/Ads Click Prediction (in the future)
![Page 26: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/26.jpg)
Future Work
» Puma› Enhance functionality: add application-level transactions on Hbase› Streaming SQL interface
» Seekable Compression format› for large categories, the files are 400-500 MB› need an efficient way to get to the end of the stream› Simple Seekable Format• container with compressed/uncompressed stream offsets• contains data segments which are independent virtual files
![Page 27: Data Freeway : Scaling Out to Realtime Eric Hwang, Sam Rash {ehwang,rash}@fb.com](https://reader035.vdocuments.us/reader035/viewer/2022081515/56649e235503460f94b10ba6/html5/thumbnails/27.jpg)
Fin
» Questions?