2011 06-30-hadoop-summit v5

Data Freeway : Scaling Out to Realtime

Eric Hwang, Sam Rash{ehwang,rash}@fb.com

mailto:[email protected]

Agenda

» Data at Facebook» Data Freeway System Overview» Realtime Requirements» Realtime Components

› Calligraphus/Scribe› HDFS use case and modifications› Calligraphus: a Zookeeper use case › ptail› Puma

» Future Work

Big Data, Big Applications / Data at Facebook

» Lots of data› more than 500 million active users › 50 million users update their statuses at least once each day› More than 1 billion photos uploaded each month › More than 1 billion pieces of content (web links, news stories,

blog posts, notes, photos, etc.) shared each week› Data rate: over 7 GB / second

» Numerous products can leverage the data› Revenue related: Ads Targeting› Product/User Growth related: AYML, PYMK, etc› Engineering/Operation related: Automatic Debugging› Puma: streaming queries

Data Freeway System Diagram

Realtime Requirements

› Scalability: 10-15 GBytes/second› Reliability: No single point of failure› Data loss SLA: 0.01% • loss due to hardware: means at most 1 out of 10,000 machines

can lose data› Delay of less than 10 sec for 99% of data• Typically we see 2s

› Easy to use: as simple as ‘tail –f /var/log/my-log-file’

Scribe

• Scalable distributed logging framework• Very easy to use:

• scribe_log(string category, string message)

• Mechanics:• Runs on every machine at Facebook• Built on top of Thrift• Collect the log data into a bunch of destinations• Buffer data on local disk if network is down

• History:• 2007: Started at Facebook• 2008 Oct: Open-sourced

Calligraphus

» What › Scribe-compatible server written in Java› emphasis on modular, testable code-base, and

performance» Why?

› extract simpler design from existing Scribe architecture

› cleaner integration with Hadoop ecosystem• HDFS, Zookeeper, HBase, Hive

» History› In production since November 2010› Zookeeper integration since March 2011

HDFS : a different use case

» message hub› add concurrent reader support and sync› writers + concurrent readers a form of pub/sub model

HDFS : add Sync

» Sync› implement in 0.20 (HDFS-200)• partial chunks are flushed• blocks are persisted

› provides durability› lowers write-to-read latency

HDFS : Concurrent Reads Overview

» Without changes, stock Hadoop 0.20 does not allow access to the block being written

» Need to read the block being written for realtime apps in order to achieve < 10s latency

HDFS : Concurrent Reads Implementation

1.DFSClient asks Namenode for blocks and locations

2.DFSClient asks Datanode for length of block being written

3.opens last block

HDFS : Checksum Problem

» Issue: data and checksum updates are not atomic for last chunk

» 0.20-append fix: › detect when data is out of sync with checksum using a visible

length› recompute checksum on the fly

» 0.22 fix› last chunk data and checksum kept in memory for reads

Calligraphus: Log Writer

Calligraphus Servers

HDFSScribe categories

ServerServer

ServerServer

ServerServer

Category 1Category 1



How to persist to HDFS?

Calligraphus (Simple)



Number of categories

Number of servers

Total number of directories

x =

ServerServer

ServerServer

ServerServer






Number of categories

Total number of directories

=




RouterRouter

RouterRouter

RouterRouter

WriterWriter

WriterWriter

WriterWriter

Calligraphus (Stream Consolidation)

ZooKeeperZooKeeper

ZooKeeper: Distributed Map

» Design› ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>)› Cannonical ZooKeeper leader elections under each bucket for

bucket ownership› Independent load management – leaders can release tasks› Reader-side caches› Frequent sync with policy db

AA

11 5522 33 44

BB

11 5522 33 44

CC

11 5522 33 44

DD

11 5522 33 44

RootRoot

ZooKeeper: Distributed Map

» Real-time Properties› Highly available› No centralized control› Fast mapping lookups› Quick failover for writer failures› Adapts to new categories and changing throughput

Distributed Map: Performance Summary

» Bootstrap (~3000 categories)› Full election participation in 30 seconds› Identify all election winners in 5-10 seconds› Stable mapping converges in about three minutes

» Election or failure response usually <1 second› Worst case bounded in tens of seconds

Canonical Realtime Application

» Examples› Realtime search indexing› Site integrity: spam detection› Streaming metrics

Parallel Tailer

• Why?• Access data in 10 seconds or less• Data stream interface

• Command-line tool to tail the log• Easy to use: ptail -f cat1• Support checkpoint: ptail -cp XXX cat1

Canonical Realtime ptail Application

Puma Overview

» realtime analytics platform» metrics

› count, sum, unique count, average, percentile

» uses ptail checkpointing for accurate calculations in the case of failure

» Puma nodes are sharded by keys in the input stream» HBase for persistence

Puma Write Path

Puma Read Path

Summary - Data Freeway

» Highlights:› Scalable: 4G-5G Bytes/Second› Reliable: No single-point of failure; < 0.01% data loss with

hardware failures› Realtime: delay < 10 sec (typically 2s)

» Open-Source› Scribe, HDFS› Calligraphus/Continuous Copier/Loader/ptail (pending)

» Applications› Realtime Analytics› Search/Feed› Spam Detection/Ads Click Prediction (in the future)

Future Work

» Puma› Enhance functionality: add application-level transactions on Hbase› Streaming SQL interface

» Seekable Compression format› for large categories, the files are 400-500 MB› need an efficient way to get to the end of the stream› Simple Seekable Format• container with compressed/uncompressed stream offsets• contains data segments which are independent virtual files

Fin

» Questions?

2011 06-30-hadoop-summit v5

Technology