strata + hadoop world 2012: hdfs: now and future

HDFS:

Now and FutureTodd Lipcon ([email protected])

Sanjay Radia ([email protected])

OutlinePart 1 – Todd Lipcon (Cloudera)

• Namenode HA

• HDFS Performance improvements

• Taking advantage of next-gen hardware

• Storage Efficiency (RAID and compression)

Part 2 - Sanjay Radia (Hortonworks)

• Federation and Generalized storage service

– Leverage it for further innovation

• Snapshots

• Other

– WebHDFS

– Wire compatibility

2 O'Reilly Strata & Hadoop World

HDFS HA in Hadoop 2.0.0

• Initial implementation last year– Introduced Standby NameNode and manual hot

failover (see Hadoop World 2011 presentation)• Handled planned maintenance (eg upgrades) but not

unplanned

– Required a highly-available NFS filer to store NameNode metadata• Complicated and expensive to set up

O'Reilly Strata & Hadoop World3

HDFS HA Phase 2

• Automatic failover– Uses Apache ZooKeeper to automatically detect

NameNode failures and trigger a failover– Ops may invoke manual failover for planned

maintenance windows

• Removed dependency on NFS storage– HDFS HA is entirely self-contained– No special hardware or software required– No SPOF anywhere in the system


Automatic Failover• Each NameNode has a new process called

ZooKeeperFailoverController (ZKFC)– Maintains a session to ZooKeeper– Periodically runs a health-check against its local NameNode to verify

that it is running properly

• Triggers failover if the health check fails or the ZK session expires• Operators may still issue manual failover commands for planned

maintenance• Failover time: 30-40 seconds unplanned; 0-3 seconds planned.• Handles all types of faults: machine, software, network, etc.


Removed NFS/filer dependency

• Shared storage on NFS practical for some organizations, but difficult for others– Complex configuration, custom fencing scripts

– Filer itself must be highly available

– Expensive to buy, expensive to support

– Buggy NFS clients in Linux

• Introduced new system for reliable edit log storage: QuorumJournalManager


QuorumJournalManager• Run 3 or 5 JournalNodes, collocated on existing hardware

investment• Each edit must be committed to a majority of the nodes (i.e

a quorum)– A minority of nodes may crash or be slow without affecting

system availability– Run N nodes to tolerate (N-1)/2 failures (same as ZooKeeper)

• Built into HDFS– Designed for existing Hadoop ops teams to understand– Hadoop Metrics support, full Kerberos support, etc.



NNActive

NNStandby

JNJN JN

Shared NN state through Quorum of JournalNodes

DN

FailoverControllerActive

ZK

Cmds

Monitor Healthof NN. OS, HW

Monitor Healthof NN. OS, HW

Block Reports to Active & StandbyDN fencing: only obey commandsfrom active

DN DN

FailoverControllerStandby

ZK ZKHeartbeat Heartbeat

DN

HDFS HA Architecture(with Automatic Failover and QuorumJournalManager)

HA Improvements Summary

• Automatic failover– Avoid both planned an unplanned downtime

• Non-NFS Shared Storage– No need to buy or configure a filer

• Result: HA with no external dependencies• Available now in HDFS trunk and CDH4.1• Come to our 5pm talk in this room for more

details on these HA improvements!


HDFS Performance Update: 2.x vs 1.x

• Significant speedups from SSE4.2 hardware checksum calculation (2.5-3x less CPU on read path)

• Rewritten read path for fewer memory copies

• Short-circuit past datanodes for 2-3x faster random read (HBase workloads)

• I/O scheduling improvements: push down hints to Linux using posix_fadvise()

• Covered in my presentation from Hadoop World 2011


HDFS Performance: Recent Work

• Completed– Zero-copy read for libhdfs (2-3x improvement for C++

clients like Impala reading cached data)

– Expose mapping of blocks to disks: 2x improvement by avoiding contention on slower drives (HDFS-3672)

• In progress– Using native checksum computation on write path

– Avoiding copies and allocation on write path


HDFS Performance Benchmarks(as of June 2012)


0

200

400

600

800

1000

Raw ext4 HDFS HDFS with disk awareness

Thro

ugh

pu

t(M

B/s

ec)

Read

Write

Dual quad-core, 12x2T 7200RPM drives, measured max disk throughput at

900MB/sec.

Write throughput is CPU bound; improvements in progress bring it to max disk

throughput as well

Easily saturates SATA3 bus bandwidth on common hardware

Hardware Trends• Denser storage

– 36T per node already common– Millions of blocks per DN

• New need to invest in scaling DataNode memory usage

• More RAM– 64GB common today. 256GB soon inexpensive– Customers want to explicitly pin recently ingested data in RAM

(especially with efficient query engines like Impala)

• Solid state storage (SSD, FusionIO, etc)– HDFS should transparently or explicitly migrate hot random-

access data to/from flash– Hierarchical storage management


HDFS Storage Efficiency• Many customers are expanding their clusters simply to add storage

– How can we better utilize the disks they already have?

• RAID (Reed-Solomon coding)– Store blocks at low replication, keep parity blocks to allow

reconstruction if they are lost– Effective replication: 1.5x with same durability, less locality

• Transparent compression– Automatically detect infrequently used files, transparently re-

compress with Snappy, GZip, bz2, or LZMA– Cloudera workload traces indicate 10% of files accessed 90% of the

time!


OutlinePart 1 – Todd Lipcon (Cloudera)

• Namenode HA

• HDFS Performance improvements

• Taking advantage of next-gen hardware

• Storage Efficiency (RAID and compression)

Part 2 - Sanjay Radia (Hortonworks)

• Federation and Generalized storage service

– Leverage it for further innovation

• Snapshots

• Other

– WebHDFS

– Wire compatibilityHA in Hadoop 1!

15 O'Reilly Strata & Hadoop World

Federation: Generalized Block Storage

• Block Storage as generic storage service– Set of blocks for a Namespace Volume is called a Block Pool

– DNs store blocks for all the Namespace Volumes – no partitioning

• Multiple independent Namenodes and Namespace Volumes in a cluster– Namespace Volume = Namespace + Block Pool


DN 1 DN 2 DN m.. .. ..

NS1

Foreign NS n

..

...

.

NS k

Block Pools

Pool nPool kPool 1

NN-1 NN-k NN-n

Common Storage

Blo

ck

Sto

rag

eN

am

es

pa

ce

HDFS’ Generic Storage ServiceOpportunities for Innovation

• Federation - Distributed (Partitioned) Namespace

– Simple and Robust due to independent masters

– Scalability, Isolation, Availability

• New Services – Independent Block Pools

– New FS - Partial namespace in memory

– MR Tmp storage directly on block storage

– Shadow file system – caches HDFS, NFS, S3

• Future: move Block Management in DataNodes

– Simplifies namespace/application implementation

– Distributed namenode becomes significantly simple


Storage Service

HDFS

Namespace

Alternate NN Implementation

HBase

MR tmp

Managing Namespaces• Federation has multiple namespaces

• Don’t you need a single global namespace?

– Some tenants want private namespace

– Do you create a single DB or Single Table?

– Many volumes, share what you want

– Global? Key is to share the data and the names used to access the data

• Client-side mount table can implement global or private namespaces

– Shared mount-table => “global” shared view

– Personalized mount-table => per-application view

• Share the data that matter by mounting it

• Client-side implementation of mount tables

– xInclude from shared place – global view

– No single point of failure

– No hotspot for root and top level directories

Client-side

mount-table

homeproject

NS1 NS3NS2

NS4

tmp

/

data


Next Steps… first class support for volumes• NameServer - Container for namespaces

– Lots of small namespace volumes

• Chosen per user, tenant, data feed

• Management policies (quota, …)

• Mount tables for unified namespace

– Centrally managed – (xInclude, ZK, ..)

• Keep only WorkingSet of namespace in memory

– Break away from old NN’s full namespace in memory

– Faster startup, Billions of names, Hundreds of volumes

• Number of NameServers =

– Sum of (Namespace working set)

– Sum of (Namespace throughput)

– Move namespace for balancing19

Datanode Datanode…

…

NameServers as

Containers of Namespaces

Storage Layer

O'Reilly Strata & Hadoop World

Snapshots• Take snapshot of any directory

– Multiple snapshots allowed

• Snapshot metadata info stored in Namemode– Datanodes have no knowledge– Blocks are shared

• All regular commands/apis can be used against snapshots– Cp /foo/bar/.snapshot/x/y /a/b/z

• New CLI’s to create and delete snapshots


Snapshots - Status

• HDFS-2802 (feature branch)

– Initial design and prototype – March 2012

– Development active

• Updated design document and test plan posted– Review meeting – 1st week November

• 15 + patches

– Expected completion – early December!


Enterprise Use Cases• Storage fault-tolerance – built into HDFS Architecture

– Over 7’9s of data reliability

• High Availability

• Standard Interfaces

– WebHdfs(REST) , Fuse and NFS access• HTTPFS – (WebHDFS as farm of proxy servers)

• libWebhdfs – pure c-library for HDFS

• Wire protocol compatibility

– Protocol buffers

• Rolling upgrades– Rolling upgrades for dot-releases

• Snapshots - Under active development

• Disaster Recovery– Distcp does parallel and incremental copies across cluster

• Future - Enhance using journal interface & SnapshotsO'Reilly Strata & Hadoop World22

Summary• HA for Namenode

– Hot failover, shared storage not required (QJM)

• Performance improvements

• Utilize today’s and tomorrow’s hardware to full potential

• Federation and Generalized storage layer

– Opportunities for innovation

• Partial namespace in memory, shadow/caching file system, MR tmp, etc.

• Wire compatibility, WebHdfs, …

• Snapshots - Development well in progress


Questions?


strata + hadoop world 2012: hdfs: now and future

Documents