apache hadoop india summit 2011 keynote talk "hdfs federation" by sanjay radia

Apache Hadoop India Summit 2011

HDFS FederationSanjay Radia, Hadoop Architect Yahoo! Inc

1

Outline

• HDFS - Quick overview• Scaling HDFS - Federation

HDFS Distributed file system

MapReduce Distributed computation

HBase Column store

Pig Dataflow language

Hive Data warehouse

Zookeeper Distributed coordination

Avro Data Serialization

Oozie Workflow

Hadoop Components

4

HDFS

b1

b2

b3 b1

b5

b3 b3

b5

b2

b4b5

b6b2

b3

b4

Namenode

Namespace Metadata & Journal

NamespaceState

Block Map

Heartbeats & Block Reports

Block ID Block Locations

Datanodes

Block ID Data

Backup Namenode

Hierarchal NamespaceFile Name BlockIDs

Horizontally Scale IO and Storage

5

HDFSClient reads and writes

b1

b2

b3 b1

b5

b3 b3

b5

b2

b4b5

b6b2

b3

b4

Client Client

Namenode1 open

2 read2 write

1 create

writewrite

Datanodes

NamespaceState

Block Map

End-to-end checksum

6

HDFS Architecture : Computation close to the data

DataData data data data dataData data data data dataData data data data data

Data data data data dataData data data data dataData data data data data



ResultsData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data data

Hadoop Cluster

Block 1

Block 1

Block 2

Block 2

Block 2

Block 1

MAP

MAP

MAP

Reduce

Block 3

Block 3

Block 3

Quiz: What Is the Common Attribute?

7

HDFS Actively maintain data reliability

b1

b2

b3 b1

b5

b3 b3

b5

b2

b4b5

b6b2

b3

b4

Namenode

2. copy

3. blockReceived

1. replicate

Datanodes

Bad/lost block replicaBad/lost block replica

Periodically check block checksums

NamespaceState

Block Map

Hadoop at Yahoo!

9

Production

Research

Sandbox

99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9

99.85

99.47

99.69

Availability SLA

0

50,000

100,000

150,000

200,000

250,000

2006 -Qtr1

2006 -Qtr2

2006 -Qtr3

2006 -Qtr4

2007 -Qtr1

2007 -Qtr2

2007 -Qtr3

2007 -Qtr4

2008 -Qtr1

2008 -Qtr2

2008 -Qtr3

2008 -Qtr4

2009 -Qtr1

2009 -Qtr2

2009 -Qtr3

2009 -Qtr4

2010 -Qtr1

2010 -Qtr2

2010 -Qtr3

Total Nodes = 43,936Total Storage = 206 PB

13,687

22,334

7,803

0 5000 10000 15000 20000 25000

Production

Research

Sandbox

Nodes running Hadoop at Yahoo!

Over 43,000 nodes running Hadoop

1M+ Monthly Hadoop Jobs

10

Scaling Hadoop

Early Gains

• Simple design allowed rapid improvements

• Namespace is all in RAM, simpler locking

• Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas)

Growth of number of files and storage is limited by adding RAM to namenode

• 50G heap = 200M “fs objects” = 100M names + 100MBlocks

• 14PB of storage (50MB blocksize)

• 4K nodes

- Job Tracker carries out both job lifecycle management and scheduling

Yahoo’s Response:

• HDFS Federation: horizontal scaling of namespace (0.22)

• Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker

Goal:

• Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster

6 May 2010

11

Scaling the Name Service: Options

Partial NS (Cache) in memory

Multiple Namespace volumes

All NS in memory Archives

# names

# clients

100M 10B200M 1B 2B

1x

4x

20x

50x

PartialNS in memoryWith Namespace volumes

Not to scale

100x Good isolation properties

Separate Bmaps from NN

20B

Block-reports for Billions of blocks requires rethinking block layer

Distributed NNs

Opportunity:Vertical & Horizontal scaling

Horizontal scaling/federation benefits:– Scale– Isolation, Stability, Availability– Flexibility– Other Namenode implementations or non-HDFS namespaces

12

Namenode Horizontal: Federation

Vertical scalingMore RAM, Efficiency in memory usageFirst class archives (tar/zip like)Partial namespace in main memory

Block (Object) Storage Subsystem

13

NS1 Foreign NS n

... ... NS k

Nam

esp

ace

Datanode 1 Datanode 2 Datanode m... ... ...

Balancer

Block Pools

Pools nPools kPools 1

Blo

ck s

tora

ge

Block (Object) Storage Subsystem• Shared storage provided as pools of blocks• Namespaces (HDFS, others) use one or more block-pools• Note: HDFS has 2 layers today – we are generalizing/extending it.

1st Phase: B-Pool management inside Namenode

14


NS1 Foreign NS n

... ... NS k

Balancer

Block Pools


NN-1 NN-k NN-n

Future: Move Block

mgt into separate

nodes

Future: Move block management out

15


Balancer

Block Pools


NS1Foreign

NS n... ... NS k

Block Managerclient

1. Open

2. getBlockLocations

3. ReadBlock

Easier to scale horizontally

than the name server

What is a HDFS Cluster

Current• HDFS Cluster

– 1 Namespace– A set of blocks

• Implemented as– 1 Namenode– Set of DNs

New• HDFS Cluster

– N Namespaces– Set of block-pools

• Each block-pool is set of blocks• Phase 1: 1 BP per NS

– Implies N block-pools

• Implemented as– N Namenode– Set of DNs

• Each DN stores the blocks for each block-pool

16

Managing Namespaces

• HDFS Namespaces as a first class entity– Many many namespaces: one per-user or per-project

• Why? Because it can’t fit in a server? No• Pieces of data are often autonomous

• Log data from different dates• Photos/videos loaded by a user• A user’s mail, or his home directory

• The key is sharing the data– A global namespace is one way to do that – but even there we talk of

several large “global” namespaces– Client-side mount table is another way to share

• Shared mount-table => “global” shared view• Personalized mount-table => per-application view

– Share the data that matter by mounting it

17

Plan 9, Spring OS:

dad personalized namespaces

18

HDFS Federation Across Clusters

home

projectdata

tmp

/

Cluster 1

home

projectdatatmp

/

Cluster 2

Application mount-table in Cluster 1

Application mount-table in Cluster 2

19

Nameserver as container for namespaces

•Nameserver as a container for namespaces• Each namespace with its own separate state

• Persistent state in shared storage (e.g. Book Keeper)•Each nameserver serves a set of namespaces

• Selected based on isolation and capacity• A namespace can be moved between nameserver

…

…

Shared persistent storage for namespace metadata (e.g. Book keeper)

Nameserver Nameserver

20

Summary

Federated HDFS (Jira HDFS-1052)

• Scale by adding independent Namenodes

• Preserves the robustness of the Namenodes

• Not much code change to the Namenode

• Generalizes the Block storage layer

• Analogous to Sans & Luns

• Can add other implementations of the Namenodes

• Even other name services (HBase?)

• Could move the Block management out of the Namenode in the future

• But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block reports

• Benefits

• Scale number of file names and blocks

• Improved isolation and hence availability

6 May 2010

21

Q & A

apache hadoop india summit 2011 keynote talk "hdfs federation" by sanjay radia

Documents

autonomouslog data

nopieces of data

namespaces hdfs

blockpoolseach blockpool

block mgt

block management out15datanode

nonhdfs namespaces

set of namespaces