apache hadoop india summit 2011 keynote talk "hdfs federation" by sanjay radia
TRANSCRIPT
Apache Hadoop India Summit 2011
HDFS FederationSanjay Radia, Hadoop Architect Yahoo! Inc
1
Outline
• HDFS - Quick overview• Scaling HDFS - Federation
HDFS Distributed file system
MapReduce Distributed computation
HBase Column store
Pig Dataflow language
Hive Data warehouse
Zookeeper Distributed coordination
Avro Data Serialization
Oozie Workflow
Hadoop Components
3
4
HDFS
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4b5
b6b2
b3
b4
Namenode
Namespace Metadata & Journal
NamespaceState
Block Map
Heartbeats & Block Reports
Block ID Block Locations
Datanodes
Block ID Data
Backup Namenode
Hierarchal NamespaceFile Name BlockIDs
Horizontally Scale IO and Storage
5
HDFSClient reads and writes
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4b5
b6b2
b3
b4
Client Client
Namenode1 open
2 read2 write
1 create
writewrite
Datanodes
NamespaceState
Block Map
End-to-end checksum
6
HDFS Architecture : Computation close to the data
DataData data data data dataData data data data dataData data data data data
Data data data data dataData data data data dataData data data data data
Data data data data dataData data data data dataData data data data data
Data data data data dataData data data data dataData data data data data
ResultsData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data dataData data data data
Hadoop Cluster
Block 1
Block 1
Block 2
Block 2
Block 2
Block 1
MAP
MAP
MAP
Reduce
Block 3
Block 3
Block 3
Quiz: What Is the Common Attribute?
7
HDFS Actively maintain data reliability
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4b5
b6b2
b3
b4
Namenode
2. copy
3. blockReceived
1. replicate
Datanodes
Bad/lost block replicaBad/lost block replica
Periodically check block checksums
NamespaceState
Block Map
Hadoop at Yahoo!
9
Production
Research
Sandbox
99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9
99.85
99.47
99.69
Availability SLA
0
50,000
100,000
150,000
200,000
250,000
2006 -Qtr1
2006 -Qtr2
2006 -Qtr3
2006 -Qtr4
2007 -Qtr1
2007 -Qtr2
2007 -Qtr3
2007 -Qtr4
2008 -Qtr1
2008 -Qtr2
2008 -Qtr3
2008 -Qtr4
2009 -Qtr1
2009 -Qtr2
2009 -Qtr3
2009 -Qtr4
2010 -Qtr1
2010 -Qtr2
2010 -Qtr3
Total Nodes = 43,936Total Storage = 206 PB
13,687
22,334
7,803
0 5000 10000 15000 20000 25000
Production
Research
Sandbox
Nodes running Hadoop at Yahoo!
Over 43,000 nodes running Hadoop
1M+ Monthly Hadoop Jobs
10
Scaling Hadoop
Early Gains
• Simple design allowed rapid improvements
• Namespace is all in RAM, simpler locking
• Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas)
Growth of number of files and storage is limited by adding RAM to namenode
• 50G heap = 200M “fs objects” = 100M names + 100MBlocks
• 14PB of storage (50MB blocksize)
• 4K nodes
- Job Tracker carries out both job lifecycle management and scheduling
Yahoo’s Response:
• HDFS Federation: horizontal scaling of namespace (0.22)
• Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker
Goal:
• Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster
6 May 2010
11
Scaling the Name Service: Options
Partial NS (Cache) in memory
Multiple Namespace volumes
All NS in memory Archives
# names
# clients
100M 10B200M 1B 2B
1x
4x
20x
50x
PartialNS in memoryWith Namespace volumes
Not to scale
100x Good isolation properties
Separate Bmaps from NN
20B
Block-reports for Billions of blocks requires rethinking block layer
Distributed NNs
Opportunity:Vertical & Horizontal scaling
Horizontal scaling/federation benefits:– Scale– Isolation, Stability, Availability– Flexibility– Other Namenode implementations or non-HDFS namespaces
12
Namenode Horizontal: Federation
Vertical scalingMore RAM, Efficiency in memory usageFirst class archives (tar/zip like)Partial namespace in main memory
Block (Object) Storage Subsystem
13
NS1 Foreign NS n
... ... NS k
Nam
esp
ace
Datanode 1 Datanode 2 Datanode m... ... ...
Balancer
Block Pools
Pools nPools kPools 1
Blo
ck s
tora
ge
Block (Object) Storage Subsystem• Shared storage provided as pools of blocks• Namespaces (HDFS, others) use one or more block-pools• Note: HDFS has 2 layers today – we are generalizing/extending it.
1st Phase: B-Pool management inside Namenode
14
Datanode 1 Datanode 2 Datanode m... ... ...
NS1 Foreign NS n
... ... NS k
Balancer
Block Pools
Pools nPools kPools 1
NN-1 NN-k NN-n
Future: Move Block
mgt into separate
nodes
Future: Move block management out
15
Datanode 1 Datanode 2 Datanode m... ... ...
Balancer
Block Pools
Pools nPools kPools 1
NS1Foreign
NS n... ... NS k
Block Managerclient
1. Open
2. getBlockLocations
3. ReadBlock
Easier to scale horizontally
than the name server
What is a HDFS Cluster
Current• HDFS Cluster
– 1 Namespace– A set of blocks
• Implemented as– 1 Namenode– Set of DNs
New• HDFS Cluster
– N Namespaces– Set of block-pools
• Each block-pool is set of blocks• Phase 1: 1 BP per NS
– Implies N block-pools
• Implemented as– N Namenode– Set of DNs
• Each DN stores the blocks for each block-pool
16
Managing Namespaces
• HDFS Namespaces as a first class entity– Many many namespaces: one per-user or per-project
• Why? Because it can’t fit in a server? No• Pieces of data are often autonomous
• Log data from different dates• Photos/videos loaded by a user• A user’s mail, or his home directory
• The key is sharing the data– A global namespace is one way to do that – but even there we talk of
several large “global” namespaces– Client-side mount table is another way to share
• Shared mount-table => “global” shared view• Personalized mount-table => per-application view
– Share the data that matter by mounting it
17
Plan 9, Spring OS:
dad personalized namespaces
18
HDFS Federation Across Clusters
home
projectdata
tmp
/
Cluster 1
home
projectdatatmp
/
Cluster 2
Application mount-table in Cluster 1
Application mount-table in Cluster 2
19
Nameserver as container for namespaces
•Nameserver as a container for namespaces• Each namespace with its own separate state
• Persistent state in shared storage (e.g. Book Keeper)•Each nameserver serves a set of namespaces
• Selected based on isolation and capacity• A namespace can be moved between nameserver
…
…
Shared persistent storage for namespace metadata (e.g. Book keeper)
Nameserver Nameserver
20
Summary
Federated HDFS (Jira HDFS-1052)
• Scale by adding independent Namenodes
• Preserves the robustness of the Namenodes
• Not much code change to the Namenode
• Generalizes the Block storage layer
• Analogous to Sans & Luns
• Can add other implementations of the Namenodes
• Even other name services (HBase?)
• Could move the Block management out of the Namenode in the future
• But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block reports
• Benefits
• Scale number of file names and blocks
• Improved isolation and hence availability
6 May 2010
21
Q & A