hdfs high availability

A High Availability story for HDFS AvatarNode

Dhruba Borthakur & Dmytro Molkov [email protected] & [email protected] Presented at The Hadoop User Group Meeting, Sept 29, 2010

How infrequently does the NameNode (NN) stop?

  Hadoop Software Bugs –  Two directories in fs.name.dir, but when a write to first

directory failed, the NN ignored the second one (once) –  Upgrade from 0.17 to 0.18 caused data corruption

(once)   Configuration errors

–  Fsimage partition ran out of space (once) –  Network Load Anomalies (about 10 times)

  Maintenance: –  Deploy new patches (once every month)

What does the SecondaryNameNode do?

  Periodically merges Transaction logs   Requires the same amount of memory as NN   Why is it separate from NN?

–  Avoids fine-grain locking of NN data structures –  Avoids implementing copy-on-write for NN data

structures   Renamed as CheckpointNode (CN) in 0.21 release.

Shortcomings of the SecondaryNameNode?

  Does not have a copy of the latest transaction log   Periodic and is not continuous

–  Configured to run every hour

  If the NN dies, the SecondaryNameNode does not take over the responsibilities of the NN

BackupNode (BN)

  NN streams transaction log to BackupNode   BackupNode applies log to in-memory and disk image   BN always commit to disk before success to NN   If BN restarts, it has to catch up with NN   Available in HDFS 0.21 release

Limitations of BackupNode (BN)

  Maximum of one BackupNode per NN –  Support only two-machine failure

  NN does not forward block reports to BackupNode   Time to restart from 12 GB image, 70M files + 100 M

blocks –  3 – 5 minutes to read the image from disk –  20 min to process block reports –  BN will still take 25+ minutes to failover!

Overlapping Clusters for HA

  “Always available for write” model   Two logical clusters each with their own NN   Each physical machine runs two instances of DataNode   Two DataNode instances share the same physical storage

device   Application has logic to failover writes from one HDFS

cluster to another   More details at http://hadoopblog.blogspot.com/2009/06/hdfs-

scribe-integration.html

HDFS+Zookeeper

  HDFS can store transaction logs in Zookeeper/Bookeeper –  http://issues.apache.org/jira/browse/HDFS-234

  Transaction log need not be stored in NFS filer   A new NN will still have to process block reports

–  Not good for HA yet, because NN failover will take 30 minutes

Our use case for High Availability

  Failover should occur in less than a minute   Failovers are needed only for new software upgrades

Challenges

  DataNodes send block location information to only one NameNode

  NameNode needs block locations in memory to serve clients

  The in-memory metadata for 100 million files could be 60 GB, huge!

DataNodes

Primary NameNode

Client

Block location message “yes, I have blockid 123”

Client retrieves block location from NameNode

Introduction for AvatarNode

  Active-Standby Pair –  Coordinated via zookeeper –  Failover in few seconds –  Wrapper over NameNode

  Active AvatarNode –  Writes transaction log to filer

  Standby AvatarNode –  Reads transactions from filer –  Latest metadata in memory

http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html

NFS Filer

Active AvatarNode (NameNode)

Client

Standby AvatarNode (NameNode)

Block location messages

Client retrieves block location from Primary or Standby

write transaction

read transaction

Block location messages

DataNodes

ZooKeeper integration for Clients

  DistributedAvatarFileSystem: –  Connects to ZooKeeper to figure out who the Primary node is.

There is a znode in ZooKeeper that has the current address of the primary. Clients read it on creation and during failover.

–  Is aware of the failover state and pauses until it is over. If the znode is empty the cluster is failing over to the new Primary, just wait for that to finish.

–  Handles failures in calls to the NameNode. If the call failed with network exception – checks for the failover in progress and retries the call after the new Primary is up.

Four steps to failover

  Wipe ZooKeeper entry. Clients will know the failover is in progress. (0 seconds)

  Stop the primary namenode. Last bits of data will be flushed to Transaction Log and it will die. (Seconds)

  Switch Standby to Primary. It will consume the rest of the Transaction log and get out of safemode ready to serve traffic. (Seconds)

  Update the entry in ZooKeeper. All the clients waiting for failover will pick up the new connection. (0 seconds)

  After: Start the first node in the Standby Mode (Takes a while, but the cluster is up and running)

Why add ZooKeeper to the mix

  Provides a clean way to execute failovers in the application layer.

  A centralized control of all the clients. Gives us the ability to Pause clients until the failover is done.

  A good stepping stone for future improvements needed to perform automatic failover: –  Nodes voting on who will be the primary –  DataNodes knowing who has the authority to delete blocks

ZooKeeper is an option

  Can be implemented using IP failover based on the existing infrastructure.

  The clients will not know if the failover is done and it is safe to make the call again.

  IP failover works well in tightly coupled system (both nodes in one rack) so the Single Point of Failure is still there (rack switch)

  There is no need to run a dedicated ZooKeeper cluster.

It is not all about failover

  The Standby node has a lot of CPU that is not used   The Standby node has a full (but delayed a bit) picture of

Blocks and the Namesystem.

  Send all reads that can deal with stale data to Standby.   Have pluggable services run as a part of Standby node and

use the metadata of the filesystem directly from the shared memory instead of querying the namenode all the time.

  My Hadoop Blog: –  http://hadoopblog.blogspot.com/ –  http://www.facebook.com/hadoopfs

hdfs high availability

Education