hbase: how to get mttr below 1 minute

How to get the MTTR below 1 minute and more

Devaraj Das ([email protected])

Nicolas Liochon ([email protected])

Outline

•  What is this? Why are we talking about this topic? Why it ma>ers? ….

•  HBase Recovery – an overview •  HDFS issues •  Beyond MTTR (Performance post recovery) •  Conclusion / Future / Q & A

What is MTTR? Why its important? …

•  Mean Time To Recovery -‐> Average Pme required to repair a failed component (Courtesy: Wikipedia)

•  Enterprises want an MTTR of ZERO – Data should always be available with no degradaPon of perceived SLAs

–  PracPcally hard to obtain but yeah it’s a goal •  Close to Zero-‐MTTR is especially important for HBase – Given it is used in near realPme systems

•  MTTR in other NoSQL systems & Databases

HBase Basics •  Strongly consistent – Write ordered with reads – Once wri>en, the data will stay

•  Built on top of HDFS

•  When a machine fails the cluster remains available, and its data as well

•  We’re just speaking about the piece of data that was handled by this machine

Write path

WAL – Write Ahead Log

A write is finished once wri>en on all HDFS nodes

The client communicated with the region servers

We’re in a distributed system

•  You can’t disPnguish a slow server from a dead server

•  Everything, or, nearly everything, is based on Pmeout

•  Smaller Pmeouts means more false posiPve •  HBase works well with false posiPve, but they

always have a cost.

•  The less the Pmeouts the be>er

HBase components for recovery

Recovery in acPon

Recovery process •  Failure detecPon: ZooKeeper

heartbeats the servers. Expire the session when it does not reply

•  Region assignment: the master reallocates the regions to the other servers

•  Failure recovery: read the WAL and rewrite the data again

•  The clients stops the connecPon to the dead server and goes to the new one.

ZK Heartbeat

Client

Region Servers, DataNode

Data recovery

Master, RS, ZK Region Assignment

So….

•  Detect the failure as fast as possible •  Reassign as fast as possible •  Read / rewrite the WAL as fast as possible

•  That’s obvious

The obvious – failure detecPon •  Failure detecPon –  Set a ZooKeeper Pmeout to 30s instead of the old 180s default.

–  Beware of the GC, but lower values are possible. –  ZooKeeper detects the errors sooner than the configured Pmeout

•  0.96 –  HBase scripts clean the ZK node when the server is kill

-‐9ed •  => DetecPon Pme becomes 0

–  Can be used by any monitoring tool

The obvious – faster data recovery

•  Not so obvious actually •  Already distributed since 0.92 –  The large the cluster the be>er.

•  Completely rewri>en in 0.96 –  Recovery itself rewri>en in 0.96 –  Will be covered in the second part

The obvious – Faster assignment •  Faster assignment –  Just improving performances

•  Parallelism •  Speed

– Globally ‘much’ faster –  Backported to 0.94

•  SPll possible to do be>er for huge number of regions.

•  A few seconds for most cases

With this

•  DetecPon: from 180s to 30s •  Data recovery: around 10s •  Reassignment : from 10s of seconds to seconds

Do you think we’re be>er with this

•  Answer is NO •  Actually, yes but if and only if HDFS is fine – But when you lose a regionserver, you’ve just lost a datanode

DataNode crash is expensive! •  One replica of WAL edits is on the crashed DN – 33% of the reads during the regionserver recovery will go to it

•  Many writes will go to it as well (the smaller the cluster, the higher that probability)

•  NameNode re-‐replicates the data (maybe TBs) that was on this node to restore replica count – NameNode does this work only amer a good Pmeout (10 minutes by default)

HDFS – Stale mode Live

Stale

Dead

As today: used for reads & writes, using locality

Not used for writes, used as last resort for reads

As today: not used. And actually, it’s be>er to do the HBase recovery before HDFS replicates the TBs of data of this node

30 seconds, can be less.

10 minutes, don’t change this

Results

•  Do more read/writes to HDFS during the recovery

•  MulPple failures are sPll possible – Stale mode will sPll play its role – And set dfs.Pmeout to 30s – This limits the effect of two failure in a row. The cost of the second failure is 30s if you were unlucky

Are we done?

•  We’re not bad •  But there is sPll something

The client

You lem it waiPng on the dead server

Here it is

The client

•  You want the client to be paPent •  Retries when the system is already loaded is not good.

•  You want the client to learn about region servers dying, and to be able to react immediately.

•  You want this to scale.

SoluPon

•  The master noPfies the client

– A cheap mulPcast message with the “dead servers” list. Sent 5 Pmes for safety.

– Off by default. – On recepPon, the client stops immediately waiPng on the TCP connecPon. You can now enjoy large hbase.rpc.Pmeout

Full workflow t0

t1

t2

t3

Client reads and writes

RegionServer serving reads and writes

RegionServer crashes

Affected regions reassigned

Client writes

Data recovered

Client reads and writes t4

Are we done

•  In a way, yes – There is a lot of things around asynchronous writes, reads during recovery

– Will be for another Pme, but there will be some nice things in 0.96

•  And a couple of them is presented in the second part of this talk!

Faster recovery •  Previous algo –  Read the WAL files – Write new Hfiles –  Tell the region server it got new Hfiles

•  Put pressure on namenode –  Remember: don’t put pressure on the namenode

•  New algo: –  Read the WAL – Write to the regionserver – We’re done (have seen great improvements in our tests) –  TBD: Assign the WAL to a RegionServer local to a replica

RegionServer0 RegionServer_x RegionServer_y

WAL-‐file3 <region2:edit1><region1:edit2> …… <region3:edit1> ……..



HDFS

Splitlog-‐file-‐for-‐region3 <region3:edit1><region1:edit2> …… <region3:edit1> ……..



HDFS

RegionServer3

RegionServer2

RegionServer1

writes

writes reads

reads

Distributed log Split

RegionServer0 RegionServer_x RegionServer_y




HDFS

Recovered-‐file-‐for-‐region3 <region3:edit1><region1:edit2> …… <region3:edit1> ……..



HDFS

RegionServer3

RegionServer2

RegionServer1

writes

writes reads

reads

Distributed log Replay

replays

Write during recovery

•  Hey, you can write during the WAL replay •  Events stream: your new recovery Pme is the failure detecPon Pme: max 30s, likely less!

MemStore flush

•  Real life: some tables are updated at a given moment then lem alone – With a non empty memstore – More data to recover

•  It’s now possible to guarantee that we don’t have MemStore with old data

•  Improves real life MTTR •  Helps snapshots

.META. •  .META. –  There is no –ROOT-‐ in 0.95/0.96 –  But .META. failures are criPcal

•  A lot of small improvements –  Server now says to the client when a region has moved (client can avoid going to meta)

•  And a big one –  .META. WAL is managed separately to allow an immediate recovery of META

– With the new MemStore flush, ensure a quick recovery

Data locality post recovery

•  HBase performance depends on data-‐locality •  Amer a recovery, you’ve lost it –  Bad for performance

•  Here comes region groups •  Assign 3 favored RegionServers for every region •  On failures assign the region to one of the secondaries

•  The data-‐locality issue is minimized on failures

Block1 Block2 Block3 Block1 Block2

Rack1

Block3 Block3

Rack2 Rack3

Block1 Block2

Datanode

RegionServer1

Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

Block1 Block2

Rack1

Block3 Block3

Rack2 Rack3

Block1 Block2

RegionServer4 Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

Reads Blk1 and Blk2 remotely

Reads Blk3 remotely

RegionServer1 serves three regions, and their StoreFile blks are sca>ered across the cluster with one replica local to RegionServer1.

Block1 Block2 Block3 Block1 Block2

Rack1

Block3 Block3

Rack2 Rack3

Block1 Block2

Datanode

RegionServer1

Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

RegionServer1 serves three regions, and their StoreFile blks are placed on specific machines on the other racks

Block1 Block2

Rack1

Block3 Block3

Rack2 Rack3

Block1 Block2

RegionServer4 Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

No remote reads

Datanode

Conclusion

•  The target was “from omen 10 minutes to always less than 1 minute” – We’re almost there

•  Most of it is available in 0.96, some parts were backported

•  Real life tesPng of the improvements in progress

•  Room for more improvements

Q & A

Thanks!

hbase: how to get mttr below 1 minute

Technology

hdfs splitlog

y walle3

x regionserver

regionserver0 regionserver

zk regionassignment

from180sto30s datarecovery

from10sofsecondsto seconds

arewedone inaway