Download - HBase: How to get MTTR below 1 minute
![Page 1: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/1.jpg)
How to get the MTTR below 1 minute and more
Devaraj Das ([email protected])
Nicolas Liochon ([email protected])
![Page 2: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/2.jpg)
Outline
• What is this? Why are we talking about this topic? Why it ma>ers? ….
• HBase Recovery – an overview • HDFS issues • Beyond MTTR (Performance post recovery) • Conclusion / Future / Q & A
![Page 3: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/3.jpg)
What is MTTR? Why its important? …
• Mean Time To Recovery -‐> Average Pme required to repair a failed component (Courtesy: Wikipedia)
• Enterprises want an MTTR of ZERO – Data should always be available with no degradaPon of perceived SLAs
– PracPcally hard to obtain but yeah it’s a goal • Close to Zero-‐MTTR is especially important for HBase – Given it is used in near realPme systems
• MTTR in other NoSQL systems & Databases
![Page 4: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/4.jpg)
HBase Basics • Strongly consistent – Write ordered with reads – Once wri>en, the data will stay
• Built on top of HDFS
• When a machine fails the cluster remains available, and its data as well
• We’re just speaking about the piece of data that was handled by this machine
![Page 5: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/5.jpg)
Write path
WAL – Write Ahead Log
A write is finished once wri>en on all HDFS nodes
The client communicated with the region servers
![Page 6: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/6.jpg)
We’re in a distributed system
• You can’t disPnguish a slow server from a dead server
• Everything, or, nearly everything, is based on Pmeout
• Smaller Pmeouts means more false posiPve • HBase works well with false posiPve, but they
always have a cost.
• The less the Pmeouts the be>er
![Page 7: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/7.jpg)
HBase components for recovery
![Page 8: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/8.jpg)
Recovery in acPon
![Page 9: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/9.jpg)
Recovery process • Failure detecPon: ZooKeeper
heartbeats the servers. Expire the session when it does not reply
• Region assignment: the master reallocates the regions to the other servers
• Failure recovery: read the WAL and rewrite the data again
• The clients stops the connecPon to the dead server and goes to the new one.
ZK Heartbeat
Client
Region Servers, DataNode
Data recovery
Master, RS, ZK Region Assignment
![Page 10: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/10.jpg)
So….
• Detect the failure as fast as possible • Reassign as fast as possible • Read / rewrite the WAL as fast as possible
• That’s obvious
![Page 11: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/11.jpg)
The obvious – failure detecPon • Failure detecPon – Set a ZooKeeper Pmeout to 30s instead of the old 180s default.
– Beware of the GC, but lower values are possible. – ZooKeeper detects the errors sooner than the configured Pmeout
• 0.96 – HBase scripts clean the ZK node when the server is kill
-‐9ed • => DetecPon Pme becomes 0
– Can be used by any monitoring tool
![Page 12: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/12.jpg)
The obvious – faster data recovery
• Not so obvious actually • Already distributed since 0.92 – The large the cluster the be>er.
• Completely rewri>en in 0.96 – Recovery itself rewri>en in 0.96 – Will be covered in the second part
![Page 13: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/13.jpg)
The obvious – Faster assignment • Faster assignment – Just improving performances
• Parallelism • Speed
– Globally ‘much’ faster – Backported to 0.94
• SPll possible to do be>er for huge number of regions.
• A few seconds for most cases
![Page 14: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/14.jpg)
With this
• DetecPon: from 180s to 30s • Data recovery: around 10s • Reassignment : from 10s of seconds to seconds
![Page 15: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/15.jpg)
Do you think we’re be>er with this
• Answer is NO • Actually, yes but if and only if HDFS is fine – But when you lose a regionserver, you’ve just lost a datanode
![Page 16: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/16.jpg)
DataNode crash is expensive! • One replica of WAL edits is on the crashed DN – 33% of the reads during the regionserver recovery will go to it
• Many writes will go to it as well (the smaller the cluster, the higher that probability)
• NameNode re-‐replicates the data (maybe TBs) that was on this node to restore replica count – NameNode does this work only amer a good Pmeout (10 minutes by default)
![Page 17: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/17.jpg)
HDFS – Stale mode Live
Stale
Dead
As today: used for reads & writes, using locality
Not used for writes, used as last resort for reads
As today: not used. And actually, it’s be>er to do the HBase recovery before HDFS replicates the TBs of data of this node
30 seconds, can be less.
10 minutes, don’t change this
![Page 18: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/18.jpg)
Results
• Do more read/writes to HDFS during the recovery
• MulPple failures are sPll possible – Stale mode will sPll play its role – And set dfs.Pmeout to 30s – This limits the effect of two failure in a row. The cost of the second failure is 30s if you were unlucky
![Page 19: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/19.jpg)
Are we done?
• We’re not bad • But there is sPll something
![Page 20: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/20.jpg)
The client
You lem it waiPng on the dead server
![Page 21: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/21.jpg)
Here it is
![Page 22: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/22.jpg)
The client
• You want the client to be paPent • Retries when the system is already loaded is not good.
• You want the client to learn about region servers dying, and to be able to react immediately.
• You want this to scale.
![Page 23: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/23.jpg)
SoluPon
• The master noPfies the client
– A cheap mulPcast message with the “dead servers” list. Sent 5 Pmes for safety.
– Off by default. – On recepPon, the client stops immediately waiPng on the TCP connecPon. You can now enjoy large hbase.rpc.Pmeout
![Page 24: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/24.jpg)
Full workflow t0
t1
t2
t3
Client reads and writes
RegionServer serving reads and writes
RegionServer crashes
Affected regions reassigned
Client writes
Data recovered
Client reads and writes t4
![Page 25: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/25.jpg)
Are we done
• In a way, yes – There is a lot of things around asynchronous writes, reads during recovery
– Will be for another Pme, but there will be some nice things in 0.96
• And a couple of them is presented in the second part of this talk!
![Page 26: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/26.jpg)
Faster recovery • Previous algo – Read the WAL files – Write new Hfiles – Tell the region server it got new Hfiles
• Put pressure on namenode – Remember: don’t put pressure on the namenode
• New algo: – Read the WAL – Write to the regionserver – We’re done (have seen great improvements in our tests) – TBD: Assign the WAL to a RegionServer local to a replica
![Page 27: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/27.jpg)
RegionServer0 RegionServer_x RegionServer_y
WAL-‐file3 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file2 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file1 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
HDFS
Splitlog-‐file-‐for-‐region3 <region3:edit1><region1:edit2> …… <region3:edit1> ……..
Splitlog-‐file-‐for-‐region2 <region2:edit1><region1:edit2> …… <region2:edit1> ……..
Splitlog-‐file-‐for-‐region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..
HDFS
RegionServer3
RegionServer2
RegionServer1
writes
writes reads
reads
Distributed log Split
![Page 28: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/28.jpg)
RegionServer0 RegionServer_x RegionServer_y
WAL-‐file3 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file2 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
WAL-‐file1 <region2:edit1><region1:edit2> …… <region3:edit1> ……..
HDFS
Recovered-‐file-‐for-‐region3 <region3:edit1><region1:edit2> …… <region3:edit1> ……..
Recovered-‐file-‐for-‐region2 <region2:edit1><region1:edit2> …… <region2:edit1> ……..
Recovered-‐file-‐for-‐region1 <region1:edit1><region1:edit2> …… <region1:edit1> ……..
HDFS
RegionServer3
RegionServer2
RegionServer1
writes
writes reads
reads
Distributed log Replay
replays
![Page 29: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/29.jpg)
Write during recovery
• Hey, you can write during the WAL replay • Events stream: your new recovery Pme is the failure detecPon Pme: max 30s, likely less!
![Page 30: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/30.jpg)
MemStore flush
• Real life: some tables are updated at a given moment then lem alone – With a non empty memstore – More data to recover
• It’s now possible to guarantee that we don’t have MemStore with old data
• Improves real life MTTR • Helps snapshots
![Page 31: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/31.jpg)
.META. • .META. – There is no –ROOT-‐ in 0.95/0.96 – But .META. failures are criPcal
• A lot of small improvements – Server now says to the client when a region has moved (client can avoid going to meta)
• And a big one – .META. WAL is managed separately to allow an immediate recovery of META
– With the new MemStore flush, ensure a quick recovery
![Page 32: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/32.jpg)
Data locality post recovery
• HBase performance depends on data-‐locality • Amer a recovery, you’ve lost it – Bad for performance
• Here comes region groups • Assign 3 favored RegionServers for every region • On failures assign the region to one of the secondaries
• The data-‐locality issue is minimized on failures
![Page 33: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/33.jpg)
Block1 Block2 Block3 Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
RegionServer4 Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Reads Blk1 and Blk2 remotely
Reads Blk3 remotely
RegionServer1 serves three regions, and their StoreFile blks are sca>ered across the cluster with one replica local to RegionServer1.
![Page 34: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/34.jpg)
Block1 Block2 Block3 Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
RegionServer1 serves three regions, and their StoreFile blks are placed on specific machines on the other racks
Block1 Block2
Rack1
Block3 Block3
Rack2 Rack3
Block1 Block2
RegionServer4 Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
No remote reads
Datanode
![Page 35: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/35.jpg)
Conclusion
• The target was “from omen 10 minutes to always less than 1 minute” – We’re almost there
• Most of it is available in 0.96, some parts were backported
• Real life tesPng of the improvements in progress
• Room for more improvements
![Page 36: HBase: How to get MTTR below 1 minute](https://reader033.vdocuments.us/reader033/viewer/2022052505/554f740ab4c905bb178b537a/html5/thumbnails/36.jpg)
Q & A
Thanks!