hbase: how to get mttr below 1 minute

36
How to get the MTTR below 1 minute and more Devaraj Das ([email protected]) Nicolas Liochon ([email protected])

Upload: hortonworks

Post on 10-May-2015

3.115 views

Category:

Technology


0 download

DESCRIPTION

Best practices with HBase Mean Time to Recovery.

TRANSCRIPT

Page 1: HBase: How to get MTTR below 1 minute

How  to  get  the  MTTR  below  1  minute  and  more  

Devaraj  Das  ([email protected])  

Nicolas  Liochon  ([email protected])  

Page 2: HBase: How to get MTTR below 1 minute

Outline  

•  What  is  this?  Why  are  we  talking  about  this  topic?  Why  it  ma>ers?  ….  

•  HBase  Recovery  –  an  overview  •  HDFS  issues  •  Beyond  MTTR  (Performance  post  recovery)  •  Conclusion  /  Future  /  Q  &  A  

Page 3: HBase: How to get MTTR below 1 minute

What  is  MTTR?  Why  its  important?  …  

•  Mean  Time  To  Recovery  -­‐>  Average  Pme  required  to  repair  a  failed  component  (Courtesy:  Wikipedia)  

•  Enterprises  want  an  MTTR  of  ZERO  – Data  should  always  be  available  with  no  degradaPon  of  perceived  SLAs  

–  PracPcally  hard  to  obtain  but  yeah  it’s  a  goal  •  Close  to  Zero-­‐MTTR  is  especially  important  for  HBase  – Given  it  is  used  in  near  realPme  systems  

•  MTTR  in  other  NoSQL  systems  &  Databases  

Page 4: HBase: How to get MTTR below 1 minute

HBase  Basics  •  Strongly  consistent  – Write  ordered  with  reads  – Once  wri>en,  the  data  will  stay  

•  Built  on  top  of  HDFS  

•  When  a  machine  fails  the  cluster  remains  available,  and  its  data  as  well  

•  We’re  just  speaking  about  the  piece  of  data  that  was  handled  by  this  machine  

Page 5: HBase: How to get MTTR below 1 minute

Write  path  

WAL  –  Write  Ahead  Log  

A  write  is  finished  once  wri>en  on  all  HDFS  nodes  

The  client  communicated  with  the  region  servers  

Page 6: HBase: How to get MTTR below 1 minute

We’re  in  a  distributed  system  

•  You  can’t  disPnguish  a  slow  server  from  a  dead  server  

•  Everything,  or,  nearly  everything,  is  based  on  Pmeout  

•  Smaller  Pmeouts  means  more  false  posiPve  •  HBase  works  well  with  false  posiPve,  but  they  

always  have  a  cost.  

•  The  less  the  Pmeouts  the  be>er    

Page 7: HBase: How to get MTTR below 1 minute

HBase  components  for  recovery  

Page 8: HBase: How to get MTTR below 1 minute

Recovery  in  acPon  

Page 9: HBase: How to get MTTR below 1 minute

Recovery  process  •  Failure  detecPon:  ZooKeeper  

heartbeats  the  servers.  Expire  the  session  when  it  does  not  reply  

•  Region  assignment:  the  master  reallocates  the  regions  to  the  other  servers  

•  Failure  recovery:  read  the  WAL  and  rewrite  the  data  again  

•  The  clients  stops  the  connecPon  to  the  dead  server  and  goes  to  the  new  one.  

ZK  Heartbeat  

Client  

Region  Servers,  DataNode  

Data  recovery  

Master,  RS,  ZK  Region  Assignment  

Page 10: HBase: How to get MTTR below 1 minute

So….  

•  Detect  the  failure  as  fast  as  possible  •  Reassign  as  fast  as  possible  •  Read  /  rewrite  the  WAL  as  fast  as  possible  

•  That’s  obvious  

Page 11: HBase: How to get MTTR below 1 minute

The  obvious  –  failure  detecPon  •  Failure  detecPon  –  Set  a  ZooKeeper  Pmeout  to  30s  instead  of  the  old  180s  default.    

–  Beware  of  the  GC,  but  lower  values  are  possible.  –  ZooKeeper  detects  the  errors  sooner  than  the  configured  Pmeout  

•  0.96    –  HBase  scripts  clean  the  ZK  node  when  the  server  is  kill  

-­‐9ed  •  =>  DetecPon  Pme  becomes  0  

–  Can  be  used  by  any  monitoring  tool  

Page 12: HBase: How to get MTTR below 1 minute

The  obvious  –  faster  data  recovery  

•  Not  so  obvious  actually  •  Already  distributed  since  0.92  –  The  large  the  cluster  the  be>er.  

•  Completely  rewri>en  in  0.96  –  Recovery  itself  rewri>en  in  0.96  –  Will  be  covered  in  the  second  part  

Page 13: HBase: How to get MTTR below 1 minute

The  obvious  –  Faster  assignment  •  Faster  assignment  –  Just  improving  performances  

•  Parallelism  •  Speed  

– Globally  ‘much’  faster  –  Backported  to  0.94  

•  SPll  possible  to  do  be>er  for  huge  number  of  regions.    

•  A  few  seconds  for  most  cases  

Page 14: HBase: How to get MTTR below 1 minute

With  this  

•  DetecPon:  from  180s  to  30s  •  Data  recovery:  around  10s  •  Reassignment  :  from  10s  of  seconds  to  seconds  

Page 15: HBase: How to get MTTR below 1 minute

Do  you  think  we’re  be>er  with  this  

•  Answer  is  NO  •  Actually,  yes  but  if  and  only  if  HDFS  is  fine  – But  when  you  lose  a  regionserver,  you’ve  just  lost  a  datanode  

Page 16: HBase: How to get MTTR below 1 minute

DataNode  crash  is  expensive!  •  One  replica  of  WAL  edits  is  on  the  crashed  DN  – 33%  of  the  reads  during  the  regionserver  recovery  will  go  to  it  

•  Many  writes  will  go  to  it  as  well  (the  smaller  the  cluster,  the  higher  that  probability)  

•  NameNode  re-­‐replicates  the  data  (maybe  TBs)  that  was  on  this  node  to  restore  replica  count  – NameNode  does  this  work  only  amer  a  good  Pmeout  (10  minutes  by  default)  

Page 17: HBase: How to get MTTR below 1 minute

HDFS  –  Stale  mode  Live  

Stale  

Dead  

As  today:  used  for  reads  &  writes,  using  locality  

Not  used  for  writes,  used  as  last  resort  for  reads  

As  today:  not  used.  And  actually,  it’s  be>er  to  do  the  HBase  recovery  before  HDFS  replicates  the  TBs  of  data  of  this  node  

30  seconds,  can  be  less.  

10  minutes,  don’t  change  this  

Page 18: HBase: How to get MTTR below 1 minute

Results  

•  Do  more  read/writes  to  HDFS  during  the  recovery  

•  MulPple  failures  are  sPll  possible  – Stale  mode  will  sPll  play  its  role  – And  set  dfs.Pmeout  to  30s  – This  limits  the  effect  of  two  failure  in  a  row.  The  cost  of  the  second  failure  is  30s  if  you  were  unlucky  

Page 19: HBase: How to get MTTR below 1 minute

Are  we  done?  

•  We’re  not  bad  •  But  there  is  sPll  something  

Page 20: HBase: How to get MTTR below 1 minute

The  client  

You  lem  it  waiPng  on  the  dead  server          

Page 21: HBase: How to get MTTR below 1 minute

Here  it  is  

Page 22: HBase: How to get MTTR below 1 minute

The  client  

•  You  want  the  client  to  be  paPent  •  Retries  when  the  system  is  already  loaded  is  not  good.    

•  You  want  the  client  to  learn  about  region  servers  dying,  and  to  be  able  to  react  immediately.  

•  You  want  this  to  scale.  

Page 23: HBase: How to get MTTR below 1 minute

SoluPon  

•  The  master  noPfies  the  client  

– A  cheap  mulPcast  message  with  the  “dead  servers”  list.  Sent  5  Pmes  for  safety.  

– Off  by  default.  – On  recepPon,  the  client  stops  immediately  waiPng  on  the  TCP  connecPon.  You  can  now  enjoy  large  hbase.rpc.Pmeout  

Page 24: HBase: How to get MTTR below 1 minute

Full  workflow  t0  

t1  

t2  

t3  

Client  reads  and  writes  

RegionServer  serving  reads  and  writes  

RegionServer  crashes  

Affected  regions  reassigned  

Client  writes  

Data  recovered  

Client  reads  and  writes  t4  

Page 25: HBase: How to get MTTR below 1 minute

Are  we  done  

•  In  a  way,  yes  – There  is  a  lot  of  things  around  asynchronous  writes,  reads  during  recovery  

– Will  be  for  another  Pme,  but  there  will  be  some  nice  things  in  0.96  

•  And  a  couple  of  them  is  presented  in  the  second  part  of  this  talk!  

Page 26: HBase: How to get MTTR below 1 minute

Faster  recovery  •  Previous  algo  –  Read  the  WAL  files  – Write  new  Hfiles  –  Tell  the  region  server  it  got  new  Hfiles  

•  Put  pressure  on  namenode  –  Remember:  don’t  put  pressure  on  the  namenode  

•  New  algo:  –  Read  the  WAL  – Write  to  the  regionserver  – We’re  done  (have  seen  great  improvements  in  our  tests)  –  TBD:  Assign  the  WAL  to  a  RegionServer  local  to  a  replica  

Page 27: HBase: How to get MTTR below 1 minute

RegionServer0   RegionServer_x  RegionServer_y  

WAL-­‐file3  <region2:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

WAL-­‐file2  <region2:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

WAL-­‐file1  <region2:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

HDFS  

Splitlog-­‐file-­‐for-­‐region3  <region3:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

Splitlog-­‐file-­‐for-­‐region2  <region2:edit1><region1:edit2>  ……  <region2:edit1>  ……..  

Splitlog-­‐file-­‐for-­‐region1  <region1:edit1><region1:edit2>  ……  <region1:edit1>  ……..  

HDFS  

RegionServer3  

RegionServer2  

RegionServer1  

writes  

writes  reads  

reads  

Distributed  log  Split  

Page 28: HBase: How to get MTTR below 1 minute

RegionServer0   RegionServer_x  RegionServer_y  

WAL-­‐file3  <region2:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

WAL-­‐file2  <region2:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

WAL-­‐file1  <region2:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

HDFS  

Recovered-­‐file-­‐for-­‐region3  <region3:edit1><region1:edit2>  ……  <region3:edit1>  ……..  

Recovered-­‐file-­‐for-­‐region2  <region2:edit1><region1:edit2>  ……  <region2:edit1>  ……..  

Recovered-­‐file-­‐for-­‐region1  <region1:edit1><region1:edit2>  ……  <region1:edit1>  ……..  

HDFS  

RegionServer3  

RegionServer2  

RegionServer1  

writes  

writes  reads  

reads  

Distributed  log  Replay  

replays  

Page 29: HBase: How to get MTTR below 1 minute

Write  during  recovery  

•  Hey,  you  can  write  during  the  WAL  replay  •  Events  stream:  your  new  recovery  Pme  is  the  failure  detecPon  Pme:  max  30s,  likely  less!  

Page 30: HBase: How to get MTTR below 1 minute

MemStore  flush  

•  Real  life:  some  tables  are  updated  at  a  given  moment  then  lem  alone  – With  a  non  empty  memstore  – More  data  to  recover  

•  It’s  now  possible  to  guarantee  that  we  don’t  have  MemStore  with  old  data  

•  Improves  real  life  MTTR  •  Helps  snapshots  

Page 31: HBase: How to get MTTR below 1 minute

.META.  •  .META.  –  There  is  no  –ROOT-­‐  in  0.95/0.96  –  But  .META.  failures  are  criPcal  

•  A  lot  of  small  improvements  –  Server  now  says  to  the  client  when  a  region  has  moved  (client  can  avoid  going  to  meta)  

•  And  a  big  one  –  .META.  WAL  is  managed  separately  to  allow  an  immediate  recovery  of  META  

– With  the  new  MemStore  flush,  ensure  a  quick  recovery  

Page 32: HBase: How to get MTTR below 1 minute

Data  locality  post  recovery  

•  HBase  performance  depends  on  data-­‐locality  •  Amer  a  recovery,  you’ve  lost  it  –  Bad  for  performance  

•  Here  comes  region  groups  •  Assign  3  favored  RegionServers  for  every  region  •  On  failures  assign  the  region  to  one  of  the  secondaries  

•  The  data-­‐locality  issue  is  minimized  on  failures  

Page 33: HBase: How to get MTTR below 1 minute

Block1   Block2   Block3   Block1   Block2  

Rack1  

Block3  Block3  

Rack2   Rack3  

Block1   Block2  

Datanode  

RegionServer1  

Datanode1  

RegionServer1  

Datanode  

RegionServer2  

Datanode1  

RegionServer1  

Datanode  

RegionServer3  

Block1   Block2  

Rack1  

Block3  Block3  

Rack2   Rack3  

Block1   Block2  

RegionServer4   Datanode1  

RegionServer1  

Datanode  

RegionServer2  

Datanode1  

RegionServer1  

Datanode  

RegionServer3  

Reads  Blk1  and  Blk2  remotely  

Reads  Blk3  remotely  

RegionServer1  serves  three  regions,  and  their  StoreFile  blks  are  sca>ered  across  the  cluster  with  one  replica  local  to  RegionServer1.  

Page 34: HBase: How to get MTTR below 1 minute

Block1   Block2   Block3   Block1   Block2  

Rack1  

Block3  Block3  

Rack2   Rack3  

Block1   Block2  

Datanode  

RegionServer1  

Datanode1  

RegionServer1  

Datanode  

RegionServer2  

Datanode1  

RegionServer1  

Datanode  

RegionServer3  

RegionServer1  serves  three  regions,  and  their  StoreFile  blks  are  placed  on  specific  machines  on  the  other  racks  

Block1   Block2  

Rack1  

Block3  Block3  

Rack2   Rack3  

Block1   Block2  

RegionServer4   Datanode1  

RegionServer1  

Datanode  

RegionServer2  

Datanode1  

RegionServer1  

Datanode  

RegionServer3  

No  remote  reads  

Datanode  

Page 35: HBase: How to get MTTR below 1 minute

Conclusion  

•  The  target  was  “from  omen  10  minutes  to  always  less  than  1  minute”  – We’re  almost  there  

•  Most  of  it  is  available  in  0.96,  some  parts  were  backported  

•  Real  life  tesPng  of  the  improvements  in  progress  

•  Room  for  more  improvements  

Page 36: HBase: How to get MTTR below 1 minute

Q  &  A  

Thanks!