ceph optimization on all flash storage

81
CEPH Op(miza(on on All Flash Storage Axel Rosenberg Systems Field Engineering & Solu(ons Marke(ng

Upload: vunhi

Post on 10-Dec-2016

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ceph Optimization on All Flash Storage

CEPH  Op(miza(on  on    All  Flash  Storage  Axel  Rosenberg                      Systems  Field  Engineering  &  Solu(ons  Marke(ng  

Page 2: Ceph Optimization on All Flash Storage

2  

Agenda  /  Talking  Points  

Ø  Challenges  of  Storage  @Scale  

Ø  CEPH  vs  IFOS  (some  Benchmarks)  

Ø  HDD  vs  FLASH  Ø Object  EC  vs  3-­‐way  Replica  (some  Benchmarks)  

Ø  FLASH  Economics  with  SDS  beyond  CEPH  /  Customer  References  

&  Studies  

Page 3: Ceph Optimization on All Flash Storage

3  

Why  was  CEPH  invented?  

Page 4: Ceph Optimization on All Flash Storage

4  

This  is  Sage  in  2006.    He’s  an  HPC  engineer.  

Page 5: Ceph Optimization on All Flash Storage

5  

His  Lustre  file  system  isn’t  scaling  very  well  and  he  wants  to  know  why.  

Page 6: Ceph Optimization on All Flash Storage

6  

Eventually  he  realizes...  It’s  the  metadata!  

Page 7: Ceph Optimization on All Flash Storage

7  

So  Sage  sits  down  to  write  his  own  file  system...  one  that  manages  metadata  beXer!    He  calls  it  Ceph.  

Page 8: Ceph Optimization on All Flash Storage

8  

Ceph  is  short  for  cephalopod,  like  an  octopus,  because  it  can  do  a  lot  of  things  in  parallel.  

Page 9: Ceph Optimization on All Flash Storage

9  

He  likes  it  so  much  that  he  decides  to  work  on  it  full  ]me.    He  starts  a  company  called  Inktank.  

Page 10: Ceph Optimization on All Flash Storage

10  

Everyone  loves  Inktank.    They  win  lots  of  awards.  

Page 11: Ceph Optimization on All Flash Storage

11  

Inktank  is  eventually  purchased  by  Redhat.  

Page 12: Ceph Optimization on All Flash Storage

12  

Tradi]onal  file  systems  use  lookup  tables  to  store  file  loca]ons.  

Inode  Table   Your  Files  

Page 13: Ceph Optimization on All Flash Storage

13  

For  local  file  systems,  this  is  easy  to  manage.  

Inode  Table  

Page 14: Ceph Optimization on All Flash Storage

14  

For  distributed  file  systems,  it’s  more  difficult.  

Inode  Table  

Page 15: Ceph Optimization on All Flash Storage

15  

At  the  Petabyte  Scale,  it  starts  to  break  down.  

Inode  Table  

Page 16: Ceph Optimization on All Flash Storage

16  

This  is  why  Sage  doesn’t  like  Lustre.    

Page 17: Ceph Optimization on All Flash Storage

17  

CEPH Optimization on All Flash Storage

17

8 Years Later

Page 18: Ceph Optimization on All Flash Storage

18  

FLASH READINESS? •  Started  with  CEPH  Emperor  (~2  years  ago)  •  Not  much  scaling  with  increasing  number  of  clients  •  Minimal  scaling  if  we  increase  number  of  SSDs  •  No  resources  are  saturated  •  CPU  core  usage  per  IOPs  is  very  high  •  Double  write  due  to  CEPH  journal  is    increasing  WA  by  minimum  2X  

  18

Page 19: Ceph Optimization on All Flash Storage

19  

SanDisk on CEPH •  We  saw  lot  of  poten]al  on  CEPH  

 

•  Decided  to  dig  deep  and  op]mize  the  code  base  for  flash  

•  CEPH  read  path  was  compara]vely  easier  to  tackle,  so,  started  with  that  first  

 

•  Also,  lot  of  our  poten]al  customer  workload  was  kind  of  fewer  writes  and  many  reads  

19

Page 20: Ceph Optimization on All Flash Storage

20  

IFOS  -­‐  Enhancing  CEPH  for  Enterprise  Consump(on  

Open  Source  Ceph    +  SanDisk  Performance  

patches  

v  Out-­‐of-­‐the  Box  configura]ons  tuned  for  performance  with  Flash  

v  Sizing  &  Planning  Tool  

v  Higher  node  resiliency  with  Mul]-­‐Path  support  

v  Persistent  reserva]ons  of  drives  to  nodes  

v  CEPH  Installer  that  is  specifically  built  for  InfiniFlash  v  High  Performance  iSCSI  Storage  with  SCST  deployment  per  default  

v  BeXer  Diagnos]cs  with  Log  Collec]on  Tool  v  Enterprise  hardened  QA  @scale  v  InfiniFlash  Drive  Management  integrated  into  CEPH  Management  (Coming  Soon)  

SanDisk  CEPH  Distro  adds  Usability  &  Performance  u]li]es  without  sacrificing  Open  Source  Principles  

IFOS  =  SanDisk  Ceph  Distribu]on  +  U]li]es  

All  CEPH  Performance  improvements  developed  by  SanDisk  are  contributed  back  to  community  

Page 21: Ceph Optimization on All Flash Storage

21  

IFOS:…what’s  the  deal?  

Page 22: Ceph Optimization on All Flash Storage

22  

Innova(ng  Performance  @Massive  Scale  InfiniFlash  OS  Ceph  Transformed  for  Flash  Performance  and    Contributed  Back  to  Community  

–  10x  Improvement  for  Block  Reads,  2x  Improvement  for  Object  Reads  

Major  Improvements  to  Enhance  Parallelism  –  Removed  single  Dispatch  queue  boXlenecks  for                                                                                                                                                                                                                              

OSD  and  Client  (librados)  layers  –  Shard  thread  pool  implementa]on  –  Major  lock  reordering  –  Improved  lock  granularity  –  Reader  /  Writer  locks    –  Granular  locks  at  Object  level  –  Op]mized  OpTracking  path  in  OSD  elimina]ng                                                                                                                                                                                                                    

redundant  locks  

Messenger  Performance  Enhancements  •  Message  signing    •  Socket  Read  aheads  •  Resolved  severe  lock  conten]ons  

Backend  Op(miza(ons  –  XFS  and  Flash  •  Reduced  ~2  CPU  core  usage  with  improved  

file  path  resolu]on  from  object  ID  •  CPU  and  Lock  op]mized  fast  path  for  reads  •  Disabled  throXling  for  Flash  •  Index  Manager  caching  and  Shared  

FdCache  in  filestore  

Page 23: Ceph Optimization on All Flash Storage

23  

BOTTLENECKS IDENTIFIED AND FIXED

23

•  Op]mized  lot  of  CPU  intensive  code  path  •  Found  out  context  switching  overhead  is  significant  if  backend  is  very  fast  •  Lot  of  lock  conten]on  overhead  popped  up,  sharding  helped  a  lot  •  Fine  grained  locking  helped  to  achieve  more  parallelism  •  New/delete  overhead  on  IO  path  is  becoming  significant  •  Efficient  caching  of  indexes  (placement  groups)  is  beneficial  •  Efficient  buffering  while  reading  from  socket  •  Need  to  disable  Nagle’s  algorithm  while  scaling  out  •  Needed  to  op]mize  tcmalloc  for  object  size  <  32k    

Page 24: Ceph Optimization on All Flash Storage

24  

So[ware  :  Ceph  Cluster  Configura(on  auth_cluster_required  =  none  auth_service_required  =  none  auth_client_required  =  none  filestore_xaXr_use_omap  =  true  debug_lockdep  =  0/0  debug_context  =  0/0  debug_crush  =  0/0  debug_buffer  =  0/0  debug_]mer  =  0/0  debug_filer  =  0/0  debug_objecter  =  0/0  debug_rados  =  0/0  debug_rbd  =  0/0  debug_journaler  =  0/0  debug_objectcatcher  =  0/0  debug_client  =  0/0  debug_osd  =  0/0  debug_optracker  =  0/0  debug_objclass  =  0/0  debug_filestore  =  0/0  debug_journal  =  0/0  debug_ms  =  0/0  debug_monc  =  0/0  

debug_monc  =  0/0  debug_tp  =  0/0  debug_auth  =  0/0  debug_finisher  =  0/0  debug_heartbeatmap  =  0/0  debug_perfcounter  =  0/0  debug_asok  =  0/0  debug_throXle  =  0/0  debug_mon  =  0/0  debug_paxos  =  0/0  debug_rgw  =  0/0  osd_op_threads  =  2  osd_op_num_threads_per_shard  =  2  osd_op_num_shards  =  7  filestore_op_threads  =  3  ms_nocrc  =  true  filestore_fd_cache_size  =  64  filestore_fd_cache_shards  =  32  cephx_sign_messages  =  false  cephx_require_signatures  =  false  ms_dispatch_throXle_bytes  =  0  throXler_perf_counter  =  false  ms_tcp_nodelay  =  true  osd_pool_default_size  =  3  osd_pool_default_min_size  =  2  

 [osd]  osd_journal_size  =  150000  osd_client_message_size_cap  =  0  osd_client_message_cap  =  0  osd_enable_op_tracker  =  false  osd_mkfs_op]ons_xfs  =  -­‐K    [mon]  mon_clock_drit_allowed  =  1  mon_clock_drit_warn_backoff  =  30  

Note  :  For  each  sotware  solu]on  there  are  specific  tuning  .Refer    user  guide  for  details.  

Page 25: Ceph Optimization on All Flash Storage

25  

Time  for  a  Benchmark  of..  

InfiniFlash(™) System IF500          CEPH  with  all  the  op]miza]ons  men]oned  on  top  of  IF100  

     +  

               Proper  filesystem/kernel  tuning  op]mized  for  IF100  box  

25

Page 26: Ceph Optimization on All Flash Storage

26  

Tools  and  IO  Profiles  

       

Tools  used  for  benchmarking    –  vdbench  (  .503  )  

–  fio-­‐2.1.11  –  COSBench  0.4.1.0    

IO  profiles  (  block  )    –  Small    block  (  4K  ,  8K    Random  )    –  Medium  Size  (  64K    Random  )      –  High  block  (  256K    Random  )    

IO  profiles  (  Object  )    –  Object  size  (4MB  )    

   

 

 

Page 27: Ceph Optimization on All Flash Storage

27  

IF500  topology  on  a  single  512  TB  IF100  

27

-­‐    IF-­‐100  BW  is  ~8.5GB/s  (with  6Gb  SAS,  12  Gb  is  coming  EOY)  and  ~1.5M  4K  RR  IOPS  -­‐  We  saw  that  Ceph  is  very  resource  hungry,  so,  need  at  least  2  physical  nodes  on  top  of  IF-­‐100  -­‐  We  need  to  connect  all  8  ports  of  an  HBA  to  saturate  IF-­‐100  for  bigger  block  size  

Page 28: Ceph Optimization on All Flash Storage

28  

Setup details

28

Performance  Config    -­‐  IF500  2  Node  Cluster  (  32  drives  shared  to  each  OSD  node)  

Node      2  Servers  (Dell  R620)   2x  E5-­‐2680  12C  2.8GHz      4x  16GB  RDIMM,  dual  rank  x4  (64GB)      

1x  Mellanox  X3  Dual  40GbE      1x  LSI  9207  HBA  card  

RBD  Client    4  Servers  (Dell  R620)   1  x  E5-­‐2680  10C  2.8GHz      2  x  16GB  RDIMM,  dual  rank  x4  (32  GB)    1x  Mellanox  X3  

Dual  40GbE      Storage  –  IF-­‐100  with  64  Icechips  in  A2  Config  IF-­‐100   IF-­‐100  is  connected  64  x  1YX2  Icechips  in  A2  topology.   Total  storage  -­‐  64  *  8  tb  =  512tb  

Network  Details  40G  Switch   NA      OS  Details    

OS     Ubuntu  14.04  LTS  64bit   3.13.0-­‐32  LSI  card/  driver   SAS2308(9207)   mpt2sas    

Mellanox  40gbps  nw  card   MT27500  [ConnectX-­‐3]   mlx4_en    -­‐    2.2-­‐1  (Feb  2014)  Cluster  Configura(on    CEPH  Version   sndk-­‐ifos-­‐1.0.0.04   0.86.rc.eap2  

Replica]on  (Default)  2    [Host]    

 Note:  -­‐  Host  level  replica]on.  

Number  of  Pools,  PGs  &  RBDs   pool  =  4    ;PG  =    2048  per  pool     2  RBDs  from  each  pool  

RBD  size   2TB      Number  of  Monitors   1      Number  of  OSD  Nodes   2      Number  of  OSDs  per  Node   32   total  OSDs    =  32  *  2  =    64  

Page 29: Ceph Optimization on All Flash Storage

29  

8K Random IO

29

0

50000

100000

150000

200000

250000

300000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Stock  Ceph

Storm  1.0

ceph

ReadPct Threads/worker

Sum of TotalOps

Block Size

IF500

Stock  Ceph

Latency(ms)

[Queue Depth] Read Percent

Page 30: Ceph Optimization on All Flash Storage

30  

64K Random IO

30

[Queue Depth] Read Percent

020000400006000080000

100000120000140000160000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Stock  Ceph

Storm  1.0

ceph

ReadPct Threads/worker

Sum of TotalOps

Block Size

Latency(ms)

IF500

Stock  Ceph

We are saturating ~8.5 GB/s IF-100 BW here

Page 31: Ceph Optimization on All Flash Storage

31  

IF500 scale out Topology

31

Page 32: Ceph Optimization on All Flash Storage

32  

IF500 HW set up

32

Performance  Config    6  Node  Cluster  (  8  drives  connected  to  each  OSD  node)  Node      6  Servers  

(Dell  R620)  2x  E5-­‐2680    v2    2.8GHz  25M$    8  x  16GB  RDIMM,  dual  rank    (128  GB)      1x  Mellanox  X3  Dual  40GbE      1x  LSI  9207  HBA  card  

RBD  Client    5  Servers  (Dell  R620)  

1  x  E5-­‐2680    v2    2.8GHz  25M$    4  x  16GB  RDIMM,  dual  rank    (64  GB)      1x  Mellanox  X3  Dual  40GbE      

Network  Details  40G  Switch   NA      OS  Details    

OS     Ubuntu  14.04  LTS  64bit   3.13.0-­‐32  

LSI  card/  driver   SAS2308(9207) / mpt2sas   16.100.00.00  

Mellanox  40gbps  nw  card   MT27500  [ConnectX-­‐3]  /  MLX4_EN   mlx4_en    -­‐    2.2-­‐1    Cluster  Configura(on    CEPH  Version   sndk-­‐ifos-­‐1.0.0.07   0.87-IFOS-1.0.0.7.beta  

Replica]on  (Default)  3  (CHASSIS)    

 Note:  -­‐  chassis  level  replica]on  

Number  of  Pools,  PGs  &  RBDs   pool  =  5    ;PG  =    2048  per  pool     2  RBDs  from  each  pool  

RBD  size   3TB    total  DATA  SET  size    =  30TB  Number  of  Monitors   1      Number  of  OSD  Nodes   6      Number  of  OSDs  per  Node   8   total  OSDs    =  6  *  8  =  48  

Page 33: Ceph Optimization on All Flash Storage

33  

4K Random IOPs Performance

33

[Queue Depth] Read Percent

Latency(ms)

   With  tuning  ,maximum  cumula]ve  performance  of  ~700K  IOPs  measured  towards  4KB  blocks,  satura]ng  node  CPUs  at  this  point.    

Page 34: Ceph Optimization on All Flash Storage

34  

64K Random IOPs Performance

34

[Queue Depth] Read Percent

Latency(ms)

   Maximum  Bandwidth  of  12.8GB/s  measured  towards  64KB  blocks    

Page 35: Ceph Optimization on All Flash Storage

35  

90-10 Random Read Performance

35

[Queue Depth] block size

Page 36: Ceph Optimization on All Flash Storage

36  

Performance trends

36

60597,88 84505,54

121754,19

183827,8

79288,91 105736,83

174073,72

218235,51

37179,3 48681,03

59981,97 67797,82 55431,02

67530,84 85210,88 92720,89

0

50000

100000

150000

200000

250000

1 2 3 4

4K-100%RR-16QD 4K-100%RR-64QD 16K-90%RR-16QD 16K-90%RR-64QD

Number of Compute Nodes VM  configura(on    •  On  single  Compute  node  hosted  4  VMs  and  each  VM  is  mapped  four  1TB  RBDs.    •  VMs  root  disks  also  from  the  same  CEPH  cluster.  •  Each  VM  has    8  virtual  CPU  and  16GB  RAM  

Page 37: Ceph Optimization on All Flash Storage

37  

FLASH  Econimics  /  IF100  HW  Intro  

Page 38: Ceph Optimization on All Flash Storage

38  

On  Flash  Economics,  what  do  you  say  to  the  Myth  of  Flash  is  expensive?  

Page 39: Ceph Optimization on All Flash Storage

39  

Our  All  Flash  designs  delivers  breakthrough  Features  and  breakthrough  Cost  for  An  All  Flash  Array  

Low Power High Performance Scalable Reliable

Now

Breakthrough Economics

Page 40: Ceph Optimization on All Flash Storage

40  

InfiniFlash  System  •  Ultra-dense All-Flash Appliance -  512TB in 3U -  Best in class $/IOPS/TB

•  Scale-out software for massive capacity -  Unified Content: Block, Object -  Flash optimized software with

programmable interfaces (SDK) •  Enterprise-Class storage features -  snapshots, replication, thin

provisioning IF500  

InfiniFlash  OS  (Ceph)  

Ideal  for  large-­‐scale  object  storage  use  cases  

Page 41: Ceph Optimization on All Flash Storage

41  

Scalable  Performance  1M  IOPS,  Latency  1-­‐3ms  6–  8GB/s  Throughput  Upgrade    -­‐12-­‐15GB/s,  Nov  15  

8TB  Flash-­‐Card  Innova(ons  •  Enterprise  class    power-­‐fail  protec]on    •  Alerts  &  monitoring    •  Latching  integrated  &  monitored  •  Directly  samples  air  temp  •  Form-­‐factor  enables  lowest  cost  SSD  

InfiniFlash™  HW  Planorm  –  IF100  Capacity  512TB  raw  All  Flash  3U  JBOD  of  Flash  (JBOF)  Up  to  64  x  8TB  cards    4TB  cards  also  available    in  Q116    

Opera]onal  Efficiency  &  Resilient  Hot  Swappable  Architecture,  Easy  FRU  Low  power  –  typical  workload  400-­‐500W    150W(idle)  -­‐  750W(max)  

MTBF  1.5+  million  hours  

Hot  Swappable  !  Fans,  SAS  Expander  Boards,  Power  Suppliers,  Flash  cards  

Host  Connec(vity  Connect  up  to  8  servers  through  8  SAS  ports  

Available  Now  

Icechip  Drive  Card  

Page 42: Ceph Optimization on All Flash Storage

42  

Disaggrega(on  is  the  Key  to  breakthrough  Economics  !  

InfiniFlash™    

Old  Model  §  Monolithic  

§  Proprietary  Storage  OS  

§  Costly:  $$$$$  

New  Model  §  Disaggregated  

§  Open  Sotware  Stack  

§  Cost  effec]ve:  $  

So[ware  Defined  Storage  

Standard  X86  Servers  

Advantages  of  Disaggrega(on    

§  Lower  TCA  and  TCO  than  tradi]onal  models  

§  Rich  choice  of  ecosystem  partners  -­‐  compute,  networking  and  sotware  stacks    

Evidence    

§  Decline  of  tradi]onal  frame  array  business    

Page 43: Ceph Optimization on All Flash Storage

43  

InfiniFlash  for  OpenStack  with  Dis-­‐Aggrega(on  

§  Compute  &  Storage  Disaggrega]on  enables  Op]mal  Resource  u]liza]on    

§  Allows  for  more  CPU  usage  required  for  OSDs  with  small  Block  workloads  

§  Allows  for  higher  bandwidth  provisioning  as  required  for  large  Object  workload  

§  Independent  Scaling  of  Compute  and  Storage  §  Higher  Storage  capacity  needs  doesn't’t  force  you  to  

add  more  compute  and  vice-­‐versa  

§  Leads  to  op]mal  ROI  for  PB  scale  OpenStack  deployments  HSEB  A   HSEB  B  

OSDs  

SAS    

….  

HSEB  A   HSEB  B   HSEB  A   HSEB  B  

….  

Compu

te  Farm  

LUN   LUN  

iSCSI  Storage  

…Obj   Obj  

Swit  ObjectStore  

…LUN   LUN  

Nova  with  Cinder  &  Glance  

LibRBD  

QEMU/KVM  

RGW  

WebServer  

KRBD  

iSCSI  Target  

OSDs   OSDs   OSDs   OSDs   OSDs  

Storage  Farm

 

Confiden]al  –  EMS  Product  Management  

Page 44: Ceph Optimization on All Flash Storage

44  

TCO  Example  (for  Hadoop)  -­‐  Flash  Performance  with  TCO  of  HDD  

InfiniFlash  models  are  comparable  to  HDD  based  TCO  with  all  the  benefits  of  Flash.  Note  that  opera]onal/maintenance  cost  and  performance  benefits  are  not  accounted  for  in  these  models.  $0.60/GB  used  are  an  avg  representa]ve  across  high  to  low-­‐end  product  offerings.  Price  reduc]on  roadmap  allows    for  further  InfiniFlash  cost  improvements  not  shown  in  this  average  model.  

 $-­‐        

 $20.000.000    

 $40.000.000    

 $60.000.000    

 $80.000.000    

 $100.000.000    

 $120.000.000    

Trad]onal  Hadoop  

InfiniFlash  Primary  Flash  +  HDD  Tier    

InfiniFlash  with  S3  +  

disaggregated  MR  

InfiniFlash  S3  with  

ErasureCoding  

3  year  TCO  comparison  

3  year  Opex  

TCA   0  50  

100  150  200  250  300  

Trad]onal  Hadoop  

InfiniFlash  Primary  Flash  +  

HDD  Tier    

InfiniFlash  with  S3  +  disaggregated  

MR  

InfiniFlash  S3  with  ErasureCoding  

Total  Rack  

Page 45: Ceph Optimization on All Flash Storage

45  

Example:  6PB  Datalake  solu(on  on  Infiniflash  vs  HDD  array    

§  InfiniFlash  results  in  TCA  reduc]on  &  even  higher  Opex  savings    

§  Determinis]c  QoS  with  all  Flash  architecture  vs  HDD  array  

§  InfiniFlash  with  Erasure  Coded  configura]on  of  9:3  ra]o  leads  to  higher  storage  efficiency  (8PB  raw,  6PB  useable)  

§  InfiniFlash  technology  roadmap  leads  to  ~10%  YoY  cost  reduc]on  

§  Further  savings  (not  shown)  on  opera]onal  &  maintenance  costs  related  to  higher  failure  rate  of  HDDs  vs.  Flash  

–  Does  not  require  over  provisioning  of  storage  to  compensate  for  inconsistent  performance  due  to  high  failure  rates  of  HDD  

§  Service  costs  are  not  included  in  either  products  

Es]mate  used  for  Isilon:  $1.2M/1PB  usable  capacity  

 $(1,50)  

 $0,50    

 $2,50    

 $4,50    

 $6,50    

 $8,50    

 $10,50    

 Isilon   InfiniFlash  w  /  EC  

Millions  

Σ  OPEX  

Σ  TCA  

$1.5M  savings  es]mated  for  a  6PB  deployment  

Page 46: Ceph Optimization on All Flash Storage

46  

CEPH  RECOVERY  OPTIMIZATION  

Page 47: Ceph Optimization on All Flash Storage

47  

Client  IO  recovery  a[er  1  OSD  drive  down/out  with  default  sesngs    

IOPS      :2  Luns  /Client  (  Total  5  Clients)  :          Profile  :  64K  Block  Random  Read        

Recovery  ]me  in  second  

0  

20000  

40000  

60000  

80000  

100000  

120000  

140000  

160000  

180000  30  

180  

330  

480  

630  

780  

930  

1080  

1230  

1380  

1530  

1680  

1830  

1980  

2130  

2280  

2430  

2580  

2730  

2880  

3030  

3180  

3330  

3480  

3630  

3780  

3930  

4080  

4230  

4380  

4530  

4680  

4830  

4980  

5130  

5280  

5430  

5580  

5730  

5880  

a.  Recovery  Parameters  with  default  se�ngs  b.  Recovery/rebalancing  ]me  is  around  90  min    when  one  osd  down  with  IO  load  of  64k  RR_qd64  (user  data  30tb  ;  with  3  way    

replica]on    total  data  on  cluster  is  90tb  and  each  osd  consists  of  around  2TB).    c.  Average  perf  degrada]on  is  around  60%  when  recovery  is  happening.  

Op]mal  perf  

IOPS  

Page 48: Ceph Optimization on All Flash Storage

48  

Client IO with recovery after the ‘down’ OSD is ‘in’

48

Note:-                      a.      Recovery/rebalancing  ]me  is  around  60  min    when  one  osd  in  with  IO  load  of  64k  RR_qd64                          b.      Average  perf  degrada]on  is  around  5%  when  recovery  is  happening.                        c.      Recovery  parameters  with  priority  to  client  IO    

Recovery time in second

Page 49: Ceph Optimization on All Flash Storage

49  

Client  IO  recovery  a[er  1  OSD  drive  down/out  with  tuning      IOPS      :2  Luns  /Client  (  Total  5  Clients)                                                      Profile  :  64K  Random  Read    

Recovery  (me  in  second  Note:-­‐                      a.      Recovery  Parameters  sesng  with  Priority  to  Client(applica(on)  IO  .                    b.    Recovery/rebalancing  ]me  is  around  90  min    when  one  osd  down  with  IO  load  of  64k  RR_qd64                                  (user  data  10tb  ;  with  3  way    replica]on  total  data  on  cluster  is  30tb  and  each  osd  consists  of  around  1.2TB).                    c.  Average  perf  degrada]on  is  around  15%  when  recovery  is  happening.  

0  

50000  

100000  

150000  

200000  

250000  0  

150  

300  

450  

600  

750  

900  

1050  

1200  

1350  

1500  

1650  

1800  

1950  

2100  

2250  

2400  

2550  

2700  

2850  

3000  

3150  

3300  

3450  

3600  

3750  

3900  

4050  

4200  

4350  

4500  

4650  

4800  

4950  

5100  

5250  

5400  

5550  

5700  

5850  

Recovery  

Op]mal  

Page 50: Ceph Optimization on All Flash Storage

50  

Client  IO  with  recovery  a[er  1  OSD  drive  back  IN  with  tuning      IOPS      :2  Luns  /Client  (  Total  5  Clients)  

Recovery  (me  in  second  

Note:-­‐                          a.      Recovery  Parameters  with  priority  to  Client  IO  during  drive  coming  back  to  cluster                        b.      Recovery/rebalancing  ]me  is  around  60  min    when  one  osd  in  with  IO  load  of  64k  RR_qd64                          c.        Average  perf  degrada]on  is  around  5%  when  recovery  is  happening.  

0  

50000  

100000  

150000  

200000  

250000  0  

150  

300  

450  

600  

750  

900  

1050  

1200  

1350  

1500  

1650  

1800  

1950  

2100  

2250  

2400  

2550  

2700  

2850  

3000  

3150  

3300  

3450  

3600  

3750  

3900  

4050  

4200  

4350  

4500  

4650  

4800  

4950  

5100  

5250  

5400  

5550  

5700  

5850  

Recovery  

Op]mal  

Page 51: Ceph Optimization on All Flash Storage

51  

Client  IO  recovery  a[er  Chassis  down/out  with  default  sesngs    IOPS      :2  Luns  /Client  (  Total  5  Clients)                                                      Profile  :  64K  Random  Read    

Recovery  ]me  in  second  

Note:-­‐    •  Recovery  parameters  with    default  priority  to  client  IO  .  •  Recovery/rebalancing  ]me  is  around  6  hours  when  one  chassis  down  with  IO  load  of  64k  RR  (user  data  30tb  )  with  3  way    

replica]on  total  data  on  cluster  is  90tb  and  each  chassis  consists  of  around  30  TB).  

0  20000  40000  60000  80000  100000  120000  140000  160000  180000  200000  

30  

570  

1110  

1650  

2190  

2730  

3270  

3810  

4350  

4890  

5430  

5970  

6510  

7050  

7590  

8130  

8670  

9210  

9750  

10290  

10830  

11370  

11910  

12450  

12990  

13530  

14070  

14610  

15150  

15690  

16230  

16770  

17310  

17850  

18390  

18930  

19470  

20010  

20550  

21090  

op]mal_perf  

Recovery_Perf  

IOPS  

Page 52: Ceph Optimization on All Flash Storage

52  

Recovery_IO_Perf_of_64K_Rand_Read  qd64:  Chassis  down/out  IOPS      :2  Luns  /Client  (  Total  5  Clients)  

Recovery  (me  in  second  Note:-­‐                  a.  Recovery  Parameters  with  priority  to  Client  IO                    b.  Recovery/rebalancing  ]me  is  around  7  hours  when  one  chassis  down  with  IO  load  of  64k  RR  

0  

50000  

100000  

150000  

200000  

250000  0  

750  

1500  

2250  

3000  

3750  

4500  

5250  

6000  

6750  

7500  

8250  

9000  

9750  

10500  

11250  

12000  

12750  

13500  

14250  

15000  

15750  

16500  

17250  

18000  

18750  

19500  

20250  

21000  

21750  

22500  

23250  

24000  

24750  

25500  

26250  

27000  

27750  

28500  

29250  

IOPS  

Recovery  

Op]mal  

Page 53: Ceph Optimization on All Flash Storage

53  

Client  IO  with  recovery  a[er  Chassis  back  in  with  tuning    IOPS      :2  Luns  /Client  (  Total  5  Clients)  

Recovery  (me  in  second  Note:-­‐  Recovery/rebalancing  ]me  is  around  2  hours  ater  adding  removed  chassis  with  IO  load  of  64k  RR_qd64  (user  data  10tb  ;  with  3  way    replica]on  total  data  on  cluster  is  30tb  and  each  chassis  consists  of  around  10  TB).  

0  

50000  

100000  

150000  

200000  

250000  

0  210  

420  

630  

840  

1050  

1260  

1470  

1680  

1890  

2100  

2310  

2520  

2730  

2940  

3150  

3360  

3570  

3780  

3990  

4200  

4410  

4620  

4830  

5040  

5250  

5460  

5670  

5880  

6090  

6300  

6510  

6720  

6930  

7140  

7350  

7560  

7770  

7980  

8190  

8400  

8610  

8820  

Recovery  

Op]mal  

Page 54: Ceph Optimization on All Flash Storage

54  

Object  Storage  Topology  

Note  :  Clients    are  not  shown  in  above  topology  .  COSBench    is  used  to  generate  the  workload  from  Clients    

Benchmark  Environment      –  Ceph  release              :  Giant  (  .87.1  )    –  Object  Interface  :  S3  –  WebServer                    :  Civetweb              

 Workload  Generator  

–  COSBench    0.4.1.0      IO  Profiles    

–  4MB    Object  Random  Read    –  4MB    Object  Random  Write    

 

Page 55: Ceph Optimization on All Flash Storage

55  

Hardware  Configura(on  Performance  Config    -­‐  DA  IF100  with  Quarter  Pop  

6  Node  Cluster  (  8  drives  connected  to  each  OSD  node)  

Node      6  Servers  (Dell  R720)   2x  E5-­‐2680    v2    2.8GHz  25M$    8  x  16GB  RDIMM,  dual  rank    (128  GB)      

1x  Mellanox  X3  Dual  40GbE      1x  LSI  9207  HBA  card  

RGW  Gateway  (Rados  gateway  Server)    4  Servers  (Dell  R620)  

1  x  E5-­‐2680    v2    2.8GHz  25M$    4  x  16GB  RDIMM,  dual  rank    (64  GB)      1x  Mellanox  X3  Dual  40GbE      

Storage  –  IF100    with    16    Drives  (  Par((oned    8  Drives    to  each  Storage  Node  using  Reserva(on)  

IF100  (  Chassis)    Storage  Connected  to  2  Hosts    with  Reserva]on(refer  fig)   Total  storage  -­‐  16  *  8  Tb  =  128TB  

IF100-­‐  Package    Version      DA  build   FFU  1.0.0.31.1  

Network  Details  40G  Switch   Brocade  VDX  8770      OS  Details    

OS     Ubuntu  14.04  LTS  64bit   3.13.0-­‐24  

LSI  card/  driver   SAS2308(9207)  /  mpt2sas       16.100.00.00  

Mellanox  40gbps  nw  card   MT27500  [ConnectX-­‐3]  /  MLX4_EN   mlx4_en    -­‐    2.2-­‐1    Cluster  Configura(on    CEPH  Version   sndk-­‐ifos-­‐1.0.0.07   0.87-­‐IFOS-­‐1.0.0.8.beta  

Replica]on  (Default)  3  (CHASSIS)    

 Note:  -­‐  chassis  level  replica]on  (  depends  on  tests)  

Number  of  Pools,  PGs  &  RBDs   pool  =  5    ;PG  =    2048  per  pool     EC  Pool  

Number  of  Monitors   3    recommended  to  have  3  Monitor  for  Deployment  Number  of  OSD  Nodes   6      Number  of  OSD  drives  per  OSD    Node   8   total  OSDs    =  6  *  8  =  48  

Page 56: Ceph Optimization on All Flash Storage

56  

So[ware  :  Ceph  Cluster  Configura(on  §  auth_cluster_required  =  none                                                                                                  §  auth_service_required  =  none  §  auth_client_required  =  none  §  filestore_xaXr_use_omap  =  true  §  debug_lockdep  =  0/0  §  debug_context  =  0/0  §  debug_crush  =  0/0  §  debug_buffer  =  0/0  §  debug_]mer  =  0/0  §  debug_filer  =  0/0  §  debug_objecter  =  0/0  §  debug_rados  =  0/0  §  debug_rbd  =  0/0  §  debug_journaler  =  0/0  §  debug_objectcatcher  =  0/0  §  debug_client  =  0/0  §  debug_osd  =  0/0  §  debug_optracker  =  0/0  §  debug_objclass  =  0/0  §  debug_filestore  =  0/0  §  debug_journal  =  0/0  §  debug_ms  =  0/0  §  debug_monc  =  0/0  §  debug_tp  =  0/0  §  debug_auth  =  0/0  §  debug_finisher  =  0/0  §  debug_heartbeatmap  =  0/0  §  debug_perfcounter  =  0/0  §  debug_asok  =  0/0  §  debug_throXle  =  0/0  §  debug_mon  =  0/0  §  debug_paxos  =  0/0  §  debug_rgw  =  0/0  

osd_op_threads  =  2  osd_op_num_threads_per_shard  =  2  osd_op_num_shards  =  7  filestore_op_threads  =  3  ms_nocrc  =  true  filestore_fd_cache_size  =  64  filestore_fd_cache_shards  =  32  cephx_sign_messages  =  false  cephx_require_signatures  =  false  ms_dispatch_throXle_bytes  =  0  throXler_perf_counter  =  false  ms_tcp_nodelay  =  true  osd_pool_default_size  =  3  osd_pool_default_min_size  =  2  osd_recovery_max_ac]ve  =  1  osd_max_backfills  =  1  osd_recovery_threads  =  1  osd_recovery_op_priority  =  1  rbd_cache  =  false  objecter_inflight_ops  =  2048000  objecter_inflight_bytes  =  10485760000000000  rgw_thread_pool_size  =  1024  rgw_cache_enable  =  true    [osd]  osd_journal_size  =  150000  osd_mkfs_op]ons_xfs  =  -­‐K  osd_client_message_size_cap  =  0  osd_client_message_cap  =  0  osd_enable_op_tracker  =  false    [mon]  mon_clock_drit_allowed  =  30  mon_clock_drit_warn_backoff  =  30    

[client.radosgw.gateway-­‐1]  host  =  rack1-­‐ramp-­‐1  keyring  =  /etc/ceph/ceph.client.admin.keyring  rgw_socket_path  =  /var/log/ceph/radosgw1.sock  log_file  =  /var/log/ceph/radosgw-­‐1.rack1-­‐ramp-­‐1.log  rgw_max_chunk_size  =  4194304  rgw_frontends  =  "civetweb  port=8081"  rgw_dns_name  =  rack1-­‐ramp-­‐1  rgw_ops_log_rados  =  false  rgw_enable_ops_log  =  false  rgw_cache_lru_size  =  1000000  rgw_enable_usage_log  =  false  rgw_usage_log_]ck_interval  =  30  rgw_usage_log_flush_threshold  =  1024  rgw_exit_]meout_secs  =  600    

Page 57: Ceph Optimization on All Flash Storage

57  

EC  Performance  on  InfiniFlash  Erasure  Code  Pool  Configura(on    

–  Host  level  Erasure  coding    4:2  (  k:m  )  –  Jerasure  and  technique  is  Cauchy-­‐good  –  .rgw.buckets  are  in  Erasure  coding  pool  others  in  replicated  pool  

–  Object  performance  scales  up  with  increase  number  of  RGW  servers  &  instances    

–  Erasure  Coding  with  InfiniFlash  is  very  CPU  efficient  compared  to  HDDs      

–  For  all  cases  of  read/write,  op]mal/recovery  tests  the  latency  is  20%-­‐50%  beXer  with  Cauchy  Good    

profile   BW(GB/s)  90%  -­‐  

RespTime  (ms)  

Avg.  CPU  usage    of  OSD  nodes(%)  

disk  usage  (%)  

Network  usage    OSD  nodes  

Avg.  CPU  usage    

of  RGW-­‐CPU(%)  

Network  usage    RGW  

send/recv  

Avg.  total  disk  bw  of    

each  OSD  node(GB/s)  

4M_write(ds:20tb)   4.1   820   40-­‐45   80-­‐90   1.6GB/s  -­‐  receive   20-­‐25   s/r:  1.1G/1.1G   2.2  

4M_read(ds:20tb)   14.43   220   50-­‐55   85-­‐93   4.2GB/s  -­‐  send   50-­‐52   80-­‐85   2.2  

Ceph  Gateway  Servers  •  No.  of  RGW  Servers:  4  •  No.  oF  RGW  Instances:  16  •  No.  of  Clients:  4  

 

Page 58: Ceph Optimization on All Flash Storage

58  

Write Performance with Scaling OSD

58

 Write  Object  performance  with  EC  pool  scaling  from  48  Drives  to  96  drives    

Profile   No  of  OSD  Drives  

 

No  of  RGW  servers  

No  of  RGW  instances  

Clients   Workers   BW(GB/s)   90%  -­‐  RespTime(ms)  

Avg.  CPU  usage  of  OSD  nodes(%)  

Avg.  CPU  usage  of  RGW-­‐nodes(%)  

Avg.  CPU  usage  of  COSBENCH-­‐CPU(%)  

4M_Write   48   4   16   4   512   2.86   1870   40   10   15  

4M_Write  

 96   4   16   4   512   4.9   420   65-­‐70   30-­‐35   40-­‐45  

   –    With  increase  in  OSD  drives(  2x)  write  performance  improve  by  around  65-­‐70%    

Page 59: Ceph Optimization on All Flash Storage

59  

Op(mal  EC  Performance:  Cauchy-­‐good    

59

Profile   No  of  RGW  servers  

No  of  RGW  instances  

Clients   Workers   BW(GB/s)   90%  -­‐  RespTime(ms)  

Avg.  CPU  usage  of  OSD  nodes(%)  

Avg.  CPU  usage  of  RGW-­‐nodes(%)  

4M_Write(ds:20  TB)  

4   16   4   512   4.1   820   40-­‐45   20-­‐25  

4M_Read  (ds:20  TB)  

 

4   16   4   512   14.33   220   50-­‐55   50-­‐52  

   •  Cauchy  good  Op]mal  write  performance  is  at  least  33%  beXer  Reed  Solomon.    

in  some  runs  got  around  45%  improvement    •  Cauchy  good  Op]mal  read  performance  10%  beXer  however  there  was  no  much  scope  

of  improvement  as  we  were  already  satura]ng      the  N/W  bandwidth  of  RGW  nodes.    

Page 60: Ceph Optimization on All Flash Storage

60  

Degraded  EC  Read  Performance:  Reed-­‐Solomon  vs  Cauchy-­‐good    

60

Cluster   Profile   BW(GB/s)   90%  -­‐  RespTime(ms)   Avg.  CPU  usage  of  OSD  nodes(%)  

Cauchy_good   Degraded  (1  Node  Down)   4M_Read   12.9   230   71  

Cauchy_good    

Degraded  (2  Node  Down)   4M_Read   9.83   350   75-­‐80  

RS   Degraded  (1  Node  Down)    

4M_Read   8.26   250   60  

RS  

 Degraded  (2  Node  Down)    

4M_Read   3.88   740   50-­‐55  

   Cauchy_Good  is  doing  far  beXer  than  Reed  Solomon  technique  for  op]mal  and  recovery  IO  performance  as  compared  with  write  performance:  ge�ng  almost  double  

Page 61: Ceph Optimization on All Flash Storage

61  

Replica(on  vs  Erasure  Coded  Pool  

0  

0,5  

1  

1,5  

2  

2,5  

3  

EC-­‐Write   Rep-­‐Write   EC-­‐Read   Rep-­‐Read  

GB/s  

Bandwidth  

Single  Gateway  Node  Performance  between  Replicated  and  Erasure  Coded  Pool            Erasure  Code  Pool  Configura(on    

–  OSD  level  Erasure  coding  9:3  (  k:m  )  –  Jerasure  and  technique  is  Reed  Solomon    –  .rgw.buckets  to  be  in  Erasure  coding  pool  others  to  be  in  replicated  pool  

 

  Bandwidth(GB/s) 90%  Resp  Time

EC-­‐Write 1.1 170

Rep-­‐Write 1.39 130

EC-­‐Read 2.5 320

Rep-­‐Read 2.58 250

–  No    major  degrada]on    in  performance  for  EC  pool    compared  to  replicated  pool  for  read  and  write    

Page 62: Ceph Optimization on All Flash Storage

62  

CPU/Memory/Disk  Usage  

Page 63: Ceph Optimization on All Flash Storage

63  

4K  Random  Read  Performance  stats  for  Block  IO  

QD  

0  

200000  

400000  

600000  

800000  

1   2   4   8   16   32   64   128   256  

IOPS  

0  

1  

2  

3  

4  

1   2   4   8   16   32   64   128   256  

Latency(ms)  

0  

50  

100  

150  

1   2   4   8   16   32   64   128  256  

%CP

U  usage  

%CPU  usage  -­‐  osd  node  

0  

200  

400  

600  

1   2   4   8   16   32   64   128   256  

netstats-­‐osd  node(MB/s)  

0  

50  

100  

150  

1   2   4   8   16   32   64   128  256  

%  disk  usage  

%  disk  usage  -­‐  osd  node  

0  

20  

40  

60  

80  

100  

1   2   4   8   16   32   64   128   256  

%  CPU

 usage  

%  Avg.  of    clients  CPU  usage  

Page 64: Ceph Optimization on All Flash Storage

64  

64K  Random  Read  Performance  stats  for  Block  IO  

QD  

0  

50000  

100000  

150000  

200000  

1   2   4   8   16   32   64   128  256  

IOPS  

IOPS  

0  

20  

40  

60  

80  

1   2   4   8   16   32   64   128   256  

%CP

U  usage  

%CPU  usage  -­‐  osd  node  

0  

50  

100  

150  

1   2   4   8   16   32   64   128  256  

%  disk  usage  

%  disk  usage  -­‐  osd  node  

0  

500  

1000  

1500  

2000  

2500  

1   2   4   8   16   32   64   128   256  

KB/s  

N/W  stats  osd  node(MB/s)  

0  

20  

40  

60  

80  

100  

1   2   4   8   16   32   64   128   256  

%  CPU

 usage  

%  Avg.  of    clients  CPU  usage  

0  

5  

10  

15  

1   2   4   8   16   32   64   128  256  

Latency(ms)  

Latency(ms)  

Page 65: Ceph Optimization on All Flash Storage

65  

256K  Random  Read  Performance  stats  for  Block  IO  

QD  

0  

2000  

4000  

6000  

8000  

10000  

12000  

1   2   4   8   16   32   64   128   256  

MB/sec  

0  

20  

40  

60  

80  

1   2   4   8   16   32   64  128  256  

Latency(ms)  

0  

10  

20  

30  

40  

1   2   4   8   16   32   64   128   256  

%CP

U  usage  

%CPU  usage  -­‐  osd  node  

0  

1000  

2000  

1   2   4   8   16   32   64   128  256  

netstats-­‐osd  node(MB/s)  

0  20  40  60  80  100  120  

1   2   4   8   16   32   64  128  256  

%  disk  usage  

%  disk  usage  -­‐  osd  node  

0  

20  

40  

60  

80  

100  

1   2   4   8   16   32   64  128  256  

%  CPU

 usage  

%  Avg.  of    clients  CPU  usage  

Page 66: Ceph Optimization on All Flash Storage

66  

Response  Time  varia(on  for  64K  100%  Random  Read  

0  

10  

20  

30  

40  

50  

60  

70  

80  

%  IO

 

Latency  range  in  usec  

64k_RR  on  single  RBD  Range  in  us   99.99%  data  

100   0.01  250   0.01  500   0.03  750   0.04  1000   0.23  2000   13.91  4000   67.05  10000   18.64  20000   0.02  50000   0.02  100000   0.01  250000   0.06  500000   0.01  750000   0.01  

Page 67: Ceph Optimization on All Flash Storage

67  

Erasure  Coding  Summary    §  Cauchy  Good  is  doing  far  beXer  than  Reed  Solomon  technique  for  op]mal  and  recovery  IO  performance  as  compared  wrt  write  

performance:  ge�ng  almost  double(with  reed_sol  :2  .6  and  Cauchy  Good:  4.1)                                                            

§  Cauchy  Good  op]mal  write  performance  is  at  least  33%  beXer  Reed  Solomon.  In  some  runs  got  around  45%  improvement  

§  Cauchy  Good  op]mal  read  performance  is  10%  beXer  as  there  is  satura]on  the  nw  bw  of  osd  nodes  ,  there  was  no  scope  of  improvement.  

§  Cauchy  Good  recovery  read  performance  is  beXer  than  Reed  Solomon  by  at  least  33%.  In  some  runs  we  are  ge�ng  around  50%  improvement  for  single  OSD  node  down  scenarios.  

§  For  fully  degraded  scenarios  (  2  Node  down)Cauchy  Good  recovery  read  performance  is  beXer  than  Reed  Solomon  by  at  least  100%.  In  some  runs  we  are  ge�ng  around  125%  read  improvement.  

§   For  all  the  cases  read/write  and  op]mal/recovery  the  latency  with  Cauchy  Good  is  beXer    20%-­‐50%  beXer.  In  some  runs  it  is  even  80%  beXer.  

§  For  writes  we  are  able  to  saturate/u]lize  the  max  disk  bandwidth  with  Cauchy  Good  and  for  reads  there  is  no  much  scope  to  saturate  the  disk    bandwidth.  

§  With  increased  CPU  u]liza]on  about  20%  in  some  cases  we  are  improving  performance  about  33%  with  Cauchy  Good,  which  points  Cauchy  Good  is  the  encoding  technique    to  choose.  

Page 68: Ceph Optimization on All Flash Storage

68  

IF500 future goals

•  Improved  write  performance  

•  Improved  write  amplifica]on  

•  Improved  mixed  read-­‐write  performance  

•  Convenient  end-­‐to-­‐end  management/monitoring  UI  

•  Easy  deployment  

•  RDMA  support  

•  EC  with  block  interface  

 

68

Page 69: Ceph Optimization on All Flash Storage

69  

Fully  Validated  So[ware  Defined  All  Flash  Solu(ons  for  Scale  Up  and  Scale  Out  Applica(ons  

Scale  Up   Scale  Out  

Infiniflash  

•  File  :  ,  

•  Block:    ION  Accelerator,    

•  File  –  IBM  GPFS,  Gluster,  Cloudbyte  •  Object  –    •  Block  –    

Software

HW

For  Databases,  VDI,  OpenStack,  Cloud  Backends,  File,  Print,  and  more  

Page 70: Ceph Optimization on All Flash Storage

70  

1  GPFS  Enterprise  License/Rack  

GPFS  Client  Plug-­‐in  Only  

IF100  Advantages  •  All-­‐flash  @  lower  TCO  than  HDD-­‐arrays  •  Disaggregated  compute    

•  18  node  license  down  to  3  •  Added  Client  Plugin  

•  IF100  performance  •  7GB/s  (Now)  à  14GB/s  (Q4’15)  •  128TB  T1/T2  storage  served  from  a  

single  box  •  Best  capacity  u]liza]on  

•  Local  Protec]on  available  w.  2  way  replica  vs.  3-­‐way  on  HDD  

•  Moving  to  Erasure  Coded  w.  GNR  will  further  enable  capacity  u]liza]on  improvements  

Enterprise  Public  Cloud  SAAS    InfiniFlash  w.  GPFS  Scale-­‐Out    

Page 71: Ceph Optimization on All Flash Storage

71  

GPFS  Integra(on  &  op(miza(on….  •  @  Scale  tes]ng  by  SNDK,  hos]ng  mul]ple  customer  

PoC’s  •  GPFS  basic  cer]fica]on  by  IBM  –  ongoing  NOW  •  preliminary  results  available  by  IBM  Labs  •  IBM  GNR  cer]fica]on  &  fully  integrated  solu]on  by  

SanDisk  –  target  Q415  •  Solu]on  will  be  offered  by  partners  as  complete  

reference  architecture  with  choice  of  server  vendor  

12  Month  Roadmap  Delivers  Performance  Doubling….  •  Expander  upgrade  to  12G  to  x2  throughput  in  Nov  2015    

•  32+  Icechip  scaling  limita]on  (6G  expander)  resolved  •  4TB  Icechip  introduc]on  in  early  Q12016  à  right  scaling  capacity  to  BW    •  12G  Icechip  upgrade  will  x2  performance  once  again  maximizing  box  performance  at  low  capacity  

Confiden]al-­‐  EMS  Product  Management  

IF100+GPFS  Solu(on  Ref  Architecture  @SNDK  IBM  Labs  Cer(fica(on  In  Progress  

Page 72: Ceph Optimization on All Flash Storage

72  

OpenStack  Private  Cloud  on  InfiniFlash  –  IF500    Telco  Customer  

§  High  Performance  OpenStack  Private  Cloud  Infrastructure  on  InfiniFlash  

§  Cinder  Storage  driven  by  Ceph  on  IF500  

§  First  migrate  latency  sensi]ve  workloads  

§   Splunk  Log  Analy]cs,  

§  Ka�a  Messaging  runs  on  InfiniFlash  

§  10x  performance  over  HDD  based  Ceph  

§  Co-­‐exists  with  low  performance  HDD  based  Ceph  

§  Highly  scalable  with  lowest  TCO    

§  2-­‐copy  model  -­‐  reduced  from  3  copy  HDD  model  

§  Ease  of  deployment  w.  reduced  footprint,  power,  thermal  management    

HSEB  A   HSEB  B  

SAS    HSEB  A   HSEB  B  

Compu

te  Farm   LUN   LUN  

Splunk  

LibRBD  

QEMU/KVM  

Storage  Farm

 

LUN   LUN  

Ka�a  

LibRBD  

QEMU/KVM  ….  

Page 73: Ceph Optimization on All Flash Storage

73  

Before     With  IF500…..  §  Customer  deployed  OpenStack  based  private  cloud  

on  RH  Ceph  

§  Pain-­‐points  

•  Cinder  storage  based  on  Ceph  being  used  on  HDD  could  not  get  over  10K  IOPS  (8K  blocks)  

•  Limi]ng  private  cloud  expansion  to  include  higher  performance  applica]ons  

§  Solu(ons  considered  before  IF500  did  not  work..  

•  Alternate  solu]ons  with  separate  storage  architectures  for  high  performance  applica]ons  add  significant  costs  and  higher  management  overhead  defea]ng  the  purpose  of  OpenStack  

§  Able  to  meet  their  higher  performance  goals  with  the  same  IF500-­‐  Ceph  based  architecture  without  disturbing  their  exis(ng  infrastructure  

•  Ceph  cluster  on  HDD  will  co-­‐exist  with  IF500  Ceph  cluster.  Applica]ons  get  deployed  on  either  based  on  performance  needs  

•  Ini]al  workloads  migrated:  u  Deploying  Splunk  log  Analy]cs  

u  Apache  Ka�a  messaging  system  u  Cassandra  (Next  target)  

§  Adding  performance  in  lower  TCO  footprint  •  2x128TB  IF500  à  expanding  to  256TB  in  next  phase  

•  Expected  to  reduce  real  estate  to  <  1/3      

•  >50%  power    reduc]on  expected  

OpenStack  Private  Cloud  on  InfiniFlash  –  IF500    Telco  Customer  

Page 74: Ceph Optimization on All Flash Storage

74  

SanDisk  is  introducing  Disaggregated  So[ware  Defined  All  Flash  Solu(ons  for  target  workloads.    

Hadoop  Object  Hyperscale  

xSP    Unified  Storage  Mixed  Work.  

CDN  -­‐  Edge  Caching  &  Streaming    

Enterprise  Unified  Storage    Mixed-­‐Work.  

Industry  Ver]cal  Apps  Object  Based  

NoSQL  Databases  

 Web2.0  Indexing  OCP  

Ac]ve  Archive  

Hadoop  HDFS  Tiered  

Video  Surveillance    Enterprise    

Hyperscale  Custom  Block  

Apps  

Ent.  Block    Data    

Warehouse  

HPC  &  File-­‐based  

Scale  Up  

Collabora]ve  Decision  Support  Biz  Processing   App  Dev  

Tier  1  ,  Tier  2  Databases  

DataLakes  @Scale  Obj,  File  

Scale  Out  

IT  Infra.   Web  Infra   Indust.  R&D  

Online  Media  Storage  &  Streaming    

Win  App  Exchange,  SharePoint  

Enterprise  Unified  Storage    Mixed-­‐Work.  

Back-­‐ups  

SanDisk  offers  complete  disaggregated  reference  designs  for  target  workloads.  Reference  designs  include  CNSSS  (Compute,  Networking,  Storage,  Sotware,  Support)  

Page 75: Ceph Optimization on All Flash Storage

75  Flash Memory Summit 2015 Santa Clara, CA

75

               Thank  you  !    Ques(ons?                  [email protected]    

©  2015  SanDisk  Corpora]on.  All  rights  reserved.  SanDisk  is  a  trademark  of  SanDisk  Corpora]on,  registered  in  the  United  States  and  other  countries.  InfiniFlash  and  IF500  are  trademarks  of  SanDisk  Enterprise  IP  LLC.  Other  brand  names  men]oned  herein  are  for  iden]fica]on  purposes  only  and  may  be  the  trademarks  of  their  respec]ve  holder(s).  

Page 76: Ceph Optimization on All Flash Storage

76  

This  is  especially  important  at  the  petabyte  scale.  

Page 77: Ceph Optimization on All Flash Storage

77  

GPFS  

Detailed  Performance  Analysis  of  Solid  State  Disks  Hussein  N.  El-­‐Harake  and  Thomas  Schoenemeyer  Swiss  Na]onal  Supercompu]ng  Centre  (CSCS),  Manno,  Switzerland  [email protected],  [email protected]  Integra(ng  FusionIO  SSD  with  GPFS  for  IOP  Performance  Madhav  Ponamgi  Joshua  Blumert  ©  2010,  IBM  Advanced  Technical  Skills  Techdocs  Version  10/25/2010  hXp://w3.ibm.com/support/Techdocs  Tips  Techdocs  Document  

Page 78: Ceph Optimization on All Flash Storage

78  

Efficient  and  High-­‐Performance  Cloud  Solu(on  for  IBM  Tivoli  Storage  Manager    

951 SanDisk Drive | Milpitas | CA 95035 | USA

At SanDisk, we’re expanding the possibilities of data storage. For more than 25 years, SanDisk’s ideas have helped transform the industry, delivering next generation storage solutions for consumers and businesses around the globe.

“The savings were enormous.

We are now able to fully

utilize the processors in our

TSM servers. This allows us to

cut our server footprint (and

accompanying floor space,

power and cooling costs) by two-

thirds. At the same time, we’ve

increased the speed of database

backups and tasks like inventory

expiration jobs, which makes our

servers more responsive.”

Geirr G. Halvorsen Senior Systems Engineer Front-safe

• Databases on internal disks in a RAID • Moved databases from disks to 320GB Fusion ioMemory card

TSM Servers

TSM Server

320GB Fusion ioMemory card

3:1 Server Consolidation

System Overview

System Before System After

Contact information

[email protected]

SanDisk951 SanDisk Drive Milpitas, CA 95035-7933, USA T: 1-800-578-6007

SanDisk Europe, Middle East, AfricaUnit 100, Airside Business Park Swords, County Dublin, Ireland T: 1-800-578-6007

SanDisk Asia PacificSuite C, D, E, 23/F, No. 918 Middle Huahai Road, Jiu Shi Renaissance Building Shanghai, 20031, P.R. China T: 1-800-578-6007

For more information, please visit: www.sandisk.com/enterprise

“The savings were enormous. We are now able to fully utilize the processors in our TSM servers. This allows us to cut our server footprint (and accompanying floor space, power and cooling costs) by two-thirds,” Geirr said. “At the same time, we’ve increased the speed of database backups and tasks like inventory expiration jobs, which makes our servers more responsive. Our customers have noticed the di!erence.”

Easy Implementation and Rapid Software Upgrade

Implementing the Fusion ioMemory cards proved much easier than upgrading to a SAN.

Geirr told us, “Adding a SAN would have required serious planning and thought toward disk architecture, and the physical upgrade would have taken days. It took 30-45 minutes to physically install the Fusion ioMemory cards and copying the databases was very fast.”

Front-safe also took this opportunity to upgrade from TSM 5.5 to 6.2 and discovered rapid upgrades were another unexpected benefit. “IBM estimates a 50-100GB upgrade will take one to three days on a disk-based system,” Geirr said. “The same upgrade on a SanDisk powered system finished in less than five hours.”

Summary

Implementing the Fusion ioMemory solution gave Front-safe the following benefits:• Query times from minutes to seconds• Tivoli expiration jobs from hours to minutes• 3x higher TSM workload• 3:1 server consolidation from improved workload• Easy implementation• 5x faster TSM software upgrade

About the Customer

Front-safe is a Danish-owned company in the JS Holding Group, which in total counts more than 100 employees. Front-safe is a dedicated and focused supplier of backup and archiving solutions, proactively protecting businesses of all types and sizes from data loss. Front-safe operates in Denmark with redundant data centers and o"ces in Aarhus and Copenhagen. Front-safe provides automated and secure remote online backup, disaster recovery and archiving solutions freeing its customers’ resources so they can focus on their core businesses.

The performance results discussed herein are based on internal Front-safe testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.

©2014 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. Fusion ioMemory, and others are trademarks of SanDisk Enterprise IP LLC. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).

Page 79: Ceph Optimization on All Flash Storage

79  

TSM  (or  any  other  Backup  /  Restore  app)  

CASE STUDY

The Challenge

Front-safe is a dedicated and focused supplier of backup and archiving solutions, proactively protecting businesses of all types and sizes from data loss. IBM Denmark’s Cloud Partner of the Year in 2010, Front-safe o!ers a unique online, user-friendly web interface that tightly integrates with IBM’s Tivoli Storage Manager (TSM), creating a cloud solution that makes TSM easily available to large and small businesses. Front-safe automates backup and recovery, while also providing reporting and dynamic invoicing.

As Front-safe’s customer base and data volume grew, delivering the best-in-class service its customers expected began to be challenging.

Geirr G. Halvorsen, Senior Systems Engineer, told us the challenges Front-safe faced: “As the database grew, maintenance tasks on this database were becoming very time consuming, using resources needed for other tasks. Server resources showed low processor utilization, but high I/O demand. Our goal was to reduce the time we spent on database maintenance tasks, improve the utilization of our server resources and, if possible, consolidate IBM Tivoli Storage Manager Servers.”

The SanDisk® Solution

“As we researched solutions, we looked at adding more disks and using them di!erently to maximize I/O—but that would require a SAN, which would increase our data center footprint, increase power and cooling costs, and still wouldn’t give us enough performance. Conventional SSDs didn’t provide us the capacity,” Geirr told us. “We chose the Fusion ioMemory™ solution because it gave us the highest performance with our existing hardware.”

Securing Front-safe Performance

The Fusion ioMemory card dramatically improved the performance of the TSM database. Geirr told us, “Front-safe worked closely with SanDisk sta! to install the Fusion ioMemory card and measure performance. We moved several of our highest workload databases onto the SanDisk products and saw significant performance improvements. For example, inventory expiration jobs were reduced from hours to minutes and operational queries from minutes to seconds. This freed up resources for additional database work.”

A Smarter System for Better Service and Attractive Pricing

Front-safe used the increased workload capabilities to reduce its infrastructure needs and operating costs.

Front-safe Creates an E!cient and High-Performance Cloud Solution for IBM Tivoli Storage Manager Backup and archive solutions provider improves performance and slashes database server footprint to o!er best-in-class service at an attractive price.

Solution Focus• Tivoli Storage Manager (TSM)• Cloud computing

Summary of Benefits• Query times from minutes to seconds• Tivoli expiration jobs from hours

to minutes• 3x higher TSM workload• 3:1 server consolidation from

improved workload• Easy implementation• 5x faster TSM software upgrade

951 SanDisk Drive | Milpitas | CA 95035 | USA

At SanDisk, we’re expanding the possibilities of data storage. For more than 25 years, SanDisk’s ideas have helped transform the industry, delivering next generation storage solutions for consumers and businesses around the globe.

“The savings were enormous.

We are now able to fully

utilize the processors in our

TSM servers. This allows us to

cut our server footprint (and

accompanying floor space,

power and cooling costs) by two-

thirds. At the same time, we’ve

increased the speed of database

backups and tasks like inventory

expiration jobs, which makes our

servers more responsive.”

Geirr G. Halvorsen Senior Systems Engineer Front-safe

• Databases on internal disks in a RAID • Moved databases from disks to 320GB Fusion ioMemory card

TSM Servers

TSM Server

320GB Fusion ioMemory card

3:1 Server Consolidation

System Overview

System Before System After

Contact information

[email protected]

SanDisk951 SanDisk Drive Milpitas, CA 95035-7933, USA T: 1-800-578-6007

SanDisk Europe, Middle East, AfricaUnit 100, Airside Business Park Swords, County Dublin, Ireland T: 1-800-578-6007

SanDisk Asia PacificSuite C, D, E, 23/F, No. 918 Middle Huahai Road, Jiu Shi Renaissance Building Shanghai, 20031, P.R. China T: 1-800-578-6007

For more information, please visit: www.sandisk.com/enterprise

“The savings were enormous. We are now able to fully utilize the processors in our TSM servers. This allows us to cut our server footprint (and accompanying floor space, power and cooling costs) by two-thirds,” Geirr said. “At the same time, we’ve increased the speed of database backups and tasks like inventory expiration jobs, which makes our servers more responsive. Our customers have noticed the di!erence.”

Easy Implementation and Rapid Software Upgrade

Implementing the Fusion ioMemory cards proved much easier than upgrading to a SAN.

Geirr told us, “Adding a SAN would have required serious planning and thought toward disk architecture, and the physical upgrade would have taken days. It took 30-45 minutes to physically install the Fusion ioMemory cards and copying the databases was very fast.”

Front-safe also took this opportunity to upgrade from TSM 5.5 to 6.2 and discovered rapid upgrades were another unexpected benefit. “IBM estimates a 50-100GB upgrade will take one to three days on a disk-based system,” Geirr said. “The same upgrade on a SanDisk powered system finished in less than five hours.”

Summary

Implementing the Fusion ioMemory solution gave Front-safe the following benefits:• Query times from minutes to seconds• Tivoli expiration jobs from hours to minutes• 3x higher TSM workload• 3:1 server consolidation from improved workload• Easy implementation• 5x faster TSM software upgrade

About the Customer

Front-safe is a Danish-owned company in the JS Holding Group, which in total counts more than 100 employees. Front-safe is a dedicated and focused supplier of backup and archiving solutions, proactively protecting businesses of all types and sizes from data loss. Front-safe operates in Denmark with redundant data centers and o"ces in Aarhus and Copenhagen. Front-safe provides automated and secure remote online backup, disaster recovery and archiving solutions freeing its customers’ resources so they can focus on their core businesses.

The performance results discussed herein are based on internal Front-safe testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.

©2014 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. Fusion ioMemory, and others are trademarks of SanDisk Enterprise IP LLC. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).

Page 80: Ceph Optimization on All Flash Storage

80  

GPFS  

Page 81: Ceph Optimization on All Flash Storage

Thank  You      [email protected]  

hzp://itblog.sandisk.com/                          @SanDiskDataCTR              @BigDataFlash  ©  2015  SanDisk  Corpora]on.  All  rights  reserved.  SanDisk  is  a  trademark  of  SanDisk  Corpora]on,  registered  in  the  United  States  and  other  countries.  InfiniFlash,  Op]mus  MAX,  and  Fusion  ioMemory  are  trademarks  of  SanDisk  Corpora]on.  Other  brand  names  men]oned  herein  are  for  iden]fica]on  purposes  only  and  may  be  the  trademarks  of  their  respec]ve  holder(s).