mantle: a programmable metadata load balancer for the ceph ...€¦ · orse than . 1 mds. adaptable...
TRANSCRIPT
![Page 1: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/1.jpg)
Michael SevillaMantle, Symposium ‘15
Mantle: A Programmable Metadata Load Balancer for the
Ceph File SystemMichael A. Sevilla, Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil*, Greg Farnum*, Sam Fineberg^
UC Santa Cruz, *Red Hat, ^HP StoragePublished at Supercomputing 2015
![Page 2: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/2.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Separating Metadata & Data IO
File System
2
![Page 3: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/3.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
metadata service
Separating Metadata & Data IO
DistributedFile System
object store
3
![Page 4: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/4.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
History: A Simple Solution
• 1 MDS is insufficient[McKusick et al., login; '10], [Beaver et al., OSDI '10], [Thusoo et al., SIGMOD '10]
• How do we distribute metadata?
4
![Page 5: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/5.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
History: Scalable Solutions
1. Hash file identifier 2. Subtree partitioning
5
![Page 6: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/6.jpg)
Michael SevillaMantle, Symposium ‘15
Outline
1. File System Metadata Management2. CephFS Background3. Complexity of Dynamic Subtree Partitioning4. Mantle5. Evaluation
6
![Page 7: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/7.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
CephFS Background
![Page 8: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/8.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Example File System Workload
• Linux kernel compile locality
• Shade of Red: locality
Time
Fewer InodeRead/Writes
Many InodeRead/Writes
8
![Page 9: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/9.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Example File System Workload
• Linux kernel compile locality
• Shade of Red: locality
Time
Fewer InodeRead/Writes
Many InodeRead/Writes
9
![Page 10: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/10.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
CephFS Hotspot Detection!
Migration!
10
![Page 11: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/11.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Does CephFS work?what we want
bad
bad
bad
11
![Page 12: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/12.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Complexity of Dynamic Subtree Partitioning
![Page 13: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/13.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
MDS Cluster
rebalancemigrate?
partitionclusterpartition
namespace
migratefragment
recv HB
Why not?
Migration Policies• How to calculate load?• When to move load?• Where to move load?• How much to move?
RADOSrebalance
Hierarchical Namespace
13
![Page 14: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/14.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
CephFS’s Policies
“weighted ∑𝒐𝒐𝒐𝒐’’
“weighted ∑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎’’
“greater than average’’
“underload MDS’’
“equal load across cluster’’
14
![Page 15: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/15.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
Good for mixed workloads
Good for create-heavy workloads
Simple implementation
15
![Page 16: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/16.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Mantle
http://synapostasy.blogspot.com/2007/10/cephalopod-awareness-day.html
![Page 17: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/17.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
MDS Cluster
Mantle API
17
![Page 18: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/18.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
MDS Cluster
18
![Page 19: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/19.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
MDS Cluster
19
![Page 20: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/20.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Implementation: API + EnvironmentMDS Cluster
rebalance
20
![Page 21: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/21.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Balancers
• Greedy Spill Balancer
• Fill & Spill Balancer
• Adaptable Balancer
21
![Page 22: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/22.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Evaluation
![Page 23: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/23.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Evaluation: Creates Workload
• % of total load:
25 25 2525
25 0 075
25 13 1350
23
![Page 24: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/24.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Workload: Creates in Same Directory
best
sp
eedu
p
distribution not worthwhile st
able
Ove
rload
ed
MD
S
bett
er th
an1
MDS
wor
se th
an
1 M
DS
Strategy
24
![Page 25: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/25.jpg)
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Workload: Compiling Code
system notsaturated
best speedupmost stable
bett
er th
an1
MDS
wor
se th
an
1 M
DS
Adaptable Balancer
too
aggr
essi
ve=
bad
perf.
25
![Page 26: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/26.jpg)
Michael SevillaMantle, Symposium ‘15
Conclusion: Separate Policy and Mechanism
• Benefits of understanding server capacity• less resource utilization• better performance/stability
• Distribution can hurt performance/stability
• Being too aggressive thrashes workload
26
![Page 27: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/27.jpg)
Michael SevillaMantle, Symposium ‘15
Thanks! Questions?Acknowledgements:
Co-authors: Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil*, Greg Farnum*, Sam Fineberg^
Collaboraters: Ivo Jimenez, Adam CrumeFunding: HP Enterprise; storage division
27
![Page 28: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/28.jpg)
Michael SevillaMantle, Symposium ‘15
Extra Slides
28/24
![Page 29: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/29.jpg)
Michael SevillaMantle, Symposium ‘15
Why is Locality Important?
29
![Page 30: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/30.jpg)
Michael SevillaMantle, Symposium ‘15
More Recent History: Distributed Metadata
Mechanisms for migrating load
Heuristics for migrating resources
30
![Page 31: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/31.jpg)
Michael SevillaMantle, Symposium ‘15
Evaluation: Compile Workload
31
![Page 32: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15](https://reader033.vdocuments.us/reader033/viewer/2022052811/608890fb5e0f607b07292004/html5/thumbnails/32.jpg)
Michael SevillaMantle, Symposium ‘15
Background CephFS
• Why layering a file system over RADOS is effective• Random access• Significant engineering effort• Specialized subsystem for handling the namespace
32