distributed storage systems challenges in using persistent ... · challenges in using persistent...
TRANSCRIPT
Challenges in Using Persistent Memory In Distributed Storage Systems
SNIA SDC 2016
Dan LambrightStorage System Software DeveloperAdjunct Professor University of Massachusetts LowellAug. 23, 2016
RED HAT
● Technologies○ Persistent memory, aka storage class memory (SCM)○ Distributed storage
● Case studies○ GlusterFS, Ceph
● Challenges○ Network latency○ Accelerating parts of the system with SCM○ CPU latency
2
Overview
RED HAT 3
● Near DRAM speeds● Wearability better than SSDs (claims Intel) ● API available
○ Crash-proof transactions○ Byte or block addressable
● Likely to be at least as expensive as SSDs● Fast random access● Has support in Linux
Storage Class MemoryWhat do we know / expect?
RED HAT 4
● Single server scales poorly○ Horizontal scaling expensive
● Multiple servers in distributed storage scale well ○ Maintain single namespace
● Commodity nodes○ Easy expansion by adding nodes○ Good fit for low cost hardware○ Minimal impact on node failure
Distributed StorageHow to scale performance and capacity?
RED HAT
● Scale-out NAS ● Aggregates file systems (bricks) into single namespace● Scalability limited - directories / management replicated across all nodes
○ No metadata servers
Case StudiesGlusterFS
RED HAT
● Metadata servers manage node membership● Supports block, object, file
○ RADOS as intermediate representation adds overhead○ Must translate ingress to: {objects, placement groups (PGs), OSDs}
Case StudiesCeph
RED HAT
Media Latency
HDD 10ms
SSD 1ms
SCM < 1us
CPU (Ceph, aprox.)
~1000us
Network (RDMA)
~50us
7
The problemMust lower latencies throughout system : storage, network, CPU
RED HAT
● Plethora of workloads and configurations○ HPC, sequential, random, mixed read/write/transfer size, etc○ # OSDs, nodes, replica/EC sets, ...
● Benchmark one○ 3X replication; one brick/OSD per node○ Linux SCM support /dev/pmem○ 14Gbps RDMA○ Two clients 4KB/8KB random reads / writes
Framing The ProblemHow to analyze distributed storage + SCM’s benefits
NETWORK LATENCY
RED HAT
● “Primary copy” : update replicas in parallel, ○ processes reads and writes○ Ceph’s choice, also Gluster’s “journal based replication” (under development)
● Other design options○ Read at “tail” - the data there is always committed
Server Side ReplicationLatency cost to replicate across nodes
client
Primary server
Replica 1
Replica 2
RED HAT
● Uses more client side bandwidth● Likely client has slower network than server. ● Gluster’s default replication strategy (AFR)
Client Side ReplicationLatency price lower than server software replication
client
Replica 1
Replica 1
Replica 2
RED HAT
● Reads following writes○ Read and write operations logically occur in some sequential order.○ Completed write operations are reflected by subsequent read operations.
● In Ceph,○ Writes to different objects (but same PG) are serialized○ Serialized even if originated from different clients○ There may be many PGs○ PG size configurable online○ (note many PGs are resource intensive)
Consistency
RED HAT
● Avoid OS data copy; free CPU from transfer● Application must manage buffers
○ Reserve memory up-front, sized properly○ Both Ceph and Gluster have good RDMA interfaces
● Extend protocol to further improve latency? ○ e.g. ideas from Microsoft’s Tom Talpey ○ RDMA write completion does not indicate data was persisted○ ACK in higher level protocol - adds overhead○ Add “commit bit”, perhaps combine with last write?
Improving Network LatencyRDMA
RED HAT
IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients
Biggest gain with reads, little gain for small I/O.
Sequential I/O (1024 block size)
Random I/O (1024 block size)
1024 bytes transfers
RED HAT
● Reduce protocol traffic (discuss more next section)● Coalesce protocol operations
○ WIth this, observed 10% gain in small file creates on Gluster● Pipelining
○ In Ceph, on two updates to same object, start replicating second before first completes
Improving Network LatencyOther techniques
ACCELERATION
RED HAT
Adding SCM to Parts of SystemKernel and application level tiering
DM-cache Ceph tiering
RED HAT
● Ceph filestore’s journal● Ceph bluestore’s - RocksDB write ahead log● XFS journal
Adding SCM to Parts of SystemCandidate destinations
OSD
bluestore
RocksDBWAL
OSD
filestore
XFS
Journal
Journal
RED HAT
● Heterogeneous storage in single volume○ Fast/expensive storage cache for slower storage○ Introduced in Q1 2016○ Fast “Hot tier” (e.g. SSD, SCM)○ Slow “Cold tier” (e.g. erasure coded)
● Policies: ○ Data put on hot tier, until “full”○ Once “full”, data “promoted/demoted” based on access frequency
In Depth: Gluster TieringIllustration of network problem
RED HAT
Gluster Tiering
RED HAT
● Tiering helped large I/Os, not small● Pattern seen elsewhere ..
○ RDMA tests ○ Customer Feedback, overall GlusterFS reputation …
● Observed many “LOOKUP operations” over network● Hypothesis: metadata transfers dominate data transfers for small files
○ small file data transfer speedup fails to help overall IO latency
Gluster’s “Small File” Tiering Problem Analysis
RED HAT
● Each directory in path is tested on an open(), by client’s VFS layer○ Full path traversal○ d1/d2/d3/f1○ Existence○ Permission
Understanding LOOKUPs in GlusterProblem : Path Traversal
d1
d2
d3
f1
RED HAT
● Distributed hash space is split in parts○ Unique space for each directory○ Each node owns a piece of this “layout”○ Stored in extended attributes○ When new nodes added to cluster, rebuild the layouts
● When file opened, entire layout is rechecked, for each directory○ Each node receives a lookup to retrieve its part of the layout
● Work is underway to improve this.
Understanding LOOKUPs in GlusterProblem : Coalescing Distributed Hash Ranges
RED HAT
LOOKUP Amplification
d2 d3 f1
S1 S2 S3 S4
d1/d2/d3/f1Four LOOKUPsFour servers16 LOOKUPs total in worse case
d2 VFS layer
Gluster client
Gluster server
Client
RED HAT
● Cache file metadata at client long term ○ WIP - under development
● Invalidate cache entry on another client’s change○ Invalidate intelligently, not spuriously○ Some attributes may change a lot (ctime, ..)
Client Metadata CacheGluster’s “md-cache” translator
Red is cached
CPU LATENCY
RED HAT
● It takes CPU cycles to…○ Distribute data over nodes○ Perform replication / ec○ Manage a single namespace○ Convert between external and internal representations of data
CPU LatencyServices needed to distribute storage add to CPU overhead
RED HAT
● Upper (fast) and lower (slow) halves of I/O path
● Context switch between halves - consider run to completion?
● Many locks taken- use lockless algorithms?
● Memory allocation matters (Jmalloc)
● Etc
Ceph Datapath - Micro Optimizationssession_dispatch _lock
PG::map_lock
pg_map_lock
OpWQ lock
RED HAT
● SanDisk○ Sharded worker thread pools○ Bluestore optimizations○ Identified TCMalloc problems, introduced JEMalloc○ .. much more … ongoing
● CohortFS (now Red Hat)○ Accelio RDMA module○ Divide and conquer performance analysis using “infinite backend” (memstore)○ Lockless algorithms / RCU (coming)
Community ContributionsSanDisk, CohortFS, many others
RED HAT
● No longer run Ceph over a file system● Motivation
○ Transactions difficult to implement with posix
○ Writing to Ceph’s journal first meant 2X writes
○ Object enumeration inefficient● Manage metadata with a database
○ ACID semantics for transactions○ No longer a double write
Ceph Datapath - Macro OptimizationsBluestore - a key value database as backend
RED HAT
● Shorter code path● More of stack customizable for Ceph’s needs
○ BlueFS allocates from block device
BluestoreHow does it help CPU latency
RED HAT
Some results (4/16)Code in flux - YMMV !
RED HAT
Some results (4/16)Code in flux - YMMV !
RED HAT
● RocksDB is an log structured merge-tree (LSM) database○ Optimized for sequential access○ Has periodic background compaction, write amplification, …○ Must carefully tune RocksDB “compaction” options○ A good fit for disks, not so much for SCM○ Use different DB, e.g. SanDisk’s ZetaScale ?
Which database to use?Further optimizations for SCM
B
SUMMARY
RED HAT
● Network latency reductions○ Use RDMA○ Reduce round trips by streamlining protocol , coalescing etc○ Cache at client
● CPU latency reductions○ Aggressively optimize / shrink stack○ Remove/replace large components
● Consider ○ SCM as a tier/cache○ 2 way replication
Summary and DiscussionDistributed storage and SCM pose unique problems with latency.
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews
RED HAT
● RAMdisk /dev/pmem● RDMA + Accelio● Bluestore ● 40GB/s infiniband
Putting it TogetherBenchmarks
RED HAT 39
● Low storage latency must be matched by network and CPU● Software stack is deeper
○ Layered on top of file system
Using Distributed Storage with SCMFundamental problem
BACKUP SLIDES
TABLES WITH ACTUAL SPACE MEASURES
RED HAT
Sample chart template
Compaction Brick 1 Brick 2 Brick 3 Brick 4
off 42.0430 42.0430 17.4258 17.5156
manual 42.5117 42.5117 16.9180 16.9180
full 42.1953 42.1992 17.4336 17.4336
incremental 42.1836 12.1836 17.4453 17.4453
42
Create FilesSpace: Measurements in MB
RED HAT
Sample chart template
Compaction Brick 1 Brick 2 Brick 3 Brick 4
off 42.1211 42.1914 17.4688 17.4844
manual 10.3516 10.3516 0.6133 0.6133
full 0.0234 0.0234 0.0234 0.0234
incremental 42.1875 42.0703 17.4336 17.4336
43
Create and Delete FilesSpace: Measurements in MB
RED HAT
Sample chart template
Compaction Brick 1 Brick 2 Brick 3 Brick 4
off 55.6953 55.5898 19.2344 19.2344
manual 49.3712 49.3750 14.9102 14.9102
full 55.5859 55.5859 18.5508 18.5508
incremental 55.4883 55.4844 19.313 19.313
44
Create and Rename FilesSpace: Measurements in MB
RED HAT
Sample chart template
Compaction Brick 1 Brick 2 Brick 3 Brick 4
off 42.9336 42.9375 17.4570 17.4570
manual 43.3555 43.3555 16 16
full 43.0195 43.0117 17.0313 17.0313
incremental 42.9961 43.0039 17.5234 17.5195
45
Create Files and Symbolic LinksSpace: Measurements in MB
TEMPLATE SLIDES
RED HAT
/* Code was here, but now it is gone */
47
CODE SNIPPETWhen referencing code snippets, use the template below. Resize the snippet box to the appropriate size for your text.
INSERT DESIGNATOR, IF NEEDED
Sample chart template
Sample chart template
VALUE VALUE VALUE VALUE
value value value value
value value value value
value value value value
48
TABLES
VALUE VALUE VALUE VALUE
value value value value
value value value value
value value value value
Using the table tool in Google Slides, you can create basic tables for your presentation. Make sure to use only corporate and secondary colors.
RED HAT 49
Insert paragraph of copy here. Do not exceed 40 words.
● Bullet● Bullet● Bullet
CLICK TO ADD TITLEClick to add subtitle
RED HAT
Click to add text
50
CLICK TO ADD TITLEClick to add subtitle
RED HAT 51
When creating charts, graphs, and tables use the corporate and secondary color palettes with no more than one accent color.
CORPORATE
SECONDARY
ACCENT (Use these colors sparingly.)
BRAND COLORSUse ONLY the colors outlined below in your presentation.
PMS 17972 98 85 2204 0 0#CC0000
PMS 18052 98 85 22163 0 0#A30000
PMS 18152 98 85 42130 0 0#820000
RICH BLACK60 40 40 1000 0 0#000000
WHITE0 0 0 0255 255 255#FFFFFF
PMS 3035100 25 18 720 65 83#004153
PMS 297535 0 6 0163 219 232#A3DBE8
DARK GRAY0 0 0 8576 76 76#4C4C4C
LIGHT GRAY0 0 0 15220 220 220#DCDCDC
PMS 268592 100 0 1059 0 131#3B0083
PMS 1300 30 100 0240 171 0#F0AB00
PMS 747498 7 30 300 122 135#007A87
PMS 30679 0 6 50 185 228#00B9E4
PMS 37547 0 94 0146 212 0#92D400