distributed storage systems challenges in using persistent ... · challenges in using persistent...

Challenges in Using Persistent Memory In Distributed Storage Systems

SNIA SDC 2016

Dan LambrightStorage System Software DeveloperAdjunct Professor University of Massachusetts LowellAug. 23, 2016

RED HAT

● Technologies○ Persistent memory, aka storage class memory (SCM)○ Distributed storage

● Case studies○ GlusterFS, Ceph

● Challenges○ Network latency○ Accelerating parts of the system with SCM○ CPU latency

2

Overview

RED HAT 3

● Near DRAM speeds● Wearability better than SSDs (claims Intel) ● API available

○ Crash-proof transactions○ Byte or block addressable

● Likely to be at least as expensive as SSDs● Fast random access● Has support in Linux

Storage Class MemoryWhat do we know / expect?

RED HAT 4

● Single server scales poorly○ Horizontal scaling expensive

● Multiple servers in distributed storage scale well ○ Maintain single namespace

● Commodity nodes○ Easy expansion by adding nodes○ Good fit for low cost hardware○ Minimal impact on node failure

Distributed StorageHow to scale performance and capacity?

RED HAT

● Scale-out NAS ● Aggregates file systems (bricks) into single namespace● Scalability limited - directories / management replicated across all nodes

○ No metadata servers

Case StudiesGlusterFS

RED HAT

● Metadata servers manage node membership● Supports block, object, file

○ RADOS as intermediate representation adds overhead○ Must translate ingress to: {objects, placement groups (PGs), OSDs}

Case StudiesCeph

RED HAT

Media Latency

HDD 10ms

SSD 1ms

SCM < 1us

CPU (Ceph, aprox.)

~1000us

Network (RDMA)

~50us

7

The problemMust lower latencies throughout system : storage, network, CPU

RED HAT

● Plethora of workloads and configurations○ HPC, sequential, random, mixed read/write/transfer size, etc○ # OSDs, nodes, replica/EC sets, ...

● Benchmark one○ 3X replication; one brick/OSD per node○ Linux SCM support /dev/pmem○ 14Gbps RDMA○ Two clients 4KB/8KB random reads / writes

Framing The ProblemHow to analyze distributed storage + SCM’s benefits

NETWORK LATENCY

RED HAT

● “Primary copy” : update replicas in parallel, ○ processes reads and writes○ Ceph’s choice, also Gluster’s “journal based replication” (under development)

● Other design options○ Read at “tail” - the data there is always committed

Server Side ReplicationLatency cost to replicate across nodes

client

Primary server

Replica 1

Replica 2

RED HAT

● Uses more client side bandwidth● Likely client has slower network than server. ● Gluster’s default replication strategy (AFR)

Client Side ReplicationLatency price lower than server software replication

client

Replica 1

Replica 1

Replica 2

RED HAT

● Reads following writes○ Read and write operations logically occur in some sequential order.○ Completed write operations are reflected by subsequent read operations.

● In Ceph,○ Writes to different objects (but same PG) are serialized○ Serialized even if originated from different clients○ There may be many PGs○ PG size configurable online○ (note many PGs are resource intensive)

Consistency

RED HAT

● Avoid OS data copy; free CPU from transfer● Application must manage buffers

○ Reserve memory up-front, sized properly○ Both Ceph and Gluster have good RDMA interfaces

● Extend protocol to further improve latency? ○ e.g. ideas from Microsoft’s Tom Talpey ○ RDMA write completion does not indicate data was persisted○ ACK in higher level protocol - adds overhead○ Add “commit bit”, perhaps combine with last write?

Improving Network LatencyRDMA

RED HAT

IOPS RDMA vs 10Gbps : Glusterfs 2x replication, 2 clients

Biggest gain with reads, little gain for small I/O.

Sequential I/O (1024 block size)

Random I/O (1024 block size)

1024 bytes transfers

RED HAT

● Reduce protocol traffic (discuss more next section)● Coalesce protocol operations

○ WIth this, observed 10% gain in small file creates on Gluster● Pipelining

○ In Ceph, on two updates to same object, start replicating second before first completes

Improving Network LatencyOther techniques

ACCELERATION

RED HAT

Adding SCM to Parts of SystemKernel and application level tiering

DM-cache Ceph tiering

RED HAT

● Ceph filestore’s journal● Ceph bluestore’s - RocksDB write ahead log● XFS journal

Adding SCM to Parts of SystemCandidate destinations

OSD

bluestore

RocksDBWAL

OSD

filestore

XFS

Journal

Journal

RED HAT

● Heterogeneous storage in single volume○ Fast/expensive storage cache for slower storage○ Introduced in Q1 2016○ Fast “Hot tier” (e.g. SSD, SCM)○ Slow “Cold tier” (e.g. erasure coded)

● Policies: ○ Data put on hot tier, until “full”○ Once “full”, data “promoted/demoted” based on access frequency

In Depth: Gluster TieringIllustration of network problem

RED HAT

Gluster Tiering

RED HAT

● Tiering helped large I/Os, not small● Pattern seen elsewhere ..

○ RDMA tests ○ Customer Feedback, overall GlusterFS reputation …

● Observed many “LOOKUP operations” over network● Hypothesis: metadata transfers dominate data transfers for small files

○ small file data transfer speedup fails to help overall IO latency

Gluster’s “Small File” Tiering Problem Analysis

RED HAT

● Each directory in path is tested on an open(), by client’s VFS layer○ Full path traversal○ d1/d2/d3/f1○ Existence○ Permission

Understanding LOOKUPs in GlusterProblem : Path Traversal

d1

d2

d3

f1

RED HAT

● Distributed hash space is split in parts○ Unique space for each directory○ Each node owns a piece of this “layout”○ Stored in extended attributes○ When new nodes added to cluster, rebuild the layouts

● When file opened, entire layout is rechecked, for each directory○ Each node receives a lookup to retrieve its part of the layout

● Work is underway to improve this.

Understanding LOOKUPs in GlusterProblem : Coalescing Distributed Hash Ranges

RED HAT

LOOKUP Amplification

d2 d3 f1

S1 S2 S3 S4

d1/d2/d3/f1Four LOOKUPsFour servers16 LOOKUPs total in worse case

d2 VFS layer

Gluster client

Gluster server

Client

RED HAT

● Cache file metadata at client long term ○ WIP - under development

● Invalidate cache entry on another client’s change○ Invalidate intelligently, not spuriously○ Some attributes may change a lot (ctime, ..)

Client Metadata CacheGluster’s “md-cache” translator

Red is cached

CPU LATENCY

RED HAT

● It takes CPU cycles to…○ Distribute data over nodes○ Perform replication / ec○ Manage a single namespace○ Convert between external and internal representations of data

CPU LatencyServices needed to distribute storage add to CPU overhead

RED HAT

● Upper (fast) and lower (slow) halves of I/O path

● Context switch between halves - consider run to completion?

● Many locks taken- use lockless algorithms?

● Memory allocation matters (Jmalloc)

● Etc

Ceph Datapath - Micro Optimizationssession_dispatch _lock

PG::map_lock

pg_map_lock

OpWQ lock

RED HAT

● SanDisk○ Sharded worker thread pools○ Bluestore optimizations○ Identified TCMalloc problems, introduced JEMalloc○ .. much more … ongoing

● CohortFS (now Red Hat)○ Accelio RDMA module○ Divide and conquer performance analysis using “infinite backend” (memstore)○ Lockless algorithms / RCU (coming)

Community ContributionsSanDisk, CohortFS, many others

RED HAT

● No longer run Ceph over a file system● Motivation

○ Transactions difficult to implement with posix

○ Writing to Ceph’s journal first meant 2X writes

○ Object enumeration inefficient● Manage metadata with a database

○ ACID semantics for transactions○ No longer a double write

Ceph Datapath - Macro OptimizationsBluestore - a key value database as backend

RED HAT

● Shorter code path● More of stack customizable for Ceph’s needs

○ BlueFS allocates from block device

BluestoreHow does it help CPU latency

RED HAT

Some results (4/16)Code in flux - YMMV !

RED HAT

● RocksDB is an log structured merge-tree (LSM) database○ Optimized for sequential access○ Has periodic background compaction, write amplification, …○ Must carefully tune RocksDB “compaction” options○ A good fit for disks, not so much for SCM○ Use different DB, e.g. SanDisk’s ZetaScale ?

Which database to use?Further optimizations for SCM

B

SUMMARY

RED HAT

● Network latency reductions○ Use RDMA○ Reduce round trips by streamlining protocol , coalescing etc○ Cache at client

● CPU latency reductions○ Aggressively optimize / shrink stack○ Remove/replace large components

● Consider ○ SCM as a tier/cache○ 2 way replication

Summary and DiscussionDistributed storage and SCM pose unique problems with latency.

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews

RED HAT

● RAMdisk /dev/pmem● RDMA + Accelio● Bluestore ● 40GB/s infiniband

Putting it TogetherBenchmarks

RED HAT 39

● Low storage latency must be matched by network and CPU● Software stack is deeper

○ Layered on top of file system

Using Distributed Storage with SCMFundamental problem

BACKUP SLIDES

TABLES WITH ACTUAL SPACE MEASURES

RED HAT

Sample chart template

Compaction Brick 1 Brick 2 Brick 3 Brick 4

off 42.0430 42.0430 17.4258 17.5156

manual 42.5117 42.5117 16.9180 16.9180

full 42.1953 42.1992 17.4336 17.4336

incremental 42.1836 12.1836 17.4453 17.4453

42

Create FilesSpace: Measurements in MB

RED HAT



off 42.1211 42.1914 17.4688 17.4844

manual 10.3516 10.3516 0.6133 0.6133

full 0.0234 0.0234 0.0234 0.0234

incremental 42.1875 42.0703 17.4336 17.4336

43

Create and Delete FilesSpace: Measurements in MB

RED HAT



off 55.6953 55.5898 19.2344 19.2344

manual 49.3712 49.3750 14.9102 14.9102

full 55.5859 55.5859 18.5508 18.5508

incremental 55.4883 55.4844 19.313 19.313

44

Create and Rename FilesSpace: Measurements in MB

RED HAT



off 42.9336 42.9375 17.4570 17.4570

manual 43.3555 43.3555 16 16

full 43.0195 43.0117 17.0313 17.0313

incremental 42.9961 43.0039 17.5234 17.5195

45

Create Files and Symbolic LinksSpace: Measurements in MB

TEMPLATE SLIDES

RED HAT

/* Code was here, but now it is gone */

47

CODE SNIPPETWhen referencing code snippets, use the template below. Resize the snippet box to the appropriate size for your text.

INSERT DESIGNATOR, IF NEEDED



VALUE VALUE VALUE VALUE

value value value value



48

TABLES

VALUE VALUE VALUE VALUE




Using the table tool in Google Slides, you can create basic tables for your presentation. Make sure to use only corporate and secondary colors.

RED HAT 49

Insert paragraph of copy here. Do not exceed 40 words.

● Bullet● Bullet● Bullet

CLICK TO ADD TITLEClick to add subtitle

RED HAT

Click to add text

50

CLICK TO ADD TITLEClick to add subtitle

RED HAT 51

When creating charts, graphs, and tables use the corporate and secondary color palettes with no more than one accent color.

CORPORATE

SECONDARY

ACCENT (Use these colors sparingly.)

BRAND COLORSUse ONLY the colors outlined below in your presentation.

PMS 17972 98 85 2204 0 0#CC0000

PMS 18052 98 85 22163 0 0#A30000

PMS 18152 98 85 42130 0 0#820000

RICH BLACK60 40 40 1000 0 0#000000

WHITE0 0 0 0255 255 255#FFFFFF

PMS 3035100 25 18 720 65 83#004153

PMS 297535 0 6 0163 219 232#A3DBE8

DARK GRAY0 0 0 8576 76 76#4C4C4C

LIGHT GRAY0 0 0 15220 220 220#DCDCDC

PMS 268592 100 0 1059 0 131#3B0083

PMS 1300 30 100 0240 171 0#F0AB00

PMS 747498 7 30 300 122 135#007A87

PMS 30679 0 6 50 185 228#00B9E4

PMS 37547 0 94 0146 212 0#92D400

distributed storage systems challenges in using persistent ... · challenges in using persistent...

Documents