remote persistent memory · block translation table: persistent memory is byte addressable....

Remote Persistent MemoryTom Talpey

Outline

Persistent Memory and Windows PMEM Support A brief summary, for better awareness

RDMA Persistent Memory Extensions And their motivation/use by Storage Protocols

Example Application Scenarios for Persistent Memory Operations RDMA Operation Behavior

2

Persistent Memory

Memory & Storage Convergence

Volatile and non-volatile technologies are continuing to converge

Source: Gen-Z Consortium 2016

*PM = Persistent Memory

**OPM = On-Package Memory

HMC

HBM

RRAM

3DXPointTM

Memory

MRAM

PCM

Low Latency NAND

Managed DRAM

New and Emerging Memory Technologies

Today

Memory

StorageDisk/SSD

PM*

Disk/SSD

PM*

Disk/SSD

PM*

Disk/SSD

DRAM DRAM DRAM/OPM** DRAM/OPM**

4

5

Persistent Memory Brings Storage

To Memory Slots

FastLike Memory

PersistentLike Storage

• For system acceleration

• For real-time data capture, analysis and intelligent response

Persistent Memory (PM) Vision

Persistent Memory (PM)is a type of Non-Volatile Memory (NVM)

Disk-like non-volatile memory Persistent RAM disk

Appears as disk drives to applications

Accessed as traditional array of blocks

Usually software arbitrated e.g. RAMdisk-like driver, or simply NVMe

Memory-like non-volatile memory (PM) Appears as memory to applications

Applications store data directly in byte-addressable memory

No IO or even DMA is required

6

Persistent Memory (PM) Characteristics

Byte addressable from programmer’s point of view

Provides Load/Store access via CPU instructions

Resides on the memory bus

Has Memory-like performance Ultra-low latency

Supports DMA including RDMA

Not prone to unexpected latencies

7

Persistent Memory Terminology

Terms used to describe the hardware: Storage Class Memory (SCM)

Byte Addressable Storage (BAS)

Persistent Memory (PM) Industry converging on this term

“PMEM” also commonly used

Non-Volatile Memory “NVM” term is moving to indicate block mode access

i.e. not byte-addressable, not a memory interface

E.g. NVMe, NVMoF, etc.

Windows Persistent Memory Support

9

Windows Persistent Memory Support

Persistent Memory is supported in Windows 10 and Windows Server 2016 PM support is foundational in Windows and is SKU-independent

Support for JEDEC-defined NVDIMM-N devices available in Windows Server 2016

Windows 10 (Anniversary Update – Fall 2016)

Access methods: Direct Access (DAX) Filesystem

Mapped files with load/store/flush paradigm

Cached and noncached with read/write paradigm

Block-mode (“persistent ramdisk”)

Raw disk paradigm

Application interfaces

Mapped and traditional file

NVM Programming Library

“PMEM-aware” open coded

10

Direct Access Architecture

Overview

Support in Windows Server 2016 and Windows 10 Anniversary Update (Fall 2016)

App has direct access to Storage Class Memory (SCM/PM) via existing memory-mapping semantics

Updates directly modify memory, the storage stack not involved

DAX volumes identified through new flag

Characteristics True device performance (no software overhead)

Byte-Addressable

Filter Drivers relying on I/O may not work or attach – no I/O, new volume flag

AV Filters can still operate (Windows Defender already updated)

11

SCM Disk DriverSCM Bus Driver

Block Mode

Application

Standard File API

PMEM

DirectAccess

Application

Load/Store

Operations

SCM-Aware File System (NTFS -

DAX)

Application

requests memory-

mapped file

Enumerates NVDIMM

User Mode

Kernel Mode

Memory

Mapped

Region

Memory

Mapped

Region

Load/Store

Operations

Direct Access

Data Path

Direct Access

Setup Path

IO in DAX mode

Memory Mapped Access This is true zero-copy access to storage

An application has direct access to persistent memory

Important No paging reads or paging writes will be generated

Cached IO Access The cache manager creates a cache map that maps directly to PM hardware

The cache manager copies directly between user’s buffer and persistent memory

Cached IO has one-copy access to persistent storage

Cached IO is coherent with memory mapped IO

As in memory mapped IO, no paging reads or paging writes are generated

No Cache Manager Lazy Writer thread

Non-Cached IO Access Is simply converted to cached IO by the file system

Cache manager copies directly between user’s buffer and persistent memory

Is coherent with cached and memory mapped IO

12

Backward App Compatibility on PM Hardware

Block Mode Volumes Maintains existing storage semantics

All IO operations traverse the storage stack to the PM disk driver

Sector atomicity guaranteed by the PM disk driver

Has shortened path length through the storage stack to reduce latency

No storport or miniport drivers

No SCSI translations

Fully compatible with existing applications

Supported by all Windows file systems

Works with existing file system filters

Block mode vs. DAX mode is chosen at format time

13

Performance Comparison

4K random writes

1 Thread, single core, QD=1

14

IOPS Avg Latency (ns) MB / Sec

NVMe SSD 14,553 66,632 56.85

Block Mode NVDIMM

148,567 6,418 580.34

DAX Mode NVDIMM

1,112,007 828 4,343.78

Using DAX in Windows

DAX Volume Creation format n: /dax /q

Format-Volume –DriveLetter n –IsDAX $true

DAX Volume Identification Is it a DAX volume?

call GetVolumeInformation(“C:\”, ...)

check lpFileSystemFlags for FILE_DAX_VOLUME (0x20000000)

Is the file on a DAX volume? call GetVolumeInformationByHandleW(hFile, ...)

check lpFileSystemFlags for FILE_DAX_VOLUME (0x20000000)

15

Using DAX in Windows

Memory Mapping HANDLE hMapping = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);

LPVOID baseAddress = MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, size);

memcpy(baseAddress + writeOffset, dataBuffer, ioSize);

FlushViewOfFile(baseAddress, 0);

OR … use non-temporal instructions for NVDIMM-N devices for better performance HANDLE hMapping = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);

LPVOID baseAddress = MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, size);

RtlCopyMemoryNonTemporal(baseAddress + writeOffset, dataBuffer, ioSize );

Future … use open source NVM Programming Library

16

Linux Kernel 4.4+NVDIMM-N Support

BTT (Block, Atomic)

PMEM

DAX

BLK

File system extensions to bypass the page cache and block layer to memory map persistent memory, from a PMEM block device, directly into a process address space.

A system-physical-address range where writes are persistent. A block device composed of PMEM is capable of DAX. A PMEM address range may span an interleave of several DIMMs.

Block Translation Table: Persistent memory is byte addressable. Existing software may have an expectation that the power-fail-atomicity of writes is at least one sector, 512 bytes. The BTT is an indirection table with atomic update semantics to front a PMEM/BLK block device driver and present arbitrary atomic sector sizes.

A set of one or more programmable memory mapped apertures provided by a DIMM to access its media. This indirection precludes the performance benefit of interleaving, but enables DIMM-bounded failure modes.

Linux 4.2 added support of NVDIMMs. Mostly stable from 4.4

NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1

QEMU support (experimental)

XFS-DAX and EXT4-DAX available

17

Remote Access to Persistent Memory

18

Going Remote

One local copy of storage isn’t storage at all Basically, temp data

Enterprise-grade storage requires replication Multi-device quorum

In addition to integrity, privacy, manageability, … (requirements vary)

Remote access is required

PM value is all about LATENCY Single digit microsecond remote latency goal

Which btw is 2-3 orders of magnitude better than today’s block storage

We can take steps to get there, with great benefit at each

Use RDMA Requires an RDMA protocol extension

19

RDMA Protocols

Need a remote guarantee of Durability

RDMA Write alone is not sufficient for this semantic

Indicates only that data is accepted for ordered delivery (not that it is actually delivered and placed)

An extension is required

Proposed “RDMA Commit”, a.k.a. “RDMA Flush”

Executes like RDMA Read

Ordered, Flow controlled, acknowledged

Initiator requests specific byte ranges to be made durable

Responder acknowledges only when durability complete

Strong consensus on these basics

Being discussed in IBTA, SNIA and other venues

Details being worked out

Scope of durability: region-based, region-list-based, connection, all under discussion

Connection scope seems most efficient for implementations

Additional semantics possible (later in this deck)

20

RDMA-Aware Storage Protocol Use

SMB3/SMB Direct “Push Mode”

NFS/RDMA Via possible pNFS layout extension

Other Commit can work to any remotely-mappable device, e.g. NVMe with a PCIe BAR

Anything that can be memory-registered and accessed via RDMA

21

1

2

3

1 Traditional i/o2 DAX load/store by SMB3 Server3 Push Mode direct from RDMA NIC

Example: Going Remote – SMB3

SMB3 RDMA and “Push Mode” discussed at previous SNIA Storage Developers Conferences

Enables zero-copy remote read/write to DAX file

Ultra-low latency and overhead

2, 3 can be enabled even before RDMA Commit extensions become available, with only slight incremental cost

22

SMB3 Server

RDMA NIC

SMB3

RDMAPush/

Commit

“Buffer Cache”

RDMA R/W

Load/Store

DAX Filesystem

PMEM

I/O requests

Direct file mapping

Remote Persistent Memory Workloads

23

1) Basic Replication

Write, optionally more Writes, Commit No overwrite

No ordering dependency (but see logwriter and non-posted write)

No completions at data sink

Can be pipelined

Other semantics: Asynchronous mode

Discussion in SNIA NVMP TWG

Where it’s affectionately named “Giddy-Up”

Local API behavior at initiator to perform Commit asynchronously

Enables remote write-behind for load/store access, among other scenarios

Complicates error recovery, but in well-defined way

Reads are interesting too

But easily interleaved with writes/commits

No additional protocol implications (envisioned)

24

2) Log Writer (Filesystem)

For (ever)

{ Write log record, Commit }, { Write log pointer, Commit }

Latency is critical

Log pointer cannot be placed until log record is successfully made durable

Log pointer is the validity indicator for the log record

Transaction model

Log records are eventually retired, buffer is circular

Protocol implications:

Must wait for first commit (and possibly the second)

Introduces a pipeline bubble – very bad for throughput and overall latency

Desire an ordering between Commit and second Write

Possible solution: “Non-posted write”

Special Write which executes “like a Read” – ordered with other non-posted operations

For example, Commits, Reads

Being discussed in IBTA

25

3) Log Writer (Database)

For (ever)

While (!full) { Write log record, Commit }

Commit log segment Persist segment to disk (asynchronously)

Log record write/commit latency (red 1/2) critical

Log segment persist to disk latency (green 3/4) not critical

Large improvement to database transaction rate

Approximately 2x for SQL Hekaton*

Very similar to log-based filesystem scenario

Similar RDMA protocol implication, but see next

26

SSD (Block)

NVDIMM-N (Byte)

Log Buffers

Log FileConfiguration HK on NVMe (block) HK on NVDIMM-N (DAX)

Row Updates / Second 63,246 124,917

Avg. Time / Txn (ms) 0.379 0.192

4) Log Write With Active-Active Signaling

After log record replication, how to make peer aware of it? Non-posted operations do not generate peer completions

E.g. Commit, RDMA Read, Atomic

… and are not ordered with Posted operations (e.g. Send, Write with Immediate)

Desire to generate a peer completion, only after durability achieved

Simple way: initiator waits for Commit completion Using an initiator Fence, or explicitly waiting

Pipeline bubble (bad for latency)

Better way: ordered operation rule at target Under discussion as part of the extension

27

Additional Extension: Remote Data Integrity

Assuming we have an RDMA Write + RDMA Commit

And the Writes + Commit all complete (with success or failure)

How does the initiator know the data is intact?

Or in case of failure, which data is not intact?

Possibilities:

Reading back

extremely undesirable (and possibly not actually reading media!)

Signaling upper layer

high overhead

Upper layer possibly unavailable (the “Memory-Only Appliance”!)

Other?

Same question applies also to:

Array “scrub”

Storage management and recovery

etc

28

Proposal: RDMA “VERIFY”

Concept: add integrity hashes to a new operation

Or, possibly, piggybacked on Commit

Note, not unlike SCSI T10 DIF

Hash algorithms to be negotiated by upper layers

Hashing implemented in RNIC or Library “implementation”

Which could be in

Platform, e.g. storage device itself

RNIC hardware/firmware, e.g. RNIC performs readback/integrity computation

Other hardware on target platform, e.g. chipset, memory controller

Software, e.g. target CPU

Ideally, as efficiently as possible

Options:

Source requests hash computation, receives hash as result, performs own comparison

Source sends hash to target, target computes and compares, returns success/failure

???

Under discussion in SNIA NVMP TWG OptimizedFlushAndVerify()

29

The (Near?) Future

Persistent Memory rapidly becoming part of Server hardware ecosystem NVDIMM-N, 3DXP, etc

Windows (and Linux) support is foundational and broadly supported

Protocol standards process well under way

Watch for the coming revolution in low latency storage!

30

Thank You!Questions?

remote persistent memory · block translation table: persistent memory is byte addressable....

Documents