remote persistent memory · block translation table: persistent memory is byte addressable....
TRANSCRIPT
Remote Persistent MemoryTom Talpey
Outline
Persistent Memory and Windows PMEM Support A brief summary, for better awareness
RDMA Persistent Memory Extensions And their motivation/use by Storage Protocols
Example Application Scenarios for Persistent Memory Operations RDMA Operation Behavior
2
Persistent Memory
Memory & Storage Convergence
Volatile and non-volatile technologies are continuing to converge
Source: Gen-Z Consortium 2016
*PM = Persistent Memory
**OPM = On-Package Memory
HMC
HBM
RRAM
3DXPointTM
Memory
MRAM
PCM
Low Latency NAND
Managed DRAM
New and Emerging Memory Technologies
Today
Memory
StorageDisk/SSD
PM*
Disk/SSD
PM*
Disk/SSD
PM*
Disk/SSD
DRAM DRAM DRAM/OPM** DRAM/OPM**
4
5
Persistent Memory Brings Storage
To Memory Slots
FastLike Memory
PersistentLike Storage
• For system acceleration
• For real-time data capture, analysis and intelligent response
Persistent Memory (PM) Vision
Persistent Memory (PM)is a type of Non-Volatile Memory (NVM)
Disk-like non-volatile memory Persistent RAM disk
Appears as disk drives to applications
Accessed as traditional array of blocks
Usually software arbitrated e.g. RAMdisk-like driver, or simply NVMe
Memory-like non-volatile memory (PM) Appears as memory to applications
Applications store data directly in byte-addressable memory
No IO or even DMA is required
6
Persistent Memory (PM) Characteristics
Byte addressable from programmer’s point of view
Provides Load/Store access via CPU instructions
Resides on the memory bus
Has Memory-like performance Ultra-low latency
Supports DMA including RDMA
Not prone to unexpected latencies
7
Persistent Memory Terminology
Terms used to describe the hardware: Storage Class Memory (SCM)
Byte Addressable Storage (BAS)
Persistent Memory (PM) Industry converging on this term
“PMEM” also commonly used
Non-Volatile Memory “NVM” term is moving to indicate block mode access
i.e. not byte-addressable, not a memory interface
E.g. NVMe, NVMoF, etc.
Windows Persistent Memory Support
9
Windows Persistent Memory Support
Persistent Memory is supported in Windows 10 and Windows Server 2016 PM support is foundational in Windows and is SKU-independent
Support for JEDEC-defined NVDIMM-N devices available in Windows Server 2016
Windows 10 (Anniversary Update – Fall 2016)
Access methods: Direct Access (DAX) Filesystem
Mapped files with load/store/flush paradigm
Cached and noncached with read/write paradigm
Block-mode (“persistent ramdisk”)
Raw disk paradigm
Application interfaces
Mapped and traditional file
NVM Programming Library
“PMEM-aware” open coded
10
Direct Access Architecture
Overview
Support in Windows Server 2016 and Windows 10 Anniversary Update (Fall 2016)
App has direct access to Storage Class Memory (SCM/PM) via existing memory-mapping semantics
Updates directly modify memory, the storage stack not involved
DAX volumes identified through new flag
Characteristics True device performance (no software overhead)
Byte-Addressable
Filter Drivers relying on I/O may not work or attach – no I/O, new volume flag
AV Filters can still operate (Windows Defender already updated)
11
SCM Disk DriverSCM Bus Driver
Block Mode
Application
Standard File API
PMEM
DirectAccess
Application
Load/Store
Operations
SCM-Aware File System (NTFS -
DAX)
Application
requests memory-
mapped file
Enumerates NVDIMM
User Mode
Kernel Mode
Memory
Mapped
Region
Memory
Mapped
Region
Load/Store
Operations
Direct Access
Data Path
Direct Access
Setup Path
IO in DAX mode
Memory Mapped Access This is true zero-copy access to storage
An application has direct access to persistent memory
Important No paging reads or paging writes will be generated
Cached IO Access The cache manager creates a cache map that maps directly to PM hardware
The cache manager copies directly between user’s buffer and persistent memory
Cached IO has one-copy access to persistent storage
Cached IO is coherent with memory mapped IO
As in memory mapped IO, no paging reads or paging writes are generated
No Cache Manager Lazy Writer thread
Non-Cached IO Access Is simply converted to cached IO by the file system
Cache manager copies directly between user’s buffer and persistent memory
Is coherent with cached and memory mapped IO
12
Backward App Compatibility on PM Hardware
Block Mode Volumes Maintains existing storage semantics
All IO operations traverse the storage stack to the PM disk driver
Sector atomicity guaranteed by the PM disk driver
Has shortened path length through the storage stack to reduce latency
No storport or miniport drivers
No SCSI translations
Fully compatible with existing applications
Supported by all Windows file systems
Works with existing file system filters
Block mode vs. DAX mode is chosen at format time
13
Performance Comparison
4K random writes
1 Thread, single core, QD=1
14
IOPS Avg Latency (ns) MB / Sec
NVMe SSD 14,553 66,632 56.85
Block Mode NVDIMM
148,567 6,418 580.34
DAX Mode NVDIMM
1,112,007 828 4,343.78
Using DAX in Windows
DAX Volume Creation format n: /dax /q
Format-Volume –DriveLetter n –IsDAX $true
DAX Volume Identification Is it a DAX volume?
call GetVolumeInformation(“C:\”, ...)
check lpFileSystemFlags for FILE_DAX_VOLUME (0x20000000)
Is the file on a DAX volume? call GetVolumeInformationByHandleW(hFile, ...)
check lpFileSystemFlags for FILE_DAX_VOLUME (0x20000000)
15
Using DAX in Windows
Memory Mapping HANDLE hMapping = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);
LPVOID baseAddress = MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, size);
memcpy(baseAddress + writeOffset, dataBuffer, ioSize);
FlushViewOfFile(baseAddress, 0);
OR … use non-temporal instructions for NVDIMM-N devices for better performance HANDLE hMapping = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);
LPVOID baseAddress = MapViewOfFile(hMapping, FILE_MAP_WRITE, 0, 0, size);
RtlCopyMemoryNonTemporal(baseAddress + writeOffset, dataBuffer, ioSize );
Future … use open source NVM Programming Library
16
Linux Kernel 4.4+NVDIMM-N Support
BTT (Block, Atomic)
PMEM
DAX
BLK
File system extensions to bypass the page cache and block layer to memory map persistent memory, from a PMEM block device, directly into a process address space.
A system-physical-address range where writes are persistent. A block device composed of PMEM is capable of DAX. A PMEM address range may span an interleave of several DIMMs.
Block Translation Table: Persistent memory is byte addressable. Existing software may have an expectation that the power-fail-atomicity of writes is at least one sector, 512 bytes. The BTT is an indirection table with atomic update semantics to front a PMEM/BLK block device driver and present arbitrary atomic sector sizes.
A set of one or more programmable memory mapped apertures provided by a DIMM to access its media. This indirection precludes the performance benefit of interleaving, but enables DIMM-bounded failure modes.
Linux 4.2 added support of NVDIMMs. Mostly stable from 4.4
NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1
QEMU support (experimental)
XFS-DAX and EXT4-DAX available
17
Remote Access to Persistent Memory
18
Going Remote
One local copy of storage isn’t storage at all Basically, temp data
Enterprise-grade storage requires replication Multi-device quorum
In addition to integrity, privacy, manageability, … (requirements vary)
Remote access is required
PM value is all about LATENCY Single digit microsecond remote latency goal
Which btw is 2-3 orders of magnitude better than today’s block storage
We can take steps to get there, with great benefit at each
Use RDMA Requires an RDMA protocol extension
19
RDMA Protocols
Need a remote guarantee of Durability
RDMA Write alone is not sufficient for this semantic
Indicates only that data is accepted for ordered delivery (not that it is actually delivered and placed)
An extension is required
Proposed “RDMA Commit”, a.k.a. “RDMA Flush”
Executes like RDMA Read
Ordered, Flow controlled, acknowledged
Initiator requests specific byte ranges to be made durable
Responder acknowledges only when durability complete
Strong consensus on these basics
Being discussed in IBTA, SNIA and other venues
Details being worked out
Scope of durability: region-based, region-list-based, connection, all under discussion
Connection scope seems most efficient for implementations
Additional semantics possible (later in this deck)
20
RDMA-Aware Storage Protocol Use
SMB3/SMB Direct “Push Mode”
NFS/RDMA Via possible pNFS layout extension
Other Commit can work to any remotely-mappable device, e.g. NVMe with a PCIe BAR
Anything that can be memory-registered and accessed via RDMA
21
1
2
3
1 Traditional i/o2 DAX load/store by SMB3 Server3 Push Mode direct from RDMA NIC
Example: Going Remote – SMB3
SMB3 RDMA and “Push Mode” discussed at previous SNIA Storage Developers Conferences
Enables zero-copy remote read/write to DAX file
Ultra-low latency and overhead
2, 3 can be enabled even before RDMA Commit extensions become available, with only slight incremental cost
22
SMB3 Server
RDMA NIC
SMB3
RDMAPush/
Commit
“Buffer Cache”
RDMA R/W
Load/Store
DAX Filesystem
PMEM
I/O requests
Direct file mapping
Remote Persistent Memory Workloads
23
1) Basic Replication
Write, optionally more Writes, Commit No overwrite
No ordering dependency (but see logwriter and non-posted write)
No completions at data sink
Can be pipelined
Other semantics: Asynchronous mode
Discussion in SNIA NVMP TWG
Where it’s affectionately named “Giddy-Up”
Local API behavior at initiator to perform Commit asynchronously
Enables remote write-behind for load/store access, among other scenarios
Complicates error recovery, but in well-defined way
Reads are interesting too
But easily interleaved with writes/commits
No additional protocol implications (envisioned)
24
2) Log Writer (Filesystem)
For (ever)
{ Write log record, Commit }, { Write log pointer, Commit }
Latency is critical
Log pointer cannot be placed until log record is successfully made durable
Log pointer is the validity indicator for the log record
Transaction model
Log records are eventually retired, buffer is circular
Protocol implications:
Must wait for first commit (and possibly the second)
Introduces a pipeline bubble – very bad for throughput and overall latency
Desire an ordering between Commit and second Write
Possible solution: “Non-posted write”
Special Write which executes “like a Read” – ordered with other non-posted operations
For example, Commits, Reads
Being discussed in IBTA
25
3) Log Writer (Database)
For (ever)
While (!full) { Write log record, Commit }
Commit log segment Persist segment to disk (asynchronously)
Log record write/commit latency (red 1/2) critical
Log segment persist to disk latency (green 3/4) not critical
Large improvement to database transaction rate
Approximately 2x for SQL Hekaton*
Very similar to log-based filesystem scenario
Similar RDMA protocol implication, but see next
26
SSD (Block)
NVDIMM-N (Byte)
Log Buffers
Log FileConfiguration HK on NVMe (block) HK on NVDIMM-N (DAX)
Row Updates / Second 63,246 124,917
Avg. Time / Txn (ms) 0.379 0.192
4) Log Write With Active-Active Signaling
After log record replication, how to make peer aware of it? Non-posted operations do not generate peer completions
E.g. Commit, RDMA Read, Atomic
… and are not ordered with Posted operations (e.g. Send, Write with Immediate)
Desire to generate a peer completion, only after durability achieved
Simple way: initiator waits for Commit completion Using an initiator Fence, or explicitly waiting
Pipeline bubble (bad for latency)
Better way: ordered operation rule at target Under discussion as part of the extension
27
Additional Extension: Remote Data Integrity
Assuming we have an RDMA Write + RDMA Commit
And the Writes + Commit all complete (with success or failure)
How does the initiator know the data is intact?
Or in case of failure, which data is not intact?
Possibilities:
Reading back
extremely undesirable (and possibly not actually reading media!)
Signaling upper layer
high overhead
Upper layer possibly unavailable (the “Memory-Only Appliance”!)
Other?
Same question applies also to:
Array “scrub”
Storage management and recovery
etc
28
Proposal: RDMA “VERIFY”
Concept: add integrity hashes to a new operation
Or, possibly, piggybacked on Commit
Note, not unlike SCSI T10 DIF
Hash algorithms to be negotiated by upper layers
Hashing implemented in RNIC or Library “implementation”
Which could be in
Platform, e.g. storage device itself
RNIC hardware/firmware, e.g. RNIC performs readback/integrity computation
Other hardware on target platform, e.g. chipset, memory controller
Software, e.g. target CPU
Ideally, as efficiently as possible
Options:
Source requests hash computation, receives hash as result, performs own comparison
Source sends hash to target, target computes and compares, returns success/failure
???
Under discussion in SNIA NVMP TWG OptimizedFlushAndVerify()
29
The (Near?) Future
Persistent Memory rapidly becoming part of Server hardware ecosystem NVDIMM-N, 3DXP, etc
Windows (and Linux) support is foundational and broadly supported
Protocol standards process well under way
Watch for the coming revolution in low latency storage!
30
Thank You!Questions?