introduction of system software for persistent memory (reading circle 2014/12/18)
TRANSCRIPT
Introduction of System Software for Persistent Memory
Makoto Shimazu
@Reading Circle
2014/12/18
S. R. Dulloor1,3, S. Kumar1, A. Keshavamurthy2, P. Lantz1, D. Reddy1, R. Sankaran1, J. Jackson1
1Intel Labs, 2Intel Corp, 3Georgia Institute of Technology
EuroSys 2014
Contributions
Introduction of pm_wbarrier
File system architecture optimized for PM
light-weight and consistent POSIX file system
memory-mapped I/O
protecting stray writes
Performance evaluation with PM emulator
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Flush the cache explicitly works well (clflush)
Caching problem in PM
5
fig of HDD/SSD) http://storage-system.fujitsu.com/jp/lib-f/tech/beginner/ssd/
load/store to DRAM
read/write to
SSD/HDD
load/store to PM
Non-volatile Area
Cache
Volatile Area
Flush the cache explicitly works well (clflush)
clflush cannot flush from memory controller
Caching problem in PM
6
fig of HDD/SSD) http://storage-system.fujitsu.com/jp/lib-f/tech/beginner/ssd/
load/store to DRAM
read/write to
SSD/HDD
load/store to PM
Non-volatile Area
Cache
Volatile Area
MC
pm_wbarrier
Feature
Enforce the durability of a cacheline
Steps of usage
1. clflush A flush the cacheline contains A
2. sfence ensure the completion of store
3. pm_wbarrier ensure the durability of every store to PM
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Consistency
Three existing techniques:
Copy on Write (CoW)
Journaling
Log-structured updates
One more PM specific technique:
Atomic in-place writes
Used for updates on
Data Area
Used for updates on
Meta Data (inode)
Used for updates of
small portion of data
Copy on Write (Shadow Paging)
Safe and consistent method to modify data
Three steps: Copy, Modify, Refer
1: Copy
2: Modify
3: Refer
Recursive Copy!!!
12
Journaling
13
Hello World!
RINKO
NXXXX
hello.txt
1: WRITE “RINKO”
2: WRITE “NOW!!!”
Log
Snapshot
CRASH!
Hello World!
RINKO
NOW!!!
Hybrid method
Metadata
Updated by fine-grained logging
Data
Use Copy on Write method
Distributed small
modification
Centralized large
modification
Copy on Write ☓ (Write Amplification) ◯ (Freely after copy)
Journaling ◯ (Just append logs) ☓ (Double writes)
Extended atomic in-place writes
8 bytes (the same as BPFS)
Update inode’s access time
16 bytes
Using cmpxchg16b instruction
Update inode’s size and modification time
64 bytes
Using RTM (introduced in Haswell and having erratum)
Update a number of inode fields like delete
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Write Protection
Supervisor Mode Access Protection (SMAP)
Prohibit writes into user area
Write windows (introduced in this paper)
Mount as read-only
When writing, CR0.WP is set to zero
Right) http://en.wikipedia.org/wiki/Protection_ring
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Implementation on Linux
Execution In Place (XIP)
Interface of loading data from Flash directly in limited RAM environment
Used to avoid the block device/page cache layer
Testing and Validation
Yat: Hypervisor-based validation framework
Ensure cache flushing and pm_wbarrier are executed in correct order
Paper is published in USENIX ATC’14
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Evaluation
Environment
PM Emulation Platform (PMEP)
PM Block Driver (PMBD)
Results
File-based Access
Memory-Mapped I/O
Write Protection
Evaluation Settings
PM Emulation Platform (PMEP)
Configurable latencies and bandwidth for PM
Configurable pm_wbarrier latency
Environment
Partitioned memory channels using custom BIOS?
Latency Emulation debug hook and HW counter counting LLC stall cycles
Bandwidth Emulation memory controller
Element Value
CPU Xeon(2.6GHz) 8 cores x 2sockets
DRAM 16GB
PM 256GB (disabled NUMA?)
PMBD
Persistent Memory Block Driver (PMBD)presented in MSST’14
Introduced for fair comparison
Open-source implementation https://github.com/linux-pmbd/pmbd
Partition between DRAM and PM
Use non-temporal stores
File-based AccessFile I/O (Right 4 Graphs)
Single thread
Single 64GB file
File Utilities (Bottom)
For Linux Kernel tarball
In-place updates/Logging
Effect of in-place updates
Compare with fine-grained logging... Using 16-byte atomic writes: 1.8X faster
Using 64-byte atomic writes: 18% faster
Logging Overhead
Mmap Random read/write in a single 64GB file
PMFS-D: default 4kB page
PMFS-L: 1GB page
Large enough
not to be on page cache
Thanks to omitting
page cache
Neo4j (user application of mmap)
Dataset 10M nodes/100M edges from Wikipedia dataset
Workload Delete: deleting 2000 nodes and associated edges
Insert: adding back the 2000 nodes and the edges
Query: selecting two nodes and calculate the shortest path
Improvements by
no copy overhead
Improvements by
synchronous write latency
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion
Related Work
Enhance new storage DFS[30], Log-structured File System[37], Conquest FS[41]
Hybrid of NVM and Disk or Flash Rio File Cache[24], Conquest FS[41]
PM-only Storage BPFS[27], SCMFS[43]
High Level API on PM Failure-atomic msync[33]
NV-Heaps[26], Mnemosyne[40]
Library solutions[39]
Outline
Volatile cache problem
Architecture
Consistency
Write protection from stray writes
Implementation
Evaluation
Related Work
Conclusion