nvmfs: a new file system designed specifically to take advantage of nonvolatile memory
TRANSCRIPT
NVMFS: A New File System Designed Specifically to Take Advantage of
Nonvolatile Memory
Dhananjoy Das, Sr. Systems Architect SanDisk Corp.
Flash Memory Summit 2015 Santa Clara, CA
1
Forward-Looking Statements During our meeting today we will make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market growth, products and their capabilities, performance and compatibility, cost savings and other benefits to customers. Information in this presentation may also include or be based upon information from third parties, which reflects their expectations and projections as of the date of issuance. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof.
Agenda: Applications are KING!
§ Storage landscape (Flash / NVM) § Non Volatile Memory File System
§ [Use Case ] MySQL Atomics § [Use Case ] MySQL NVM-Compressed § [Use Case ] Extended memory, DB Acceleration
Flash Memory Summit 2015 Santa Clara, CA
123
Non-Volatile Memory (NVM) Current: NAND Flash
§ Capacity: 100s of GB to 100s of TB per device § Trends: Higher capacity, lower cost/GB,
lower write cycles, SLC->MLC->3BPC § IOPS: 100K to millions, GB/s of bandwidth
Non-Volatile Memory § DDR/PCI-e attached NVDIMM / Capacitance backed power-
safe buffers + FLASH § orders of magnitude performance improvement
Future: Non-Volatile Memory technologies (Phase Change Memory, MRAM, STT-RAM, etc.)
PCIe attached
SAS and SATA attach SSDs
Fabric attached
NVDIMMs
Flash Memory Summit 2015 Santa Clara, CA
Why Do Applications Need Optimization for Flash?
Flash Memory Summit 2015 Santa Clara, CA
Read/Write Performance
Largely symmetrical Heavily asymmetrical
Wear out/ Background ops
Largely unlimited / Rare
Limited write cycles / Regularly
IOPS/
Latency
100 to 1,000 / 10’s milli sec
100Ks to Millions / 10’s-100’s micro sec
Addressing
Sequence, Sector
Direct, addressable
HDD Flash
Managed writes = greater device lifetime (wear leveling, endurance) Improved system efficiency (TCO and TCA)
Becoming “Flash Aware”: SanDisk NVMFS
Non-Volatile Memory File System – Optimized for Flash and Persistent Memory
Native Flash Translation Layer block allocation, mapping, recycling
ACID updates, logging/journaling, crash-recovery
NVMFS file metadata mgmt
Kernel block layer
kernel-space
user-space
Standard Linux File Systems file metadata mgmt,
block allocation, mapping, recycling, ACID updates, logging/journaling,
crash-recovery
NVM Primitives
MySQL™ Database
Linux VFS (virtual file system) abstraction layer
New File system
Eliminating Duplicate Logic and Leveraging New Primitives for Optimal Flash Performance and Efficiency Flash Memory Summit 2015 Santa Clara, CA
§ NVMFS is POSIX compliant
§ Leverages the functionality of underlying Flash Translation Layer (FTL)
§ Namespace Management
§ Enables Direct flash/memory access and crash recovery
§ NVM primitives are exposed through standard system file interface
Flash Beyond Disk: Adapting the Software Stack
§ Flash-Aware Acceleration § Changes to MySQL are “aware”
of flash and automatically leverage optimized API
§ NVMFS 1.1 New! § NVM optimized filesystem § Standard file namespace, all
existing customer management tools work
§ Raw performance of NVM § Flash-aware interfaces direct to
applications Flash-Aw
are Acceleration
NVM Primitives Standard Block I/O
Flash-Aware Applications Traditional Databases
Enhanced APIs Double-Buffer Writes
(writes) (writes)
NVM Optimized Filesystem NVMFS 1.1 Linux File Systems
Traditional FTL
SanDisk ioMemory Other SSDs
Flash Memory Summit 2015 Santa Clara, CA
Legacy MySQL Challenges Double-Write and Compression Penalties
Every MySQL write translates to 2 writes to storage device
80% performance penalty with legacy MySQL
compression enabled
Page C Page
B
Page A
Buffer
DRAM Buffer
SSD (or HDD) Database
Database Server
Page C
Page B
Page A
Page C
Page B
Page A
Page C
Page B
Page A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to double-write buffer on the media.
3
Once step 3 is acknowledged, MySQL writes the updates to the actual tablespace.
4
2 Writes
80%
redu
ctio
n in
TP
S
21
Tran
sact
ion
Rat
e
Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications. Flash Memory Summit 2015
Santa Clara, CA
Solving the Double-Write Problem SanDisk NVMFS with Atomic Write
▸ MySQL with Atomic Write
Fusion ioMemory Database
Page C
Page B
Page A
DRAM Buffer
Page C
Page B
Page A
Application initiates updates to pages A, B, and C.
1
MySQL copies updated pages to memory buffer.
2
MySQL writes to actual tablespace, bypassing the double-write buffer step due to inherent atomicity guaranteed by the intelligent Fusion ioMemory device.
3
Database Server
Page C Page
B
Page A
§ Enhanced Life Expectancy of Fusion ioMemory Devices:
– Reduce Writes to flash by half at similar throughput
§ Improved performance consistency
§ Reduced latency, increased transaction/sec
§ Higher performance – Especially workloads with datasets that are bigger than
DRAM
Perfect Fit for ACID-compliant MySQL
1
One, Single Atomic Write!
The performance results discussed herein are based on internal testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.
Flash Memory Summit 2015 Santa Clara, CA
SanDisk NVMFS Improves Latency Consistency Lower Latency with Greater Stability
0 20 40 60 80 100 120 140 160 180 200
1 78
155
232
309
386
463
540
617
694
771
848
925
1002
1079
1156
1233
1310
1387
1464
1541
1618
1695
1772
1849
1926
2003
2080
2157
2234
2311
2388
2465
2542
2619
2696
2773
2850
2927
3004
3081
3158
3235
3312
3389
3466
3543
Latency (m
sec)
Time (Seconds)
NVMFS atomics XFS double-‐write
NVMFS Atomic Write Significantly Reduces Latency while Increasing Performance Consistency
Sysbench -‐ MariaDB 10.0.15, 4000 OLTP TXN injecLon/second, 99% latency, 220GB data -‐ 10GB buffer pool
NVMFS latency range
XFS latency range
Flash Memory Summit 2015 Santa Clara, CA
Improving MySQL Compression SanDisk Contribution to MySQL Community
§ Benefits of compression without severe performance penalty • Within 10% of uncompressed
§ Up to 50% improvement in capacity utilization1
§ Enhanced life expectancy of flash devices2
• Up to 4x fewer writes to storage with Compression and Atomic Write
Compression with almost no performance penalty 1For workloads that compress well. Improvement will vary 2At Similar Throughput (assuming same load)
Transaction Rate
2
The performance results discussed herein are based on internal testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.
Flash Memory Summit 2015 Santa Clara, CA
Flash Beyond Disk: NVM Compression
High level approach
§ Application operates on sparse address space which is always the size of uncompressed.
§ Compressed data block is written in place at same virtual address as the un-compressed. Leaving a hole, empty space in the remainder of space allocated.
§ FTL garbage collection naturally coalesces the addresses in physical space, allowing for re-provisioning of physical space.
Database
File System File 1 File 2 File 3
NVM Compression
write fallocate(PUNCH_HOLE)
FTL (Mapping and Garbage Collection
Sparse Address Space Addr (0) -> (248 – 1)
Physical Address Space (0 – Device capacity)
Read/Write/PTrim
FTL indirect map
Flash Memory Summit 2015 Santa Clara, CA
2
Compression without Performance Penalty Improved Flash Utilization
0
5000
10000
15000
20000
25000
30000
Tim
e 80
160
240
320
400
480
560
640
720
800
880
960
1040
11
20
1200
12
80
1360
14
40
1520
16
00
1680
17
60
1840
19
20
2000
20
80
2160
22
40
2320
24
00
2480
25
60
2640
27
20
2800
28
80
2960
30
40
3120
32
00
3280
33
60
3440
35
20
# of
Tra
nsac
tions
Time (Seconds)
TPC-C-Like benchmark, 1000 warehouses - 75GB Buffer pool, MariaDB 10.0.15
5x improvement
Legacy MySQL with Compression ON
MySQL /NVMFS with Compression ON
Legacy MySQL with Compression OFF
Flash Memory Summit 2015 Santa Clara, CA
2
NVM Compression Eliminates Legacy MySQL’s Compression Penalty
Database Acceleration using Flash Extended Memory
NVMFS (POSIX compliant file system) PMM (Persistent Memory Manager)
NVMFS Help to Increase Performance of Databases
Flash Devices Memory Persistent Memory
Memory Device Technologies:
Memory Interfaces
Oracle MySQL™
(No application changes) (Application optimized)
“Flash -Aware” “Transparent”
Flash Extended Memory
Flash Memory Summit 2015 Santa Clara, CA
3
DDR / PCI-e
MySQL Configuration
16
Fusion ioMemory™ PCIe-based flash
Standard MySQL
database
“Baseline”
Flash Extended Memory Enabled
Fusion ioMemory™ and persistent memory with
NVMFS
Standard MySQL
database
“Transparent”
Optimized MySQL
database
“Flash Aware”
Fusion ioMemory™ and persistent memory with
NVMFS
Flash Memory Summit 2015 Santa Clara, CA
3NVM attach points: DDR / PCI-e
Performance Results
Latency (lower is better)
Throughput (higher is better)
Source: Based on internal testing by SanDisk Flash Memory Summit 2015 Santa Clara, CA
3PCI-e attached NVM
MySQL :: Transparent Acceleration (latency) Performance Overview:: Comparison Between Baseline and NVDIMM
Flash Memory Summit 2015 Santa Clara, CA
3 Using DDR NVDIMM
5614%
22857% 19512% 17204% 14035%11189%
48485%39024% 34783%
28571%
(25%)Insert/Delete/
(50%)Insert/Lookup5
70%Insert530%Lookup5
80%Insert520%Lookup5
100%Insert5
MySQL55.6.245AcceleraDon5Mixed%Workload;%Threads;16,%dataset(thread):%100K%%
(Higher(is(Be+er)(
Baseline% NVDIMM;enabled%
Flash Memory Summit 2015 Santa Clara, CA
3
Up to 2x higher transactions per second using DDR attached NVDIMM
MySQL :: Transparent Acceleration Performance Overview:: Comparison Between Baseline and NVDIMM
Atomic Writes Summary § Transac?on throughput improvements of roughly 2.4x over conven?onal flash SSDs
§ Half as many writes per transac?on means poten?al for as
much as 2x flash endurance for write intensive workloads: more cost effecLve flash storage
§ Standardized: Approved SNIA standard,SBC-‐4 SPC-‐5 Atomic-‐Write hTp://www.t10.org/cgi-‐bin/ac.pl?t=d&f=11-‐229r6.pdf
§ Uses unique flash-‐aware op?miza?ons from SanDisk
Flash Memory Summit 2015 Santa Clara, CA
§ Available commercially and fully supported from SanDisk (NVMFS 1.1) and Oracle MySQL 5.7.4, Percona Server 5.5, 5.6 and MariaDB 10
1
The performance results discussed herein are based on internal testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.
NVM Compression Summary § Performance within 10% of uncompressed (and
sometimes greater) for Linkbench and TPC-C like workloads. 5x acceleration of TPC-C like as compared to Row Compression
§ Storage savings of ~2x (data dependent) with as much or more compressibility as MySQL row compression
§ Upto 4x better flash endurance when combined with Atomic Writes
§ Addition power/cooling benefits and scalability benefits § Available commercially and fully supported from SanDisk (NVMFS 1.0)
and MariaDB 10, coming soon from Oracle and Percona distributions Flash Memory Summit 2015 Santa Clara, CA
2
The performance results discussed herein are based on internal testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.
Extended Memory: Advantages & Benefits
§ Uses “Flash-as-Memory” byte-addressable architecture and interface § Seamless deployment – add ioMemory and NVMFS/PMM software to Linux
§ Increase performance and capacity in flexible configurations
Flash Memory Summit 2015 Santa Clara, CA
3 Workload: Insert Heavy
Transparent (no software mods)
Flash-Aware (modified software)
Throughput 1.8x - 2x 4x Latency 2x 4x
The performance results discussed herein are based on internal testing and use of Fusion ioMemory products. Results and performance may vary according to configurations and systems, including drive capacity, system architecture and applications.
Who is NVMFS for? § NVMFS will optimize customer database flash storage by improving
§ Transactional performance such as, latency and throughput § Enhanced lifespan of flash devices § Practical capacity
§ Enterprise environment § OLTP databases running in a Linux OS environment § Insert heavy workloads needing to persist large amounts of data § Latency sensitive OLTP workloads § Concerned about flash endurance
§ Hyperscale environment § To improve CPU utilization per node § Clusters of MySQL nodes by being able to store more data
Flash Memory Summit 2015 Santa Clara, CA
Questions?
Flash Memory Summit 2015 Santa Clara, CA
24
Thank You!
Booth#207 @BigDataFlash #bigdataflash
ITblog.sandisk.com hTp://bigdataflash.sandisk.com
Paper: https://www.usenix.org/system/files/conference/inflow14/inflow14-das.pdf http://www.sandisk.com/assets/white-papers/MySQL_with_directFS_and_Atomic_Writes.pdf Blog: http://itblog.sandisk.com/in-a-battle-of-hardware-software-innovation-comes-out-on-top/
Performance -‐ Datapath
Micro-‐benchmark § Parallel, direct I/O on a single
file on a very fast device ApplicaLons § Databases § Virtual Machines
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 Pe
rfor
man
ce
(nor
mal
ized
) Threads
block nvmfs xfs
0
1
2
3
4
5
6
7
8
nvmfs xfs Elap
sed Time
(normalize
d)
Performance -‐ TRIM Handling
Micro-‐benchmark § Trim aher write § 16 KiB Direct Write + 4 KiB TRIM
ApplicaLons § MySQL Page-‐compression
Performance -‐ File Logging
block device nvmfs xfs IOPS 1 0.97 0.36 Physical Bytes Written 1 1 1.38
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Perf
orm
ance
(n
orm
aliz
ed)
Micro-‐benchmark § Append data at end of file § 4 KiB write(2) + fdatasync(2) ApplicaLons § Databases § HFT § Log Structured Systems
§ Up to 3.2x reduction in Writes to flash resulting in a longer device lifetime § Utilize flash hardware longer 80.77%
43.03%
24.37%
42.4%
18.75%7.7%
100%%Update% 50%%Update,%50%%Lookup%
30%%Update,%70%%Lookup%
Flash%Writes%(GB)%(Lower'is'Be+er)'
Baseline%Writes% OpDmized%Writes%
Endurance benefits for Cassandra Performance Overview:: Comparison Between Baseline and NVDIMM
Flash Memory Summit 2015 Santa Clara, CA