emerging persistent memory hardware and zufs - pm-based file systems in user space

25
KernelTLV Filesystem Supersession Part I File Systems: why, how and where (another slide deck) [email protected] Part II Emerging PM HW & SW stack implications [email protected] Part III ZUFS – PM-based file systems in user space [email protected] [email protected] © 2017 NetApp, Inc. All rights reserved 1 KernelTLV Meetup Nov. 14 th 2017

Upload: kernel-tlv

Post on 22-Jan-2018

436 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

KernelTLV Filesystem Supersession

Part I File Systems: why, how and where (another slide deck)[email protected]

Part II Emerging PM HW & SW stack [email protected]

Part III ZUFS – PM-based file systems in user [email protected]@netapp.com

© 2017 NetApp, Inc. All rights reserved1

KernelTLV MeetupNov. 14th 2017

Page 2: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Part IIPersistent Memory (PM) – HW & SW implications

Emerging PM/NVDIMM devices, the value they bring to applications and

how they revolutionize the storage stack

© 2017 NetApp, Inc. All rights reserved2

KernelTLV MeetupNov. 14th 2017

Page 3: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

About

© 2017 NetApp, Inc. All rights reserved.3

~12,000employees

50+countries

Only Top 5 vendor that is rapidly

growing

Celebrating

25 Years

Founded in 1992

NetApp acquired @ June 2017

TLV areabased

Recruiting FS Dev.

+1

PM SoftwarePioneer

Since 2014

Ground breaking Latencies

Page 4: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Storage Media Generations

© 2017 NetApp, Inc. All rights reserved4

PM marries the best of both worlds: + StoragePersistency

MemorySpeed

HDD FLASH

IOPS(even if random…)

Latency(even under load…)

NVDIMM / PM

Page 5: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Definitions

Rounded latency numbers & under typical load

5

SCM (Storage Class Memory)

Byte-addressable Media@ Near-memory speed

<1us

5 © 2017 NetApp, Inc. All rights reserved.

PM (Persistent Memory)

Byte-addressable Device@ Near-memory speed, on memory bus

Page 6: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

PM-based Storage - Question Traditional Assumptions

Byte-addressable media

Block-addressable wrapper

SW layers Network SW caching Block abstraction

?

66 © 2017 NetApp, Inc. All rights reserved.

Memory Vs. Storage

Page 7: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Block wrapper

PM-based FS

Application

Block-based FS

Page Cache

bio

PM-based Software Approaches

Application Re-written Application

NPM

DAX-enabled FS

SW reuse Performance

App

SWInfrastructure

HW

Page 8: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Linux Kernel Enablers

“-o dax”

Built in Kernel driver nd_btt.ko. Source: drivers/nvdimm/btt.c

Built in Kernel driver nd_pmem.ko. Source: drivers/nvdimm/pmem.c

Built in Kernel driver core.ko. Source: drivers/nvdimm/core.c

Linux 4.1+ subsystems added support of NVDIMMs. Mostly stable from 4.4

NVDIMM modules presented as device links: /dev/pmem0, /dev/pmem1

QEMO support

BTT (Block, Atomic)

PMEM (Direct Access)

DAX Enabled FS

NFIT Core

8Can also refer to kTLV Meetup from 2016 - https://www.youtube.com/watch?v=FVrgt9JtcwQ

Page 9: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Block wrapper

PM-based FS

Applications

DAX-enabled FS

Storage semantics

PM-based Software Approaches

Memorysemantics

Block-based FS

Page Cache

bio

NPM

Mmap, ld/st, msyncRead/write, fsync

NVDIMM Driver

Page 10: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Examples:

Block wrapper

PM-based FS

Applications

DAX-enabled FS

Storage semantics

Memorysemantics

Block-based FS

Page Cache

bio

NPM

NTFS-DAXREFS-DAX

XFS-DAXEXT4-DAX

NOVALUMFS SIMFSHINFS

Plexistor M1FS

NVDIMM Driver

Examples:

Windows server 2016Linux 4.4 and aboveUbuntu 16.04 LTSRHEL 7.3Fedora 23SLES 12 SP2

Examples:

NVML 1.3

(*) Huge variance in features and stability(*) Good portability

PM-based Software Approaches

Page 11: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

11 1111© 2017 NetApp, Inc. All rights reserved.

Page 12: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Part IIIZUFS - Zero-copy User-mode FS

New style user-mode filesystems that require: - Extremely Low-Latency- Synchronous & DAX

© 2017 NetApp, Inc. All rights reserved12

KernelTLV MeetupNov. 14th 2017

Page 13: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

From VFS to Zufs

© 2017 NetApp, Inc. All rights reserved13

Page 14: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Why Userspace?

• Resiliency

• Ease of development

• Externals (e.g. compress, encrypt)

• Licensing

• Market requirements (avoid kernel modules)

© 2017 NetApp, Inc. All rights reserved14

Page 15: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

ZUFS and FUSE are complementary tools

© 2017 NetApp, Inc. All rights reserved15

Page 16: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

MotivationFuSE is great for HDDs and ok(ish) for SSDs, but not suitable for PMEM

© 2017 NetApp, Inc. All rights reserved.16

FlashHDD Memory

FUSE

SCM

?RDMATCP

Latency$/GB

FUSE ZUFS

Typical medias Built for HDDs & extended to Flash Built for PM/NVDIMMs and DRAM

SW Perf. goals • Secondary (High latency media)• Async I/O Throughput

• SW is the bottleneck • Latency is everything

SW caching Slow media ->Rely on OS Page Cache

Near-memory speed media ->Bypass OS Page Cache

Access method I/O only I/O and mmap (DAX)

Cost of redundant copy / context switch

Negligible The bottleneck ->Avoid copies, queues & remain on core

Latency penalty under load

100s of µs 3-4 µs

De

sig

n A

ss

um

pti

on

s

Page 17: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Zufs Overview

Core 1

Core 2

Core 3

Core 4

© 2017 NetApp, Inc. All rights reserved18

Page 18: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Kernel to Userspace

© 2017 NetApp, Inc. All rights reserved19

Page 19: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

ZUFS – Kernel Zoom-in

© 2017 NetApp, Inc. All rights reserved20

KernelTLV MeetupNov. 14th 2017

Page 20: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Preliminary Results FUSE Vs. ZUFS (PM Media)

© 2017 NetApp, Inc. All rights reserved.21

• Measured on Dual socket Intel XEON 2650v4 (48 HW Threads)DRAM-backed PMEM type

• Random 4KB DirectIO writ(ish) access

Page 21: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Architecture

© 2017 NetApp, Inc. All rights reserved.22

APP

zt-vma

PPP

App pages Mapped into Server VM

Unmapped on return

ZUSZu Server

ZUFZu Feeder

zt per cpu ...

kernel

Page 22: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

ZT - ZUFS Thread per CPU, affinity on a single CPU (thread_fifo/rr)

Special ZUFS communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)

ZT-vma - Mmap 4M vma zero copy communication area per ZT

IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation

On App IO current CPU ZT is selected, app pages mapped into ZT-vma. Server thread released with an operation

After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, Server wait for new operation.

On exit (or server crash) file is closed, Kernel cleans all resources

Async operation is also supported. Server can return EAGAIN.

Server will later complete the operation ASYNC. App will be woken up.

Application mmap (DAX) is the opposite direction. ZUS exposes pages (opt_get_data_block) into the app VM

© 2017 NetApp, Inc. All rights reserved. 23

Architecture

Page 23: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Perf. Optimizations - 1 MMAP_LOCAL_CPU

© 2017 NetApp, Inc. All rights reserved24

• mm patch to allow single-core TLB invalidate (in the common case)

0

5

10

15

20

25

30

- 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000

Late

ncy

[us]

IOPS

ZUFS w/wo mm patch

ZUFS_unpatched_mm

ZUFS_patched_mm

Page 24: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

Perf. Optimizations - 2

© 2017 NetApp, Inc. All rights reserved. 25

• scheduler patch to allow efficient context switch on same core (Relay Object)

UnimplementedNo Perf. Results

Page 25: Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User Space

© 2017 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---26

Questions