managing nvm in the machine - events.static.linuxfound.org · –basic unit of nvm access...

30
Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Managing NVMin

The Machine

Rocky Craig, Master Linux TechnologistLinux Foundation Vault 2016

Page 2: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

The Machine Project from Hewlett Packard Enterprise

2

Massive SoC pool Massive memory poolPhotonic fabric

http://www.labs.hpe.com/research/themachine/“The Machine: A New Kind of Computer

Page 3: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Memory-Centric Computing: “No IO” from NVM persistence

3

// Give me some space in a way I can find it again tomorrow

int *vaddr = TheMachineVoodoo(...., identifier, ….., size, ….);

// Use it

*vaddr = 42;

// Don't lose it

exit(0);

Page 4: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

The NVM Fabric of The Machine

4

DRAM SoC

Fabric Bridge

SoC DRAM

Fabric Bridge

NVM NVM NVM NVM

Fabric Switch

NVM NVM NVM NVM

NVM NVM NVM NVM NVM NVM NVM NVM

Fabric Bridge Fabric Bridge

DRAM SoC SoC DRAM

Page 5: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Hardware Point of View for Fabric-Attached Memory (FAM)

–Basic unit of SoC HW memory access is still the page– Looks like DRAM, smells like DRAM...– But it's not identified as DRAM

–Basic unit of NVM access granularity is the 8 GB “book”– A collection of pages– 4T per node == 512 books, goal of 80 nodes

–Memory-mapping operations provide direct load/store access– FAM on same node as SoC doing load/store is cache-coherent– FAM on a different node is not cache-coherent

5

Page 6: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Node 1 Node 2 Node N

Linuxon SoC

Fabric Bridge

. . . . . .

NVM

Linuxon SoC

Fabric Bridge

NVM

Linuxon SoC

Fabric Bridge

NVM

Hardware Platform Basics

FabricSwitches

Page 7: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Single Load/Store Domain

7

SoCFabric Bridge

Fabric-Attached Memory

1-4 TB

256 GB DRAM

256 GB DRAM

SoCFabric Bridge

Fabric-Attached Memory

1-4 TB

SoCFabric Bridge

Fabric-Attached Memory256 GB DRAM

Page 8: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

TheMachineVoodoo(): rough consensus and running code

• Provide a new file system for FAM allocation• File system daemon

– Runs on each node– File system API under a mount point, typically “/lfs”– Communicates to metadata server over SoC Ethernet– Provides access to FAM books for applications on SoC

• Librarian– Running on Top of Rack Management Server (ToRMS)– FS metadata (“shelves” and attributes) managed in SQL database– Never sees actual book contents in FAM

8

Page 9: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Memory-Centric Computing under LFS

9

fd = open("/lfs/bigone", O_CREAT | O_RDWR, 0666);

ftruncate(fd, 10 * TB);

int *vaddr = mmap(NULL, 10 * TB, PROT_READ | PROT_WRITE,MAP_SHARED, fd, 0);

*vaddr = 42;

Page 10: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Possible usage pattern

● open(.....)

● truncate(1 or 2 books)

● mmap() and use “briefly”

● read() or write() mixed in

● truncate(up or down) a lot

● close()

● copy it, unlink it, save it for later...

● open(....)

● truncate(1 or 2 books)

● lather rinse repeat especially across SoCs

Page 11: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Expected use patterns

● open(.....)

● truncate( 1 or 2 books)

● mmap() and use “briefly”

● read() or write() mixed in

● close()

● unlink()

● open(....)

● truncate(1 or 2 books)

● lather rinse repeat

● open()

● truncate(thousands of books)

● mmap() sections across many cores/SoCs

● Run until solution convergence

● Sporadically, truncate(increase size)

Implications:

● Solution architectures need re-thinking

● It's not only about persistence

● File-system performance is not critical

Page 12: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

NUMA and cache coherency

12

DRAM SoC

Fabric Bridge

SoC DRAM

Fabric Bridge

NVM NVM NVM NVM

Fabric Switch

NVM NVM NVM NVM

NVM NVM NVM NVM NVM NVM NVM NVM

Fabric Bridge Fabric Bridge

DRAM SoC SoC DRAM

Page 13: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

LFS POSIX Extended File Attributes

$ touch /lfs/myshelf

$ getfattr -d /lfs/myshelf

getfattr: Removing leading '/' from absolute path names

# file: lfs/myshelf

user.LFS.AllocationPolicy="RandomBooks"

user.LFS.AllocationPolicyList="RandomBooks,LocalNode,Nearest,...."

user.LFS.<other stuff but you get the idea>

$ truncate -s40G /lfs/myshelf

Page 14: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

One SoC

User

Kernel

librarian.py SQLlfs_fuse.py Ethernet

fuse.ko

ToRMS

lfs_fuse.py

VFS

myprocess

fuse.py

libfuse.so

/dev/fuseFS API

system calls

Files under/lfs

Books andShelves

Database is initialized withbook layout and topology of all nodes / enclosures / racks

During runtime it tracks shelves,usage, and attributes

Librarian and Librarian File System

Where's the beef?

Page 15: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Oh this one againNode 1 Node 2 Node N

Linuxon SoC

Fabric Bridge

. . . . . .

NVM

Linuxon SoC

Fabric Bridge

NVM

Linuxon SoC

Fabric Bridge

NVM

FabricSwitches

Page 16: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Encapsulation 1 Encapsulation 2 Encapsulation N

lfs_fuse.py

Physicalmemory

lfs_fuse.py

Physicalmemory

lfs_fuse.py

Physicalmemory

Developing without hardware

Emulated sharing

librarian.py LAN

Page 17: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

User

Kernel

librarian.py SQLlfs_fuse.pylocalhost

fuse.ko

lfs_fuse.py

VFS

myprocess

fuse.py

libfuse.so

/dev/fuseFS API

system calls

Early LFS development: self-hosted

ShadowFile

$ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs1 1

$ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs2 2::

$ vi smalltm.ini # node count, book size, book total

$ create_db.py smalltm.ini smalltm.db

$ librarian.py …. --db_file=smalltm.db

$ truncate -s 16G /tmp/GlobalNVM

Page 18: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Address Translations

18

ARM CoreARM Core

VA -> PAVA -> PA

CacheCacheBook firewallBook firewall

Fabric requesterFabric requesterDRAMmax 1TDRAMmax 1T

Fabric Bridge: 14.9T of Apertures

(worst case)

Coherent interconnect

VA: 48b (256 TB)

53b (8 PB)“Book space”

SOC

Fabric space: 75b (32 ZB)

~1900 PA → LABook Descriptors~1900 PA → LA

Book Descriptors

PA: 44 - 48b (16 - 256 TB)

PCI,etcPCI,etc

Page 19: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Page and book faultsfd = open("/lfs/bigone"...

ftruncate(fd, 20 * TB);

int *vaddr = mmap(... fd, ...);

*vaddr = 42;

Passthrough to lfs_fuse::open()● lfs_fuse converse with Librarian – create a new shelf● lfs_fuse return a file descriptor for VFS

Passthrough to lfs_fuse::ftruncate()● Requests keyed on fd● lfs_fuse converse with Librarian – allocate books (LA)

Stay in kernel (FuSE hook)● Allocate VMA● LFS changes: set up caching structures to assist faulting

Start in kernel LFS page fault handler● If first fault in a book

● Overload getxattr() into lfs_fuse● lfs_fuse converse with Librarian – get book LA info● Kernel caches book LA

● Get book LA info from cache● Select and program unused descriptor● map with vma_insert_pfn()

Page 20: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

One Node

User

Kernel

librarian.py SQLlfs_fuse.pyEthernet

tm-fuse.ko

ToRMS

lfs_fuse.py

VFS

myprocess

tm-fuse.py

tm-libfuse.so

/dev/fuseFS API

system calls

Fabric bridgeFPGA

Librarian File System – Data in FAM

Hardware

Page 21: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Descriptors are in short supply*(vaddr + 1G) = 43;

(touch enough space to use all descriptors)

*onetoomany = 43;

Start in kernel LFS page fault handler● If first fault in a book

● Overload getxattr hook to lfs_fuse● lfs_fuse converse with Librarian – get book LA info● Kernel caches book LA

● Get book LA info from cache● Reuse previous descriptor/aperture as address base● map with vma_insert_pfn()

Lather rinse repeat

Need to reclaim a descriptor● Select an LRU candidate● For all VMAs mapped into that descriptor (book LA):

● flush caches● zap_vma_pte()

● Reprogram selected descriptor with LA, vma_insert_pfn()

Page 22: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

IVSHMEM

as

Global FAM

QEMU guest as node

lfs_fuse.py

QEMU guest as node

lfs_fuse.py

QEMU guest as node

lfs_fuse.py

Modified Nahanniserver manages

file used asbacking store

*

Apertures

Apertures

Apertures

*

*

librarian.py

LFS & Driver Development on QEMU and IVSHMEM

* Guest-private IVSHMEM regions emulate bridge resource space

Page 23: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

23Confidential

Platforms and environments

Fabric-Attached Memory Emulation

(Develop)

The MachineArchitectural Simulator

(Validate)The Machine

ApplicationApplicationApplication

New APIsPOSIX APIsNew APIsPOSIX APIsNew APIsPOSIX APIs

LFSLFSLFS

LibrarianDriversDrivers LibrarianLibrarianDrivers

HardwareFirmwareFirmware Hardware

Page 24: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

libpmem

–Part of http://pmem.io/nvml/

–API for controlling data persistence

–Flushing SoC caches.

–Clearing memory controller buffers

–Accelerated APIs for persistent data movement

–Non-temporal copies

–Bypass SoC caches

–Additions for The Machine

–APIs for invalidating SoC caches

24

Page 25: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Fabric-Attached Memory Atomics

–Native SoC atomic instructions are cache-dependent– Do not work between nodes

–Bridge and switch hardware includes fabric-native atomic operations

–Proprietary fam-atomic library provides API– Atomic read/write, compare/exchange, add, bitwise and/or– Cross-node Spin Locks– Depends on LFS for VA → PA → FA translations

25

Page 26: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

LFS native block devices

–Legacy applications or frameworks that need a block device– File-system dependent (ext4)– Ceph

–Triggered via mknod

–Simplifications for proof-of-concept– Plagiarize drivers/nvdimm/pmem.c – Avoid cache complications: node-local only– Lock the descriptors

26

Page 27: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

The Future● Short-term

– Full integration into management infrastructure of The Machine– Frameworks / Middleware / Demos / Applications / Stress testing– Optimizations (i.e., huge pages) – Learn, learn, learn

● And beyond– More capable or specialized SoCs– Deeper integration of fabric– Enablement of NVM technologies at production scale– Harden proven software (i.e., replace FuSE with a “real” file system)– True concurrent file system – Eliminate separate ToRMS server– ???????

Page 28: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Open Source

–Yes we're going to release all system software – Librarian, LFS, kernel modules

–Started with FAM Emulation in December 2015– http://github.com/FabricAttachedMemory– x86 and Debian Jessie– “Platform” only

28

Page 29: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

How Fast Is The Field of Dreams?

29

Page 30: Managing NVM in The Machine - events.static.linuxfound.org · –Basic unit of NVM access granularity is the 8 GB “book” – A collection of pages – 4T per node == 512 books,

Thank youRocky Craig<first.last>@hpe.com

30