managing nvm in the machine - events.static.linuxfound.org · –basic unit of nvm access...
TRANSCRIPT
Managing NVMin
The Machine
Rocky Craig, Master Linux TechnologistLinux Foundation Vault 2016
The Machine Project from Hewlett Packard Enterprise
2
Massive SoC pool Massive memory poolPhotonic fabric
http://www.labs.hpe.com/research/themachine/“The Machine: A New Kind of Computer
Memory-Centric Computing: “No IO” from NVM persistence
3
// Give me some space in a way I can find it again tomorrow
int *vaddr = TheMachineVoodoo(...., identifier, ….., size, ….);
// Use it
*vaddr = 42;
// Don't lose it
exit(0);
The NVM Fabric of The Machine
4
DRAM SoC
Fabric Bridge
SoC DRAM
Fabric Bridge
NVM NVM NVM NVM
Fabric Switch
NVM NVM NVM NVM
NVM NVM NVM NVM NVM NVM NVM NVM
Fabric Bridge Fabric Bridge
DRAM SoC SoC DRAM
Hardware Point of View for Fabric-Attached Memory (FAM)
–Basic unit of SoC HW memory access is still the page– Looks like DRAM, smells like DRAM...– But it's not identified as DRAM
–Basic unit of NVM access granularity is the 8 GB “book”– A collection of pages– 4T per node == 512 books, goal of 80 nodes
–Memory-mapping operations provide direct load/store access– FAM on same node as SoC doing load/store is cache-coherent– FAM on a different node is not cache-coherent
5
Node 1 Node 2 Node N
Linuxon SoC
Fabric Bridge
. . . . . .
NVM
Linuxon SoC
Fabric Bridge
NVM
Linuxon SoC
Fabric Bridge
NVM
Hardware Platform Basics
FabricSwitches
Single Load/Store Domain
7
SoCFabric Bridge
Fabric-Attached Memory
1-4 TB
256 GB DRAM
256 GB DRAM
SoCFabric Bridge
Fabric-Attached Memory
1-4 TB
SoCFabric Bridge
Fabric-Attached Memory256 GB DRAM
TheMachineVoodoo(): rough consensus and running code
• Provide a new file system for FAM allocation• File system daemon
– Runs on each node– File system API under a mount point, typically “/lfs”– Communicates to metadata server over SoC Ethernet– Provides access to FAM books for applications on SoC
• Librarian– Running on Top of Rack Management Server (ToRMS)– FS metadata (“shelves” and attributes) managed in SQL database– Never sees actual book contents in FAM
8
Memory-Centric Computing under LFS
9
fd = open("/lfs/bigone", O_CREAT | O_RDWR, 0666);
ftruncate(fd, 10 * TB);
int *vaddr = mmap(NULL, 10 * TB, PROT_READ | PROT_WRITE,MAP_SHARED, fd, 0);
*vaddr = 42;
Possible usage pattern
● open(.....)
● truncate(1 or 2 books)
● mmap() and use “briefly”
● read() or write() mixed in
● truncate(up or down) a lot
● close()
● copy it, unlink it, save it for later...
● open(....)
● truncate(1 or 2 books)
● lather rinse repeat especially across SoCs
Expected use patterns
● open(.....)
● truncate( 1 or 2 books)
● mmap() and use “briefly”
● read() or write() mixed in
● close()
● unlink()
● open(....)
● truncate(1 or 2 books)
● lather rinse repeat
● open()
● truncate(thousands of books)
● mmap() sections across many cores/SoCs
● Run until solution convergence
● Sporadically, truncate(increase size)
Implications:
● Solution architectures need re-thinking
● It's not only about persistence
● File-system performance is not critical
NUMA and cache coherency
12
DRAM SoC
Fabric Bridge
SoC DRAM
Fabric Bridge
NVM NVM NVM NVM
Fabric Switch
NVM NVM NVM NVM
NVM NVM NVM NVM NVM NVM NVM NVM
Fabric Bridge Fabric Bridge
DRAM SoC SoC DRAM
LFS POSIX Extended File Attributes
$ touch /lfs/myshelf
$ getfattr -d /lfs/myshelf
getfattr: Removing leading '/' from absolute path names
# file: lfs/myshelf
user.LFS.AllocationPolicy="RandomBooks"
user.LFS.AllocationPolicyList="RandomBooks,LocalNode,Nearest,...."
user.LFS.<other stuff but you get the idea>
$ truncate -s40G /lfs/myshelf
One SoC
User
Kernel
librarian.py SQLlfs_fuse.py Ethernet
fuse.ko
ToRMS
lfs_fuse.py
VFS
myprocess
fuse.py
libfuse.so
/dev/fuseFS API
system calls
Files under/lfs
Books andShelves
Database is initialized withbook layout and topology of all nodes / enclosures / racks
During runtime it tracks shelves,usage, and attributes
Librarian and Librarian File System
Where's the beef?
Oh this one againNode 1 Node 2 Node N
Linuxon SoC
Fabric Bridge
. . . . . .
NVM
Linuxon SoC
Fabric Bridge
NVM
Linuxon SoC
Fabric Bridge
NVM
FabricSwitches
Encapsulation 1 Encapsulation 2 Encapsulation N
lfs_fuse.py
Physicalmemory
lfs_fuse.py
Physicalmemory
lfs_fuse.py
Physicalmemory
Developing without hardware
Emulated sharing
librarian.py LAN
User
Kernel
librarian.py SQLlfs_fuse.pylocalhost
fuse.ko
lfs_fuse.py
VFS
myprocess
fuse.py
libfuse.so
/dev/fuseFS API
system calls
Early LFS development: self-hosted
ShadowFile
$ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs1 1
$ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs2 2::
$ vi smalltm.ini # node count, book size, book total
$ create_db.py smalltm.ini smalltm.db
$ librarian.py …. --db_file=smalltm.db
$ truncate -s 16G /tmp/GlobalNVM
Address Translations
18
ARM CoreARM Core
VA -> PAVA -> PA
CacheCacheBook firewallBook firewall
Fabric requesterFabric requesterDRAMmax 1TDRAMmax 1T
Fabric Bridge: 14.9T of Apertures
(worst case)
Coherent interconnect
VA: 48b (256 TB)
53b (8 PB)“Book space”
SOC
Fabric space: 75b (32 ZB)
~1900 PA → LABook Descriptors~1900 PA → LA
Book Descriptors
PA: 44 - 48b (16 - 256 TB)
PCI,etcPCI,etc
Page and book faultsfd = open("/lfs/bigone"...
ftruncate(fd, 20 * TB);
int *vaddr = mmap(... fd, ...);
*vaddr = 42;
Passthrough to lfs_fuse::open()● lfs_fuse converse with Librarian – create a new shelf● lfs_fuse return a file descriptor for VFS
Passthrough to lfs_fuse::ftruncate()● Requests keyed on fd● lfs_fuse converse with Librarian – allocate books (LA)
Stay in kernel (FuSE hook)● Allocate VMA● LFS changes: set up caching structures to assist faulting
Start in kernel LFS page fault handler● If first fault in a book
● Overload getxattr() into lfs_fuse● lfs_fuse converse with Librarian – get book LA info● Kernel caches book LA
● Get book LA info from cache● Select and program unused descriptor● map with vma_insert_pfn()
One Node
User
Kernel
librarian.py SQLlfs_fuse.pyEthernet
tm-fuse.ko
ToRMS
lfs_fuse.py
VFS
myprocess
tm-fuse.py
tm-libfuse.so
/dev/fuseFS API
system calls
Fabric bridgeFPGA
Librarian File System – Data in FAM
Hardware
Descriptors are in short supply*(vaddr + 1G) = 43;
(touch enough space to use all descriptors)
*onetoomany = 43;
Start in kernel LFS page fault handler● If first fault in a book
● Overload getxattr hook to lfs_fuse● lfs_fuse converse with Librarian – get book LA info● Kernel caches book LA
● Get book LA info from cache● Reuse previous descriptor/aperture as address base● map with vma_insert_pfn()
Lather rinse repeat
Need to reclaim a descriptor● Select an LRU candidate● For all VMAs mapped into that descriptor (book LA):
● flush caches● zap_vma_pte()
● Reprogram selected descriptor with LA, vma_insert_pfn()
IVSHMEM
as
Global FAM
QEMU guest as node
lfs_fuse.py
QEMU guest as node
lfs_fuse.py
QEMU guest as node
lfs_fuse.py
Modified Nahanniserver manages
file used asbacking store
*
Apertures
Apertures
Apertures
*
*
librarian.py
LFS & Driver Development on QEMU and IVSHMEM
* Guest-private IVSHMEM regions emulate bridge resource space
23Confidential
Platforms and environments
Fabric-Attached Memory Emulation
(Develop)
The MachineArchitectural Simulator
(Validate)The Machine
ApplicationApplicationApplication
New APIsPOSIX APIsNew APIsPOSIX APIsNew APIsPOSIX APIs
LFSLFSLFS
LibrarianDriversDrivers LibrarianLibrarianDrivers
HardwareFirmwareFirmware Hardware
libpmem
–Part of http://pmem.io/nvml/
–API for controlling data persistence
–Flushing SoC caches.
–Clearing memory controller buffers
–Accelerated APIs for persistent data movement
–Non-temporal copies
–Bypass SoC caches
–Additions for The Machine
–APIs for invalidating SoC caches
24
Fabric-Attached Memory Atomics
–Native SoC atomic instructions are cache-dependent– Do not work between nodes
–Bridge and switch hardware includes fabric-native atomic operations
–Proprietary fam-atomic library provides API– Atomic read/write, compare/exchange, add, bitwise and/or– Cross-node Spin Locks– Depends on LFS for VA → PA → FA translations
25
LFS native block devices
–Legacy applications or frameworks that need a block device– File-system dependent (ext4)– Ceph
–Triggered via mknod
–Simplifications for proof-of-concept– Plagiarize drivers/nvdimm/pmem.c – Avoid cache complications: node-local only– Lock the descriptors
26
The Future● Short-term
– Full integration into management infrastructure of The Machine– Frameworks / Middleware / Demos / Applications / Stress testing– Optimizations (i.e., huge pages) – Learn, learn, learn
● And beyond– More capable or specialized SoCs– Deeper integration of fabric– Enablement of NVM technologies at production scale– Harden proven software (i.e., replace FuSE with a “real” file system)– True concurrent file system – Eliminate separate ToRMS server– ???????
Open Source
–Yes we're going to release all system software – Librarian, LFS, kernel modules
–Started with FAM Emulation in December 2015– http://github.com/FabricAttachedMemory– x86 and Debian Jessie– “Platform” only
28
How Fast Is The Field of Dreams?
29
Thank youRocky Craig<first.last>@hpe.com
30