Non-Volatile Memory for Next Generation I/O
Dr Michèle [email protected]
Current trends & approaches
28/03/2017 39th ORAP Forum 2
Burst buffer
28/03/2017 39th ORAP Forum 3
highperformancenetwork
externalfilesystem
computenodes
highperformancenetwork
externalfilesystem
computenodes
burstfilesystem
Moving beyond burst buffer
• Non-volatile is coming to the node rather than the filesystem• Argonne Theta machine has 128GB SSD in each
compute node
28/03/2017 39th ORAP Forum 4
highperformancenetwork
externalfilesystem
computenodes
Non-volatile memory
• Non-volatile RAM• 3D XPoint technology is one example
• Much larger capacity than DRAM• Hosted in the DRAM slots (DIMM form
factor), controlled by a standard memory controller
• Slower than DRAM by a small factor, but significantly faster than SSDs
28/03/2017 39th ORAP Forum 5
Memory
Storage
Cache
SlowStorage
Cache
NVRAM
FastStorage
Memory
The NEXTGenIO approach
28/03/2017 39th ORAP Forum 6
NEXTGenIO key facts
• FETHPC Research & Innovation Action• Active for 18 months so
far• 8 partners, covering
• Hardware• HPC centres and users• Software• Tools developers
28/03/2017 39th ORAP Forum 7
Our objectives
• Hardware platform prototypeØDemonstrating the prototype’s broad applicability for both HPC
and data centric applications
• Exascale I/O investigationØUnderstanding how best to exploit NVRAM
• Systemware developmentØProducing the necessary software to enable (Exascale)
application execution on the hardware platform
• Application co-design ØUnderstanding individual application I/O profiles and typical I/O
workloads on shared systems running multiple different applications
28/03/2017 39th ORAP Forum 8
Systemware
• System software must understand extra level present in the memory hierarchy• Work on adapting job scheduler (SLURM)• Development of a data scheduler• Object stores as alternatives to file systems
• DAOS (Distributed Application Object Storage)• dataClay
• Multi-node NVRAM file system• echoFS
☛ Key goal: Platform must be usable “as is” for legacy applications
28/03/2017 39th ORAP Forum 9
Workloads & I/O
• Try and understand how different I/O behaviour and scheduling policies will impact job throughput• Three different workloads
• Generic à EPCC• Special purpose à ECMWF• Commercial à Arctur
• I/O Workload Simulator• Create benchmark of synthetic jobs generated from
real workloads à to be deployed on HPC system• Create simulator of workload schedule to test
impact of policies and I/O performance à to be deployed on laptop
28/03/2017 39th ORAP Forum 10
ARCHER workload
28/03/2017 39th ORAP Forum 11
metadata
write
read
ARCHER workload
28/03/2017 39th ORAP Forum 12
metadata
write
read
Tools support
• Profiling and debugging tools need to be able to understand implications of an additional (potentially persistent) memory layer
28/03/2017 39th ORAP Forum 13
Applications
• Traditional HPC• OpenFOAM à CFD• CASTEP à Chemistry• IFS à weather forecasting• MONC à cloud modelling
• Novel uses• OSPRay à ray tracing, rendering engine• Halvade à genome sequencing• Tiramisu à deep learning (based on Caffe)• K-means à ML
28/03/2017 39th ORAP Forum 14
Usage models
28/03/2017 39th ORAP Forum 15
NVRAM usage models
• The “memory” usage model allows for the extension of the main memory • The data is volatile like normal DRAM based main memory
• The “storage” usage model which supports the use of NVRAM like a classic block device• E.g. like a very fast SSD
• The “application direct” usage model maps persistent storage from the NVRAM directly into the main memory address space• Direct CPU load/store instructions for persistent main
memory regions
28/03/2017 39th ORAP Forum 16
Exploiting distributed storage
Filesystem
Memory Memory Memory Memory Memory Memory
Node Node Node Node Node Node
Network
Filesystem
Network
Memory
NodeNVRAM
Memory
NodeNVRAM
Memory
NodeNVRAM
Memory
NodeNVRAM
Memory
NodeNVRAM
Memory
NodeNVRAM
28/03/2017 17Filesystem
Network
Memory
NodeNVRAM
Memory
Node
Memory
NodeNVRAM
Memory
Node
Memory
NodeNVRAM
Memory
Node
Using distributed storage
• Without changing applications• Large memory space/in-memory database etc…• Local filesystem
ØUsers manage data themselvesØNo global data access/namespace, large number of filesØStill require global filesystem for persistence
28/03/2017 39th ORAP Forum 18
Filesystem
Network
Memory
Node/tmp
Memory
Node/tmp
Memory
Node/tmp
Memory
Node/tmp
Memory
Node/tmp
Memory
Node/tmp
Using distributed storage
• Without changing applications• Filesystem buffer
ØPre-load data into NVRAM from filesystemØUse NVRAM for I/O and write data back to filesystem at the
endØRequires systemware to preload and postmove dataØUses filesystem as namespace manager
28/03/2017 39th ORAP Forum 19
Filesystem
Network
Memory
Nodebuffer
Memory
Nodebuffer
Memory
Nodebuffer
Memory
Nodebuffer
Memory
Nodebuffer
Memory
Nodebuffer
Using distributed storage
• Without changing applications• Global filesystem
ØRequires functionality to create and tear down global filesystems for individual jobs
ØRequires filesystem that works across nodesØRequires functionality to preload and postmove filesystemsØNeed to be able to support multiple filesystems across system
28/03/2017 39th ORAP Forum 20
Filesystem
Network
Memory Memory
Node
Memory Memory Memory Memory
Node
Node NodeNodeNode
Filesystem
Using distributed storage
• With changes to applications• Object store
ØNeeds same functionality as global filesystemØRemoves need for POSIX, or POSIX-like functionality
28/03/2017 39th ORAP Forum 21
Filesystem
Network
Memory Memory
Node
Memory Memory Memory Memory
Node
Node NodeNodeNode
Objectstore
Using distributed storage
• Without changing applications• Automatic check-pointing
• Resiliency• Local check-pointing without hitting the filesystem
• Pause and restart• Just-in-time scheduling/high priority jobs• Waiting for something else to happen…
28/03/2017 39th ORAP Forum 22
Using distributed storage
• New usage models• Resident data sets
• Sharing preloaded data across a range of jobs• Data analytic workflows• How to control access/authorisation/security/etc….?
• Workflows• Producer-consumer model
• Remove filesystem from intermediate stages
28/03/2017 39th ORAP Forum 23
Job1
Filesystem
Job2 Job3 Job4
Using distributed storage
• Workflows• How to enable different sized applications?
• How to schedule these jobs fairly?• How to enable secure access?
28/03/2017 39th ORAP Forum 24
Job1
Filesystem
Job2Job3
Job4Job2
Job2 Job2 Job4
The Challenge of distributed storage
• Enabling all the use cases in multi-user, multi-job environment is the real challenge• Heterogeneous scheduling mix• Different requirements on the NVRAM• Scheduling across these resources• Enabling sharing of nodes• Not impacting on node compute performance
• Enabling applications to do more I/O• Large numbers of our applications don’t heavily
use I/O at the moment• What can we enable if I/O is significantly cheaper?
28/03/2017 39th ORAP Forum 25
Questions?
28/03/2017 39th ORAP Forum 26