hpc innovation exchange - dell emc isilon...to test the real-world benefits of the dac, three hpc...
TRANSCRIPT
HPC Innovation ExchangeA series that examines key trends, technology infrastructures and solutions within high-performance computing.
The Data Accelerator A detailed examination of how the University of Cambridge, Dell EMC and Intel solved HPC I/O bottlenecks by co-developing the world’s fastest open source, software-defined NVMe storage solution. 0
01
ContentsOutline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1 The HPC I/O bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1 Generic burst buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Cambridge “Data Accelerator” overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Freely available code repository and detailed “How-To”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Data Accelerator workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Burst buffer lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Burst buffer access modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.5 Data Accelerator use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
2.1 Server & storage hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
2.2 Placement of DACs within the OPA network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 NVMe and Lustre file system configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Slurm and Orchestration software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Synthetic benchmark overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Motivation for DAC performance study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Synthetic benchmark tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Synthetic benchmark summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Synthetic benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Bandwidth scaling within a single DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Bandwidth scaling across multiple DACs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.3 Bandwidth limit of a single client node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.4 Bandwidth — comparison to HDD Lustre system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.5 Metadata performance of a single DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Metadata limit of a single client node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 Metadata scaling across multiple DACs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.8 Metadata — comparison to HDD Lustre system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.9 IOPS scaling within a single DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
4.10 IOPS scaling across multiple DACs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.11 IOPS limit of a single client node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.12 IOPS — comparison to HDD Lustre system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Application use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Large checkpoint restart and simulation output data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Relion — Cryo-EM structural biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Square Kilometre Array — Science Data Processor benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Traditional centralised HPC file systems, built from
spinning HDD, can create I/O bottlenecks and produce
unresponsive systems, significantly impacting return on
investment in HPC solutions.
As workloads become even more data intense, these
bottlenecks are only exacerbated. These artifacts
frustrate and restrain the art of the possible in the mind
of the researcher, limiting the imagination of the Data
Scientists, Engineers or Astronomers alike, that wish to
make tomorrow’s discoveries today.
The Cambridge Dell EMC Data Accelerator removes
these limitations. It is architected to integrate with
traditional HPC storage without the need for re-design,
to interact with commonly-used scheduling tools, be
completely open-source and extensible, and make
optimal use of modern SSD and fabric technologies.
With special thanks to:Paul Calleja, Matthew Raso-Barnett, Alasdair King, Jeffrey Salmond, Wojciech Turek, John Garbutt, John Taylor, Jonathan Steel
During the development and testing of the DAC solution, a comprehensive set of I/O performance
tests were undertaken to probe the scalability and efficiency of the solution. In addition, issues
such as maximum single client node performance and OPA network saturation are examined. A final
DAC configuration is thus described that provides a very high absolute performance with excellent
scalability and efficiency compared to the raw NVMe performance. The performance of the DAC is
summarised in the tables below:
Table 1 Single DAC node performance with 12 P4600 NVMe
ReadGB/s
WriteGB/s
Read 4K
IOPS
Write 4K
IOPS
Metadata
Create Stat Delete
Performance of all 12 NVMe drives within single DAC
22 14.7 5.6M 4M 83K 136K 69K
Efficiency relative to spec data for 12 drives
58% 93% N/A N/A N/A N/A N/A
Table 2 Multiple DAC node performance running across 24 DACs
ReadGB/s
WriteGB/s
Read IOPs
WriteIOPs
Metadata
Create Stat Delete
No of DAC nodes 24 24 24 24 24 24 24
Multi DAC performance 513 340 15M 28M 1.3M 2.9M 0.7M
Scaling efficiency 97% 96% N/A N/A N/A N/A N/A
To test the real-world benefits of the DAC, three HPC application use cases have been examined:
HPC simulations with large checkpoint restart requirements (the so-called Burst Buffer problem);
a contemporary structural biology Cryo-EM challenging data set; and a very challenging SKA
performance prototype. Executing on standard Parallel File Systems constructed with HDD, these
applications exhibited significant I/O overheads which were completely ameliorated using the DAC.
With the checkpoint restart use case alone, it is calculated that the cost of the DAC would be
recovered after one year of continuous usage in terms of increased throughput of the system.
Future work with the DAC will develop alternative file systems deployment using the DAC
Orchestrator, such as BeeGFS and XFS/ext4 via NVMe over fabrics. We will also investigate I/O
monitoring and profiling on a ‘per job’ basis at both the client and server. Finally, we will investigate
a wider range of application test cases within the materials, astronomy, engineering, bioinformatics,
medical imaging and AI domains — in order to build up an understanding of how to best use tiered
storage of this nature within a production HPC environment.
Outline
This paper describes the high-level architecture, comprehensive synthetic benchmarking, and
initial application testing of a performance-optimised prototype and reference implementation for
a solid-state storage Data Accelerator (DAC) subsystem which can be deployed as an add-on to
Slurm1-enabled HPC systems, significantly boosting the performance of data-intensive HPC and AI
workloads. The DAC described here is currently deployed as part of the Dell EMC Cumulus Cloud
HPC system2 at the University of Cambridge and is accessible from conventional HPC systems and
Slurm as a Service3 platforms atop of OpenStack-enabled infrastructure. The work was undertaken
as a co-design project involving the University of Cambridge, Dell EMC, Intel and StackHPC,
and represents the first demonstrator of an open source non-proprietary solid state burst buffer
implementation. This demonstrator implementation of the Cambridge Dell-EMC DAC was so
successful that it reached #1 in the June 2019 IO5004 world HPC storage ranking making this early
prototype system the fastest HPC storage solution in the world, with almost twice the performance
of the second-placed entry on the list. With a footprint of just one rack of storage, the solution was
able to deliver over 500 GB/s Bandwidth, 1.3 million file creates per second and 28M IOPS.
The motivation for the DAC is three-fold:
a. To alleviate the performance bottleneck often experienced with data-intensive HPC & AI
applications running on central networked file systems
b. To provide deterministic, high performance, schedulable I/O resources for these applications,
providing breakthrough I/O performance for data-intensive applications
c. To create an open source software solution, utilising infrastructure as code, cloud-native
technologies—built on readily available server and networking technology—with all the
configuration scripts and detailed build instructions being freely available thereby, making the
technology as accessible as possible promoting the further development and testing of solid
state I/O accelerators within the HPC domain.
From a hardware perspective, the DAC exploits modern SSD NVMe storage packaged in a form to
balance I/O connectivity providing a highly scalable storage subsystem. The current DAC provides
an on-demand Lustre parallel file system, created on a ‘per job’ basis and is fully integrated with
the open source Slurm HPC job scheduler, allowing stage-in/stage-out data to be defined at job
submission time, along with the size, performance and usage mode of the SSD Lustre file system
to be created. The creation and lifecycle management of the Lustre file system is achieved by a
new open source software component developed at the University of Cambridge called the “DAC
Orchestrator”.5
The DAC system deployed at Cambridge is within a 1,152-node Intel SkyLake Omni-Path (OPA)
cluster and comprises 24 so-called ‘DAC server nodes’ providing around 500TB of usable capacity
that can be presented as a Lustre parallel file system. A single DAC server node consists of a
Dell EMC R740xd server, 12 Intel P4600 1.6TB NVMe drives and two Intel Omni-Path low latency
high bandwidth network adapters.
4 5HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
Introduction
1 The HPC I/O bottleneckOver recent years, scientific and technical workloads have become ever more data-centric associated
with burgeoning volumes of data coupled to the emergence of data analytics and AI/machine
learning workloads which are also often bandwidth and latency (IOPS) sensitive. Traditional HPC
architectures commonly involve many thousands of compute cores attached to a single, or more
commonly now, multiple large centralised parallel file systems, with 100s of HPC applications running
concurrently placing simultaneously high demands on monolithic central parallel file systems. This
inevitably not only causes an I/O bottleneck, degrading HPC application performance as a result
of increased I/O wait time, but also during periods of high I/O intensity such file systems can be
rendered inoperable to normal file system operations.
Although the aggregate file system bandwidth of a large spinning disk HPC file system can be high,
the file system performance of a single HPC application spanning multiple files or a single shared file
is often much lower. This low single file performance on traditional disk-based parallel file systems is
due to file stripe patterns tending to cross only small elements of the total file system. There quickly
comes a point on traditional parallel file systems constructed with spinning disk, when performance
gains from increased striping are offset by increasing contention. As a result, it is common to find
large central file systems with hundreds or even thousands of GB/s total aggregate performance
only realising a fraction of this in practice due to such operational issues.
Another factor that may add to the I/O bottleneck on a large spinning disk parallel file system, is the
need for long-term data resiliency. The bigger the file system the more moving parts it contains that
can fail. To ensure that this large monolithic storage system does not fail, causing long downtimes
for the users of such system, data protection techniques such as RAID are used. Typically, RAID with
dual and even triple parity is used. While the RAID provides additional resiliency and protects from
disk failures it also introduces a significant performance hit. Additional resiliency is provided by using
hardware with redundant components which brings the TCO even higher. The Cambridge Dell EMC
DAC completely removes the need for resiliency by introducing the concept of short-lived transient
parallel file systems that are tightly bound with the user workload and controlled by the workload
manager (Slurm).
The poor application I/O performance obtained with spinning disk central HPC file systems reduces
investment returns from current-day HPC systems, and unresponsive interactive sessions deliver
poor user experiences. This situation continues to worsen as workloads become more data-intensive.
Even more pressing is that we now have more data-intensive use cases than we can exploit due to
lack of I/O capability. This restrains the art of the possible in the mind of the researcher, limiting
the imagination of the Data Scientists, Engineers or Astronomers wishing to make tomorrow’s
discoveries. The Cambridge Dell EMC DAC removes this limitation unbinding the art of the possible,
unlocking tomorrow’s discoveries today, whilst reducing time to discovery and improving ROI on
HPC / AI solutions.
1.1 Generic burst buffers The work presented here is motivated by the generalisation of the ‘Burst Buffer’ solution and the
adoption of state-of-the-art SSD technologies, high-performance networking, and software CI/CD
methodologies for ease of integration and extensibility. The use of SSD technology within large scale
HPC system architecture has emerged over recent years as supercomputing sites have attempted to
overcome the growing HPC I/O bottleneck and the rapid commoditisation of this technology-driven
by other markets. Early Burst Buffer work can be seen at Los Alamos National Lab with the Trinity
system6 initially for check-point/ restart of large running simulations. Another early use of flash
accelerated file systems for HPC was at the San Diego Supercomputing Centre (SDSC) with their
system called Flash Gordon7 which was primarily motivated by underlying application acceleration.
The work described here builds on this early Burst Buffer work and extends the functionality. As
a result, the term ‘Burst Buffer’ no longer seems appropriate, and the term “Data Accelerator” or
“DAC” was coined.
1.2 Cambridge “Data Accelerator” overviewTo mitigate these I/O-related performance issues and to enable the next generation of data-
intensive workflows, we describe here the high-level architecture, performance characterisation
and integration of an NVMe-based Data Accelerator within the Cambridge HPC cluster. The solution
uses the Intel PC4600, a PCIe-based SSD drive, combined with Intel HPC networking Omni-Path
packaged with the Dell EMC R740xd 14G server platform. In addition to this storage, server and
network hardware layer is a software layer allowing for the creation and deletion of an on-demand
per job Lustre parallel file system. The software-defined design approach exploits modern cloud
methodologies to promote a reproducible and extensible code base, making heavy use of Ansible
for infrastructure as code, together with etcd, a distributed key-value store used, for example, by
Kubernetes.
The DAC forms an integral part of the Cumulus Science Cloud system at the University of Cambridge
as seen in Figure 1 below:
High Performance Cumulus Science Cloud
High Performance Networking
High Performance Storage
OPA Network
Main Lustre F
ile System
(SS
D+
HD
D)
OpenStack IRIS HPC Cluster
CSD3-KNL Cluster
CSD3-GPU Cluster
CSD3-Skylake Cluster
OpenStack Clinical Cluster
All Flash DAC
HTE Network
Figure 1 Cumulus Cloud at the University of Cambridge showing how the DAC fits into the overall
hardware architecture.
6 7HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
The current DAC is integrated with the Slurm scheduling software with no changes required to the
base solution so that within the job submission command an individual per-job Lustre parallel file
system is created. The size and performance of each Lustre file system can be set at job submission
time. Because each job obtains its own freshly created ephemeral Lustre parallel file system no
RAID redundancy features are needed on file system maximising performance. The result is that
I/O bandwidth and IOPS performance become determinant and fully scalable in terms of both
single file and aggregate performance of the whole file system.
1.2.1 Freely available code repository and detailed “How-To”A GitHub code repository5 contains all the ansible scripts to build the solution, the orchestrator
software, relevant documentation and build instructions so that a skilled HPC engineer with the
relevant Lustre system administration knowledge can recreate the solution. In addition, all test
scripts and results are provided so the solution can be verified. All the software is open source and
freely available meaning that this extreme performance capability is now widely available and not tied
to expensive proprietary solutions.
1.2.2 Data Accelerator workflowsThe DAC orchestration has focused on integrating with Slurm’s existing burst buffer support.8 There
is currently only one plugin for this, contributed by Cray for its Cray DataWarp product, an example
of this can be seen with the NERSC Cori system.9 The DAC Orchestrator reuses Slurm’s Cray
burst buffer plugin but with a different underlying implementation. The following section lists the
functionality available to Slurm users when using the Lustre based DAC.
1.2.3 Burst buffer lifecycleWhen a Slurm burst buffer is created by the DAC, it creates a new parallel file system of the
requested size. The granularity of resources assigned to the parallel file system is that of a single
NVMe drive and Slurm “rounds-up” the requested size to the nearest quantity of NVMe disks. This
approach has been chosen to help deliver the most predictable levels of performance while allowing
many users to share the resources.
Each NVMe is split up into an LVM volume for storage and an LVM volume for metadata. This means
if a user needs to store more files than a single NVMe can hold, they simply request more storage
space to increase the number of supported files.
When creating a burst buffer there are two ‘life cycles’ that can be chosen:
y Per job burst buffer
- Job script specifies the required size of the buffer to be created. The system ensures the
buffer is ready to use before a job is assigned compute nodes and is deleted and cleaned up
after the job releases compute node resources.
y Persistent burst buffer
- A named buffer that can be used by multiple jobs and the user controls when the buffer
is deleted.
1.2.4 Burst buffer access modesA job can have zero or one per job buffer and/or zero or more persistent buffers. The job is supplied
with environment variables that instruct on the location of mount points on the compute nodes.
While the latest versions of Slurm can create buffers on login nodes, this is not currently supported
by the DAC.
There are currently two supported access modes of the DAC.
y The simplest is called “striped” and involves a single shared namespace being mounted on all
the compute nodes running a given job. This is currently the only supported access mode for a
persistent burst buffer.
y If using a per job burst buffer, a “private” access mode is also supported where each compute
node is given a different independent namespace. A per job burst buffer can support both the
“private” and “striped” access modes. The DAC implements these using separate directories for
each required namespace. This means any single mount point could consume all of the resources
assigned to that buffer.
A future access mode being considered is that of “transparent cache” using either a persistent or
per job burst buffer. It is also hoped to explore access modes using NVMe-over-fabrics rather than
a parallel file system to improve the performance of both a read-only data set mounted in several
locations, or splitting a NVMe up into namespaces to mount a separate file system on each compute
node, similar to the above private access mode.
Burst buffer data movementOne further aspect of orchestration is the data staging functionality that is available for per job burst
buffers. Before the file system is mounted, data can be copied into a buffer. Before the file system
is destroyed, data can be copied out to an external parallel file system. The copying of data happens
using DAC nodes and does not consume compute node wall clock time.
1.2.5 Data Accelerator use casesBuilding on the available lifecycles and access models the following use cases are expected to be
where users will see the biggest benefit from the Data Accelerator:
y Checkpoint and Restart
- a persistent buffer to help split very long-running jobs into shorter jobs that can recover state.
y Faster access to an existing data set
- some subset of a data set pre-staged into a per job buffer before the job starts.
- or “hot” data sets made available via read-only mount of a persistent buffer.
y High-speed scratch space
- with the option to copy results back to the capacity file system after the job completes.
- reduce compute node time being wasted doing purely I/O.
y Dealing with lots of small files
- DAC’s dedicated metadata helps reduce the impact of noisy neighbours on a capacity oriented
parallel file system.
y Swap
- for jobs with non-deterministic memory requirements.
8 9HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
2 Architecture
Conceptually, the high-level system architecture of the DAC can be viewed as consisting of 5 layers
(see Figure 2), each of which contributes to the performance profile and functionality of the DAC.
There is not a clean separation between these layers, but it is a useful construct to help segment and
understand how the various elements affect the attributes of the DAC.
Layer 1Orchestration layer
Orchestration software and Slurm integration
DAC Node
Network Fabric
LNET Local Filesystem
Operating System
NIC NICNVMe x 6
NVMe x 6
OPA NVMe
NUMA Node PCI lanes
NUMA Node PCI lanes
Lustre Parallel File System
LNET
Operating System
PCI
NIC
OPA
Lustre Mount
LNET
Operating System
PCI
NIC
OPA
Lustre Mount
Client Node
Layer 2
Layer 3
Layer 4
Layer 5
Figure 2 Conceptual layering of the DAC high-level architecture
Each layer needs to be optimised to achieve a high percentage of total theoretical performance
combined with the transparent application integration from the user perspective. These layers are
described below and during the development of the DAC, each of these layers was consistently
tuned to obtain optimal performance and functionality.
y Layer 1 — Orchestration: expose the system to users by dynamically creating new parallel file
systems of the requested size to meet users’ needs — consists of new orchestrator software
with integration into the Slurm HPC scheduling software.
y Layer 2 — Parallel file system: tune parallel file system server and client to make the best use of
available storage and network resources.
y Layer 3 — Operating System kernel and drivers tuned for optimal NVMe and OPA networking
performance.
y Layer 4 — OPA NICs and NVMe Disks are evenly spread between NUMA nodes,
i.e. to reduce PCI bus and CPU interconnect bottlenecks.
y Layer 5 — Network fabric configuration needs to be tuned to ensure bandwidth can be sustained
between clients and servers.
2.1 Server & storage hardware The Data Accelerator is built using Dell EMC R740xd 2U servers. Each DAC server has two 16 core
Intel Xeon Gold 6142 CPUs. Each of these servers contains two PLX PCIe gen-3 switches with 6
Intel P4600NVMe SSD10 connecting to each of the two PLX PCIe switches. This is summarised in
the table below together with the base performance specifications of the NVMe drives.
Processor 2x Intel Xeon Gold 6142 at 2.5GHz
Memory 24x 16GB 2666 MT/s DDR4
Networking 2x Intel OPA v1 100Gbps (12.5GiB) HFI
PCIe Storage 2x PLX PCIe switch for NVMe
NVMe 12x Intel P4600 (1.6TB)
Operating system Red Hat Linux 7.6
Intel P4600 SSD 1.6 TB Specs
Sequential Read (64k) (Up To) 3200 MB/s
Sequential Write (64k) (Up To) 1325 MB/s
Random Read (4k) (Up To) 559550 IOPS
Random Write (4k) (Up To) 176500 IOPS
The DAC test configuration consists of a total of 24 DAC servers each attached with two Intel OPA
links to the University of Cambridge production 1152 node Intel Skylake cluster, a component of the
UK Science Cloud in Cambridge called Cumulus. The DAC will be rolled out over the rest of the HPC
estate, including the Cumulus-KNL and GPU clusters together with the UK IRIS OpenStack cluster.
The DAC provides a total of around 0.5PB of storage using 288 NVMe drives.
10 11HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
2.2 Placement of DACs within the OPA networkThe 24 node DAC configuration described here is connected to the Cambridge Cumulus system via
its OPA network and can generate over 500 GiB/s of I/O bandwidth. To ensure that this data can
move around the cluster without bottlenecks requires careful consideration of the placement of the
DAC nodes within the fabric and optimisation of the OPA settings.
Cumulus employs a traditional oversubscribed (2:1) fat-tree topology, using 48 port Omni-Path edge
switches. Each edge switch has 32 ports connecting compute nodes and 16 ports connecting to the
core OPA network.
Our initial placement of DAC nodes was to connect the 24 DAC nodes to 3 dedicated edge switches,
8 DAC nodes per switch, 16 links down to the DAC nodes and 16 links up to the core OPA fabric. The
motivation here was to remove any oversubscription between the DAC nodes and the core network.
In this configuration, scalability across the DAC nodes was observed to be significantly impaired.
Further experimentation found that DAC scalability stopped when more than 3 DAC Server Nodes
were connected to a single leaf switch, even with congestion control turned on. We have yet to
understand the root cause of these issues.
To work around this issue, an alternative topology approach was used where the DAC nodes were
spread out over a much larger number of switches. Here we place one DAC node per edge switch
which connects client nodes to the core OPA fabric. Thus all 24 DAC nodes are spread out over 24
edge switches that also plug directly into client nodes. With this configuration full DAC bandwidth
scaling is observed.
Besides fabric topology, the DAC nodes and compute clients were tuned according to revision 13.0
of the Intel Omni-Path Fabric Performance Tuning User Guide.14 For all the benchmarks in this paper,
the DAC nodes were using the in-distribution inbox driver as part of Redhat Enterprise Linux 7.6. The
specific hfi1 driver parameters used for benchmarking on both servers and clients were:
options hfi1 krcvqs=8 pcie _ caps=0x51 rcvhdrcnt=8192
Compute nodes were using the Intel Fabric Software (IFS) version 10.8 driver.
2.3 NVMe and Lustre file system configurationThe current DAC deploys Lustre on top of the ldiskfs backend file system. Lustre is a long-standing
high-performance parallel file system and is open-sourced under the GPL v2 licence.
Each OST or MDT is built from a single physical volume with no RAID redundancy. This is because
the design of the DAC is optimised for performance and not long-term resilience as all the data to be
stored on the DAC is also held on long term storage and staged in and out continuously. This is one
of the key design elements that allow maximum performance to be obtained.
Each NVMe device is formatted with a 4KiB sector size using the Intel Solid-State Drive Data Centre
Tool. We then create an LVM physical volume for each device, which the DAC Orchestrator will use
to partition into MDT and OST logical volumes.
For the benchmarks in this paper, we are using Lustre 2.12.2 on the server which contains
improvements for flash storage systems ensuring optimum performance.
Lustre 2.12.2 is also used on clients. Lustre was built and installed as described in the Lustre
documentation.15
Both servers and clients were using Redhat Enterprise Linux 7.6, using
kernel 3.10.0-957.10.1.
During the benchmarking tests of this paper, we reliably hit a problem with Lustre hitting RDMA
timeout errors particularly during IOR read tests. We have a bug report regarding this problem in12
and were advised to try reverting one commit on top of 2.12.2 as a workaround until the issue is
resolved. We are still working with the developers to root cause this issue, but the benchmarks in this
paper were run with 2.12.2 plus this commit reverted.
Tunables for the Lustre file system are provided in the DAC ansible configuration in the code
repository,5 but a summary is given below:
Lustre server tunables
Increase MDT maximum number of modifying RPCs in-flight allowed per client:
mds$ echo 127 > /sys/module/mdt/parameters/max _ mod _ rpcs _ per _ client
Increase maximum RPC size to 16MiB:
oss$ lctl set _ param obdfilter.*OST*.brw _ size=16
Lustre client tunables
Disable client LNET data integrity checksums:
client$ lctl set _ param osc.*.checksums=0
Increase client maximum RPC size to 16MiB:
client$ lctl set _ param osc.*OST*.max _ pages _ per _ rpc=16M
Increase client max_rpcs_in_flight and max_mod_rpcs_in_flight to match server:
client$ lctl client$ lctl set _ param mdc.*.max _ rpcs _ in _ flight=128client$ lctl set _ param osc.*.max _ rpcs _ in _ flight=128client$ lctl set _ param mdc.*.max _ mod _ rpcs _ in _ flight=127 osc.*OST*.max _ pages _ per _rpc=16M
Increase client maximum amount of data readahead on a file, and the global limit:
client$ lctl set _ param llite.*.max _ read _ ahead _ mb=2048client$ lctl set _ param llite.*.max _ read _ ahead _ per _ file _ mb=256
Increase the amount of dirty data per OST the client will store in the pagecache:
client$ lctl set _ param osc.*OST*.max _ dirty _ mb=512
12 13HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
2.4 Slurm and Orchestration softwareThe DAC uses an open source orchestration tool developed at the University of Cambridge with
StackHPC that integrates with Slurm. The DAC Orchestrator is currently configured to deploy a
Lustre parallel file system per job as directed by Slurm. It is planned that the orchestrator could be
extended to use other file systems such as BeeGFS, but here we only discuss Lustre.
Slurm already has support for burst buffer via burst buffer plugins. However, there is currently
only one working plugin, the Cray DataWarp plugin. To integrate with Slurm the DAC Orchestrator
has chosen to reuse both the user and orchestrator interfaces Slurm has created to expose Cray
DataWarp to Slurm users. As such, the standard Slurm documentation is valid for DAC Orchestrator-
provided burst buffers.8
Requesting a Burst Buffer from SlurmUsers interact with burst buffers by adding special directives to their job scripts. For example, to
create a burst buffer that can be attached to multiple different jobs, you can create a persistent
burst buffer using the following job script:
#BB create _ persistent name=alpha capacity=2TB access=striped type=scratch
In this second example we look at the directives you add if you want to use the persistent buffer
created in the previous example. In addition, we ask Slurm to create a per job burst buffer, copy an
input file into that per job buffer before the job starts, and copy out a directory of files after the job
as completed. Finally, we also request that additional swap is added to every compute node.
#DW persistentdw name=alpha#DW jobdw capacity=1400GB access _ mode=striped,private type=scratch#DW stage _ in source=/mnt/user1/input/file1 destination=\$DW _ JOB _ STRIPED/file1 type=file#DW stage _ out source=\$DW _ JOB _ STRIPED/outdir destination=/mnt/user1/out/run3 type=directory#DW swap 1TB
When the user’s job executes, Slurm uses the burst buffer plugin to ensure the buffer has been
created and mounted on the compute nodes before the job is started. The user’s job script is
supplied with environment variables that tell the user where the buffer has been mounted.
To better understand what is available to the users, let us consider the ‘DW jobdw’ directive more
carefully. The type is defined to be scratch. Currently, this is the only valid value, although it is hoped
to add a transparent cache mode in the future.
The capacity requested will be “rounded-up” to the nearest level of granularity that is available in
the requested burst buffer pool (by default we are requesting from the default pool). The size of the
NVMe disks currently determines the granularity that is reported by the DAC Orchestrator.
Finally, we should note that two access modes are supported: striped and private. “Striped” means
a buffer is exposed as a single namespace to all compute nodes. “Private” means each compute
node receives its own dedicated namespace. In the example above, the buffer is dynamically shared
between both access modes.
The DAC Orchestrator implements this by creating a new parallel file system for each burst buffer.
This parallel file system is mounted on all the compute nodes, with a directory created for each of
the required namespaces. In addition, the same parallel file system is extended to include the space
needed for any compute node swap files, that are also stored on the same parallel file system as
shown in Figure 3.
Host1
/dac/name/global
/dac/name/private
swap on
Host2
/dac/name/global
/dac/name/private
swap on
Global Host1 Private Host1 Swap
Bu�er: Name
Host2 Private Host2 Swap
Figure 3 Mount points and namespaces within a per job buffer
14 15HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
DAC Orchestrator Implementation
The DAC Orchestrator is composed of three key components, written in golang and ansible:
y dacctl
- command line tool called by Slurm’s Cray DataWarp burst buffer plugin
y dacd
- a service that runs on each DAC node and creates the requested parallel file system
y fs-ansible
- ansible scripts used by dacd to orchestrate the creation and deletion of the requested parallel
file systems.
dacctl and dacd communicate with each other via the etcd datastore: as shown in Figure 4.
dacctl etcd
Slurm
Hardware
Ansible
Cray DataWarp
dacd
Figure 4 Data Accelerator Orchestrator’s Architecture
The Slurm plugin defines the workflow for management of the burst buffers. Figure 5 shows at each
step the evolution of the user’s request. At each point along the line, there is input from Slurm via
dacctl as the buffer is created, consumed and removed. Slurm also delegates to the DAC Orchestrator
the responsibility for copying users’ files, so that compute resources are not tied up during data
copy. The mount and unmount does happen within the compute nodes assigned to the job, but that is
designed to be a relatively quick process and does not overly impact the time spent in execution.
Mount UnmountExecuteSetup Bu�er
Validate Request
Chosen Compute Nodes
Release Bu�erStage-In Data Stage-Out Data
Figure 5 Lifecycle of a buffer requested by a user’s job submission
For further guidance on setting up and configuring the Orchestrator please see the Orchestrator
installation guide.5
3 Synthetic benchmark overview
A detailed set of synthetic benchmark studies were undertaken to probe I/O scaling both within a
single DAC and across multiple DAC units, that are connected to the large scale (1152 node) Intel
Skylake HPC cluster deployed at the University of Cambridge. This system is part of the “Cumulus”
science cloud, the largest academic HPC system in the UK from 2017 -2019.
3.1 The motivation for the DAC performance study
Characterise DAC performance as a design toolA key design goal of the DAC is to build a storage server with the highest possible performance for
the lowest possible cost, with the system being balanced in terms of read and write I/O metrics,
including sequential bandwidth, metadata and IOPS. To achieve this, the Dell-EMC R740xd server
platform was adopted, which provided a flexible PCIe sub-system that can be populated with variable
amounts of NVMe and OPA network cards allowing a wide range of cost-effective flexible system
configurations to be explored.
Given the flexibility of the hardware unit, it was necessary to maintain a rigorous testing environment
to ensure the DAC configuration was optimised to provide the highest performance possible and to
verify that the performance was maintained both within a single DAC server and across multiple DAC
servers connected across the OPA switch fabric. In order to do this a series of synthetic benchmarks
were undertaken to probe performance and scalability up through the data path, within a single DAC,
across multiple DAC units and out to remote clients.
Link application I/O requirements to DAC I/O performanceThe questions we look to answer in respect of performance optimisation are:
y What is the maximum performance you can extract from a single DAC? Are we obtaining all the
performance we can expect? Where are the bottlenecks?
y Given my application I/O requirements, how many NVMe drives should I have in each DAC node?
y How many DAC nodes are needed if the performance required is more than a single DAC
can deliver?
When trying to understand how application I/O will perform and scale on the DAC we have
undertaken a basic study looking at a range of I/O operations that can be characterised by simple
synthetic benchmark metrics. These synthetic benchmark metrics can then be run on the DAC
allowing the results to be related back to application I/O operation requirements. The Table below
lists the metrics associated with these operations:
I/O Pattern Metric
Bulk File Read/Write Bandwidth — IOR13
File Creation and Deletion Metadata — mdtest13
Small File Read/Write IOPS — IOR with 4k transfer size
16 17HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
We have probed the performance of these metrics as you scale across NVMe drives within a single
DAC and then as you scale across multiple DACs. By understanding an application’s bandwidth,
metadata or IOPS requirements and knowing how the performance of these metrics scale within
and across the DAC units, it is possible to estimate the performance of a particular sized DAC in
terms of NVMe drives or multiple DAC server nodes. The synthetic benchmark regime used here is
described in the section below and provides results for an idealised application. As such, it explores
the uppermost performance an application could deliver.
3.2 Synthetic benchmark toolsThe synthetic benchmarks used in this work were taken from the recently developed IO-500
benchmark suite. The IO-500 creates a single score from a range of different benchmarks. The
O-500 is used as a worldwide HPC storage ranking that takes place twice a year.
For the most recent (June 2019) submission, we worked in close collaboration with engineers from
Whamcloud, the primary developers of Lustre, to get the best possible performance we could from
the system, with the result that the DAC came top of the list, with a score almost twice that of the
second-place entry.
The DAC result in the June 2019 IO500 submission used the very latest tip of the development
version of Lustre, and included additional patches developed by Whamcloud that provided significant
performance improvements to a number of the tests in the benchmark. Since these improvements
are still landing in an official stable release of Lustre, we chose to use the latest stable version of
Lustre (2.12.2) for the benchmarks in this paper to provide a more easily reproducible configuration.
IO-500
This is the official ranked list for ISC HPC 2019.
Please see also the 10 node challenge ranked list.
The list shows the best result for a given combination of the system/institution/filesystem.
# information io500
institution system storage vendor
filesystem type
client nodes
client total procs
data score bw md
GiB/s kIOP/s
1 University of Cambridge Data Accelorator Dell EMC Lustre 512 8192 zip 620.69 162.05 2377.44
2 Oak Ridge National Laboratory
Summit IBM Spectrum Scale
504 1008 zip 330.56 88.20 1238.93
3 JCAHPC Oakforest-PACS DDN IME 2048 2048 zip 275.65 492.06 154.41
4 Korea Institute of Science and Technology
Information (KISTI)
NURION DDN IME 2048 4096 zip 156.91 554.23 44.43
Figure 6 June 2019 IO-500 Listing with the Cambridge DAC 1st
Figure 7 below, shows all the entries made in the three submission rounds since the IO-500 began,
each submission round having a different coloured data point. By looking at the different tests within
the IO-500 benchmark suite, it is clear that the DAC’s high metadata and IOPS performance mean it
stands head and shoulders above the pack.
IO-5
00
Sco
re
Position in the list
0 10 20 30 40 50
Cambridge 1st Place2 x scond place – 4 x most of the pack
Dell – R740xd NVMe – Lustre
Summit – IBM Spectrum Scale
Oakforest PACS DDN IME
Weka – Weka-IO Matrix
The rest of the pack
0
200
400
600
Year of submission 2018-06 2018-11 2019-06
Figure 7 IO-500 entries coloured by submission round
For the work in this paper, we have taken several of the individual tests within the set of benchmarks
that make up the complete IO-500 suite, in order to probe the I/O performance metrics of interest
to us. The use of these subcomponents provides a well thought out and well-tested benchmark
regime which also allows the results generated to be compared with a wider set of results, from
other storage systems, already obtained that are stored within the IO-500 database.
For the purposes of this paper, we have used the “ior-easy” and “mdtest-easy” tests from the
IO-500 since they probe the maximum performance values possible on the hardware, and thus
provide the upper bound in performance obtained from an idealised application.
18 19HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
3.3 Synthetic benchmark summaryThe following provides a high-level summary of the synthetic benchmark results:
y Bandwidth
- Single DAC bandwidth.
> Max performance across all 12 NVMe, read 22 GiB/s, write 14 GiB/s.
> Single NVMe gives 3.1 GiB/s read, 1.3 GiB/s write, 95% spec value.
> Scales linearly across NVMe drives within a single DAC subject to internal PCI switch and
Lustre LNET data transfer limits via the OPA.
- Multiple DAC bandwidth.
> Max bandwidth across 24 DACs, 513 GiB/s read and 340 GiB/s write.
> Scaling across DACs is linear with over 97% efficiency.
- Single client node bandwidth.
> The IOR easy benchmark from IO-500 demonstrates that a single Client node is able to
reach R/W bandwidths of 10GiB/s. This is very close to the maximum bandwidth lustre can
support to a single OPA card of 11 GiB/s.
- Comparison to traditional SAS-HDD based Lustre file system with 36 OSTS built from
RAID6 arrays.
> Maximum performance of busy file system under load, when striped overall 36 OSTs was
approximately 10 GiB/s write, 15 GiB/s read.
y Metadata
- Single DAC metadata performance with 1 MDT.
> Maximum performance appeared to peak around 12 clients, with 83k file create IOPS, 136k
stat IOPS, 69k delete IOPS.
> Maximum single client performance of 21k create IOPS, 71k stat IOPS,
58k delete IOPS.
- Multiple DAC metadata performance scaling.
> Good scalability was observed using Lustre’s DNE2 striped directories as we scaled the
number of MDTs up to 48 in the file system. At 24 DACs with 128 clients, we reached a
peak of 1.3M create IOPS, 2.9M stat IOPS, and 0.7M delete IOPS.
- Comparison to traditional SAS-HDD based Lustre file system with 1MDT on RAID10 array.
> Maximum performance appears to peak around 6-8 clients, 22K file create IOPS, 150K stat
IOPS, 29K delete IOPS.
> Maximum single client performance of 18K create IOPS, 48K stat IOPS,
19K delete IOPS.
y 4K IOPS
- Single DAC 4k IOPS performance with Direct I/O off (buffered IO).
> Maximum performance with 32 clients is 5.6M read and 4M write IOPS.
- Single DAC 4k IOPS performance with Direct I/O on.
> Maximum performance with 32 clients is 250K read and 70K write IOPS.
- Multiple DAC 4k IOPS performance scaling — Direct I/O off.
> Maximum performance with 32 clients is 15M read and 28M write IOPS.
- Multiple DAC 4k IOPS performance scaling — Direct I/O on.
> Maximum performance with 32 clients is 1.6M read and 1.6M write IOPS.
- Comparison to a traditional HDD-based Lustre file system with 36x RAID6-based
OSTS — Direct I/O Off.
> Maximum performance with 32 clients is 1.7M read and 2.3M write IOPS.
- Comparison to a traditional HDD-based Lustre file system with 36x RAID6-based
OSTS — Direct I/O On.
> Maximum performance with 32 clients is 372K read and 15K write IOPS.
4 Synthetic benchmark results
This section provides a detailed description of the synthetic benchmark results that were undertaken
to profile the performance scaling within and across multiple DAC units as they were configured
within the Cumulus OPA cluster. All the test scripts and raw data can be found in the code repository.
4.1 Bandwidth scaling within a single DACHere we explore a single DAC node’s bandwidth performance as we scale the Lustre file system
across an increasing number of NVMe drives within a single DAC node using the IO-500 ior_easy
test with 12 client nodes.
Ban
dwid
th G
B/s
Number of NVMe drives in Lustre pool
1 2 3 4 5 6 7 8 9 10 11 12
5.00
10.00
15.00
20.00
25.00
0.00
Single DAC R/W Bandwidth Scaling
Write bandwidth Read bandwidth
Figure 8 Read / Write performance as measured from the ‘ior_easy’ phases of the IO-500 benchmark.
The number of IOR clients is fixed at 12, each with 32 MPI ranks and varying the number of NVMe
OSTs in the Lustre file system being tested. All NVMe drives are housed in a single DAC server.
Single NVMe performanceFirst, let’s consider the results for a file system consisting of a single NVMe. Here we see a read
performance of 3.1 GB/s and a write performance of 1.3 GB/s for a single NVMe which is very close
to published spec values of the NVMe disks which have a maximum value of read 3.2 GB/s and write
of 1.3 GB/s.10 This demonstrates that the Lustre parallel file system and the underlying configuration
of the NVMe drive, OS and drivers are able to obtain the full read/write performance of the NVMe
drives through the file system and out across the network to a remote client node.
Scaling across multiple NVMeAs more NVMe OSTs are added to the Lustre file system, we see the write performance scale
linearly all the way to 12 disks, with an overall value of 14.6 GiB/s and a write efficiency of 93%,
compared to the theoretical maximum for 12 NVMe disks. The read performance scales with a more
complex pattern across the drives. We see linear read performance scalability up to NVMe #4 with
a delivered bandwidth of 12.5 GiB/s and a read efficiency across the 4 drives of 97% compared to
the theoretical maximum value. Adding disk #5 only increases the bandwidth by 0.5 GiB/s reaching a
maximum value of 13 GiB/s at 5 drives.
20 21HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
The performance then plateaus until drive #7 at which point the performance starts to linearly scale
again until plateau again at drive #10, with a maximum read performance across all 12 NVMe drives of
22.3 GiB/s, an overall read efficiency of 58%.
Understanding the read and write behaviourThis complex read scaling behaviour is a result of the bandwidth limit within the data path— from
NVMe drive to remote client — which reduces the read performance down from its maximum potential
value for 12 drives of 38 GiB/s to 22 GiB/s which is in line with a network bandwidth limit of two OPA
cards. The write performance scaling is straightforward, here scaling is linear — reaching 97% of
the maximum potential value. This is because the write performance per disk is lower than the read
performance. And as more disks are added, it remains below the bandwidth limit of the OPA cards.
The read scaling needs to be viewed in the context of the architecture of the DAC server. Each DAC
node consists of a Dell R740xd, configured with two OPA NIC’s, which are presented as a single
“bonded pair” using Lustre’s LNET Multi-Rail feature. The NVMe disks are evenly spread across two
internal server PCIe switches. The Dell R740xd two-socket configuration is balanced by having each
CPU socket connected to one Onmi-Path adapter and one PCIe switch.
LNET self-test was used to establish a maximum bandwidth of 22 GiB/s for a single node. In a similar
way, IOR was used to establish a maximum bandwidth through a single PCIe switch to be 13.6GiB/s.
Thus, from a data path perspective, the first I/O bandwidth limit is the 13.6 GiB/s cap on the PCIe
switch, this will limit read bandwidth at disk #5 when the bandwidth of the disk pool exceeds
13.6GiB/s. Each disk provides 3 GiB/s bandwidth thus 5 disks cross this limit. This is observed
and shown in Figure 8. This limit is removed once the system uses the disk from the second PCIe
switch which happens at disk #7 and again this is confirmed as shown in Figure 8. The next cap in
bandwidth, as shown further in Figure 8, is introduced by the 22 GiB/s bandwidth limit of the dual
OPA cards and observed at disk #11.
Bandwidth scaling over 95% of the expected maximum valueWe can see that the bandwidth scaling of the file system in terms of the theoretical maximum value
obtained from the underlying NVMe drives is good, reaching over 95% of the maximum value when
internal bandwidth limitations of the internal PCIe switch and OPA network cards are considered.
This helps us understand that the tunings made at the NVMe firmware level, LVM level, Lustre file
system and OPA network levels are all optimal. We now understand the internal hardware limits of
the server, and how this affects the performance of NVMe drives as they are scaled within a DAC
node, and we also understand how to obtain linear scaling of a Lustre file system as it is scaled
across NVMe drives within a single DAC node.
Test our understanding and obtain linear scaling for both read and writeTo help prove our understanding of bandwidth scaling and to obtain a linear scaling curve for both
read and write the tests were repeated, but this time balancing the disks between the two PCIe
switches as we scale up. The results are shown in Figure 9.
Ban
dwid
th G
B/s
Number of NVMe drives in Lustre pool
1 2 3 4 5 6 7 8 9 10 11 12
5.00
10.00
15.00
20.00
25.00
0.00
Single DAC R/W Bandwidth ScalingPCI Aware Disk Layout
Write bandwidth Read bandwidth
Figure 9 Read/Write performance as NVMe OSTs within a single DAC node are added in a PCIe
aware fashion to the lustre OST pool
As expected, we see the same write bandwidth scaling, but a different pattern in the read bandwidth.
With a careful balancing of the NVMe between PCIe bridges, we see that the limit on read bandwidth
is now only the OPA bandwidth limit. With a smooth curve growing at 3GiB/s per NVMe drive until
hitting the limit of the OPA links out of the server at 22 GiB/s.
To further prove the limit is related to a single host and not some fundamental limitation in a single
Lustre file system, we can remove the PCIe and OPA bottleneck by spreading the NVMe evenly
between three servers. (By default, this is what the orchestrator will do when creating a file system
across three DAC servers).
Ban
dwid
th G
B/s
Number of NVMe drives in Lustre pool
1 2 3 4 5 6 7 8 9 10 11 12
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0.00
Bandwidth Scaling as NVMe Drives added across DAC Units By Orchestrator
Write bandwidth Read bandwidth
Figure 10 Bandwidth scaling as NVMe drives spread evenly across 3 DAC nodes
22 23HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
4.2 Bandwidth scaling across multiple DACsIn this section, we show the linear scaling of read and write bandwidth across all 24 DACs in our
configuration. Using ior_easy in the same way as the previous test, we grow the size of the Lustre
file system one DAC at a time until all 24 DAC units are used.B
andw
idth
GB
/s
Number of DAC units in Lustre pool
0 5 10 15 20 25 30
100
200
300
400
500
600
0
Multiple DAC Bandwidth Scaling
Write bandwidth Read bandwidth
Figure 11 R/W IOR bandwidth scaled across 1-24 DAC units
Here we see the single DAC bandwidth performance of 22 GiB/s read and 14.7 GiB/s write scale
linearly as we increase the number of DAC servers, giving a total of 513 GiB/s read and 340 GiB/s
write across 24 DACs. The number of clients in the test is scaled with the number of DACs to ensure
all the bandwidth can be consumed. At 24 DAC nodes, we are using 160 client nodes to fully saturate
the bandwidth.
4.3 Bandwidth limit of a single client node Running across all 24 DAC nodes, with a range of different client-node counts, using the same
ior_easy test from the IO-500 that was used in the last section, we see that individual client nodes
can sustain a peak of 10 GB/s for both read and write. The single OPA link within each client has
a maximum throughput with Lustre via the LNET protocol of 11 GB/s hence 10 GB/s is a high
percentage of max performance.
Thus, we now understand the bandwidth behaviour of the DAC I/O server and the bandwidth that
can be consumed by a single client, providing us with a complete picture of the upper bound of
bandwidth performance as seen by an idealised application.
4.4 Bandwidth — comparison to HDD Lustre systemCurrent HDD-based Lustre systems in Cambridge are built on Dell EMC MD direct-attached SAS
storage arrays. We currently run multiple 2.4PiB file systems instead of one large file system as this
allows us more management flexibility and creates smaller failure domains. Each file system is built
in a highly-available configuration with pairs of storage arrays, each connected in a fully redundant
configuration to two storage servers. We utilise the MD3460 array for the OSTs, which contains 60
disks, and we arrange this into six 10-disk RAID6 volumes as the OST devices. Most of our production
file systems are composed of six MD3460 arrays, giving us 36 OSTs in total per file system.
The current maximum read or write bandwidth we obtain from an entire 2PiB Lustre file system is
approximately 15GiB/s, but to obtain this for a single job or workflow it must utilise all the OSTs
within the file system, which on a shared HPC system is very unlikely and in a multiuser environment
would lead to considerable performance contention. Comparing this to the DAC, we obtain almost
the same performance from just one DAC unit, i.e. 20 GiB/s read, 15 GiB/s write, and this is available
to a single job or workflow with zero performance contention. The cost delta between 2PiB of
spinning disk Lustre and 19TiB of DAC storage—where the DAC provides more real-world single job
bandwidth performance—is 13x. That is to say, a DAC offers 13x lower cost-per-unit of performance
than our HDD-based Lustre.
4.5 Metadata performance of a single DACHere we use mdtest ‘easy’ phases from the IO-500 benchmark, to probe the maximum metadata
performance we can obtain from a single DAC node, (see Figure 12). In this configuration, the
file system consists of a single MDT and 12 OSTs. For this benchmark, we are using the following
settings to the io500.sh script:
io500 _ mdtest _ easy _ params=”-u -L -vvv”io500 _ mdtest _ easy _ files _ per _ proc=200000
kIO
PS
Number of clients
1 2 4 6 8 20 24 3212 16
20
40
60
80
100
120
140
mdtest-easy IO500 - 1x DAC server (1x MDT)
0
Write kIOPS Stat kIOPS Delete kIOPS
Figure 12 Metadata performance within a single node vs the number of clients
From the graph, we can see a maximum of 83K write (file creates) IOPS, 136K stat IOPS and 69K
delete IOPS.
4.6 Metadata limit of a single client nodeWhen we utilise just one client node to drive metadata performance, we see the following
performance: Create 21 KIOPS, Stat 71 KIOPS, Delete 58 KIOPS.
24 25HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
4.7 Metadata scaling across multiple DACsTo measure metadata performance as we scale the Lustre file system across multiple DAC units, we
use the same mdtest easy benchmarks that we used in the last section. We found we were almost
able to scale linearly up from one DAC to 24 DACs as shown in Figure 13, where each DAC server
contains a single MDT and 12 OSTs. We are using Lustre’s DNE2 striped directories to spread the
metadata load across all the MDT targets. We also tested a configuration using two MDTs per DAC,
48 MDTs in total and saw a further-improved performance. However, we were unable to explore
configurations with three or more MDTs per DAC due to a current bug in Lustre11 limiting the
maximum number of MDTs in a file system.
All tests were run using 128 clients. Our io500.sh configuration was similar to the single DAC tests.
io500 _ mdtest _ easy _ params=”-u -L -vvv”
io500 _ mdtest _ easy _ files _ per _ proc=120000
In the directory setup section of io500.sh the mdtest directory was configured to be striped over all
MDTs.
# Use DNE2 for mdt _ easy
lfs setdirstripe -c -1 -D $io500 _ mdt _ easy
The maximum performance was achieved with 48 MDTs, with approximately 1.3M creates, 2.9M
stats and 0.7M deletes per second.
500
1000
1500
2000
2500
3000
kIO
PS
Number of MDTs in Filesystem
1 2 4 8 12 24 4816 20
mdtest-easy IO500 results for multiple MDTs using DNE2
0
Write kIOPS Stat kIOPS Delete kIOPS
Figure 13 Multiple DAC unit metadata scalability
It is worth noting that the DAC Orchestrator will only limit a file system to one MDT per host when a
user requests more than 24 NVMe drives to be added into their file system. Smaller file systems are
given an MDT for every OST.
4.8 Metadata — comparison to HDD Lustre systemWe have made a comparison of metadata performance seen on the DAC with the same metadata
tests run on our traditional spinning disk Lustre file systems, as described above in the bandwidth
comparison section. These file systems contain only a single MDT, which is built on the Dell MD3420
SAS array, which can contain up to 24 2.5” SAS drives. Our MDT devices are configured as a RAID10
of 20 300GiB 15K RPM SAS HDDs, with 4 SAS SSDs used as a read and writethrough cache.
Similar to the single-DAC benchmark above, here we varied the number of client nodes in the
benchmark to see how metadata performance scaled. This was carried out on a ‘live’ in-use production
file system, so it was not possible to ensure other jobs were not causing contention in the system.
Here we see the write and delete performance remaining relatively consistent with a peak of 22K
creates, and 150K stats and 29K deletes per second. Other than the stat performance, which can be
affected by client caching, the create and delete performance of a single DAC MDT is between 2-3x
greater, and multiple DAC MDTs vastly outperform this configuration.
Number of clients
mdtest-easy IO500 - HDD-Based Lustre - 1x MDT on RAID10 SAS-HDD with Flash cache
Write kIOPS Stat kIOPS Delete kIOPS
kIO
PS
1 2 4 6 8 20 24 3212 16
20
40
60
80
100
120
140
160
0
Figure 14 Metadata performance of a HDD-based production file system with one MDT
4.9 IOPS scaling within a single DAC For this test, we utilise the same IO-500 ior-easy benchmarks as we used for the bandwidth tests.
However, we limit the IOR transfer size to 4KiB. We perform this test both with O_DIRECT on and
off to also see a worst-case performance without any page-cache buffering or readahead.
For all ior-easy benchmarks, we configure the ior-easy directory in io500.sh as follows:
# 1 stripe, 16M stripe-sizelfs setstripe -c 1 -S 16M $io500 _ ior _ easy
The ior-easy options used are, for O_DIRECT:
io500 _ ior _ easy _ size=”5G”io500 _ ior _ easy _ params=”-vvv -a=POSIX --posix.odirect -t 4k -b ${io500 _ ior _ easy _ size} -F”
and for tests without O_DIRECT we use:
io500 _ ior _ easy _ size=”5G”io500 _ ior _ easy _ params=”-vvv -a=POSIX -t 4k -b ${io500 _ ior _ easy _ size} -F”
26 27HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
Direct IO Disabled — Buffered IO / ReadaheadMost applications do not utilise O_DIRECT and can make use of asynchronous write and Lustre’s
client readahead in order to increase small I/O performance. Small writes will be written to the VFS
page cache and flushed as bulk RPCs, whereas reads will cause the Lustre client to prefetch file
data into memory, causing subsequent small sequential reads to find the data available in memory
immediately without further RPCs to the server. Figure 15 below shows how such buffered 4K
sequential I/O scales as we add additional OSTs within a single DAC, with 32 clients.
kIO
PS
Number of OSTs in Filesystems
2 4 6 128 10
IOR - 4k transfer-size - IOPS - Single DAC Server
Maximum 4k Write kIOPS Maximum 4k Read kIOPS
1000
2000
3000
4000
5000
6000
0
Figure 15 Single DAC IOPS scaling with direct I/O off and 32 clients
Direct I/O EnabledWith Direct I/O enabled all write and read calls will cause a request over the network to the Lustre
servers and bypass all caches, thus giving us a worst-case performance for such small I/O sizes.
Figure 16 shows the results for 32 clients as we increase the number of OSTs in a single DAC server.
100
150
kIO
PS
2 4 6 128 10
IOR - 4k transfer-size - O_DIRECT - Single DAC Server
Maximum 4k Write kIOPS Maximum 4k Read kIOPS
50
200
250
0
Number of OSTs in Filesystems
Figure 16 ior-easy from 32 clients doing 4k sequential reads/writes with Direct IO enabled, as we
increase the number of OSTs in a single DAC node
4.10 IOPS scaling across multiple DACs This test is identical to the single DAC tests above, however now we are using multiple DAC servers,
each with 12 OSTs configured. As before, we are testing with both Direct IO enabled and disabled.
Direct IO Disabled — Buffered IO / Readahead Figure 17 shows the sequential read/write 4k IOPs achieved as we scale the number of DAC servers
in the file system, where each DAC server contains 12 OSTs.
kIO
PS
Number of DAC Servers
2 4 6 248 20
IOR - 4k transfer-size - IOPS - Multiple DAC Servers
12 161
Maximum 4k Write kIOPS Maximum 4k Read kIOPS
5000
10000
15000
20000
25000
30000
0
Figure 17 ior-easy from 32 clients doing 4k sequential reads/writes with Direct IO disabled, as we
increase the number of DAC servers in the file system
Direct IO Enabled
Figure 18 shows the same test as above in Figure 17, but with Direct IO enabled sequential read/write
4k IOPs achieved as we scale the number of DAC servers in the file system, where each DAC server
contains 12 OSTs.
kIO
PS
Number of DAC Servers
2 4 248 20
IOR - 4k transfer-size - O_DIRECT - IOPS - Multiple DAC Servers
12 161
Maximum 4k Write kIOPS Maximum 4k Read kIOPS
200
400
600
800
1000
1200
1600
1400
0
Figure 18 ior-easy from 32 clients doing 4k sequential reads/writes with Direct IO enabled, as we
increase the number of DAC servers in the file system
28 29HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
4.11 IOPS limit of a single client nodeHere we probe the maximum IOPS values achievable from a single client to a single DAC node (12
OSTs), performing the same tests as above with Direct I/O enabled and disabled.
4k Sequential IOPS Write kIOPS Read kIOPS
Direct IO Enabled 74 69
Direct IO Disabled / Buffered
1300 1350
4.12 IOPS — comparison to HDD Lustre systemAs described above in the Bandwidth test section, our HDD-based Lustre file systems usually consist of
36 OSTs, where each OST is a 10-disk RAID6 array and these OSTs are spread across 6 OSS servers.
The values below were the maximum IOPS values achieved under the same ior-easy tests as used
above on the DAC, with 32 client nodes. As before, we limit the ior transfer size to 4KiB and measure
the sequential read/write IOPS values.
4k Sequential IOPS Write kIOPS Read kIOPS
Direct IO Enabled 16 370
Direct IO Disabled / Buffered
2300 1700
5 Application use cases
Here we undertake an initial investigation of real-life HPC application performance benefits obtained
from using the DAC. We have chosen three use cases: first, a traditional large-scale, long-running
HPC simulation that uses regular checkpointing; second a large-scale structural biology cryo-EM
workload (using the Relion application), and third: a radio astronomy SKA prototype application. We
expect many more use cases across machine learning, engineering, computational chemistry and
bioinformatics to emerge as further testing is undertaken.
5.1 Large checkpoint restart and simulation output dataAs mentioned in the introduction, the concept of the DAC is a generalisation of the “Burst Buffer”
solutions that were built to support large-scale checkpointing of long-running HPC simulations. The
requirement here is to be able to dump the resident memory of a multi-node application at regular
intervals. For the CSD3-skylake cluster, each node comprises 384 GB of DRAM totalling around
150TB across the cluster.
To read this amount in from the main Lustre file system and to write this out for a checkpoint would
take approximately 16 hours — a prohibitively long time given the nature of the simulation. By use
of the DAC, it would take around 12 minutes to perform the same function. This has important
consequences in terms of core hours used for the simulation — it reduces the cost of the simulation
from 344,000 core hours to 147,000 core hours, this has a significant monetary value, with the cost
of the DAC being recovered after just one year with this class of usage.
5.2 Relion — Cryo-EM structural biology The field of Cryogenic Electron Microscopy (Cryo-EM) is currently experiencing a great deal of
interest due to dramatic increases in resolution and the possibilities that this offers to the scientists
using these techniques. Relion, now at version 3, with significant improvements in computational
performance, is one of the key applications for analysing and processing these large data sets.
Greater resolution brings challenges — as the volume of data ingest from such instruments increases
dramatically, and the compute requirements for processing and analysing this data explode.
The Relion refinement pipeline is an iterative process that performs multiple iterations over the same
data to find the best structure. As the total volume of data can be tens of terabytes in size, this is
beyond the memory capacity of almost all current-generation computers and thus, the data must
be repeatedly read from the file system. Therefore, coupled with the various optimisations that have
been carried out into the compute performance of the application, the bottleneck in application
performance moves to the I/O.
A recent challenging test case produced by Cambridge research staff has a size of 20TB. The I/O
time for this test case on the Cumulus traditional Lustre file system versus the new NVMe DAC
reduces I/O wait times from over an hour to just a couple of minutes. This has an immediate impact
as it reduces the amount of time biological samples remain in situ, in the instrument, increasing the
overall throughput of the service.
5.3 Square Kilometre Array — Science Data Processor benchmarkAs part of the work to prototype the SKA’s SDP component, the DAC was used to provide a
secondary, larger-scale prototyping environment to assess the potential performance of the “Buffer
component” of the Science Data Processor architecture which provides a pivotal storage resource
for intermediate (hot buffer) and final (cold buffer) science data products. The performance
requirements for the hot buffer are anticipated to be in the order of around 4GB/s per compute node
across a cluster of around 3000 nodes. Initial prototyping was performed on an OpenStack bare-
metal cluster — supporting a number of execution frameworks together with a networked NVMe
storage appliance. For this exercise, a Slurm service was made available and the user was able to
easily move on to the Cumulus-skylake cluster and exploit the DAC and demonstrate the scalability
of the prototype application, using the DAC, without impacting other users.
30 31HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
6 Discussion
We present here a generalisation of the ‘Burst Buffer’ concept — a Data Accelerator based on
commodity hardware components sourced from Dell EMC and Intel together with an open source
orchestration layer, developed at the University of Cambridge, that seamlessly integrates with the
widely-used Slurm job scheduler. This creates a Lustre parallel file system on a per-job basis, as
directed by the job submission command, on the DAC with the size and performance capability and
stage-in/stage-out routes as required by the job.
The solution delivers a large fraction of the theoretical NVMe SSD performance to a single job in a
completely transparent and determinate fashion, delivering 100x the job performance delivered by
traditional spinning disk Lustre file systems. The solution was tested with synthetic benchmarks and
three HPC applications, and shown to deliver significant application performance advantages.
The synthetic benchmark data collected shows more than 90% of the raw write performance is
delivered to remote clients while read efficiency was 58% of the raw drive performance. This was
due to the limitation of the available network bandwidth. Furthermore, unlike traditional spinning disk
central file systems, a large fraction of this total performance can be delivered to a single HPC job.
This breakthrough performance puts the Cambridge DAC prototype at number 1 in the IO-500 in the
June 2019 submission. This makes the solution the fastest HPC storage system in the world at an
unrivalled price point compared to competitive proprietary solutions.
When designing a DAC commodity storage solution there are a wide range of server, storage and
networking elements available and a wide range of configuration options. The design choices made
here were intended to produce a DAC appliance with the highest and most balanced performance
for the lowest cost, with performance prioritised over capacity. Thus, medium-sized 1.6 TB NVMe
drives were used, with 12 drives per server helping to balance the overall R/W performance. NVMe
disks tend to have more read performance than write. The 12 disk solution was a compromise which
yielded an overall raw R/W performance per DAC of 36 / 15 GB/s, which is limited by the two OPA
card network solution to an achievable R/W bandwidth out of the DAC of 22 / 15 GB/s, explaining
why read efficiency is lower than the write efficiency, since read bandwidth is capped by the OPA
card and write is not and we need to over-provision read performance in order to bring the write
performance up. In any case, the PCIe switches in the R740xd server max-out at around 27.21 GB/s
so even if another OPA card was added only an additional 5.2 GB/s read bandwidth could be realised.
A more balanced system could also be provided by using Intel Optane disks but current market
pricing means that this is a far more expensive option and it is more cost effective to over-provision
read bandwidth in order to increase relative write performance. It would also be possible to add more
NVMe drives to gain an equal R/W performance. Within the Dell EMC R740xd, it is possible to add
twice as many disks at 2 or 3 times the capacity enabling a broad range of capacity vs performance
solutions to be designed. With new PCIe-gen4 servers on the horizon and the increasing availability
of 200 Gb/s HPC networking products, future DAC designs with even higher performance per DAC
are soon to be possible. This will offer even better price points for performance vs capacity targets.
The DAC appliance is built using cloud-native infrastructure as code methodologies using Ansible.
All the scripts and software documentation along with a detailed “How-To” relating to Ansible and
golang tuning and build of the DAC are freely available at the DAC code repository.5 This combined
with the integration of the DAC with Slurm now means that for the first time extremely high and
100% determinant I/O bandwidth, IOPS and metadata performance is readily available to the wider
HPC community. Importantly this no-longer needs expensive proprietary solutions but can be easily
constructed from commodity server and storage technology and open source software.
Solutions like the DAC will help drive uptake of new SSD based storage solutions in HPC and propel
the next wave of data-intensive application development needed to meet the demands of the data-
centric world we now live in. The fields of Astronomy, Physics and Medicine are generating vast and
rapidly increasing data sets which require ever more complex analytical techniques such as Hadoop,
Spark and machine learning. If HPC systems are going to keep pace with these workflows, hybrid
multi-tiered storage solutions will be needed, with seamless, automatic data movement. The solution
presented here allows for just this type of solution to be explored and further developed and tested
with a wide range of applications and workflows.
Future DAC development work involves implementing BeeGFS as an alternative parallel file system
to Lustre and a direct-attached NVMe storage pool via NVMe over fabrics to HPC nodes within
the cluster, again under the control of the SLURM scheduler. In addition, extensive I/O telemetry
functionality is being investigated that will report on the consumption of I/O resources on a per job
level from both the server and client. We will also test an increased range of HPC applications where
we profile the I/O activity and performance increase, producing a better understanding of how to
best utilise tiered storage solutions consisting of NVMe and traditional spinning disk solutions.
32 33HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator
7 References
1 https://slurm.schedmd.com/documentation.html
2 https://www.dellemc.com/resources/en-us/asset/customer-profiles-case-studies/products/
servers/cambridge_case_study.pdf
3 https://www.stackhpc.com/cluster-as-a-service.html
4 https://www.vi4io.org/
5 https://github.com/RSE-Cambridge/data-acc
6 https://www.lanl.gov/projects/trinity/about.php
7 https://phys.org/news/2012-03-sdsc-gordon-supercomputer-ready.html
8 https://slurm.schedmd.com/burst_buffer.html
9 https://www.nersc.gov/users/computational-systems/cori/burst-buffer/
10 https://ark.intel.com/content/www/us/en/ark/products/97005/intel-ssd-dc-p4600-series-
1-6tb-2-5in-pcie-3-1-x4-3d1-tlc.html
11 https://jira.whamcloud.com/browse/LU-12506
12 https://jira.whamcloud.com/browse/LU-12385
13 https://github.com/hpc/ior
14 https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-
products/Intel_OP_Performance_Tuning_UG_H93143_v13_0.pdf
15 http://wiki.lustre.org/Installing_the_Lustre_Software
34 HPC Innovation Exchange | 001 The Data Accelerator
The information in this publication is provided “as is.”Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright ©2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC are trademarks of Dell Inc. and subsidiaries in the United States and other countries. Other trademarks may be trademarks of their respective owners. Dell Corporation Limited. Registered in England. Reg. No. 02081369 Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK.
Intel and the Intel Logo are trademarks of Intel Corporation and subsidiaries in the U.S. and/or other countries.