dell terascala hpc storage solution technical...
Post on 20-Sep-2020
0 Views
Preview:
TRANSCRIPT
Dell | Terascala HPC Storage Solution (DT-HSS3)
A Dell Technical White Paper
David Duncan
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page ii
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR
IMPLIED WARRANTIES OF ANY KIND.
© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without
the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.
Other trademarks and trade names may be used in this document to refer to either the entities
claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in
trademarks and trade names other than its own.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 1
Contents Figures .................................................................................................................... 2
Tables ..................................................................................................................... 3
Introduction ............................................................................................................. 4
Lustre Object Storage Attributes .................................................................................... 4
Dell | Terascala HPC Storage Solution Description ............................................................... 6
Metadata Server ...................................................................................................... 6
Object Storage Server Pair ......................................................................................... 7
Scalability ............................................................................................................. 8
Networking .......................................................................................................... 10
Managing the Dell | Terascala HPC Storage Solution ........................................................... 10
Performance Studies ................................................................................................. 11
N-to-N Sequential Reads / Writes ............................................................................... 14
Random Reads and Writes ........................................................................................ 15
IOR N-to-1 Reads and Writes ..................................................................................... 16
Metadata Testing .................................................................................................. 17
Conclusions ............................................................................................................ 21
Appendix A: Benchmark Command Reference................................................................... 22
Appendix B: Terascala Lustre Kit .................................................................................. 23
References ............................................................................................................. 24
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 2
Figures
Figure 1: Lustre Overview ............................................................................................... 5
Figure 2: Dell PowerEdge R710 ......................................................................................... 6
Figure 3: OSS Base Components ........................................................................................ 7
Figure 4: 96TB Dell | Terascala HPC Storage Solution Configuration ............................................ 8
Figure 5: MDS Cable Configuration ..................................................................................... 9
Figure 6: OSS Scalability ................................................................................................. 9
Figure 7: Cable Diagram for OSS ...................................................................................... 10
Figure 8: Terascala Management Console ........................................................................... 11
Figure 11: N-to-N Random reads and writes. ....................................................................... 16
Figure 12: N-to-1 Read / Write ....................................................................................... 17
Figure 13: Metadata create comparison ............................................................................. 18
Figure 14: Metadata File / Directory Stat ........................................................................... 19
Figure 15: Metadata File / Directory Remove ...................................................................... 20
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 3
Tables
Table 1: Test Cluster Details .......................................................................................... 12
Table 2: DT-HSS3 Configuration ...................................................................................... 14
Table 3: Metadata Server Configuration ............................................................................ 17
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 4
Introduction
In High-Performance computing, the requirement for efficient delivery of data to and from compute
nodes is critical. With the speed at which research can use data in HPC systems, storage is fast
becoming a bottleneck. Managing complex storage systems is another growing burden on storage
administrators and researchers. Data requirements are increasing rapidly, from the perspective of both
performance and capacity. Ways to increase the throughput and scalability of the storage devices
supporting those compute nodes typically require a great deal of planning and configuration.
The Dell | Terascala HPC Storage Solution (DT-HSS) is designed for researchers, universities, HPC users
and enterprises who are looking to deploy a fully supported, easy to use, scale-out and cost-effective
parallel file system storage solution. DT-HSS is a new scale-out storage solution providing high
throughput storage as an appliance. Utilizing intelligent, yet intuitive management interface, the
solution simplifies managing and monitoring all the hardware and file system components. It is easy to
scale in capacity or performance or both, thereby providing you a path to grow in future. The storage
appliance uses Lustre®, the leading open source parallel file system.
The storage solution is delivered as a pre-configured, ready to go appliance and is available with full
hardware and software support from Dell and Terascala. Utilizing the enterprise ready Dell
PowerEdge™ servers and PowerVault™ storage products, the latest generation Dell | Terascala HPC
Storage Solution, referred to as DT-HSS3 in the rest of the paper, delivers a great combination of
performance, reliability, ease of use and cost-effectiveness.
Because of the changes made in this third generation of the Dell | Terascala High-Performance
Computing Storage Solution, it is important to evaluate the performance progress of the solution to
determine what should be expected. This paper is a description of this generation of the solution and
outlines some of the performance characteristics.
Many details of the Lustre filesystem and integration with Platform Computing’s Platform Cluster
Manager (PCM) remain unchanged from the previous generation solution or DT-HSS2. Readers may
refer to Appendix B on Page 23 for instructions on integrating with PCM.
Lustre Object Storage Attributes Lustre is a clustered file system, offering high performance through parallel access and distributed
locking. In the DT-HSS family, access to the storage is provided via a single namespace. This single
namespace is easily mounted from PCM and managed through the Terascala Management Console.
A parallel file system such as Lustre delivers its performance and scalability by distributing (called
“striping”) data across multiple access points, allowing multiple compute engines to access data
simultaneously. A Lustre installation consists of three key systems: the metadata system, the object
storage system, and the compute clients.
The metadata system is comprised of the Metadata Target (MDT) and the Metadata Server (MDS). The
MDT stores all metadata for the file system including file names, permissions, time stamps, and where
the data objects are stored within the object storage system. The MDS is the dedicated server that
manages the MDT. For the Lustre version used in our testing, there is an active/passive pair of MDS
servers providing a highly available metadata service to the cluster.
The object storage system is comprised of the Object Storage Target (OST) and the Object Storage
Server (OSS). The OST provides storage for file object data, while the OSS manages one or more OSTs.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 5
There are typically several active OSSs at any time. Lustre is able to deliver increased throughput with
an increase in the number of active OSSs. Each additional OSS increases the maximum networking
throughput, and storage capacity. Figure 1 shows the relationship of the OSS and OST components of
the Lustre filesystem.
Figure 1: Lustre Overview
The Lustre client software is installed on the compute nodes to gain access to data stored in the Lustre
clustered filesystem. To the clients, the filesystem appears as a single branch in the filesystem tree.
This single directory makes a simple starting point for application data access. This also ensures access
via tools native to the operating system for ease of administration.
Lustre also includes a sophisticated storage network protocol enhancement referred to as lnet that can
leverage features of certain types of networks; the DT-HSS3 utilizes Infiniband in support of lnet’s
sophisticated features. When the lnet is used with Infiniband, Lustre can take advantage of the RDMA
capabilities of the Infiniband fabric to provide more rapid IO transport than can be achieved by typical
networking protocols.
The elements of the Lustre clustered file system are as follows:
Metadata Target (MDT) – Tracks the location of “stripes” of data
Object Storage Target(OST) – Stores the stripes or extents on a filesystem
Lustre Clients – Access the MDS to determine where files are located, then access the OSSs to
read and write data
Typically, Lustre deployments and configurations are considered complex and time consuming. Lustre
installation and administration is done via a command line interface, which may prevent the Systems
Administrator unfamiliar with Lustre from performing an installation, possibly preventing his or her
organization from experiencing the benefits of a clustered filesystem. The Dell | Terascala HPC Storage
Solution removes the complexities of installation and minimizes Lustre deployment time as well as
configuration time. This decrease in time and effort leaves opportunity for filesystem testing and
speeds the general preparations for production use.
Client
Client
Client
Client
MDS Server
OSS 2 Server
OSS 1 Server OSTs
Data
MDT
Metadata
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 6
Dell | Terascala HPC Storage Solution Description The DT-HSS3 solution provides a pre-configured storage solution consisting of Lustre Metadata Servers
(MDS), Lustre Object Storage Servers (OSS), and the associated storage. The application software
images have been modified to support the PowerEdge R710 as a standards-based replacement for the
previously custom fabricated servers used as the Object Storage and Metadata Servers in the
configuration. This substitution, shown in Figure 2, allows for a significant enhancement in the
performance and serviceability of these solution components with a decrease in the overall complexity
of the solution itself.
Figure 2: Dell PowerEdge R710
Metadata Server Pair
In the latest DT-HSS3 solution, the Metadata Server (MDS) pair is comprised of two Dell PowerEdge
R710 servers configured as an active/passive cluster. Each server is directly attached to a Dell
PowerVault MD3220 storage array housing the Lustre MDT. The MD3220 is fully populated with 24,
500GB, 7.2K, 2.5” near-line SAS drives configured in a RAID10 for a total available storage of 6TB. In
this single Metadata Target (MDT), the DT-HSS3 provides 4.8TB of formatted space for filesystem
metadata. The MDS is responsible for handling file and directory requests and routing tasks to the
appropriate Object Storage Targets (OST's) for fulfillment. With this single MDT, the maximum number
of files served will be in excess of 1.45 Billion. Storage requests are handled across a single 4OGb QDR
Infiniband connection via lnet.
Figure 3 Metadata Server Pair
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 7
Object Storage Server Pair
Figure 4: OSS Pair
The hardware improvements also extend to the Object Storage Server (OSS). The previous generation
of the OSS was built with the custom fabricated server and this has been changed. The PowerEdge R710
server has become the new standard for this solution. In the DT-HSS3, Object Storage Servers are
arranged in two-node high availability (HA) clusters providing active/active access to two Dell
PowerVault MD3200 storage arrays. Each MD3200 array is fully populated with 12 2TB 3.5” 7.2K near-
line SAS drives. Capacity of each MD3200 array can be extended with up to seven additional MD1200s.
Each OSS Pair provides raw storage capacity ranging from 48TB up to 384TB.
Object Storage Servers (OSS’s) are the building blocks of the DT-HSS solutions. With the two 6Gb SAS
controllers in each PowerEdge R710, two servers are redundantly connected to each of two MD3200
storage arrays. The MD3200 storage arrays are extended with MD1200 attached SAS devices to provide
additional capacity.
With 12 7.2K, 2TB Near Line SAS drives in each PowerVault MD3200 or MD1200 enclosure provides a
total of 24TB raw storage capacity. The storage is divided equally in to two RAID 5 (Five data and one
parity disk) virtual disks per enclosure to yield two Object Storage Targets (OST’s) per enclosure. Each
OST provides 9TB of formatted object storage space. The OSS pair is expanded by 4 OST’s with each
stage of increase. The OSTs can be connected to an Lnet via 40Gb QDR Infiniband connections.
When viewed from any compute node equipped with the Lustre client the entire namespace can be
reviewed and manage like any other filesystem, but with enhancements for Lustre management.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 8
Figure 5: 96TB Dell | Terascala HPC Storage Solution Configuration
Scalability
Providing the Object Storage Servers in active/active cluster configurations yields greater throughput
and product reliability. This is also the recommended configuration for decreasing maintenance
requirements, consequently reducing potential downtime.
The PowerEdge Servers are included to provide greater performance and simplify maintenance by
eliminating specialty hardware. This 3rd Generation solution continues to scale from 48TB up to 384TB
of raw storage for each OSS pair. The DT-HSS3 solution leverages the QDR Infiniband interconnects for
high-speed, low-latency storage transactions. The 3rd generation upgrade to PowerEdge R710 takes
advantage of the PCIe Gen2 interface for QDR Infiniband helping achieve higher network throughput
per OSS. The Terascala Management console also offers the same level of management and monitoring.
The Terascala Lustre Kit for PCM 2.0.1 remains unchanged at version 1.1. This combination of
components from storage to client access is formally offered as the DT-HSS3 appliance.
An example, 96TB configuration shown with the PowerEdge R710 is provided in Figure 5.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 9
Figure 6: MDS Cable Configuration
Figure 7: OSS Scalability
Scaling the DT-HSS3 is achieved in two ways. The first method, demonstrated in Figure 8 involves
simply expanding the storage capacity of a single OSS pair using MD1200 storage arrays in pairs. This
allows for an increase in the volume of storage available while maintaining a consistent maximum
network throughput. As shown in Figure 7, the second method adds additional OSS pairs, thus
increasing the total network throughput and increasing the storage capacity at once.
Also see Figure 8 for detail of the cable configuration and expansion of the DT-HSS3 storage by
increasing the number of storage arrays.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 10
Figure 8: Cable Diagram for OSS
Networking
Management Network
The private management network provides a communication infrastructure for Lustre and Lustre HA
functionality as well as storage configuration and maintenance. This network creates the segmentation
required to facilitate day-to-day operations and to limit the scope of troubleshooting and maintenance.
Access to this network is provided via the Terascala Management Server, the TMS2000.
The TMS2000 extends a single communications port for the external management network. All user-
level access is via this device. The TMS2000 is responsible for user interaction as well as systems health
management and monitoring. While the TMS2000 is responsible for collecting data and management, it
does not play an active role in the Lustre filesystem and can be serviced without requiring downtime on
the filesystem. The TMS2000 presents the data collected and provides management via an interactive
GUI. Users interact with the TMS2000 through the Terascala Management Console.
Data Network
The Lustre filesystem is served via a preconfigured LustreNet implementation on Infiniband. In the
Infiniband network, fast transfer speeds with low latency can be achieved. LustreNet leverages the use
of RDMA for rapid file and metadata transfer from MDT’s and OST’s to the clients. The OSS and MDS
servers take advantage of the full QDR Infiniband fabric with single port ConnectX-2 40Gb adapters.
The QDR Infiniband controllers can be easily integrated in to existing DDR networks using QDR to DDR
crossover cables if necessary.
Managing the Dell | Terascala HPC Storage Solution The Terascala Management Console (TMC) takes the complexity out of administering a Lustre file
system by providing a centralized GUI for management purposes. The TMC can be used as a tool to
standardize the following actions: mounting and unmounting the file system, initiating failover of the
file system from one node to another, and monitoring performance of the file system and the status of
its components. Figure 9 illustrates the TMC main interface.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 11
The TMC is a Java-based application that can be run from any Java JRE equipped computer to remotely
manage the solution (assuming all security requirements are met). It provides a comprehensive view of
the hardware and filesystem, while allowing monitoring and management of the DT-HSS solution.
Figure 9 shows the initial window view of a DT-HSS3 system. In the left pane of the window are all the
key hardware elements of the system. Each element can be selected to drill down for additional
information. In the center pane is a view of the system from the Lustre perspective, showing the status
of the MDS and various OSS nodes. In the right pane is a message window that highlights any conditions
or status changes. The bottom pane displays a view of the PowerVault storage arrays.
Using the TMC, many tasks that once required complex CLI instructions, can now be completed easily
with a few mouse clicks. The TMC can be used to shut down a file system, initiate a failover from one
MDS to another, monitor the MD32XX arrays, etc.
Figure 9: Terascala Management Console
Performance Studies The goal of the performance studies in this paper is to profile the capabilities of a selected DT-HSS3
configuration. The goal is to identify points of peak performance and the most appropriate methods for
scaling to a variety of use cases. The test bed is a Dell HPC compute cluster with configuration as
described in Table 1.
A number of performance studies were executed, stressing a DT-HSS3 192TB configuration with
different types of workloads to determine the limitations of performance and define the sustainability
of that performance.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 12
Table 1: Test Cluster Details
Component Description
Compute Nodes: Dell PowerEdge M610, 16 each in 4 PowerEdge M1000e chassis SERVER BIOS: 3.0.0 PROCESSORS: Intel Xeon X5650 MEMORY: 6 x 4GB 1333 MHz RDIMM INTERCONNECT: Infiniband - Mellanox MTH MT26428 [ConnectX IB QDR, Gen-2 PCIe] IB SWITCH: Mellanox 3601Q QDR blade chassis I/O switch module CLUSTER SUITE: Platform PCM 2.0.1 OS: Red Hat Enterprise Linux 5.5 x86_64 (2.6.18-194.el5) IB STACK: OFED 1.5.1
Testing focused on three key performance markers:
Throughput, data transferred in MiB/s
I/O Operations per second (IOPS), and
Metadata Operations per second
The goal is a broad, but accurate review of the computing platform.
There are two types of file access methods used in these benchmarks. The first file access method is N-
to-N, where every thread of the benchmark (N clients) writes to a different file (N files) on the storage
system. IOzone and IOR can both be configured to use the N-to-N file-access method. The second file
access method is N-to-1, which means that every thread writes to the same file (N clients, 1 file). IOR
can use MPI-IO, HDF5, or POSIX to run N-to-1 file-access tests. N-to-1 testing determines how the file
system handles the overhead introduced with multiple concurrent requests when multiple clients
(threads) write to the same file. The overhead encountered comes from threads dealing with Lustre’s
file locking and serialized writes. See Appendix A for examples of the commands used to run these
benchmarks. The number of threads or tasks chosen to run with each of these servers is chosen around
the test cluster configuration. The number of threads corresponds to a number of hosts up to 64. That
is to say that the number of threads is consistently 1 per host until we reach 64. Total numbers of
threads above 64 are achieved by increasing the number of threads across all clients. IOzone limits the
number of threads to 255 so the final data point for IOzone is consistent, save for one client who will
have 3 rather than 4 IOzone threads like the other 3.
Figure 10 shows a diagram of the cluster configuration used for this study.
The number of threads or tasks chosen to run with each of these servers is chosen around the test
cluster configuration. The number of threads corresponds to a number of hosts up to 64. That is to say
that the number of threads is consistently 1 per host until we reach 64. Total numbers of threads above
64 are achieved by increasing the number of threads across all clients. IOzone limits the number of
threads to 255 so the final data point for IOzone is consistent, save for one client who will have 3
rather than 4 IOzone threads like the other 3.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 13
Figure 10: Test Cluster Configuration
ES
T
External
Network
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
LNK/ACT LNK/ACT LNK/ACT LNK/ACT
RESET
PWRRPSDIAG
TEMPFAN
FDX
/HDX
LNK
/ACT
COMBO PORTS
PowerEdge M1000e
9 101 2
11 123 4
13 145 6
15 167 8
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
PowerEdge M1000e
9 101 2
11 123 4
13 145 6
15 167 8
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
M3601Q
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
M3601Q
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
PowerEdge M1000e
9 101 2
11 123 4
13 145 6
15 167 8
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
PowerEdge M1000e
9 101 2
11 123 4
13 145 6
15 167 8
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
0
1
00
1
0 0
1
00
1
0 0
1
00
1
0 0
1
00
1
0
M3601Q
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
M3601Q
8
7
6
5
4
3
2
1
16
15
14
13
12
11
10
9
IS 5035CONSOLEMGT
STATUS
PSU 1
PSU 2
FAN
RST
3433
3231
3635
2827
2625
3029
2221
2019
2423
1615
1413
1817
109
87
1211
43
21
65
Infiniband NetworkHead Node
Compute Nodes/
Lustre Clients
Ma
na
ge
me
nt N
etw
ork
0 3 6 9
1 4 7 10
2 5 8 11
Hard Drives
0 3 6 9
1 4 7 10
2 5 8 11
Hard Drives
TMS2000
LNK ACT
Stacking HDMI1 2 SFP+1 2
LNK ACT LNK ACT
Reset
LNK ACT
Stack No.
M RPSFan
PW
R
Status
2 4 6 8 10 12 14 16 18 20 22 24
1 3 5 7 9 11 13 15 17 19 21 23
Terascala
Management Network
OSS Pair
MDS Servers
0 3 6 9
1 4 7 10
2 5 8 11
Hard Drives
0 3 6 9
1 4 7 10
2 5 8 11
Hard Drives
+MD1200's for
Additional Storage
The test environment for the DT-HSS3 has a single MDS Pair and a single OSS Pair. There is a total of
192TB of raw disk space for the performance testing. The OSS pair contains two PowerEdge R710s, each
with 24GB of memory, two 6Gbps SAS controllers and a single Mellanox ConnectX-2 QDR HCA. The MDS
has 48GB of memory, a single 6Gbps SAS controller and a single Mellanox ConnectX-2 QDR HCA.
The Infiniband fabric was configured as a 50% blocking fabric. It was created by populating Mezzanine
slot B on all blades, using a single Infiniband switch in I/O module slot B1 in the chassis and using one
external 36 port switch. 8 external ports on the internal switch were connected to the external 36 port
switch. As the total throughput to and from any chassis exceeds the throughput to the OSS servers 4:1,
this fabric configuration exceeded the bandwidth requirements necessary to fully utilize the OSS pair
and MDS pair for all testing.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 14
Table 2: DT-HSS3 Configuration
DT-HSS3 Configuration
Configuration Size 192TB Medium
Lustre Version 1.8.4
# of Drives 96 x 2TB NL SAS drives.
OSS Nodes 2 x PowerEdge R710 Servers with 24GB Memory
OSS Storage Array 2 x MD3200 /3 x MD1200
Drives in OSS Storage Arrays 96 3.5" 2TB 7200RPM Nearline SAS
MDS Nodes 2 x PowerEdge R710 Servers with 48GB Memory
MDS Storage Array 1 x MD3220
Drives in MDS Storage Array 24 2.5" 500GB 7200RPM Nearline SAS
Infiniband
PowerEdge R710 Servers Mellanox ConnectX-2 QDR HCA
Compute Nodes Mellanox ConnectX-2 QDR HCA - Mezzanine Card
External QDR IB Switch Mellanox 36 Port is5030
IB Switch Connectivity QDR Cables
All tests are performed with a cold cache established with the following technique. After each test, the
filesystem is unmounted from all clients, then unmounted from the metadata server. Once the
metadata server is unmounted, a sync is performed and the kernel is instructed to drop caches with the
following command:
sync
echo 3 > /proc/sys/vm/drop_caches
In order to further thwart any client cache errors, the compute node responsible for writing the
sequential file was not the same one used for reading the files. A file size of 64GB was determined for
sequential writes. This is 25% greater than the total memory combined for each PowerEdge R710 , so
all sequential reads and writes have an aggregate sample size of 64GB multiplied by the number of
threads, up to a total 64GB * 255 threads or greater than 15TB. The block size for IOzone was set to
1MB to match the 1MB request size.
In measuring the performance of the DT-HSS3, all tests were performed within similar environments.
The filesystem was configured to be fully functional and the targets tested were emptied of files and
directories prior to each test.
N-to-N Sequential Reads / Writes
The sequential testing was done with the IOzone testing tools version 3.347 and throughput results are
reported in units of KB/sec, but are converted to MiB for consistency. N-to-N testing describes the
technique of generating a single file for each process thread spawned by IOzone. The file size
determined to be appropriate for testing is a 64GB file. Each individual file written was large enough to
saturate the total available cache on each OSS/MD3200 pair. Files written were distributed across the
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 15
client LUNs evenly. This was to prevent uneven I/O load on any single SAS connection in the same way
that a user would expect to balance a workload.
Figure 11: Sequential Reads / Writes DT-HSS3
Figure 11 demonstrates the performance of the HSS-3 192TB test configuration. Write performance
peaks at 4.1GB/sec at sixteen concurrent threads and continues to exceed 3.0GB/sec while each of the
64 clients writes up to four files at a time. With four files per client, sixteen files are being written
simultaneously on each OST. Read performance exceeds 6.2GB/sec for sixteen simultaneous processes
and continues to exceed 3.5GB/sec while four threads per compute-node access individual OST's. The
write and read performance rises steadily as we increase the number of process threads up to sixteen.
This is the result of increasing the number of OST's utilized directly with the number of processes and
compute nodes. When the client to OST ratio (CO) exceeds one, the write performance is evenly
distributed between the multiple files on each OST, consequently increasing seek and Lustre callback
time. By careful crafting of the IOzone hosts file, each thread added balances between blade chassis,
compute nodes, and SAS controllers, as well as adding a single file to each of the OST's allowing a
consistent increase in the number of files written while creating minimal locking contention between
the files. As the number of files written increases beyond one per OST, throughput declines. This is
most likely the result of issues related to the larger number of requests per OST. Throughput drops as
the number of clients making requests to each OST increase. Positional delays and other overhead are
expected as a result of the increased randomization of client requests. In order to maintain the higher
throughput for a greater number of files, increasing the number of OST’s should help. A review of the
storage array performance from the Dell PowerVault Modular Disk Storage Manager or using SMCli
performance monitor was done to independently confirm the throughput values produced by the
benchmarking tools.
Random Reads and Writes
The same tool, IOzone, was used to gather random reads and writes metrics. In this case, we have
chosen to write a total of 64GB during any one execution. This total 64GB size is divided amongst the
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16 32 64 128 255
MiB
/se
c
Threads
DT-HSS3 Sequential Reads / Writes
readers writers
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 16
threads evenly. The IOzone hosts are arranged to distribute the work load evenly across the compute
nodes. The storage is addressed as a single volume with an OST count of 16 and stripe size of 4MB. As
we are measuring our performance in Operations per second (OPS), a 4KB request size is used because
it aligns with this Lustre version’s 4KB filesystem block size.
Figure 12 shows that the random writes are saturated at sixteen threads and around 5,200 IOPS. At
sixteen threads, the client to OST ratio (CO) is one. As we increase the number of threads in IOzone,
we increase the CO to a maximum of 4:1 at 64 threads. The reads increase rapidly from sixteen to 32
and then continuously increase at a relatively steady rate. As the writes require a file lock per OST
accessed, saturation is not unexpected. Reads take advantage of Lustre’s ability to grant overlapping
read extent locks for part or all of a file. Increasing the number of disks in the single OSS pair or
additional OSS pairs can increase the throughput.
Figure 12: N-to-N Random reads and writes.
0
5000
10000
15000
20000
25000
8 16 32 64 128 192 255
IOP
S
Threads
DT-HSS3 N-to-N Random
Random reads Random writes
IOR N-to-1 Reads and Writes
Performance review of the DT-HSS3 with reads and writes to a single file was done with the IOR
benchmarking tool. IOR is written to accommodate MPI communications for parallel operations and has
support for manipulating Lustre striping. IOR allows several different IO interfaces for working with the
files, but this testing used the POSIX interface to exclude more advanced features and associated
overhead. This gives us an opportunity to review the filesystem and hardware performance
independent of those additional enhancements and most applications have adopted a “one-file-per-
processor approach” to disk-IO rather than using parallel IO API’s.
IOR benchmark version 2.10.3 was used in this study, and is available at
http://sourceforge.net/projects/ior-sio/. The MPI stack used for this study was Open MPI version
1.4.1, and was provided as part of the Mellanox OFED kit.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 17
The configuration for writing included a directory set with striping characteristics designed to stripe
across all sixteen OST’s with a stripe chunk size of 4MB. The file to which all clients write is evenly
striped across all OST’s. The request size for Lustre is 1MB, but in this test, a transfer size of 4MB was
used to match the stripe size.
Figure 13: N-to-1 Read / Write
After setting the 4MB transfer size for IOR, reads have the advantage and peak at greater than 5.5 GB
per second, a value significantly near the sequential N-to-N performance. Write performance peaks
within 10% of the sequential writes at 3.9GB, but as the client to OST ratio (CO) increases, the write
performance settles around 2.6GB per second, sustained. Reads are less affected since the read locks
for clients are allowed to overlap some or all of a given file.
Metadata Testing
Metadata testing measures the time to complete certain file and directory operations that return
attributes. Mdtest is an MPI-coordinated benchmark that performs create, stat, and delete operations
on files and directories. The output of mdtest is an evaluation of the timing for completion of the
operations per second. The mdtest benchmark is MPI enabled. Using mpirun, tests create threads from
each included compute node. Mdtest can be configured to compare metadata performance for
directories and files.
This study used mdtest version 1.8.3, which is available at http://sourceforge.net/projects/mdtest/.
The MPI stack used for this study was Open MPI version 1.4.1, and was provided as part of the Mellanox
OFED kit. The Metadata Server internal details, including processor and memory configuration, are
outlined in Table 3 for comparison.
Table 3: Metadata Server Configuration
Processor 2 x Intel Xeon E5620 2.4Ghz, 12M Cache, Turbo
Memory 48GB Memory (12x4GB), 1333MHz Dual Ranked RDIMMs
0
1000
2000
3000
4000
5000
6000
2 4 8 16 32 64
MiB
/Se
c
Threads
DT-HSS3 N-to-1 Read / Write
read write
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 18
Controller 2 x SAS 6Gbps HBA
PCI Riser Riser with 2 PCIe x8 + 2 PCIe x4 Slot
On a Lustre file system, the directory metadata operations are all performed on the MDT. The OST’s
are queried for object identifiers in order to allocate or locate extents associated with the metadata
operations. This interaction requires the indirect involvement of OSTs in most metadata operations. As
a result of this dependency, Lustre favors metadata operations on files with lower stripe counts. The
greater the stripe count, the more operations are required to complete the OST object allocation or
calculations.
The first test stripes files in 1MB stripes over all available OSTs. The next test repeats the previous
operations on a single OST for comparison. Metadata operations across a single OST vs. the same
operations spreading single stripe files evenly over sixteen OST stripes are very similar.
Figure 14: Metadata create comparison
Preallocating from sixteen OSTs or a single OST does not seem to change the rate at which files are
created. The performance of the MDT and MDS are more important to the create operation.
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 4 8 16 32 64
Op
era
tio
ns
/ se
c
Tasks (Nodes)
DT-HSS3 Metadata Create
16 OST File Create 1 OST File Create DT-HSS3 16 OST Directory Create
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 19
Figure 15: Metadata File / Directory Stat
The mdtest benchmark for stat uses the native call as opposed to the bash shell call to retrieve file and
directory metadata. The directory stats seem to peak around 26,000 per second, but the increased
number of OSTs makes a significant difference in the number of calls handled by 1 task or greater. If
the files were striped across a number of OSTs greater than 1, all OSTs would be contacted in order to
accurately calculate the file size. The file stats indicate that beyond 8 tasks, the advantage lies in
distributing the requests across more than a single OST, but the increase is not significant as the
requests to any one OST do not represent more than a fraction of the data requested by the test.
Directory creates do not require the same draw from the prefetch pool of OST object IDs as the creates
do for files. The result is fewer operations to completion and therefore faster time to completion.
Removal of the files requires less communication between the compute nodes (Lustre clients) and the
MDS than removing directories. Directory removal requires verification from the MDS that the directory
is void of content prior to execution. With 32 simultaneous stripes, a break occurs in the file removal
between the single OST result and that of the sixteen OSTs. The metadata operations for directories
across multiple OSTs appear to gain a slight advantage after the number of tasks exceeds 32. In the
case of the removal of the files, we see a definite increase in performance when commands are
executed across multiple OSTs.
0
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32 64
Op
era
tio
ns
/ se
c
Tasks (Nodes)
DT-HSS3 Metadata stat
16-OST File Stat 1-OST File Stat 16 OST Directory Stat 1 OST Directory Stat
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 20
Figure 16: Metadata File / Directory Remove
0
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32 64
Op
era
tio
ns
/ se
c
Tasks (Nodes)
DT-HSS3 Metadata Remove
16 OST File Remove 1 OST File Remove
16 OST Directory Remove 1 OST Directory Remove
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 21
Conclusions There is a well-known requirement for scalable, high-performance clustered filesystem solutions. There
are many solutions for customized storage. The third generation of the Dell | Terascala HSS, the DT-
HSS3, addresses this need with the added benefit of standards based and proven Dell PowerEdge and
PowerVault products and Lustre, the leading open source solution for clustered filesystem. The
Terascala Management Console (TMC) unifies the enhancements and optimizations included in the DT-
HSS3 into a single control and monitoring panel for ease of use.
The latest generation Dell | Terascala HPC Storage Solution continues to offer the same advantages as
the previous generation solutions, but with greater processing power and memory and greater IO
bandwidth. The scale of the raw storage from 48TB up to a total of 384TB per Object Storage Server
Pair and up to 6.2GB/s of throughput in a packaged component is consistent with the needs of the high
performance computing environments. The DT-HSS3 is also capable of scaling horizontally as easily as it
scales vertically.
The performance studies demonstrate an increase in the throughput for both reads, and writes over
previous generations for n-to-n and n-to-1 file type access. Results from mdtest show a generous
increase in the metadata operations speed. With the PCI-e 2.0 interface, controllers and HCAs will
excel in both high IOP and high bandwidth applications.
The continued use of generally available, industry-standard tools like IOzone, IOR and mdtest provide
an easy way to match current and expected growth with the performance outlined for the DT-HSS3.
The profiles reported from each of these tools provide sufficient information to align the configuration
of the DT-HSS3 with the requirements of many applications or group of applications.
In short, the Dell | Terascala HPC Storage Solution delivers all the benefits of a scale-out parallel file
system-based storage for your high performance computing needs.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 22
Appendix A: Benchmark Command Reference This section describes the commands used to benchmark the Dell | Terascala HPC Storage Solution 3. IOzone IOzone Sequential Writes – /root/iozone/iozone -i 0 -c -e -w -r 1024k -s 64g -t $THREAD -+n -+m ./hosts
IOzone Sequential Reads - /root/iozone/iozone -i 1 -c -e -w -r 1024k -s 64g -t $THREAD -+n -+m ./hosts
IOzone IOPS Random Reads / Writes – /root/iozone/iozone -i 2 -w -O -r 4k -s $SIZE -t $THREAD -I -+n -+m ./hosts
The O_Direct command line parameter (“-I”) allows us to bypass the cache on the compute nodes
where the IOzone threads are running.
IOR IOR Writes – mpirun -np $i --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -w -o $IORFILE
-s 1 -b $SIZE -t 4m
IOR Reads – mpirun -np $i --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -r -o $IORFILE
-s 1 -b $SIZE -t 4m
mdtest - Metadata Create Files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b
$DIRS -z 1 -L -I $FTP -y -u -t -F -C
Stat files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b
$DIRS -z 1 -L -I $FTP -y -u -t -F -R -T
Remove files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b
$DIRS -z 1 -L -I $FTP -y -u -t -F -r
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 23
Appendix B: Terascala Lustre Kit PCM software uses the following terminology to describe software provisioning concepts: 1. Installer Node – Runs services such as DNS, DHCP, HTTP, TFTP, etc. 2. Components – RPMS 3. Kits – Collection of components 4. Repositories – Collection of kits 5. Node Groups – Allow association of software to a particular set of compute nodes The following is an example of how to integrate a Dell | Terascala HPC Storage Solution into an existing
PCM cluster:
1. Access a list of existing repositories on the head node:
The key here is the Repo name: line.
2. Add the Terascala kit to the cluster:
3. Confirm the kit has been added:
4. Associate the kit to the compute node group on which the Lustre client should be installed:
a. Launch ngedit at the console: # ngedit b. On the Node Group Editor screen, select the compute node group to add the Lustre client. c. Select the Edit button at the bottom of the screen. d. Accept all default settings until you reach the Components screen. e. Using the down arrow, select the Terascala Lustre kit. f. Expand and select the Terascala Lustre kit component. g. Accept the default settings, and on the Summary of Changes screen, accept the changes and
push the packages out to the compute nodes. On the front-end node, there is now a /root/terascala directory that contains sample directory setup scripts. There is also a /home/apps-lustre directory that contains Lustre client configuration parameters. This directory contains a file that the Lustre file system startup script uses to optimize clients for Lustre operations.
Dell | Terascala High-Performance Storage Solution DT-HSS3
Page 24
References Dell | Terascala HPC Storage Solution Brief
http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-
solution-brief.pdf
Dell | Tersascala HPC Storage Solution Whitepaper
http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-dt-hss2.pdf
Dell PowerVault MD3200 / MD3220
http://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-
md3200&s=biz&cs=555
Dell PowerVault MD1200
http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1218&s=biz
Lustre Home Page
http://wiki.lustre.org/index.php/Main_Page
Transitioning to 6Gb/s SAS
http://www.dell.com/downloads/global/products/pvaul/en/6gb-sas-transition.pdf
Dell HPC Solutions Home Page
http://www.dell.com/hpc
Dell HPC Wiki
http://www.HPCatDell.com
Terascala Home Page
http://www.terascala.com
Platform Computing
http://www.platform.com
Mellanox Technologies
http://www.mellanox.com
top related