gpfs in today's hpc processing center this is not your ... · gpfs in today's hpc...
Post on 19-Apr-2020
3 Views
Preview:
TRANSCRIPT
GPFS in Today's HPC Processing CenterGPFS in Today's HPC Processing Center"This is not your father's GPFS""This is not your father's GPFS"
Raymond L. Paden, Ph.D.HPC Technical Architect
Deep Computing
3 June 2005
raypaden@us.ibm.com713 - 940 - 1084
This presentation was produced in the United States. IBM may not offer the products, programs, services or features discussed herein in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the products, programs, services, and features available in your area. Any reference to an IBM product, program, service or feature is not intended to state or imply that only IBM's product, program, service or feature may be used. Any functionally equivalent product, program, service or feature that does not infringe on any of IBM's intellectual property rights may be used instead of the IBM product, program, service or feature.
Information in this presentation concerning non-IBM products was obtained from the suppliers of these products, published announcement material or other publicly available sources. Sources for non-IBM list prices and performance numbers are taken from publicly available information including D.H. Brown, vendor announcements, vendor WWW Home Pages, SPEC Home Page, GPC (Graphics Processing Council) Home Page and TPC (Transaction Processing Performance Council) Home Page. IBM has not tested these products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA.
All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of a specific Statement of General Direction.
The information contained in this presentation has not been submitted to any formal IBM test and is distributed "AS IS". While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk.
IBM is not responsible for printing errors in this presentation that result in pricing or information inaccuracies.
The information contained in this presentation represents the current views of IBM on the issues discussed as of the date of publication. IBM cannot guarantee the accuracy of any information presented after the date of publication.
IBM products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
Any performance data contained in this presentation was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements quoted in this presentation may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this presentation may have been estimated through extrapolation. Actual results may vary. Users of this presentation should verify the applicable data for their specific environment.
Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States and/or other countries.
UNIX is a registered trademark in the United States and other countries licensed exclusively through The Open Group.
LINUX is a registered trademark of Linus Torvalds. Intel and Pentium are registered trademarks and MMX,Itanium, Pentium II Xeon and Pentium III Xeon are trademarks of Intel Corporation in the United States and/or other countries.
Other company, product and service names may be trademarks or service marks of others.
Special Notices from IBM Legal
���®
GPFS has matured significantly over the years since its version 1.x releases for AIX, yet many HPC practitioners do not fully realize GPFS's flexibility and ease of use today. This presentation examines these new features including multi-clustering, GPFS in a mixed AIX/Linux environment, its ability to work with a wider variety of disk vendors, and other things. Sample configurations with benchmark results will be given. The presentation will include a cursory review of the GPFS roadmap.
AbstractAbstract
���®
Some say people have evolved...
���®
Some say people have evolvedinto something intelligent?!?!
���®
But what about HPC systems?
How have they evolved?
���®
A Common HPC Evolutionary PathA Common HPC Evolutionary Path
Mainframe IBM mainframe with attached vector processor(s)
Vector ProcessorCray (with attached mainframe?)
���®
A Common HPC Evolutionary PathA Common HPC Evolutionary Path
Proprietary ClustersRoom full of SPs
Mainframe IBM mainframe with attached vector processor(s)
Vector ProcessorCray (with attached mainframe?)
After n steps...
���®
A Common HPC Evolutionary PathA Common HPC Evolutionary Path
Proprietary ClustersRoom full of SPs
Then some renegade starts experimenting with Beowulf clusters...
Mainframe IBM mainframe with attached vector processor(s)
Vector ProcessorCray (with attached mainframe?)
After n steps...
starting like this...
���®
A Common HPC Evolutionary PathA Common HPC Evolutionary Path
Proprietary ClustersRoom full of SPs
Then some renegade starts experimenting with Beowulf clusters...
Mainframe IBM mainframe with attached vector processor(s)
Vector ProcessorCray (with attached mainframe?)
After n steps...
starting like this...then evolving into this...
���®
A Common HPC Evolutionary PathA Common HPC Evolutionary Path
Proprietary ClustersRoom full of SPs
Rack mounted Linux NodesIBM Cluster 1350
Blades (using Linux)IBM BladeCenter
Proprietary SMP ClustersRoom full of IBM p690s
Beowulf clusterThen some renegade starts experimenting with Beowulf clusters...
Mainframe IBM mainframe with attached vector processor(s)
Vector ProcessorCray (with attached mainframe?)
After n steps...
���®
What Next?What Next?
Cluster 1 File System
Cluster 1 Nodes
Site 1 SAN
Global SAN Interconnect
Cluster 2 Nodes
Site 2 SAN
Site 3 SAN
Visualization System
Remote disk access
Remote disk access
Local disk access
Rack mounted Linux NodesIBM Cluster 1350
Blades (using Linux)IBM BladeCenter
Proprietary SMP ClustersRoom full of IBM p690s
... and everybody is talking about grids!
���®
But Where Does Storage I/O Fit?But Where Does Storage I/O Fit?
Storage I/O...The oft forgotten stepchild
���®
But Where Does Storage I/O Fit?But Where Does Storage I/O Fit?
Early adopters of proprietary clusters (e.g., IBM SP) generally adopted vendor storage solutions (e.g., SSA and GPFS or JFS)
GPFS is NOT the same beast it used to be!
Early adopters of commodity clusters approached storage I/O from a potpouri of approaches (e.g., NFS)
There are alternatives to NFS
Customers trying to integrate proprietary and commodity systems often feel forced to use NFS
There are alternatives to NFS
And what about grids?
���®
Lets take a closer look at this.
I will begin with the Linux clusters perspective.
I will get to the SP to pSeries perspective in a moment.
He who does not study history is pre-destined to relive it... errr, but is NFS really history?
���®
Common First StepCommon First StepFor Something SmallFor Something Small
For something like this, it is common to do NFS and FTP between servers over a GbE or 100 MbE network.
samIntellistationM Pro2 CPU4 GBLinux 2.4.21 (SuSE)
frodop6152 CPU4 GBAIX 5.2
gandalfp6152 CPU4 GBAIX 5.2
hdisk0(scsi)hdisk1(scsi)hdisk2(scsi)hdisk3(scsi)
hdisk0(scsi)hdisk1(scsi)hdisk2(scsi)hdisk3(scsi)
sda (scsi)
Ethernet (GbE)
local/fs_sam
NFS/fs_frodo/fs_gandalf
local/fs_frodo
NFS/fs_sam/fs_gandalf
local/fs_gandalf
NFS/fs_sam/fs_frodo
���®
Common Second StepCommon Second StepMake the Small Solution BIGGERMake the Small Solution BIGGER
"Client" Nodese.g., x336 (Linux)access to NFS mounted file system from head nodeinternal SCSI used for local scratch
FTP files as necessary between clients
Head Nodee.g., x346 (Linux) file system based on internal storage and NFS exported to client
Node SwitchGenerally, IP based network
GbEMyrinetHPSIB?
Common Application OrganizationUse IP network to distribute data via MPI, NFS and/or other "home-grown mid-layer" codes
works well for applications using minimal or no parallel I/Odo application developers (i.e., computational scientists) want to become computer scientists?
Node
Switch
Head NodeInternal SCSI or SATA
NFS export local file system
client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node
���®
Common Third StepCommon Third StepCreate "Islands of Nodes" as You Get Even BIGGERCreate "Islands of Nodes" as You Get Even BIGGER
Node
Switch
Head NodeInternal SCSI or SATA
NFS export local file system
client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node
Node
Switch
Head NodeInternal SCSI or SATA
NFS export local file system
client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node
Higher Level Switch (LAN or WAN) Clusters are connected via a heirarchical switching network
COMMENTSProvides "any to any" connectivity
the poor man's wayWorks well when I/O model is not parallel, but may require aggregating filesISLs can be bottlenecksInadequacy of NFS semantics (especially for parallel writes)Poor I/O performanceLimited storage capacity
can add more head nodesStorage BW limited by headnode switch adapterInconvenience of ftp or other utilities to manually move filesCommon fabric for message passing and storage I/ONaturally generalizes to a grid with all of its issues, but is compounded by varibilities of geographic seperation!
���®
Common Third StepCommon Third StepCreate "Islands of Nodes" as You Get Even BIGGERCreate "Islands of Nodes" as You Get Even BIGGER
Node
Switch
NFS server nodeNFS server nodeNFS server nodeNFS server nodeNFS server nodeNFS server node
client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node
Node
Switch
iSCSI based storage
client nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient nodeclient node
Higher Level Switch (LAN or WAN) Clusters are connected via a heirarchical switching network
Disk Controller
Disk Enclosure
Disk Enclosure
More sophisticated storage systems can be adopted to work within this NFS/IP based model over a WAN/grid
iSCSI based systemsNFS not necessarily required"plain vanilla" iSCSI can be used, but more sophisticated schemes are being investigated (e.g., file replication, Univ of Tokyo)
SAN based file system provides local I/O but the local file systems are NFS exported over the WAN
still must deal with NFS shortcommings
SAN
Switch
In this SAN example, the SAN attached storage would typically be distributed over the 6 servers with 6 or more different NFS exported file systems (at least 1 file system per NFS server).
���®
The Final(?) StepThe Final(?) StepGlobal Storage with Parallel File SystemGlobal Storage with Parallel File System
Storage servers with external storage using the node switch
COMMENTShomogenous networkparallel FS fascilitates the effective use of this architecturedisks are accessed via the LAN/WAN
virtual disksperformance scales linearly in the number of servers
increasing the number of servers will increase BW
can add capacity without increasing the number of serversserver switch adapters can become a bottleneckcan inexpensively scale out the number of nodes
largest GPFS cluster with a single file system 2300 nodes
natural model for grid based file system
SAN
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
Node Switch (LAN or WAN)
storage node
storage node
storage node
storage node
diskcontroller
disk enclosure
disk enclosure
disk enclosure
disk enclosure
disk enclosure
���®
The Final(?) StepThe Final(?) StepAnother Global Storage Parallel File System ModelAnother Global Storage Parallel File System Model
SAN attached storage
COMMENTSseperate switch fabricsparallel FS fascilitates the effective use of this architectureperformace scales in the number of disk controllerscan add capacity without increasing the number controllersscaling out the number of direct attach nodes is limited by the SAN
largest SAN cluster is 200+ nodes scaling larger requires remote nodes accessing storage over IP networkdirect attach nodes get better file system BW
BW is not restricted by server node switch adaptes (typically, a FC HBA is faster the GbE... but does this change with IB?)Allows greater aggregate BW
e.g., 15 GB/s on 40 nodesSAN works well in a processing centerUse a LAN/WAN to scale out beyond SAN limits
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
client node
SAN Switch
diskcontroller
disk enclosure
disk enclosure
disk enclosure
disk enclosure
disk enclosure
Node Switch (LAN or WAN)
���®
In the early days of HPC clusters, there were limited choices for parallel/global file systems... and generally it was necessary to use the vendor's file system. Today there are other choices (at least 18 at my last count) that have been enabled by the development of Linux based clusters.
In order to more clearly understand how GPFS fits into this environment, the following pages discuss a coarse HPC storage architecture taxonomy covering range of file systems used on an HPC systems... this is a work in progress!
���®
The following pages examine an architectural taxonomy of storage I/O architectures commonly used in HPC systems. They support varying degrees of parallel I/O and do not represent mutually exclusive choices.
Conventional I/O
Asynchronous I/O
Network File Systems
Basic Parallel I/OSingle Component Architecture
Centralized Metadata Server with SAN Attached DiskDual Component Architecture
Recent Developments Triple Component Architecture
High Level Parallel I/O
HPC Storage Architecture TaxonomyHPC Storage Architecture Taxonomy
���®
Local file systemsBasic, "no frills, out of the box" file systemJournal, extent based semantics
journaling: to log information about operations performed on the file system meta-data as atomic transactions. In the event of a system failure, a file system is restored to a consistent state by replaying the log and applying log records for the appropriate transactions. extent: a sequence of contiguous blocks allocated to a file as a unit and is described by a triple consisting of <logical offset, length, physical>
If they are a native FS, they are integrated into the OS (e.g., caching done via VMM)
more favorable toward temporal than spatial locality
Intra-node process parallelismDisk level parallelism possible via stripingNot truly a parallel file systemExamples: Ext3, JFS2, XFS
Conventional I/OConventional I/O
COMMENT: GPFS cache (i.e., pagepool) is more favorable toward spatial than temporal locality. Very large pagepools (up to 8 GB using 64 bit OS) may do better with temporal locality.
���®
Asynchronous I/OAsynchronous I/O
Abstractions allowing multiple threads/tasks to safely and simultaneously access a common fileBuilt on top of a base file systemParallelism available if its supported in the base file systemPart of POSIX 4, but not supported on all unix based file systems (e.g., Linux 2.4, but Linux 2.6 now includes it?)AIX, Irix, Solaris supports AIO
���®
Networked File System (NFS)Networked File System (NFS)
Disk access from remote nodes via network access (e.g., TCP/IP over Ethernet)NFS is ubiquitous and the most common example
it is not truly parallel old versions are not cache coherent (is V3 or V4 truly safe?)write requires O_SYNC and -noac options to be safe
poorer performance for I/O intensive HPC jobswrite: only 90 MB/s on system capable of 400 MB/s (4 tasks)
read: only 381 MB/s on system capable of 740 MB/s (16 tasks)
uses POSIX I/O API, but not its semanticsuseful for on-line interactive access to smaller fileswhile NFS is not designed for general parallel file access on an HPC system, by placing restrictions on an application's storage I/O model, some customers get "good enough" performance from it
COMMENT: enhancements have been proposed for NFS V4 under AIX that should improve NFS parallel writes.
���®
Parallelizes file, metadata and control operationssingle component architecture: does not require distinction between metadata, storage and client nodes
POSIX I/O model with extensionsbyte stream using API with read(), write(), open(), close(), lseek(), stat(), etc.extends POSIX model to support safe parallel data access semantics
these options guarantee portability to other POSIX based file systems for applications using the POSIX I/O API
generally has API extensions, but these compromise portability
Good performance for large volume, I/O intensive jobsWorks best for large block, sequential access patterns, but vendors can add optimizations for other patterns Example: GPFS (IBM...best of class), GFS (Sistina/Redhat)
Basic Parallel I/OBasic Parallel I/OSingle Component ArchitectureSingle Component Architecture
���®
Centralized Metadata Servers with SAN Attached DiskCentralized Metadata Servers with SAN Attached DiskDual Component ArchitectureDual Component Architecture
Parallel user data semantics, but non-parallel metadata semantics
Support POSIX API, but with parallel data access semantics
Dual component architecture (storage client/server, metadata server)Metadata maintained and accessed from a single common server
Failover features allow a backup metadata server to takeover if the primary failsUses Ethernet (100 MbE or 1 GbE) for metadata access
Potential scaling bottleneck (but SANs already limit scaling). Latency more than BW is potential issue.
All "disks" connected to all client nodes via the SANfile data accessed via the SAN, not the node network
removes need for expensive node network (e.g., Myrinet)
inhibits scaling due to cost of FC Switch Tree (i.e., SAN)
Ideal for smaller numbers of nodesSNFS advertises up to 50 clients (and can go as high as 100 nodes), and is capable of very high BW
on a very carefully configured/tuned p690, GPFS, JFS2 and SNFS all got 15 GB/s
CXFS scales only to 10-12 servers for some users, perhaps at most 30?
good enough for large SMPs?
Example: CXFS (SGI), SNFS (ADIC), SanFS (IBM)SanFS and SNFS place heavy emphasis on storage virtualization
���®
Recent DevelopmentsRecent DevelopmentsTriple Component ArchitectureTriple Component Architecture
Lustre and Panasas, are 2 recently developed HPC style parallel file systems which began "from a clean sheet of paper" in their design that distinguishes them from other file systems in this taxonomy. They have a number of architectural similarities.
Triple component architecturestorage clients, storage servers, metadata serversfile data access over the node network between storage clients and servers (e.g., GbE, Myrinet)
Object oriented architectureobject oriented disks are not generally available yet, so the current implementation is in SW and not fully generalizedOO design is blind to the application (i.e., uses POSIX API with parallel semantics)
Designed to fascilitate storage management (i.e., "storage virtualization")
Focus on Linux/COTS environments
���®
Higher Level Parallel I/OHigher Level Parallel I/O
High level abstraction layer providing parallel modelBuilt on top of a base file system (conventional or parallel)MPI-I/O is the ubiquitous model
parallel disk I/O extension to MPI in the MPI-2 standardsemantically richer APIportable
Requires significant source code modification for use in legacy codes, but it has the adavantages of being a standard (e.g., syntactic portability)
���®
Which Architecture is Best?Which Architecture is Best?
There is no concise answer to this question. It is application/customer specific.All of them serve specific needs. All of them work well if properly deployed and used according to their design specs.Issues to consider areapplication requirements
often requires compromise between competing needshow the product implements a specific architecture
���®
What Others Say About GPFSWhat Others Say About GPFS
Two recent papers comparing/contrasting parallel file systems.
Margo, Kovatch, Andrews, Banister. "An analysis of State-of-the-Art Parallel File Systems for Linux.", 5th International Conf on Linux Clusters: HPC Revolution 2004, Austin, TX, May 2004
Compared GPFS, Lustre, PVFSCriteria: performance, system administration, redundancy, special features"In both SAN and NSD modes, GPFS performed the best. It was also easy to install and had numerous redundancy and special features."
Cope, Oberg, Tufo, Woitaszek. "Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments.", 6th International Conf on Linux Clusters: HPC Revolution 2005, Chapel Hill, NC, April 2005
Compared GPFS, Lustre, NFS, PVFS2, TerraFSCriteria: performance, usability, stability, special features"Our experiences with GPFS were very positive, and we found it to be a very powerful filesystem with well documented administrative tools."
���®
What is GPFS?
This question no longer has simple answer.
���®
node 16
node 14
node 12
node 10
node 8
node 6
node 4
node 2
node 15
node 13
node 11
node 9
node 7
node 5
node 3
node 1
node 32
node 30
node 28
node 26
node 24
node 22
node 20
node 18
node 31
node 29
node 27
node 25
node 23
node 21
node 19
node 17 node 33node 34node 35node 36
GPFS in The "Good Old Days"GPFS in The "Good Old Days"A Typical "Winterhawk" Config with GPFS 1.3A Typical "Winterhawk" Config with GPFS 1.3
SP Switch
SSA Disk
Compute clients Compute clients
hdisks 1..64
VSD Servers
All nodes run AIXThin nodes are compute clients, wide nodes are VSD serversGPFS packets transit the switchThe disk is SSAPeak Aggregate BW for this configuration
at most 440 MB/sTook experienced sysadm a day or 2 to configure
���®
... but GPFS can't do that!?!?!
Old ideas die hard. GPFS is far more versitile than what it was in its early days. The following pages highlight many of these newer features.
���®
GPFS TodayGPFS TodayGPFS under Linux with GbEGPFS under Linux with GbE
Client System
x345-1
x345-2
x345-3
x345-4
x345-5
x345-6
x345-7
x345-8
x345-9
x345-10
x345-11
x345-12
x345-13
x345-14
x345-15
x345-16
Ethernet Switch
x345-17
x345-18
x345-19
x345-20
x345-21
x345-22
x345-23
x345-24
x345-25
x345-26
x345-27
x345-28
1 GbE
1 GbE
Scaling Out: Actual GbE based designes have been extended upto 1100 Linux nodes (e.g., Intel or Opteron) with GPFS 2.2Current designed maximum for GPFS 2.3: 2000+ nodes
1 GbE
While not officially supported, 100 MbE can also be used among the client nodes instead of 1 GbE.
FAStT 900-2
EXP 700
EXP 700
EXP 700
EXP 700
EXP 700
EXP 700
File Server System
x345-29 (NSD)
x345-30 (NSD)
x345-31 (NSD)
x345-32 (NSD)
SANBrocade 2109-F16
Benchmark results on storage nodesI/O Benchmark (IBM)
command line./ibm_vg{w | r} /gpfs/xxx -nrec 4k -bsz 1m -pattern seq -ntasks 4
summarywrite rate: 491.4 MB/s*
read rate: 533.0 MB/s
iozonecommand line
./iozone.206 -c -R -b output.xls -C -r 32k -s 1024m -i 0 -i 1 -i 2 -i 5 -i 6
-i 7 -i 8 -W -t 4 -+m hostlist.cfs2a
summaryInitial write: 554.1 MB/s*Rewrite: 264.1 MB/s* Read: 526.7 MB/sRe-read: 533.6 MB/s Stride read: 31.3 MB/s Random read: 11.4 MB/s Random mix: 26.2 MB/s Random write: 54.0 MB/s
* write caching = on
Red Hat Linux 9.0 (Kernel 2.4.24-st2)GPFS 2.2
Benchmark results on "client nodes"I/O Benchmark (IBM), write caching offBW constrained by single GbE adapter on each NSD server1 client
write = 92.3 MB/s, read = 111 MB/s8 clients
write = 360 MB/s, read = 384 MB/s
Using Myrinet on 8 clients and 2 FAStT900s
write = 397 MB/sread = 585 MB/s
A second FAStT900 was needed since peak read BW exceeded the ability of 1 FAStT900.
���®
GPFS TodayGPFS TodayMixed AIX/Linux ConfigMixed AIX/Linux Config
p615CSM server
Existing User Network
SAN-1 (16 ports)
FAStT900-1
EXP700
EXP700
EXP700
EXP700
FAStT900-2
EXP700
EXP700
EXP700
EXP700
tape_1 tape_2
tape_3 tape_4
LTO Tape Library
256 x e325 nodesp690-1
RIO-1.1
RIO-1.2
p690-2
RIO-2.1
RIO-2.2
GbE Switch
Cluster Management Node -2
Cluster Management Node -1
p615-2TSM Client/Server
SAN-2 (16 ports)
EXP700
EXP700
EXP700
EXP700
EXP700 EXP700
Application and SchedulingNodes e325-3..6
p690s run AIX, e325s run Linuxp690s are NSD servers and compute clientse325s are compute clientsGPFS mounted on TSM/HSM server GPFS packets transit GbE networkThe disk is FCOverall Peak Aggregate BW < 800+ MB/sPeak Aggregate BW on e325 < 640 MB/s
���®
GPFS TodayGPFS TodayMixed AIX/Linux ConfigMixed AIX/Linux Config
Natural Aggregate
Harmonic Aggregate
Harmonic Mean
Number of Nodes
Number of tasks per node
Write Rate MB/s
Read Rate MB/s
Write Rate MB/s Read Rate MB/s
write read sizeof(file) GB
1 1 640.10 765.28 41 2 641.07 760.75 648.02 761.71 324.01 380.85 81 4 631.84 739.31 646.62 758.13 161.65 189.53 161 8 649.14 724.88 670.83 755.63 83.85 94.45 321 16 646.97 721.26 671.29 788.48 41.96 49.28 64
Natural Aggregate
Harmonic Aggregate
Harmonic Mean
Number of Nodes
Number of tasks per node
Write Rate MB/s
Read Rate MB/s
Write Rate MB/s Read Rate MB/s
write read sizeof(file) GB
1 1 109.27 110.89 42 1 198.92 218.65 198.92 219.51 99.46 109.76 84 1 269.91 435.37 269.95 437.68 67.49 109.42 168 1 282.44 624.19 283.44 626.81 35.43 78.35 32
16 1 253.23 595.50 281.78 598.18 17.61 37.39 6432 1 269.77 577.53 269.83 581.23 8.43 18.16 128
Natural Aggregate
Harmonic Aggregate
Harmonic aggregate over x335 and p690
Number of Nodes
Number of tasks per node
Write Rate MB/s
Read Rate MB/s
Write Rate MB/s Read Rate MB/s
write read sizeof(file) GB
x335 only *** 4 1 267.33 435.72 268.29 437.23 N/A N/A 16p690 only 1 4 632.45 771.00 650.98 788.25 N/A N/A 16x335 with p690 4 1 233.03 386.27 233.03 389.61 707.48* 801.66* 24**p690 with x335 1 4 470.22 407.79 477.30 412.81 32**
p690; write caching = off, pattern = sequential, bsz = 1 MB
x335; write caching = off, pattern = sequential, bsz = 1 MB
mixed nodes; write caching = off, pattern = sequential, bsz = 1 MB, ntasks = 4* Job times are nerely identical; therefore, iostat measured rate was very close to harmonic aggregate rate.** Size of combined files from each job; file sizes adjusted so job times were approximately equal. Combined files for write = 24 GB, combined files for the read = 32 GB.*** x335 aggregate read rates were gated by the 4 GbE at a little over 100 MB/s per adapter.
Benchmark Config
1 p69032 x335s
���®HS20
(GPFS NSD)
FC Ports
01 02 03 04 05 06 07 08 09 10 11 12 13 14
GbE Ports
EXP 710
EXP 710
DS4500
EXP 710
EXP 710
Disk Server Systems
x345 (GPFS SAN)
x345 (GPFS SAN)
x345 (GPFS SAN)
x345 (GPFS SAN)
BladeCenterChasis
SAN Switch (optional)Brocade 2109-F16
SYSTEM ANALYSIS
1. DS4500 - sustained peak performance < 540 MB/s2. FC Network - sustained peak performance < 600 MB/s3. GbE Network (adapter aggregate measured over all 4 x345s) - sustained peak performance < 360 MB/s4. Aggregate x345 Rate - sustained peak performance < 500 MB/s5. Predicted Aggregate HS20 Rate - sustained peak performance < 360 MB/s
Comments- HS20 performance constrained by limited GbE Ports- The rates for items 1-4 are based on benchmark tests- The SAN switch is optional; using it may reduce load on GbE network and reduce aggregate application disk I/O bandwidth
Lower Cost/Bandwidth AlternativeIf less file access bandwidth is required or a lower cost solution is required, then the x345/FAStT900 system can be replaced with the following:- 2 disk servers (x345) each with 1 GbE and 1 FC HBA- 1 FAStT600 and 1 disk enclosure (EXP700) - SAN switch is optional- sustained peak performance < 200 MB/s
Global File System over Multiple HS20 SystemsThis stand alone system can be replicated N times. By routing HS20 and x345 GbE traffic through a switch, the NSD layer in GPFS will enable all blades to see all LUNs; i.e., multiple HS20 systems can all safely mount the same GPFS file system and performance will scale linearly.
WARNING:Do not connect a FC controller to the FC ports on the blade chassis... its not supported.
EXP 710
GPFS TodayGPFS TodayGPFS in a BladeCenter - Standard ConfigurationGPFS in a BladeCenter - Standard Configuration
���®
GPFS TodayGPFS TodayGPFS in a BladeCenter - Alternative ConfigurationGPFS in a BladeCenter - Alternative Configuration
Blade Center- Internal GbE network highlighted in blue
blade
01
blade
02
blade
03
blade
04
blade
05
blade
06
blade
07
blade
08
blade
09
blade
10
blade
11
blade
12
blade
13
blade
14
ExternalGbE Ports
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
I D E
BENCHMARK ANALYSIS
- A private internal GbE network connects all blades- Each blade has a single GbE adapter
effective BW = 80 to 100 MB/s
- Baseline performance (read from a single local disk using ext2)application read rate = 30 MB/s
- Single task GPFS performanceapplication read rate = 80 MB/s2.7 X faster than single disk rate
- Further analysisassume active task is on blade 01GPFS stripes over all 28 IDE drives in this configurationGPFS uses GbE network for striping activity, therefore single blade GPFS performance limited to GbE adapter BW (i.e., upto 100 MB/s)each blade
has only one GbE adapter used by GPFS and general systemacts as a disk server and GPFS clientsingle task performance = 80 MB/s
in a similar test on x345 wherethere were seperate GPFS clients and disk serversthere were 2 GbE adapters per node (one for GPFS and one for everything else)single task performance = 100 MB/s
- Aggregate rate (1 task per blade)read rate = 560 MB/s
- Analysiseach blade is acting as a disk server (as well as client)since GPFS scales linearly in the number disk servers, it yields good aggregate performancein a Winterhawk 2 system using SSA disk with each node acting as a GPFS client and disk server, the aggregate rate over 14 nodes was < 420 MB/s
GbE Switch
NOTE:This is not recomended configuration by GPFS since the disks are not twin tailed... "but it works".
���®
Frame-1540 NodesLinux/Intelintra-node: GbE?3 FC HBAs/node
1 connection via each FC switch
3 Switches2 Gb FC
All disks directly attached to servers via FC switches.
switch 01
switch 02
switch 03
Frame-014 x FAStT6008 x EXP700
FAStT600-01
EXP700
EXP700
FAStT600-02
EXP700
EXP700
FAStT600-03
EXP700
EXP700
FAStT600-04
EXP700
EXP700
. ..
to Frame-018 connections
.
.
.
to Frame-158 connections
Aggregate BW = 15 GB/s (sustained)goto: http://www.sdsc.edu/Press/2004/11/111204_SC04.html
GPFS TodayGPFS TodaySDSC/IBM StorCloud ArchitectureSDSC/IBM StorCloud Architecture
���®
40 Linux Nodes
3 FC HBAs per Node
15 Storage frames
60 FAStT600s
2520 disks
240 LUNs
8+P
4 LUNs per FAStT600
73 GB/disk @ 15 Krpm
Sustained Aggregate Rate15 GB/s380 MB/s per node256 MB/s per FAStT600
GPFS TodayGPFS TodaySelected SDSC/IBM StorCloud StatisticsSelected SDSC/IBM StorCloud Statistics
COMMENT: IP vs FCWith today's technology, direct attached disk models (e.g., SAN attached) can yielder greater per node BW than IP based models.
IP based systems are limited by ethernet adapter (e.g., 80 MB/s for 1 GbE, 120 MB/s for dual bonded GbE)direct attached systems can have multiple FC HBAs (e.g., with 3 HBA/node the BW is 380 MB/s)Will 10 GbE change this?Will IB change this?
���®
GPFS TodayGPFS TodayStorage PoolsStorage Pools
MotivationSome newer file systems implement a concept called "storage pools"; GPFS supports a form of this.
Disks present themselves to GPFS as LUNs
A GPFS file system can mount a FS on any subset of these LUNsThere is 1 storage pool per FSMax: 32 file systems per GPFS cluster
ExampleMonolythic disk architecture (e.g., SATA)Access is "bursty"To avoid striping over all disks and stressing all disks, divide disks into 16 disjoint subsets and hence 16 file systems. File striping is confined to a file system. When a file system not in use, GPFS is not spinning disks. (n.b., All FS's are seen by all nodes in the GPFS cluster... upto 2000+ nodes)
Example2 classes of disk: FC disk and SATA disk1 FS for FC disk used for constant access1 FS for SATA disk used infrequently
���®
GbE Switch
p575-4p575-3p575-2p575-1
blade center #1
San Switch
p575-8p575-7p575-6p575-5
p575-12p575-11p575-10p575-9
blade center #2 blade center #3 blade center #4
FAStT900 FAStT100
Federation Switch
GPFS TodayGPFS TodayStorage Pools in a Mixed BladeCenter/pSeries ClusterStorage Pools in a Mixed BladeCenter/pSeries Cluster
���®
Problem: nodes outside the cluster need access to GPFS files
Solution: allow nodes outside the cluster to mount the file system“Owning” cluster responsible for
admin, managing locking, recovery, …
Separately administered remote nodes have limited status
Can request locks and other metadata operations
Can do I/O to file system disks over global SAN (IP, Fibre Channel, …)
Are trusted to enforce access control, map user Ids, …
Uses:High-speed data ingestion,
postprocessing (e.g. visualization)
Sharing data among clustersSeparate data and compute sites
(Grid)Forming multiple clusters into a
“supercluster” for grand challenge problems
Cluster 1 File System
Cluster 1 Nodes
Site 1 SAN
Global SAN Interconnect
Cluster 2 Nodes
Site 2 SAN
Site 3 SAN
Visualization System
Remote disk access
Remote disk access
Local disk access
GPFS TodayGPFS TodayCross-cluster MountsCross-cluster Mounts
���®
GPFS TodayGPFS TodayCross-cluster Mounts -- ExampleCross-cluster Mounts -- Example
IP based Switch
Compute Node nsd_A4nsd_A3nsd_A2nsd_A1Compute NodeCompute NodeCompute NodeCompute Node
Compute NodeCompute NodeCompute NodeCompute NodeCompute Node
Compute NodeCompute NodeCompute NodeCompute NodeCompute Node
Compute NodeCompute NodeCompute NodeCompute NodeCompute Node
San Switch
Cluster_A/fsAHome Cluster
IP based Switch
NSD_B2NSD_B1Compute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute NodeCompute Node
San Switch
Cluster_B/fsAonBRemote Cluster
Inter-Switch Links (at least GbE speed!)
COMMENTS: Cluster_B accesses /fsA from Cluster_A via the NSD nodes
see example on next page
Cluster_B mounts /fsA locally as /fsAonBOpenSSL (secure socket layer) provides secure access between clusters
UID MAPPING EXAMPLE (i.e., Credential Mapping)
1. pass Cluster_B UID/GID(s) from I/O thread node to mmuid2name
2. map UID to GUN(s) (Globally Unique Name)3. send GUN(s) to mmname2uid on node in Cluster_A4. generate corresponding CLUSTER_A UID/GID(s)5. send Cluster_A UID/GIDs back to Cluster_B node
runing I/O thread (for duration of I/O request)
COMMENTS: mmuid2name and mmname2uid are user written scripts made available to all users in /var/mmfs/etc; these scripts are called ID remapping helper functions (IRHF) and implement access policiessimple strategies (e.g, text based file with UID <-> GUN mappings) or 3rd party packages (e.g., Globus Security Infrastruction from Teragrid) can be used to implement the remapping procedures
UID/GIDB
mmuid2name
GUN
mmname2uid
UID/GIDA
See http://www-1.ibm.com/servers/eserver/clusters/whitepapers/uid_gpfs.html for details.
���®
GPFS TodayGPFS TodayCross-cluster Mounts -- Sysadm CommandsCross-cluster Mounts -- Sysadm Commands
On Cluster_A
1. Generate public/private key pairmmauth genkeyCOMMENTS
creates public key file with default name id_rsa.pubstart GPFS daemons after this command!
2. Enable authorizationmmchconfig cipherList=AUTHONLY
3. Sysadm gives following file to Cluster_B/var/mmfs/ssl/id_rsa.pub
COMMET: rename as cluster_A.pub7. Authorize Cluster_B to mount FS owned by Cluster_A
mmauth add cluster_B -k cluster_B.pub
On Cluster_B
4. Generate public/private key pairmmauth genkeyCOMMENTS
creates public key file with default name id_rsa.pubstart GPFS daemons after this command!
5. Enable authorizationmmchconfig cipherList=AUTHONLY
6. Sysadm gives following file to Cluster_A/var/mmfs/ssl/id_rsa.pub
COMMENT: rename as cluster_B.pub8. Define cluster name, contact nodes and public key for cluster_A
mmremotecluster add cluster_A -n
nsd_A1,nsd_A2,nsd_A3,nsd_A4 -k Cluster_A.pub
9. Identify the FS to be accessed on cluster_Ammremotefs add /dev/fsAonB -f /dev/fsA -CCluster_A -T /dev/fsAonB
10. mount FS locallymount /fsAonB
Mount a GPFS file system from Cluster_A onto Cluster_B assume diagram from the previous page
See http://publib.boulder.ibm.com/clresctr/docs/gpfs/gpfs23/200412/bl1adm10/bl1adm1031.html#admmcch for details.
���®
GPFS TodayGPFS TodayGPFS is Easier to AdministerGPFS is Easier to Administer
samnsd clientIntellistationM Pro2 CPU4 GBLinux 2.4.20GPFS 2.2
frodonsd serverp6152 CPU4 GBAIX 5.2GPFS 2.2
gandalfp6152 CPU4 GBAIX 5.2GPFS 2.2
hdisk0(scsi)hdisk1(scsi)hdisk2(scsi)hdisk3(scsi)
hdisk0(scsi)hdisk1(scsi)hdisk2(scsi)hdisk3(scsi)
disk0 (scsi)
SSA16 disks9 GB10 Krpm
EXP30014 disksSCSI36 GB15 Krpm
pdisks 0..6
pdisks 7..13
1 VG over hdisk 4..108 LVs over 1 VG
1 VG over hdisk 4..108 LVs over 1 VG
pdisks 0..15NSDs
hdisk 11..26NSDs
hdisk 11..26
Ethernet (GbE)
NSD 11..26
To build the file system*, do the following on gandalf...1. mmcrcluster2. mmstartup3. mmcrnsd
specify primary, secondary, client, server nodes in disk descriptor file
4. mmcrfs5. mount /<FS name>
* GPFS 2.3
COMMENT: This could be a FAStTdisk controller.
COMMENTS: Once the SAN zoning and low level disk formats are complete, one can build GPFS in under 5 minutes on smaller systems.For StorCloud, it took ~= 30 minutes, but this was over a 135 TB file system (n.b., 240 LUNs or 2520 disks)Other dymanic features...
mmadddisk, mmaddnode, mmdeldisk, mmdelnodemmchattr, mmchfs, mmchcluster, mmchconfig, mmchnsdmmpmon
���®
So what is GPFS?... in one line or less
���®
So what is GPFS?It is IBM’s shared disk, parallel file system for AIX and Linux clusters.
���®
What is GPFS?What is GPFS?
GPFS = General Parallel File SystemIBM’s shared disk, parallel file system for AIX, Linux clusters
Cluster: 2300+ nodes (tested), fast reliable communication, common admin domain
Shared disk: all data and metadata on disk accessible from any node through disk I/O interface (i.e., "any to any" connectivity)
Parallel: data and metadata flows from all of the nodes to all of the disks in parallel
RAS: reliability, accessibility, serviceability
General: supports wide range of HPC application needs over a wide range of configurations
���®
GPFS FeaturesGPFS Features
1. General Parallel File Systemmature IBM product generally available for 7 years
2. Clustered, shared disk, parallel file system for AIX and Linux3. Adaptable to many customer environmnets by supporting a wide
range of basic configurations and disk technologies4. Provides safe, high BW access using the POSIX I/O API5. Provides non-POSIX advanced features
e.g., DMAPI, data-shipping, multiple access hints (also used by MPI-IO)
6. Provides good performance for large volume, I/O intensive jobs7. Works best for large record, sequential access patterns, has
optimizations for other patterns (e.g., strided, backward)8. Strong RAS features (reliability, accessibility, serviceability)9. Converting to GPFS does not require application code changes
provided the code works in a POSIX compatible environment
���®
GPFS FeaturesGPFS Features
GPFS Performance Features1. striping2. large blocks (with support for sub-blocks)3. byte range locking (rather than file or extent locking)4. access pattern optimizations5. file caching (i.e., pagepool) that optimizes streaming access6. prefetch, write-behind7. multi-threading 8. distributed management functions (e.g., metadata, tokens)9. multi-pathing (i.e., multiple, independent paths to the same
file data from anywhere in the cluster)
���®
GPFS FeaturesGPFS Features
GPFS provides many of its own RAS features and exploits RAS features provided by various subsystems1. If a node fails providing GPFS management functions, an alternative
node assumes responsibility reducing risk of loosing the file system.2. When using dedicated NSD servers with "twin tailed disks",
specifying primary and secondary nodes lets the secondary node provide access to the disk if the primary node fails.
WARNING: internal SCSI and IDE drives are not twin tailed
3. In SAN environment, failover reduces risk of lost access to data.4. GPFS on RAID architectures reduces risk of lost data.5. Online and dynamic system management allows file system
modifications without bringing down the file system.mmadddisk, mmdeldisk, mmaddnode, mmdelnode, mmchconfig, mmchfs
���®
GPFS FeaturesGPFS FeaturesOther Features1. Disk scaling allowing large, single instantiation global file
systems (100's of TB now, PB in future)2. Node scaling (2300+ nodes) allowing large clusters and
high BW (many GB/s)3. Multi-cluster architecture (i.e., grid)4. Journaling (logging) File System - logs information about
operations performed on the file system meta-data as atomic transactions that can be replayed
5. Data Management API (DMAPI) - Industry-standard interface allows third-party applications (e.g. TSM) to implement hierarchical storage management
���®
Why is GPFS needed?Why is GPFS needed?
Clustered applications impose new requirements on the file system
"Any to any" accessany node in the cluster has access to any data in the cluster
Parallel applications need access to the same data from multiple nodes
Serial applications dynamically assigned to processors based on loadneed high-performance access to their data from wherever they run
Require both good availability of data and normal file system semantics
Scalability to large numbers of nodes
GPFS supports this via:
Uniform access – single-system image across cluster
Conventional Posix API – no program modification
High capacity – large files, 100TB + file system
High throughput – wide striping, large blocks, many GB/sec to one file
Parallel data and metadata access – shared disk and distributed locking
Reliability and fault-tolerance - node and disk failures
Online system management – dynamic configuration and monitoring
���®
Customer applications that require fast, scalable access to large amounts of file data. These applications may be serial or parallel, reading or writing.
Applications that serve data to visualization enginesSeismic data acquisition processing for serial or parallel reading/writing of files
Environments with very large data, especially when single file servers (such as NFS) reach capacity limits
Digital library file servingAccess to large CAD/CAM file setsData mining applications Data cleansing applications preparing data for data warehousesOracle RAC
Applications requiring data rates which exceed what can be delivered by other file systems
Large aggregate scratch space for commercial or scientific applicationsInternet serving of content to users with balanced performance
Applications with high availability (HA) file system requirements
Example GPFS ApplicationsExample GPFS Applications
���®
Selected New Features in GPFS 2.3Selected New Features in GPFS 2.3
Scale out to larger clusters (over 2000)Current largest production cluster is 2300 blades
Bigger file systems100's of TB
Bigger LUNsThere is no GPFS limitation; it is now an OS and disk limitation only
over 2 TB on AIX in 64 bit mode, up to 2 TB in "other supported" OS's
Depends less on disk protocols for many features (e.g., SCSI persistent reserve)
therefore GPFS is portable to a wider variety of disk hardware
No longer requires RSCT
Only one cluster type (n.b., no need to specify sp, rpd or hacmp)
Simpler quorum definition rules
GPFS specific performance monitorMeasures latency and bandwidth
GPFS is easier to administer and use
top related