abaqus/implicit io profiling - hpc advisory council implicit io performance... · abaqus/implicit...
TRANSCRIPT
Abaqus/Implicit IO Profiling
April 2010
2
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: AMD, Dell, SIMULIA, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The participating members would like to thank SIMULIA for their support and guidelines
• For more info please refer to
– www.mellanox.com, www.dell.com/hpc, www.amd.com
– http://www.simulia.com
3
SIMULIA Abaqus
• ABAQUS offers a suite of engineering design analysis
software products, including tools for:
– Nonlinear finite element analysis (FEA)
– Advanced linear and dynamics application problems
• ABAQUS/Standard provides general-purpose FEA that includes a broad range of analysis capabilities
• ABAQUS/Explicit provides nonlinear, transient, dynamic
analysis of solids and structures using explicit time
integration
4
Objectives
• The presented research was done to provide best practices and IO
profiling information for Abaqus/Standard
– Determination of application IO requirements
– Testing of application on NFS IO subsystem
• Provide recommendations on Storage systems for
Abaqus/Standard
5
Test Cluster Configuration
• Dell™ PowerEdge™ SC 1435 24-node cluster
• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs
• Mellanox® InfiniBand DDR Switch
• Memory: 16GB memory, DDR2 800MHz per node
• OS: RHEL5U3, OFED 1.4.1 InfiniBand SW stack
• MPI: HP-MPI 2.3
• Application: Abaqus 6.9 EF1
• Single SCSI hard drive in master node using NFS over GigE connection
• Benchmark Workload
– Abaqus/Standard Server Benchmarks: s4b
6
Mellanox InfiniBand Solutions
• Industry Standard– Hardware, software, cabling, management
– Design for clustering and storage interconnect
• Performance– 40Gb/s node-to-node
– 120Gb/s switch-to-switch
– 1us application latency
– Most aggressive roadmap in the industry
• Reliable with congestion management• Efficient
– RDMA and Transport Offload
– Kernel bypass
– CPU focuses on application processing
• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage
InfiniBand Delivers the Lowest Latency
The InfiniBand Performance Gap is Increasing
Fibre Channel
Ethernet
60Gb/s
20Gb/s
120Gb/s
40Gb/s
240Gb/s (12X)
80Gb/s (4X)
7
• Performance– Quad-Core
• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache
– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor
– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core
– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz
• Scalability– 48-bit Physical Addressing
• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor
7 November5, 2007
PCI-E® Bridge
I/O Hub
USB
PCI
PCI-E® Bridge
8 GB/S
8 GB/S
Dual ChannelReg DDR2
8 GB/S
8 GB/S
8 GB/S
Quad-Core AMD Opteron™ Processor
8
Dell PowerEdge Servers helping Simplify IT
• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers
– Servers optimized for High Performance Computing environments
– Building Block Foundations for best price/performance and performance/watt
• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity
– Dell's comprehensive HPC services help manage the lifecycle requirements.
– Integrated, Tested and Validated Architectures
• Workload Modeling– Optimized System Size, Configuration and Workloads
– Test-bed Benchmarks
– ISV Applications Characterization
– Best Practices & Usage Analysis
9
Dell PowerEdge™ Server Advantage
• Dell™ PowerEdge™ servers incorporate AMD Opteron™ and Mellanox ConnectX InfiniBand to provide leading edge performance and reliability
• Building Block Foundations for best price/performance and performance/watt
• Investment protection and energy efficient• Longer term server investment value• Faster DDR2-800 memory• Enhanced AMD PowerNow!• Independent Dynamic Core Technology• AMD CoolCore™ and Smart Fetch Technology• Mellanox InfiniBand end-to-end for highest networking performance
10
Introduction to Profiling
11
Abaqus/Standard Benchmark Results
• Master node has a single hard drive– SAS drive
• Exported using NFS over GigE– Full bi-sectional bandwidth
• No special NFS options used on server or clients– Default options used
– For example, on server: /application *(rw,sync,no_root_squash)
12
Abaqus/Standard Benchmark Results
• Input Dataset: s4b– Cylinder head bolt-up (mildly non-linear)– 5,000,000 DOF’s, 5 iterations
• 4GB memory
• 23GB disk space
• Profile was done on 8 cores– Only 1 core per node was used– All nodes are connected via InfiniBand
• Analysis is done using strace_analyzer (clusterbuffer.wetpaint.com)– GPL application
13
Abaqus/Standard IO Profiling
• The goal of IO profiling is to examine:– How the application performs IO
• How many process do IO?• How much writing? How much reading?• Sizes of syscalls?• Number of lseek? (head thrashing)
– How the profile results can translate into IO requirements (i.e. design)
– For applications with source the IO profiling can be used for changing the application for better performance
14
Executive Summary
15
Abaqus/Standard - Summary
• This particular case of Abaqus/Standard does a great deal of IO– ~45% of the time is spent doing IO
• All processes do IO• Most of the IO is write
– Rank-0 process does 83.7% of writes in 1KB – 8KB• IOPS performance can be important but it appears to be
for files in the 1KB – 8KB range• N-N file patterns seem to dominate
– But there is some N-1 that is about the same size as N-N• Recommendations:
• Fast central storage system seems to be desirable• The impact of local storage performance is unknown at this time
16
Abaqus/Standard - Summary
• This particular case of Abaqus/Standard does a great deal of IO (27% - 46% of the total time is spent doing IO) – A mixture of IO that is local and over central file system
• For this case the central file system was NFS mounted• All ranks do a great deal of IO
– There is a recognizable Rank-0 process (10759)• Most of the IO is write
– Rank-0 does 309,147 writes (83.7% are 1KB – 8KB)• Still a reasonable amount of read:
– Rank-0 does 72,827 read syscalls (72.5% are 1KB – 8KB)• Local performance can be a driver
– Need to test alternative setup and alternative storage
17
Details
18
Abaqus/Standard – Run Times
Process ID Total Run Time (secs)
IO Time (secs)
% of Time for IO
6764 837.93 370.95 44.27%
10759 761.73 34.17 4.49%
24119 807.53 216.00 26.75%
24272 808.19 314.45 38.90%
24289 853.53 359.51 42.12%
24302 831.41 313.23 37.67%
25930 831.34 280.70 33.76%
28228 853.41 390.39 45.75%
• It appears that 10759 is the rank-0– Smallest IO time and % versus other processes
• Other processes spend a great deal of time doing IO
19
Abaqus/Standard – Command Count
Process ID access lseek fcntl stat unlink open close fstat read mkdir getdents write6764 7 14,716 6 256 31 1,051 1,053 902 42,695 0 8 27,137
10759 15 38,514 32 347 35 1,110 1,119 959 72,827 8 8 309,14724119 7 14,715 6 251 31 1,022 1,024 877 42,638 0 8 27,14524272 7 14,717 6 256 31 1,047 1,049 901 42,689 0 8 27,20324289 7 14,723 6 251 31 1,021 1,023 877 42,643 0 8 27,04724302 7 14,73 6 246 31 994 996 853 42,589 0 8 26,93025930 7 14,071 6 250 31 1,017 1,019 877 42,631 0 8 27,03028228 7 14,734 6 253 31 1,022 1,024 877 42,643 0 8 27,100
• Process 10759 is the rank-0 process (perhaps)– access() are greater than others– mkdir() is only from this process– Much greater read() and write() than other processes
• Number of times an IO system function is called:
20
Abaqus/Standard – Command Count
• Open/close don’t match because of sockets– Looks like an open() function– Sockets open() but don’t close()
• Open() also works for .so libraries– .so libraries are opened and read – so they look like IO
• Is there truly a rank-0 process?– All processes are doing IO
• Possible Rank-0: (Process 10759)– 10 times more write() than other processes– Does all the mkdir() calls– Does 2 times more lseeks() than other processes– One lseek for every 10 writes
21
Abaqus/Standard – Write Statistics
Write syscall sizesProcess
ID 0 - 1KB 1KB - 8KB 8KB - 32KB 32KB - 128KB 128KB - 256KB 256KB - 512KB 512KB - 1 MB 1MB - 10 MB 10MB - 100MB6764 5,727 19,418 21 1,303 32 50 14 267 3
10759 48,092 258,881 115 1,304 30 55 16 635 724119 5,727 19,422 21 1,304 31 52 12 571 324272 5,793 19,424 14 1,304 31 48 16 567 324289 5,645 19,410 18 1,303 28 52 9 577 324302 5,520 19,419 21 1,305 37 48 13 562 325930 5,520 19,419 21 1,305 37 48 13 562 328228 5,688 19,415 19 1,303 29 48 15 578 3
• There are no syscalls larger than 100 MB– But these are really large single syscall writes
• For process 10759– 642 write syscalls larger than 1 MB
• Rank-0 process does many more writes less than 8KB– About 10 times more writes
22
Abaqus/Standard – Write Statistics
Process ID Total BytesMean Bytes
per callStandard Dev
(Bytes)
Mean Absolute Dev
(Bytes)
Median Bytes per
syscall
Median Absolute Dev
(Bytes)Slowest write
time (secs)6764 1,209,892,905 44,587.91 389,382.93 406,411.02 4096 42,256.99 30.311210759 1,998,797,029 6,465.75 232,122.19 233,500.39 2048 5,046.46 2.452824119 1,215,273,444 44,773.00 390,391.32 407,477.38 4096 42,442.30 0.527424272 1,209,949,469 44,481.80 388,991.56 406,008.85 4096 42,166.66 4.548224289 1,216,555,764 44,982.65 390,330.21 407,460.32 4096 42,630.84 2.912524302 1,206,742,345 44,813.66 390,802.04 407,876.36 4096 42,431.95 2.869525930 1,206,742,345 44,813.66 390,802.04 407,876.36 4096 42,431.95 2.869528228 1,223,024,346 45,133.38 391,201.85 408,385.38 4096 42,791.41 14.6159
• Average bytes per syscall for 10759 (rank-0?) is much smaller than others– 6.5KB vs. 45KB
• VERY slow write time for some processes– 30 seconds and 14 seconds
• Each node writes a great deal of data– 1.2GB for rank-0– 2 GB for rank-0
23
Abaqus/Standard – Read Statistics
Read syscall sizesProcess
ID 0 - 1KB 1KB - 8KB 8KB - 32KB 32KB - 128KB 128KB - 256KB 256KB - 512KB 512KB - 1 MB 1MB - 10 MB 10MB - 100MB6764 13,145 29,319 12 13 20 7 9 168 2
10759 18,701 52,780 27 17 26 11 18 1,239 824119 13,091 29,320 12 14 19 7 8 165 224272 13,142 29,320 11 13 20 6 9 166 224289 13,092 29,320 11 14 19 6 7 172 224302 13,040 29,320 12 13 20 7 8 167 225930 13,086 29,320 12 13 20 8 8 162 228228 13,092 29,320 11 13 20 5 9 171 2
• There are no syscalls larger than 100 MB• Process 10759:
– 1,247 syscalls greater than 1 MB (8 from 10-100MB)• 10 times the number for the other processes
– Twice the number of reads as other processes
24
Abaqus/Standard – Read Statistics
Process ID Total BytesMean Bytes
per callStandard
Dev (Bytes)
Mean Absolute Dev
(Bytes)
Median Bytes per
call
Median Absolute Dev
(Bytes)
Slowest write time
(secs)6764 580,525,222 13,597.03 254,496.37 260,030.28 4096 12,005.78 11.3631
10759 2,313,198,661 31,762.93 477,423.33 487,689.97 4 31,759.02 0.044624119 576,763,800 13,526.99 254,790.91 260,297.70 4096 11,929.49 1.567624272 578,160,440 13,543.55 254,445.30 259,963.03 4096 11,951.97 5.776624289 582,451,173 13,658.78 254,761.43 260,340.48 4096 12,061.11 5.937324302 579,003,926 13,595.15 254,790.07 260,331.58 4096 11,991.69 5.689325930 572,938,928 13,439.49 253,880.60 259,321.27 4096 11,841.53 1.947928228 583,322,316 13,679.20 255,137.85 260,728.20 4096 12,081.61 127.1235
• Average bytes per syscall for Rank-0 is much larger than other processes:– 31KB vs. 13.6KB– Standard deviation is also much larger
• Median bytes for rank-0 is extremely small (4 Bytes)• One VERY slow read
– Process 28228: 127 seconds (almost 2 minutes!)
25
Abaqus/Standard – IOPS Statistics
Process IDAverage
Write IOPSAverage
Read IOPSAverage
Total IOPSMax Write
IOPS
Max Write IOPS Time
(secs)Max Read
IOPS
Max Read IOPS Time
(secs)Max Total
IOPS
Max Total IOPS Time
(secs)6764 175 355 480 6,812 835 15,721 13 26,151 480
10759 1,807 475 2,232 31,092 750 20,540 9 32,801 924119 212 398 588 7,234 788 12,929 14 21,975 1424272 200 391 552 6,642 788 20,391 11 33,157 1124289 187 387 534 612 850 20,391 11 33,157 1124302 205 394 578 6,671 828 15,083 14 25,206 1425930 207 406 595 6,683 828 12,181 14 20,853 1428228 186 355 501 6,682 850 16,493 13 27,310 13
• Run time is about 850 secs– Some IOPS peaks occur at beginning (read) and some near the end (write)
• Max Average Write IOPS occurs for rank-0 (31,092)– Average is just 1,807
• Max Read IOPS occurs for several processes (~20,000)– Average is just about 400
• Max Total IOPS occurs for several processes (~33,000)– Average is just about 600 but for rank-0 it is 2,232
26
Abaqus/Standard – File Details
Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec==========================================================================================================================================/application/Simulia/benchmark/s4b_dellamd_64core.prt 0 150,920,784.99 75,854,689 353,233,129.06/dev/infiniband/uverbs0 0 66,002,083.33 106,608 212,763.29/sys/class/infiniband_verbs/uverbs0/abi_version 2 889,087.30 0 0/proc/cpuinfo 6,248 1,413,552.42 0 0/application/Simulia/benchmark/s4b_dellamd_64core.use.1 0 0 54,492 22,686,914.50/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.dat.1 0 68,590,373.73 195 3,855,281.49/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.msg.1 0 0 249 3,803,347.58/application/Simulia/benchmark/s4b_dellamd_64core.023 42,143,900 359,660,376.97 0 0/application/Simulia/benchmark/s4b_dellamd_64core.sim 1,733,172 91,525,283.73 0 0/application/Simulia/benchmark/s4b_dellamd_64core.mdl.1 154,123,640 801,527,477.80 1,332,088 325,424,705.64/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core_tmp_.1 2,048 208,035,387.05 2,280 36,897,650.79/storage/root_s4b_dellamd_64core_10591/SimTemp.8u95Fh 0 146,285,714.29 13,392 109,557,851.00/storage/root_s4b_dellamd_64core_10591/fortK2YecK 0 21,387,121.21 42,041,344 720,370,419.59/application/Simulia/benchmark/s4b_dellamd_64core.stt.1 203,076,124 303,339,180.30 362,530,784 599,991,946.68/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.cax.1 87,432 222,829,369.91 16,604 366,313,402.97/proc/6766/status 762 22,524,755.28 0 108,140,241.03
• Process 6764 (non-rank-0)
• /application directory is NFS shared• /storage directory is local
• Little writing is done to local storage (42MB is largest)• Very little reading is done to local storage (2KB)
• Most read() and write() is done over NFS• s4b_dellamd_64core.stt.1 appears to do more IO
27
Abaqus/Standard – File Details
Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec==========================================================================================================================================/application/Simulia/benchmark/s4b_dellamd_64core.prt 0 116,026,282.15 75,854,689 360,865,013.73/dev/infiniband/uverbs0 0 41,178,259.38 124,008 208,728.97/sys/class/infiniband_verbs/uverbs0/abi_version 2 546,236.17 0 0/proc/cpuinfo 6,248 1,358,478.76 0 0/application/Simulia/benchmark/s4b_dellamd_64core.use 0 0 58,686 15,626,030.44/etc/protocols 4,096 120,230,505.89 0 0/application/Simulia/benchmark/s4b_dellamd_64core.dat 0 372,363,636.36 4,815 7,236,145.84/application/Simulia/benchmark/s4b_dellamd_64core.msg 0 0 16,203 8,754,438.58/application/Simulia/benchmark/s4b_dellamd_64core.odb 47,768,316 351,966,985.98 492,464,040 272,677,237.48/application/Simulia/benchmark/s4b_dellamd_64core.023 42,143,900 381,006,383.53 0 0/application/Simulia/benchmark/s4b_dellamd_64core.sim 2,039,284 118,308,073.20 306,212 243,821,651.09/application/Simulia/benchmark/s4b_dellamd_64core_0.mdl 154,036,199 763,215,988.75 1,332,088 603,230,471.89/application/Simulia/benchmark/s4b_dellamd_64core_0.stt 202,671,800 845,758,301.20 361,935,344 588,916,616.78/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core_0.cax 87,432 401,370,493.04 16,604 264,176,077.10/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64coreTmp0.sim 2,048 208,071,565.50 2,280 36,841,904.76/storage/root_s4b_dellamd_64core_10591/SimTemp.Vfyuto 0 186,181,818.18 13,392 107,492,574.09/storage/root_s4b_dellamd_64core_10591/fort1YOKEw 0 21,115,327.38 42,041,344 742,441,432.12/application/Simulia/benchmark/s4b_dellamd_64core.sta 0 20,822,534.14 438 251,623,075.48/proc/10763/status 767 23,167,553.69 0 213,893,079.01/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core_0.cax 0 0 88,432 152,445,198.22/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.scr 0 0 168,326,652 97,006,113.13
• Process 10759 (rank-0?)
• /application directory is NFS shared• /storage is local• Some write IO is local (s4b_dellamd_64core.scr)• Some write IO (and most of the read IO) is over NFS
28
Abaqus/Standard – Local vs. Central IO
Process Number
Local Read (MB)
Local Write(MB)
Central Read (MB)
Central Write (MB)
6764 0.089 42.07 357.20 363.9210759 0.089 210.49 404.47 855.8124119 0.089 42.07 353.45 367.3724272 0.089 42.07 359.16 364.0124289 0.069 42.07 359.16 368.6424032 0.089 42.07 355.71 362.1424930 0.089 42.07 349.63 357.5628228 0.069 42.07 360.03 372.04
• “Local” IO is done on the drive in the node• “Central” IO is done to the NFS server
29
Abaqus/Standard – Shared File IO
File Total Reads (MB)
Total Writes (MB)
s4b_dellamd_64core.prt 0 227.56s4b_dellamd_64core.023 337.15 0s4b_dellamd_64core.sim 14.17 0.306
• These three files are the only shared files during the run• They are located on the NFS server• All processes read/write to these files
• Heavy read: *.023 file• Heavy write: *.prt file
30
Abaqus/Standard – Time Histories
• The next slides present time histories of:– Write, Read syscall size
• Amount of data in each system call– Write IOPS and Read IOPS– Write throughput, Read throughput
• MB/s for each syscall– Read, Write cumulative syscall size (MB)– File offsets (bytes)
• Charts can present useful visual information about the behavior over time of the IO
• Two MPI processes (ranks) are shown:– 10759: Probable Rank-0– 6764: Example of non-Rank-0 IO
31
Rank-0Plots
32
Write syscall and Write IOPS History for 10759
Largest amount writtenin a short time
Write IOPSPeak is 31,092
Average writesyscall = 6.4KBMean write syscall = 2KB
Solver Time?
Write IOPSaverage is750
33
Write syscall call distribution - 10759
Vast majority of writesyscalls are verysmall (2KB)
Very few largewrite syscalls
Writes are very small:Average: 6.4 KBMedian = 2KB
34
Write Throughput and Write IOPS History for 10759
Peak throughputis a bit more than1750 MB/s
Solver Time?
Four peaks:1. At beginning2. Just before solver3. Just after solver4. At very end
Two writethroughput peaksmatch peak in writeIOPS1. Just after solver2. At very end
35
Cumulative Write syscall History for 10759
Some writing atbeginning of run(1GB). Just beforesolver starts.
More writing abouthalf-way through run(about 500MB). Thisis just after the solverfinishes.
Final writing(about 600MB)
Solver Time?
36
Read syscall and Read IOPS History for 10759
Solver Time?
Very few “larger” reads. Only at the beginning.
Read IOPSpeak is 20,540
Reads are verysmall.Average is 31.7KBMean is 4 bytes!
Average ReadIOPS is 475
37
Read syscall call distribution - 10759
Vast majority of readsyscalls are verysmall (4 bytes!)
Very few largeread syscalls
Reads are verysmall.Average is 31.7KBMean is 4 bytes!
38
Read Throughput History and Read IOPS for 10759
Peak Read IOPS is20,540
Solver Time?
Average Read IOPS is 475
Peak Read throughput occurs at end: ~3,700 Mb/s
Four read throughput peaks:1. Very beginning2. Just before
solver3. Just after
solver4. At the very end
39
Cumulative read syscall history for 10759
Some reading atbeginning of run(~650MB)
Solver Time? Large number ofvery small reads
Very large amount ofreading just prior to solver.about 1.4GB
40
File offset history – 10759 (fort1YOKEw)
Lots of file pointermovement at the beginning
This file is a local scratch fileand does a great deal writingat the beginning of the run
41
File offset history – 10759 (s4b_dellamd_64core.scr)
Solver Time?Lots of file pointermovement aftersolver is finished
Lots of file pointermovement at veryend of run
This is a local scratchfile that does about168MB of writing
42
File offset history – 10759 (s4b_dellamd_64core.023)
Lots of file pointermovement at the beginning
This is a central filethat does about42MB of reading
43
File offset history – 10759 (s4b_dellamd_64core.odb)
Lots of file pointermovement aftersolver is finished
Lots of file pointermovement aftersolver is finished
Solver Time?
Central file that doesabout 47MB of readsand 492MB of writes
44
File offset history – 10759 (s4b_dellamd_64core.odb) Detailed view
Lots of file pointermovement
Detailed view of fileoffset near end ofrun.
45
File offset history – 10759 (s4b_dellamd_64core.scr)
Small amounts offile pointer movement
46
File offset history – 10759 (s4b_dellamd_64core.mdl)
Solver Time?
Lots of file pointermovement aftersolver is finished
Central file that does154MB of reading
47
Non-Rank-0Plots
48
Write syscall and Write IOPS History for non-rank-0 (6764)
Most Write syscallsare very small butthere are largerones
Peak writeIOPs is 6,812
Average writeIOPs is 175
49
Write syscall call distribution – non-Rank-0 (6764)
Vast majority of readsyscalls are verysmall (4KB!)
Very few largeread syscalls
50
Write throughput and Write IOPS History fornon-Rank-0 (6764)
Max writethroughput isabout2,300 MB/s
Lots of peakthroughput atvarious times
Notice that peakthroughput doesnot mean peakwrite IOPS.
51
Cumulative Write syscall History for non-Rank-0 (6764)
Fair amount of writingat beginning (around200MB)
More writing at end.About 450 MB.
52
Read syscall and Read IOPS History for non-Rank-0 (6764)
Lots of readingat beginning, justprior to solver, andjust after solveris finished
Solver Time?
Peak readIOPs is 15,721
Average readIOPs is 355
53
Read syscall call distribution – non-Rank-0 (6764)
Majority of readsare very small.Average = 13.6 KBMedian = 4KB
Very few largereads
54
Read Throughput History and Read IOPS fornon-Rank-0 (6764)
Max readthroughput isabout2,700 MB/s
Small amountsof readingtoward end ofrun
Solver Time?
Peak Read IOPSdoes not matchpeak throughput
55
Cumulative read syscall history for non-Rank-0 (6764)
Solver Time?
Lots of readingat beginning (160 MB)
Lots of readingAfter solver(~350MB)
56
File offset history – 6764 (fortK2YecK)
Lots of file pointermovement at the beginning
This file is a local scratch file and does a great deal of writing at the beginning of the run (42MB)
57
File offset history – 6764 (s4b_dellamd_64core.023)
Lots of file pointermovement at the beginning
This file is a central scratch file and does a great deal of readingat the beginning of the run (42MB)
58
File offset history – 6764 (s4b_dellamd_64core.mdl.1)
Lots of file pointermovement after thesolver finishes.
This file is a central file and does a great deal of reading at the beginning of the run (154MB)
59
File offset history – 6764 (s4b_dellamd_64core.prt)
Lots of file pointermovement at thevery end
This file is a central file and does a great deal writing at the end of the run (75MB)
60
File offset history – 6764 (s4b_dellamd_64core.stt.1)
Lots of file pointermovement after thesolver finishes
Lots of file pointermovement at thevery end.
This file is a central file and does a great deal of reading and writing during the run (203MB of reading and 362MB of writing)
61
File offset history – 6764 (s4b_dellamd_64core.stt.1) Detailed View
Lots of file pointermovement after thesolver finishes
62
Summary
63
Abaqus/Standard – IO Profile Summary
• A great deal of time is spent doing IO compared to total run time– 27%-46% of the time involves IO– This is over NFS (GigE) with single drive– Each node has 1 local drive
• All the nodes do a great deal of IO but one node seems to do more than the others (10759)– Possible Rank-0 process
• 10 times more write() than other processes• Does all the mkdir() calls• Does 2 times more lseeks() than other processes• One lseek for every 10 writes
64
Abaqus/Standard – IO Profile Summary cont’d
• Rank-0 Process:– 642 write syscalls are 1MB - 100MB– 258,881 write syscalls are 1KB - 8KB– 48,092 write syscalls are 0 – 1KB– 1,247 read syscalls are 1MB – 100MB– 52,780 read syscalls are 1KB – 8KB– 18,701 read syscalls are 0 – 1KB
• Non-Rank-0 processes:– Around 600 write syscalls are 1MB – 100MB– ~19,400 write syscalls are 1KB-8KB– ~5,700 write syscalls are 0 – 1KB– ~170 read syscalls are 1MB – 100MB– 29,320 read syscalls are 1KB – 8KB– read syscalls are 0 - 1KB
65
Abaqus/Standard – IO Profile Summary cont’d
• Rank-0 Process:– Write() syscalls:
• Average is 6.46KB +/- 232KB• Median is 2KB +/- 5KB
– Read() syscalls:• Average is 31.76KB +/- 477KB• Median is 4 Bytes +/- 31.76KB (that isn’t a typo)
• Non-Rank-0 processes:– Write() syscalls:
• Average is 44.8KB +/- 390KB• Median is 4KB +/- 42.4KB
– Read() syscalls:• Average is 13.6KB +/- 254.7KB• Median is 4KB +/- 12KB
66
Abaqus/Standard – IO Profile Summary cont’d
• Rank-0 has much higher IOPS than other processes– Particularly writes
• Rank – 0:– Write IOPS:
• Peak Write IOPS is 31,092• Average Write IOPS is 1,807
– Read IOPS:• Peak Write IOPS is 20,540• Average Write IOPS is 475
– Total IOPS:• Peak Write IOPS is 32,801• Average Write IOPS is 2,232
• Non-Rank – 0:– Write IOPS:
• Peak Write IOPS is ~6,600• Average Write IOPS is ~200
– Read IOPS:• Peak Write IOPS is ~20,000• Average Write IOPS is ~400
– Total IOPS:• Peak Write IOPS is ~33,000• Average Write IOPS is ~550
67
Abaqus/Standard – IO Profile Summary cont’d
• The overall run happens in several phases:– Up to 100-150 seconds:
• A great deal of reading and a fair amount of writing– 150-400 seconds:
• Solver phase? Very little IO– 400-740 second:
• A fair amount of IO – primarily writing (intermediate output?)• Fair amount of lseek activity• Lots of smaller reads and writes
– 740-760 seconds:• Great deal of writing (final output?)
68
Abaqus/Standard - Summary
• This particular case of Abaqus/Standard a great deal of IO (27% - 46% of the total time is spent doing IO) – A mixture of IO that is local and over central file system
• For this case the central file system was NFS mounted• All ranks do a great deal of IO
– There is a recognizable Rank-0 process (10759)• Most of the IO is write
– Rank-0 does 309,147 writes (83.7% are 1KB – 8KB)• Still a reasonable amount of read:
– Rank-0 does 72,827 read syscalls (72.5% are 1KB – 8KB)• IOPS performance can be important
– Most of the syscalls are in the 1KB-8KB range
69
File Patterns
• Most of the IO is done to a central file system– Rank-0 does a fair amount of local IO
• N-N IO dominates– Each rank does about ~350 MB reads and ~362 MB writes
• Only 3 shared files (N-1)– 337 MB reads (total)– 227 MB writes (total)
• Overall file patterns are dominated by N-N– Almost all is to central file system
• Some N-1 file patterns– About the same size as the N-N files
7070
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein