abaqus/implicit io profiling - hpc advisory council implicit io performance... · abaqus/implicit...

Abaqus/Implicit IO Profiling

April 2010

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: AMD, Dell, SIMULIA, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The participating members would like to thank SIMULIA for their support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com

– http://www.simulia.com

http://www.mellanox.com/�

http://www.dell.com/hpc�

http://www.amd.com/�

http://www.simulia.com/�

3

SIMULIA Abaqus

• ABAQUS offers a suite of engineering design analysis

software products, including tools for:

– Nonlinear finite element analysis (FEA)

– Advanced linear and dynamics application problems

• ABAQUS/Standard provides general-purpose FEA that includes a broad range of analysis capabilities

• ABAQUS/Explicit provides nonlinear, transient, dynamic

analysis of solids and structures using explicit time

integration

4

Objectives

• The presented research was done to provide best practices and IO

profiling information for Abaqus/Standard

– Determination of application IO requirements

– Testing of application on NFS IO subsystem

• Provide recommendations on Storage systems for

Abaqus/Standard

5

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U3, OFED 1.4.1 InfiniBand SW stack

• MPI: HP-MPI 2.3

• Application: Abaqus 6.9 EF1

• Single SCSI hard drive in master node using NFS over GigE connection

• Benchmark Workload

– Abaqus/Standard Server Benchmarks: s4b

6

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management

– Design for clustering and storage interconnect

• Performance– 40Gb/s node-to-node

– 120Gb/s switch-to-switch

– 1us application latency

– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload

– Kernel bypass

– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

7

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E® Bridge

I/O Hub

USB

PCI

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

8

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

9

Dell PowerEdge™ Server Advantage

• Dell™ PowerEdge™ servers incorporate AMD Opteron™ and Mellanox ConnectX InfiniBand to provide leading edge performance and reliability

• Building Block Foundations for best price/performance and performance/watt

• Investment protection and energy efficient• Longer term server investment value• Faster DDR2-800 memory• Enhanced AMD PowerNow!• Independent Dynamic Core Technology• AMD CoolCore™ and Smart Fetch Technology• Mellanox InfiniBand end-to-end for highest networking performance

10

Introduction to Profiling

11

Abaqus/Standard Benchmark Results

• Master node has a single hard drive– SAS drive

• Exported using NFS over GigE– Full bi-sectional bandwidth

• No special NFS options used on server or clients– Default options used

– For example, on server: /application *(rw,sync,no_root_squash)

12

Abaqus/Standard Benchmark Results

• Input Dataset: s4b– Cylinder head bolt-up (mildly non-linear)– 5,000,000 DOF’s, 5 iterations

• 4GB memory

• 23GB disk space

• Profile was done on 8 cores– Only 1 core per node was used– All nodes are connected via InfiniBand

• Analysis is done using strace_analyzer (clusterbuffer.wetpaint.com)– GPL application

13

Abaqus/Standard IO Profiling

• The goal of IO profiling is to examine:– How the application performs IO

• How many process do IO?• How much writing? How much reading?• Sizes of syscalls?• Number of lseek? (head thrashing)

– How the profile results can translate into IO requirements (i.e. design)

– For applications with source the IO profiling can be used for changing the application for better performance

14

Executive Summary

15

Abaqus/Standard - Summary

• This particular case of Abaqus/Standard does a great deal of IO– ~45% of the time is spent doing IO

• All processes do IO• Most of the IO is write

– Rank-0 process does 83.7% of writes in 1KB – 8KB• IOPS performance can be important but it appears to be

for files in the 1KB – 8KB range• N-N file patterns seem to dominate

– But there is some N-1 that is about the same size as N-N• Recommendations:

• Fast central storage system seems to be desirable• The impact of local storage performance is unknown at this time

16


• This particular case of Abaqus/Standard does a great deal of IO (27% - 46% of the total time is spent doing IO) – A mixture of IO that is local and over central file system

• For this case the central file system was NFS mounted• All ranks do a great deal of IO

– There is a recognizable Rank-0 process (10759)• Most of the IO is write

– Rank-0 does 309,147 writes (83.7% are 1KB – 8KB)• Still a reasonable amount of read:

– Rank-0 does 72,827 read syscalls (72.5% are 1KB – 8KB)• Local performance can be a driver

– Need to test alternative setup and alternative storage

17

Details

18

Abaqus/Standard – Run Times

Process ID Total Run Time (secs)

IO Time (secs)

% of Time for IO

6764 837.93 370.95 44.27%

10759 761.73 34.17 4.49%

24119 807.53 216.00 26.75%

24272 808.19 314.45 38.90%

24289 853.53 359.51 42.12%

24302 831.41 313.23 37.67%

25930 831.34 280.70 33.76%

28228 853.41 390.39 45.75%

• It appears that 10759 is the rank-0– Smallest IO time and % versus other processes

• Other processes spend a great deal of time doing IO

19

Abaqus/Standard – Command Count

Process ID access lseek fcntl stat unlink open close fstat read mkdir getdents write6764 7 14,716 6 256 31 1,051 1,053 902 42,695 0 8 27,137

10759 15 38,514 32 347 35 1,110 1,119 959 72,827 8 8 309,14724119 7 14,715 6 251 31 1,022 1,024 877 42,638 0 8 27,14524272 7 14,717 6 256 31 1,047 1,049 901 42,689 0 8 27,20324289 7 14,723 6 251 31 1,021 1,023 877 42,643 0 8 27,04724302 7 14,73 6 246 31 994 996 853 42,589 0 8 26,93025930 7 14,071 6 250 31 1,017 1,019 877 42,631 0 8 27,03028228 7 14,734 6 253 31 1,022 1,024 877 42,643 0 8 27,100

• Process 10759 is the rank-0 process (perhaps)– access() are greater than others– mkdir() is only from this process– Much greater read() and write() than other processes

• Number of times an IO system function is called:

20

Abaqus/Standard – Command Count

• Open/close don’t match because of sockets– Looks like an open() function– Sockets open() but don’t close()

• Open() also works for .so libraries– .so libraries are opened and read – so they look like IO

• Is there truly a rank-0 process?– All processes are doing IO

• Possible Rank-0: (Process 10759)– 10 times more write() than other processes– Does all the mkdir() calls– Does 2 times more lseeks() than other processes– One lseek for every 10 writes

21

Abaqus/Standard – Write Statistics

Write syscall sizesProcess

ID 0 - 1KB 1KB - 8KB 8KB - 32KB 32KB - 128KB 128KB - 256KB 256KB - 512KB 512KB - 1 MB 1MB - 10 MB 10MB - 100MB6764 5,727 19,418 21 1,303 32 50 14 267 3

10759 48,092 258,881 115 1,304 30 55 16 635 724119 5,727 19,422 21 1,304 31 52 12 571 324272 5,793 19,424 14 1,304 31 48 16 567 324289 5,645 19,410 18 1,303 28 52 9 577 324302 5,520 19,419 21 1,305 37 48 13 562 325930 5,520 19,419 21 1,305 37 48 13 562 328228 5,688 19,415 19 1,303 29 48 15 578 3

• There are no syscalls larger than 100 MB– But these are really large single syscall writes

• For process 10759– 642 write syscalls larger than 1 MB

• Rank-0 process does many more writes less than 8KB– About 10 times more writes

22

Abaqus/Standard – Write Statistics

Process ID Total BytesMean Bytes

per callStandard Dev

(Bytes)

Mean Absolute Dev

(Bytes)

Median Bytes per

syscall

Median Absolute Dev

(Bytes)Slowest write

time (secs)6764 1,209,892,905 44,587.91 389,382.93 406,411.02 4096 42,256.99 30.311210759 1,998,797,029 6,465.75 232,122.19 233,500.39 2048 5,046.46 2.452824119 1,215,273,444 44,773.00 390,391.32 407,477.38 4096 42,442.30 0.527424272 1,209,949,469 44,481.80 388,991.56 406,008.85 4096 42,166.66 4.548224289 1,216,555,764 44,982.65 390,330.21 407,460.32 4096 42,630.84 2.912524302 1,206,742,345 44,813.66 390,802.04 407,876.36 4096 42,431.95 2.869525930 1,206,742,345 44,813.66 390,802.04 407,876.36 4096 42,431.95 2.869528228 1,223,024,346 45,133.38 391,201.85 408,385.38 4096 42,791.41 14.6159

• Average bytes per syscall for 10759 (rank-0?) is much smaller than others– 6.5KB vs. 45KB

• VERY slow write time for some processes– 30 seconds and 14 seconds

• Each node writes a great deal of data– 1.2GB for rank-0– 2 GB for rank-0

23

Abaqus/Standard – Read Statistics

Read syscall sizesProcess

ID 0 - 1KB 1KB - 8KB 8KB - 32KB 32KB - 128KB 128KB - 256KB 256KB - 512KB 512KB - 1 MB 1MB - 10 MB 10MB - 100MB6764 13,145 29,319 12 13 20 7 9 168 2

10759 18,701 52,780 27 17 26 11 18 1,239 824119 13,091 29,320 12 14 19 7 8 165 224272 13,142 29,320 11 13 20 6 9 166 224289 13,092 29,320 11 14 19 6 7 172 224302 13,040 29,320 12 13 20 7 8 167 225930 13,086 29,320 12 13 20 8 8 162 228228 13,092 29,320 11 13 20 5 9 171 2

• There are no syscalls larger than 100 MB• Process 10759:

– 1,247 syscalls greater than 1 MB (8 from 10-100MB)• 10 times the number for the other processes

– Twice the number of reads as other processes

24

Abaqus/Standard – Read Statistics

Process ID Total BytesMean Bytes

per callStandard

Dev (Bytes)

Mean Absolute Dev

(Bytes)

Median Bytes per

call

Median Absolute Dev

(Bytes)

Slowest write time

(secs)6764 580,525,222 13,597.03 254,496.37 260,030.28 4096 12,005.78 11.3631

10759 2,313,198,661 31,762.93 477,423.33 487,689.97 4 31,759.02 0.044624119 576,763,800 13,526.99 254,790.91 260,297.70 4096 11,929.49 1.567624272 578,160,440 13,543.55 254,445.30 259,963.03 4096 11,951.97 5.776624289 582,451,173 13,658.78 254,761.43 260,340.48 4096 12,061.11 5.937324302 579,003,926 13,595.15 254,790.07 260,331.58 4096 11,991.69 5.689325930 572,938,928 13,439.49 253,880.60 259,321.27 4096 11,841.53 1.947928228 583,322,316 13,679.20 255,137.85 260,728.20 4096 12,081.61 127.1235

• Average bytes per syscall for Rank-0 is much larger than other processes:– 31KB vs. 13.6KB– Standard deviation is also much larger

• Median bytes for rank-0 is extremely small (4 Bytes)• One VERY slow read

– Process 28228: 127 seconds (almost 2 minutes!)

25

Abaqus/Standard – IOPS Statistics

Process IDAverage

Write IOPSAverage

Read IOPSAverage

Total IOPSMax Write

IOPS

Max Write IOPS Time

(secs)Max Read

IOPS

Max Read IOPS Time

(secs)Max Total

IOPS

Max Total IOPS Time

(secs)6764 175 355 480 6,812 835 15,721 13 26,151 480

10759 1,807 475 2,232 31,092 750 20,540 9 32,801 924119 212 398 588 7,234 788 12,929 14 21,975 1424272 200 391 552 6,642 788 20,391 11 33,157 1124289 187 387 534 612 850 20,391 11 33,157 1124302 205 394 578 6,671 828 15,083 14 25,206 1425930 207 406 595 6,683 828 12,181 14 20,853 1428228 186 355 501 6,682 850 16,493 13 27,310 13

• Run time is about 850 secs– Some IOPS peaks occur at beginning (read) and some near the end (write)

• Max Average Write IOPS occurs for rank-0 (31,092)– Average is just 1,807

• Max Read IOPS occurs for several processes (~20,000)– Average is just about 400

• Max Total IOPS occurs for several processes (~33,000)– Average is just about 600 but for rank-0 it is 2,232

26

Abaqus/Standard – File Details

Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec==========================================================================================================================================/application/Simulia/benchmark/s4b_dellamd_64core.prt 0 150,920,784.99 75,854,689 353,233,129.06/dev/infiniband/uverbs0 0 66,002,083.33 106,608 212,763.29/sys/class/infiniband_verbs/uverbs0/abi_version 2 889,087.30 0 0/proc/cpuinfo 6,248 1,413,552.42 0 0/application/Simulia/benchmark/s4b_dellamd_64core.use.1 0 0 54,492 22,686,914.50/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.dat.1 0 68,590,373.73 195 3,855,281.49/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.msg.1 0 0 249 3,803,347.58/application/Simulia/benchmark/s4b_dellamd_64core.023 42,143,900 359,660,376.97 0 0/application/Simulia/benchmark/s4b_dellamd_64core.sim 1,733,172 91,525,283.73 0 0/application/Simulia/benchmark/s4b_dellamd_64core.mdl.1 154,123,640 801,527,477.80 1,332,088 325,424,705.64/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core_tmp_.1 2,048 208,035,387.05 2,280 36,897,650.79/storage/root_s4b_dellamd_64core_10591/SimTemp.8u95Fh 0 146,285,714.29 13,392 109,557,851.00/storage/root_s4b_dellamd_64core_10591/fortK2YecK 0 21,387,121.21 42,041,344 720,370,419.59/application/Simulia/benchmark/s4b_dellamd_64core.stt.1 203,076,124 303,339,180.30 362,530,784 599,991,946.68/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.cax.1 87,432 222,829,369.91 16,604 366,313,402.97/proc/6766/status 762 22,524,755.28 0 108,140,241.03

• Process 6764 (non-rank-0)

• /application directory is NFS shared• /storage directory is local

• Little writing is done to local storage (42MB is largest)• Very little reading is done to local storage (2KB)

• Most read() and write() is done over NFS• s4b_dellamd_64core.stt.1 appears to do more IO

27

Abaqus/Standard – File Details

Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec==========================================================================================================================================/application/Simulia/benchmark/s4b_dellamd_64core.prt 0 116,026,282.15 75,854,689 360,865,013.73/dev/infiniband/uverbs0 0 41,178,259.38 124,008 208,728.97/sys/class/infiniband_verbs/uverbs0/abi_version 2 546,236.17 0 0/proc/cpuinfo 6,248 1,358,478.76 0 0/application/Simulia/benchmark/s4b_dellamd_64core.use 0 0 58,686 15,626,030.44/etc/protocols 4,096 120,230,505.89 0 0/application/Simulia/benchmark/s4b_dellamd_64core.dat 0 372,363,636.36 4,815 7,236,145.84/application/Simulia/benchmark/s4b_dellamd_64core.msg 0 0 16,203 8,754,438.58/application/Simulia/benchmark/s4b_dellamd_64core.odb 47,768,316 351,966,985.98 492,464,040 272,677,237.48/application/Simulia/benchmark/s4b_dellamd_64core.023 42,143,900 381,006,383.53 0 0/application/Simulia/benchmark/s4b_dellamd_64core.sim 2,039,284 118,308,073.20 306,212 243,821,651.09/application/Simulia/benchmark/s4b_dellamd_64core_0.mdl 154,036,199 763,215,988.75 1,332,088 603,230,471.89/application/Simulia/benchmark/s4b_dellamd_64core_0.stt 202,671,800 845,758,301.20 361,935,344 588,916,616.78/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core_0.cax 87,432 401,370,493.04 16,604 264,176,077.10/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64coreTmp0.sim 2,048 208,071,565.50 2,280 36,841,904.76/storage/root_s4b_dellamd_64core_10591/SimTemp.Vfyuto 0 186,181,818.18 13,392 107,492,574.09/storage/root_s4b_dellamd_64core_10591/fort1YOKEw 0 21,115,327.38 42,041,344 742,441,432.12/application/Simulia/benchmark/s4b_dellamd_64core.sta 0 20,822,534.14 438 251,623,075.48/proc/10763/status 767 23,167,553.69 0 213,893,079.01/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core_0.cax 0 0 88,432 152,445,198.22/storage/root_s4b_dellamd_64core_10591/s4b_dellamd_64core.scr 0 0 168,326,652 97,006,113.13

• Process 10759 (rank-0?)

• /application directory is NFS shared• /storage is local• Some write IO is local (s4b_dellamd_64core.scr)• Some write IO (and most of the read IO) is over NFS

28

Abaqus/Standard – Local vs. Central IO

Process Number

Local Read (MB)

Local Write(MB)

Central Read (MB)

Central Write (MB)

6764 0.089 42.07 357.20 363.9210759 0.089 210.49 404.47 855.8124119 0.089 42.07 353.45 367.3724272 0.089 42.07 359.16 364.0124289 0.069 42.07 359.16 368.6424032 0.089 42.07 355.71 362.1424930 0.089 42.07 349.63 357.5628228 0.069 42.07 360.03 372.04

• “Local” IO is done on the drive in the node• “Central” IO is done to the NFS server

29

Abaqus/Standard – Shared File IO

File Total Reads (MB)

Total Writes (MB)

s4b_dellamd_64core.prt 0 227.56s4b_dellamd_64core.023 337.15 0s4b_dellamd_64core.sim 14.17 0.306

• These three files are the only shared files during the run• They are located on the NFS server• All processes read/write to these files

• Heavy read: *.023 file• Heavy write: *.prt file

30

Abaqus/Standard – Time Histories

• The next slides present time histories of:– Write, Read syscall size

• Amount of data in each system call– Write IOPS and Read IOPS– Write throughput, Read throughput

• MB/s for each syscall– Read, Write cumulative syscall size (MB)– File offsets (bytes)

• Charts can present useful visual information about the behavior over time of the IO

• Two MPI processes (ranks) are shown:– 10759: Probable Rank-0– 6764: Example of non-Rank-0 IO

31

Rank-0Plots

32

Write syscall and Write IOPS History for 10759

Largest amount writtenin a short time

Write IOPSPeak is 31,092

Average writesyscall = 6.4KBMean write syscall = 2KB

Solver Time?

Write IOPSaverage is750

33

Write syscall call distribution - 10759

Vast majority of writesyscalls are verysmall (2KB)

Very few largewrite syscalls

Writes are very small:Average: 6.4 KBMedian = 2KB

34

Write Throughput and Write IOPS History for 10759

Peak throughputis a bit more than1750 MB/s

Solver Time?

Four peaks:1. At beginning2. Just before solver3. Just after solver4. At very end

Two writethroughput peaksmatch peak in writeIOPS1. Just after solver2. At very end

35

Cumulative Write syscall History for 10759

Some writing atbeginning of run(1GB). Just beforesolver starts.

More writing abouthalf-way through run(about 500MB). Thisis just after the solverfinishes.

Final writing(about 600MB)

Solver Time?

36

Read syscall and Read IOPS History for 10759

Solver Time?

Very few “larger” reads. Only at the beginning.

Read IOPSpeak is 20,540

Reads are verysmall.Average is 31.7KBMean is 4 bytes!

Average ReadIOPS is 475

37

Read syscall call distribution - 10759

Vast majority of readsyscalls are verysmall (4 bytes!)

Very few largeread syscalls

Reads are verysmall.Average is 31.7KBMean is 4 bytes!

38

Read Throughput History and Read IOPS for 10759

Peak Read IOPS is20,540

Solver Time?

Average Read IOPS is 475

Peak Read throughput occurs at end: ~3,700 Mb/s

Four read throughput peaks:1. Very beginning2. Just before

solver3. Just after

solver4. At the very end

39

Cumulative read syscall history for 10759

Some reading atbeginning of run(~650MB)

Solver Time? Large number ofvery small reads

Very large amount ofreading just prior to solver.about 1.4GB

40

File offset history – 10759 (fort1YOKEw)

Lots of file pointermovement at the beginning

This file is a local scratch fileand does a great deal writingat the beginning of the run

41

File offset history – 10759 (s4b_dellamd_64core.scr)

Solver Time?Lots of file pointermovement aftersolver is finished

Lots of file pointermovement at veryend of run

This is a local scratchfile that does about168MB of writing

42

File offset history – 10759 (s4b_dellamd_64core.023)


This is a central filethat does about42MB of reading

43

File offset history – 10759 (s4b_dellamd_64core.odb)

Lots of file pointermovement aftersolver is finished


Solver Time?

Central file that doesabout 47MB of readsand 492MB of writes

44

File offset history – 10759 (s4b_dellamd_64core.odb) Detailed view

Lots of file pointermovement

Detailed view of fileoffset near end ofrun.

45

File offset history – 10759 (s4b_dellamd_64core.scr)

Small amounts offile pointer movement

46

File offset history – 10759 (s4b_dellamd_64core.mdl)

Solver Time?


Central file that does154MB of reading

47

Non-Rank-0Plots

48

Write syscall and Write IOPS History for non-rank-0 (6764)

Most Write syscallsare very small butthere are largerones

Peak writeIOPs is 6,812

Average writeIOPs is 175

49

Write syscall call distribution – non-Rank-0 (6764)

Vast majority of readsyscalls are verysmall (4KB!)

Very few largeread syscalls

50

Write throughput and Write IOPS History fornon-Rank-0 (6764)

Max writethroughput isabout2,300 MB/s

Lots of peakthroughput atvarious times

Notice that peakthroughput doesnot mean peakwrite IOPS.

51

Cumulative Write syscall History for non-Rank-0 (6764)

Fair amount of writingat beginning (around200MB)

More writing at end.About 450 MB.

52

Read syscall and Read IOPS History for non-Rank-0 (6764)

Lots of readingat beginning, justprior to solver, andjust after solveris finished

Solver Time?

Peak readIOPs is 15,721

Average readIOPs is 355

53

Read syscall call distribution – non-Rank-0 (6764)

Majority of readsare very small.Average = 13.6 KBMedian = 4KB

Very few largereads

54

Read Throughput History and Read IOPS fornon-Rank-0 (6764)

Max readthroughput isabout2,700 MB/s

Small amountsof readingtoward end ofrun

Solver Time?

Peak Read IOPSdoes not matchpeak throughput

55

Cumulative read syscall history for non-Rank-0 (6764)

Solver Time?

Lots of readingat beginning (160 MB)

Lots of readingAfter solver(~350MB)

56

File offset history – 6764 (fortK2YecK)


This file is a local scratch file and does a great deal of writing at the beginning of the run (42MB)

57

File offset history – 6764 (s4b_dellamd_64core.023)


This file is a central scratch file and does a great deal of readingat the beginning of the run (42MB)

58

File offset history – 6764 (s4b_dellamd_64core.mdl.1)

Lots of file pointermovement after thesolver finishes.

This file is a central file and does a great deal of reading at the beginning of the run (154MB)

59

File offset history – 6764 (s4b_dellamd_64core.prt)

Lots of file pointermovement at thevery end

This file is a central file and does a great deal writing at the end of the run (75MB)

60

File offset history – 6764 (s4b_dellamd_64core.stt.1)

Lots of file pointermovement after thesolver finishes

Lots of file pointermovement at thevery end.

This file is a central file and does a great deal of reading and writing during the run (203MB of reading and 362MB of writing)

61

File offset history – 6764 (s4b_dellamd_64core.stt.1) Detailed View

Lots of file pointermovement after thesolver finishes

62

Summary

63

Abaqus/Standard – IO Profile Summary

• A great deal of time is spent doing IO compared to total run time– 27%-46% of the time involves IO– This is over NFS (GigE) with single drive– Each node has 1 local drive

• All the nodes do a great deal of IO but one node seems to do more than the others (10759)– Possible Rank-0 process

• 10 times more write() than other processes• Does all the mkdir() calls• Does 2 times more lseeks() than other processes• One lseek for every 10 writes

64

Abaqus/Standard – IO Profile Summary cont’d

• Rank-0 Process:– 642 write syscalls are 1MB - 100MB– 258,881 write syscalls are 1KB - 8KB– 48,092 write syscalls are 0 – 1KB– 1,247 read syscalls are 1MB – 100MB– 52,780 read syscalls are 1KB – 8KB– 18,701 read syscalls are 0 – 1KB

• Non-Rank-0 processes:– Around 600 write syscalls are 1MB – 100MB– ~19,400 write syscalls are 1KB-8KB– ~5,700 write syscalls are 0 – 1KB– ~170 read syscalls are 1MB – 100MB– 29,320 read syscalls are 1KB – 8KB– read syscalls are 0 - 1KB

65


• Rank-0 Process:– Write() syscalls:

• Average is 6.46KB +/- 232KB• Median is 2KB +/- 5KB

– Read() syscalls:• Average is 31.76KB +/- 477KB• Median is 4 Bytes +/- 31.76KB (that isn’t a typo)

• Non-Rank-0 processes:– Write() syscalls:

• Average is 44.8KB +/- 390KB• Median is 4KB +/- 42.4KB

– Read() syscalls:• Average is 13.6KB +/- 254.7KB• Median is 4KB +/- 12KB

66


• Rank-0 has much higher IOPS than other processes– Particularly writes

• Rank – 0:– Write IOPS:

• Peak Write IOPS is 31,092• Average Write IOPS is 1,807

– Read IOPS:• Peak Write IOPS is 20,540• Average Write IOPS is 475

– Total IOPS:• Peak Write IOPS is 32,801• Average Write IOPS is 2,232

• Non-Rank – 0:– Write IOPS:

• Peak Write IOPS is ~6,600• Average Write IOPS is ~200

– Read IOPS:• Peak Write IOPS is ~20,000• Average Write IOPS is ~400

– Total IOPS:• Peak Write IOPS is ~33,000• Average Write IOPS is ~550

67


• The overall run happens in several phases:– Up to 100-150 seconds:

• A great deal of reading and a fair amount of writing– 150-400 seconds:

• Solver phase? Very little IO– 400-740 second:

• A fair amount of IO – primarily writing (intermediate output?)• Fair amount of lseek activity• Lots of smaller reads and writes

– 740-760 seconds:• Great deal of writing (final output?)

68


• This particular case of Abaqus/Standard a great deal of IO (27% - 46% of the total time is spent doing IO) – A mixture of IO that is local and over central file system

• For this case the central file system was NFS mounted• All ranks do a great deal of IO

– There is a recognizable Rank-0 process (10759)• Most of the IO is write

– Rank-0 does 309,147 writes (83.7% are 1KB – 8KB)• Still a reasonable amount of read:

– Rank-0 does 72,827 read syscalls (72.5% are 1KB – 8KB)• IOPS performance can be important

– Most of the syscalls are in the 1KB-8KB range

69

File Patterns

• Most of the IO is done to a central file system– Rank-0 does a fair amount of local IO

• N-N IO dominates– Each rank does about ~350 MB reads and ~362 MB writes

• Only 3 shared files (N-1)– 337 MB reads (total)– 227 MB writes (total)

• Overall file patterns are dominated by N-N– Almost all is to central file system

• Some N-1 file patterns– About the same size as the N-N files

7070

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

abaqus/implicit io profiling - hpc advisory council implicit io performance... · abaqus/implicit...

Documents