automatically encapsulating hpc best practices into data...

National Aeronautics and Space Administration

www.nasa.gov

Automatically EncapsulatingHPC Best Practices Into Data Transfers

Paul Z. KolanoNASA Advanced Supercomputing Division

[email protected]

NASA High End Computing Capability

Outline of Presentation

•  Introduction•  Transport tuning and selection• Global resource management•  File system optimization• Conclusions

2


Introduction

• Data transfers are part of life in HPC environments- Finite storage capacity

•  Transfer to cheaper tape storage-  Back up existing data-  Make room for new data

•  Transfer from tape storage to reprocess old data•  Transfer between file systems to fix imbalances

- Finite computational capacity•  Transfer from off-site systems with cheaper pre-processing•  Transfer to off-site systems for cheaper post-processing

3


Introduction (cont.)

• User transfer concerns- Ease to use, integrity, turnaround time

• Administrator and owner transfer concerns- Environment stability, cost effectiveness

•  These items can conflict with each other- Easy to use tools or those ensuring integrity may not be fast- Easiest file structure may degrade tape performance- Fastest turnaround time may lead to resource exhaustion

•  Takes HPC expert to reconcile conflicts- Understands and applies accepted best practices to achieve

fast and efficient verified transfers without impact on stability

4


Goal

•  Let scientists focus on science without wading through documentation on transfer best practices- Specify transfers in simplest, naive fashion

•  Source and destination• Provide tool to perform transfer as if scientist were

HPC expert- Choose appropriate tools and optimize for best performance- Fully utilize available resources without starving other users- Manage files appropriately by file system type to ensure

efficient access by later and/or behind the scenes processes

5


Shift: Self-Healing Independent File Transfer

•  Satisfies user requirements-  Simple cp/scp syntax for local/remote transfers-  End-to-end integrity via checksums and sanity checks-  High speed via transport selection/tuning and automatic parallelization

•  Satisfies administrator and owner requirements-  Helps prevent resource exhaustion that leads to environment instability

•  Global throttling to allocate resources fairly•  Load balancing to avoid highly loaded hosts•  Automatic striping to avoid imbalanced disk utilization

-  Helps prevent wasted resources that impact cost effectiveness•  Allows easy utilization of idle resources•  Reduces wasted CPU cycles during jobs due to inefficient disk I/O•  Prevents issues leading to inefficient tape I/O

6


Shift: Self-Healing Independent File Transfer (cont.)

• Used in production at NASA's Advanced Supercomputing Facility for over 3.5 years- User transfers across local/LAN/WAN- Disaster recovery backups to/from remote organizations- Rebalancing entire multi-PB Lustre file systems- etc.

•  Facilitated transfers of over 14 PB in past year- 8 PB local transfers- 4 PB LAN transfers- 2 PB WAN transfers

•  "I used to hate archiving data - now I almost look for a reason to archive something" –Shift user

7


Shift Interface

> shiftc --create-tar /nobackup/user1/dataset1 archive1:dataset1.tar

Shift id is 36

Detaching process (use --status option to monitor progress)

> shiftc --statusid | state | dirs | files | file size | date | run | rate

| | sums | attrs | sum size | time | left |

---+-------+-------------+-------------+---------------+-------+------------+---------

34 | error | 0/0 | 23121/23121 | 39.5TB/39.5TB | 10/02 | 2d14h32m5s | 175MB/s

| | 46222/46242 | 23111/23121 | 79TB/79TB | 10:26 | |

35 | done | 1/1 | 5131/5131 | 303GB/303GB | 10/05 | 1m35s | 3.19GB/s

| | 10262/10262 | 5132/5132 | 605GB/605GB | 12:28 | |

36 | run | 24/24 | 26656/26656 | 1.78TB/1.78TB | 10/06 | 2h48m37s | 176MB/s

| | 15463/53312 | 10/26684 | 1.02TB/3.56TB | 12:11 | 1h47m55s |

8


Shift Components

9

• Command-line client- Performs file operations and reports results to manager

• Command-line manager-  Invoked by clients to track operations and parcel out work

Client Host

Shift Client Remote Host

App C1

InterconnectApp C2

App Cj

OSClientFile System

App R1

App R2

App RkShift Manager(s)

OS RemoteFile System




10


Transport Tuning and Selection

• Shift includes built-in local/remote transports and checksum capabilities- Fully functional out of the box- Perl-based equivalents of cp, sftp, fish, m(d5)sum

• Shift calls higher performance tools when available- bbcp, bbftp, gridftp, mcp, rsync, msum- Knows how to construct command-lines and parse output

•  Tune transports for optimal performance• Select transports based on transfer characteristics

11


Transport Tuning

•  TCP-based transports- bbcp, bbftp, gridftp- Choose TCP window size

•  Transports with internal parallelism- TCP streams (bbcp, bbftp, gridftp) or threads (mcp, msum)- Choose appropriate level of parallelism

• SSH-based transports- fish, rsync, sftp-perl- Choose fastest SSH cipher and MAC algorithm

12


TCP Window Size Tuning

•  TCP window is amount of data sender or receiver willing to buffer while waiting for acknowledgment

• Optimal value is bandwidth delay product (BDP)- bandwidth * round-trip time

• Constrained by configured operating system limits- e.g. Linux net.core.[wr]mem_max- Single stream only achieves bandwidth if limit at least BDP

13


TCP Window Size Tuning (cont.)

0

100

200

300

400

500

600

700

Default 1 4 16 64 100

LAN

Tra

nsfe

r Per

form

ance

(MB/

s)

TCP Window Size (MB)

bbcpbbftp

gridftp 0

100

200

300

400

500

600

700

800

Default 1 4 16 64 100

WAN

Tra

nsfe

r Per

form

ance

(MB/

s)

TCP Window Size (MB)

bbcp (get)bbftp (get)gridftp (put)

14

• Shift determines latency using icmp/echo/syn ping• Shift guesses bandwidth based on network type

and client hardware if not given via --bandwidth- Bandwidth difficult to compute a priori

• Chooses window size up to operating system limit


Transport Parallelism Tuning

• Number of streams in TCP-based transports- Overcome improperly configured TCP window maximums- Overcome improperly specified TCP window- Overcome interference by cross traffic

• Number of threads in mcp and msum- Take advantage of excess resource capacity on one host

15


Transport Parallelism Tuning (cont.)

0

100

200

300

400

500

600

1 2 4 8

LAN

Tra

nsfe

r Per

form

ance

(MB/

s)

TCP Streams

bbcpbbftp

gridftp

0

100

200

300

400

500

600

700

1 2 4 8

WAN

Tra

nsfe

r Per

form

ance

(MB/

s)

TCP Streams

bbcp (get)bbftp (get)

gridftp (put)

16

• Shift chooses streams based on bandwidth available beyond operating system window limit

• A minimum value can be configured for LAN/WAN to help overcome cross traffic


Transport Parallelism Tuning (cont.)

17

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16 32 64

Loca

l Cop

y Pe

rform

ance

(MB/

s)

Mcp Threads

Haswell (12-core, 4x FDR)Ivy Bridge (10-core, 4x FDR)Sandy Bridge (8-core, 4x FDR)Westmere (6-core, 4x QDR)

•  Threads can be centrally configured on the manager

•  High thread counts can induce high load on shared resources-  Intentionally set lower than

optimal at NAS due to high load on archive front-ends


SSH Cipher and MAC Algorithm Tuning

• SSH-based transports use SSH pipe to communicate- Performance directly correlated to SSH performance

• SSH does not expose TCP window settings- HPN SSH patches can be used for better window handling

•  Main SSH tuning parameters available - Encryption algorithm- Message authentication code (MAC) algorithm

18


SSH Cipher and MAC Algorithm Tuning (cont.)

0

50

100

150

200

250

aes128-cbc

aes192-cbc

aes256-cbc

aes128-ctr

aes192-ctr

aes256-ctr

arcfour

arcfour128

arcfour256

blowfish-cbc

case128-cbc

chacha20

3des-cbc

LAN

Tra

nsfe

r Per

form

ance

(MB/

s)(fi

sh w

ith u

mac

-64

MAC

)

SSH Cipher

0

50

100

150

200

250

hmac-md5

hmac-md5-96

hmac-ripemd160

hmac-sha1

hmac-sha1-96

hmac-sha2-256

hmac-sha2-512

umac-64

umac-128

LAN

Tra

nsfe

r Per

form

ance

(MB/

s)(fi

sh w

ith a

rcfo

ur25

6 ci

pher

)

SSH MAC Algorithm

19

• Shift allows preferred cipher/mac order to be centrally configured

• Availability checked on client host before transfer


Transport Selection

• Different transports have different strengths and weaknesses

• Supporting multiple transports provides opportunity to vary transport across each batch of files within transfer

20


Transport Selection (cont.)

0.1

1

10

100

1000

1 4 16 64 256 1024 4096 16384 65536

SSH/Non-SSH Break-Even Point

LAN

Tra

nsfe

r Per

form

ance

(MB/

s)

File Size (MB)

bbcpbbftpfishgridftprsyncsftp-perl

0.1

1

10

100

1000

1 4 16 64 256 1024 4096 16384 65536

SSH/Non-SSH Break-Even Point

WAN

Tra

nsfe

r Per

form

ance

(MB/

s)

File Size (MB)

bbcpbbftpfishgridftprsyncsftp-perl

21

• Using a single transport does not achieve maximum performance

• Shift's support of multiple transports allows it to use the optimum transport for each batch of files



22

0.1

1

10

100

1000

1 4 16 64 256 1024 4096 16384 65536

Loca

l Tra

nsfe

r Per

form

ance

(MB/

s)

File Size (MB)

bbcpbbftpcp-perlfishgridftpmcprsync

•  Traditionally remote transports also perform well for local transfers



0

500

1000

1500

2000

2500

3000

bbcp bbftp cp/sftp-perl fish gridftp mcp rsync

64 x

1G

B Pe

rform

ance

(MB/

s)

Transport

LocalLANWAN

0

50

100

150

200

250

300

350

400

450

500

bbcp bbftp cp/sftp-perl fish gridftp mcp rsync

1000

x 4

MB

Perfo

rman

ce (M

B/s)

Transport

LocalLANWAN

23

• Shift has configurable small file thresholds- Preferred local/LAN/WAN transports above/below

•  Transport chosen using avg. size of each batch




24


Global Resource Management

•  Transfer parallelization- The only way to maximize performance in HPC

environments is to use multiple resources at once • Global throttling-  If everyone tries to run at the maximum rate possible,

everyone loses•  Load balancing- Avoid resource exhaustion on individual hosts

25


Transfer Parallelization

• Single host may have excess capacity- Transport does not have its own parallelism- Single client cannot fully utilize host's resources

•  Full HPC environment may have excess capacity- Single system bottlenecks- Aggregate resources of many hosts

•  Two complementary forms of parallelization- Multiple clients running on a single host- One or more clients running on multiple hosts

26


Transfer Parallelization (cont.)

0

500

1000

1500

2000

2500

3000

1 2 4 8 16

Loca

l Tra

nsfe

r Per

form

ance

(MB/

s)

Parallel Clients


0

2000

4000

6000

8000

10000

12000

1 2 4 8 16 32

Loca

l Tra

nsfe

r Per

form

ance

(MB/

s)

Parallel Hosts


27

•  Shift derives file system equivalency from mount information to determine parallelization candidates

•  Shift can automatically parallelize all transfers using centrally configured clients/hosts values

•  Can trivially run across allocated compute nodes-  e.g. --host-file=$PBS_NODEFILE


Global Throttling

• Support transfers at full speed when excess capacity exists

• Prevent resource exhaustion when systems are busy 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

Configured Limit

Loca

l Writ

e Pe

rform

ance

(MB/

s)

Time

Unthrottled (5.34 GB/s)Throttled (3.62 GB/s)

28

• Shift supports throttling limits on CPU %, disk usage, I/O rate, and network rate- Limits can be global, per user, per host, per file system

•  Transfers directed to sleep until average reaches (transfer's fair share of) user's fair share of limit


Load Balancing

29

•  Single host resource limitations impact maximum transfer speed of all processes on that host

•  Less loaded hosts have more resources available for new transfers 0

1000

2000

3000

4000

5000

6000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

32-way dd write

iperf IPoIB

I/O B

andw

idth

Util

izat

ion

(MB/

s)Time

32 Hosts16 Hosts

8 Hosts4 Hosts2 Hosts1 Host


Load Balancing (cont.)

30

• Shift load balances during host parallelization- Picks least loaded hosts when spawning clients

• Shift load balances during remote transfers- Remote host switched to least loaded- Overlap prevented between parallel clients

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32

LAN

Tra

nsfe

r Per

form

ance

(MB/

s)

Client Hosts

Disjoint remote hostsSame remote host




31


File System Optimization

• Different file systems have different idiosyncrasies- Can impact performance, stability, and operating cost- Not always directly visible to users

•  Tape optimization- File extents impact tape write speed- Sequential retrieval results in inefficient tape movement

•  Tar creation/extraction- Tape I/O is inefficient with small files

•  Lustre striping- Too few stripes creates later I/O inefficiencies

32


Tape Optimization

•  Tape-backed file systems have limited parallelism due to large slow-moving physical components

•  Inefficient access by one user can have major impact on all others- Large number of file extents degrades write speed- Sequential file retrieval inhibits minimization of internal tape

movement

33


Tape Optimization (cont.)

50

100

150

200

250

300

350

1 4 16 64 256 1024 4096 16384 65536

Perfo

rman

ce (M

B/s)

Source File Extents

Tape WriteBuffered dd Read

Direct dd Read

0

500

1000

1500

2000

2500

1 4 16 64

Tape Writes Degrade

Exte

nts

Prod

uced

File Size (GB)


34

• Shift can be configured to preallocate files below a given sparsity- Minimize extents for most common regular files- Minimize disk usage for large sparse files


Tape Optimization (cont.)

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64

Tape

Ret

rieva

l Tim

e (s

)

1GB Files

SequentialBatched

•  Files automatically retrieved from tape if offline when accessed

•  May be accessed in different order than stored on tape

35

• Shift automatically initiates a retrieval of all source files on SGI DMF file systems

• Retrieval initiated again for files in each batch in case files pushed offline before being transferred


Tar Creation/Extraction

36

0

50

100

150

200

250

300

1 4 16 64 256 1024 4096 16384 65536Ta

pe P

erfo

rman

ce (M

B/s)

Source File Size (MB)

WriteRead

•  Users prefer archiving data in original hierarchy-  Each file accessed as needed

•  Small files impact stability-  Not migrated below certain

size so consume limited disk-  Tape I/O is inefficient

•  Normally use tar files-  Tar is very slow (100 MB/s)-  Must retrieve to view contents-  No assurance of integrity-  Retrieve entire tar for one file-  May not have quota to create


Tar Creation/Extraction (cont.)

0

500

1000

1500

2000

2500

1 2 4 8 16

Tar C

reat

ion

Perfo

rman

ce (M

B/s)

Parallel Hosts

Local (mcp)LAN (fish)WAN (fish)GNU tar

0

1000

2000

3000

4000

5000

6000

1 2 4 8 16

Tar E

xtra

ctio

n Pe

rform

ance

(MB/

s)

Parallel Hosts

Local (mcp)LAN (fish)WAN (fish)GNU tar

37

•  Shift has built-in tar creation/extraction-  Uses high performance transports and parallelism-  Automatic creation of index files to see contents-  Integrity verification of every file added/extracted-  Automatic split of tar files at configurable size-  Direct creation/extraction over network without use of quota


Lustre Striping

• Striping must be set before a file is first written• Higher stripe count- More I/O bandwidth available for large files- More balanced distribution of large files over OSTs

•  Lower stripe count- Less contention during metadata operations- Less wasted space for small files

• Striping can only be changed by copying file

38


Lustre Striping (cont.)

0

2000

4000

6000

8000

10000

12000

1 2 4 8 16 32 64

Non

cont

iguo

us W

rite

Perfo

rman

ce (M

B/s)

Lustre Stripes

64 Nodes32 Nodes16 Nodes8 Nodes4 Nodes2 Nodes

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16 32 64

Non

cont

iguo

us R

ead

Perfo

rman

ce (M

B/s)

Lustre Stripes

64 Nodes32 Nodes16 Nodes8 Nodes4 Nodes2 Nodes

39

• Shift automatically stripes files according to size-  Increases write performance during parallel transfers-  Increases read performance during later job access

•  Reduces wasted CPU cycles due to I/O• Preserves non-default striping when applicable


Conclusion

• Shift is an automated transfer tool that encapsulates HPC best practices to achieve better performance while preserving stability of HPC environments- Centralized configuration in manager component allows

policies to be changed globally across all clients- New best practices can be incorporated transparently

without user in the loop• Shift is open source and available for download- http://shiftc.sourceforge.net

• Contact info- [email protected]

• Questions?40

automatically encapsulating hpc best practices into data...

Documents