data transfer chem workshop - princeton universitychemlabs.princeton.edu › ... › 2017 › 11 ›...

46
File Transfer Joon Kim PICSciE Research Computing Workshop Department of Chemistry 10/30/2017

Upload: others

Post on 24-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

File Transfer

Joon KimPICSciE

Research Computing WorkshopDepartment of Chemistry

10/30/2017

Page 2: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Introduction

Hyojoon (Joon) Kim. Ph.D.Cyber Infrastructure Engineer

Role:Design & build a campus network and infrastructure that will

better support our researchers

2

Page 3: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Our goal today

• Overview of widely used tools

• Learn about data transfer basics• Pick the right tool for your job• Know what to expect

• Learn about Globus

• Q&A

3

Page 4: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Data transfer

4

Data source

Data destination

This part is our topic!

Page 5: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Why do we care?

Without good practice, you will waste time and effort

Time Effort

5

1. Start data transfer using SCP at 10pm. Usually takes 10 hours.

2. At 2am, there was a brief 1-minute network outage. Transfer job aborted.

3. Arrive 8am in the morning. See the damage. Start again, which will take 10 hours.

4. Lost a day of work.

Page 6: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Why do we care?

Without good practice, you will waste time and effort

Time Effort

6

1. Start data transfer using SCP at 10pm. Usually takes 10 hours.

Really?Are you SURE that’s the best?

Page 7: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

We want you to

Focus on your research, not on transferring data around

Time Effort

7

Page 8: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Data Transfer Tools

8

Page 9: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Transfer tools

What transfer tool do you use?

9

Page 10: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

What we normally see

10

Page 11: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Secure Copy (SCP)• Secure Copy (SCP)

• Uses SSH for authentication and data transfer (TCP port 22)• Unix-based systems (including Mac OS X): Should have it by default• Windows: WinSCP (https://winscp.net/eng/download.php)

11

Page 12: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Secure Copy (SCP)• Secure Copy (SCP)

• Uses SSH for authentication and data transfer (TCP port 22)• Unix-based systems (including Mac OS X): Should have it by default• Windows: WinSCP (https://winscp.net/eng/download.php)

12

Page 13: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

rsync• rsync (rsync over SSH)

• Sync files and directories between two endpoints. • Good for running backups. Careful with “--delete” option (this *mirrors* directories)• Unix-based systems (including Mac OS X): Should have it by default• Windows: CwRsync (https://itefix.net/cwrsync)

13

Page 14: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

File Transfer Protocol (FTP)• ‘Secure’ File Transfer Protocol (‘S’FTP)

• Widely used for file transfers• SFTP is more secure. Use it if available.• Unix-based systems (including Mac OS X): Should have it by default• Windows: FileZilla (https://filezilla-project.org)

14

Page 15: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

File Transfer Protocol (FTP)• ‘Secure’ File Transfer Protocol (‘S’FTP)

• Widely used for file transfers• SFTP is more secure. Use it if available.• Unix-based systems (including Mac OS X): Should have it by default• Windows: FileZilla (https://filezilla-project.org)

15

Page 16: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

These tools are okay, but not always

• Great compatibility. Widely available. • Small datasets. Quick transfers. (< 15 mins)

• Large bulk data transfers.• Transfers on unreliable connections and

hosts.

16

Page 17: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

When should you look for other solutions?• Transfer is unreliable and takes a long time, to the point it affects

your workflow

• You are getting speeds less than (for large datasets):• Within campus

• 800 Mbps (Mega bits per second). • 100 GB = ~820,000 Mb. Takes ~ 17 minutes

• 3-4 Gbps if you have a 10G connection• 100 GB = ~800 Gb. Takes ~ 4 minutes

• Between campus and outside • Hard to tell because of things out of our control• 200 Mbps – 5000 Mbps (5 Gbps)

• 100 GB = ~820,000 Mb. Takes ~ 1 hour

17

Page 18: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Data Transfer Basics

18

Page 19: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Data transfer: Overview

• The key players• Endpoints

• Network

• Transfer tool

• Transfer settings

Source Destination

Encrypted vs. not encrypted

SCPSFTP

rsync over ssh

FTPrsync

1/10/100 Gbps

19

Page 20: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

(Why) is my data transfer slow?

Where are the bottlenecks?

Source Destination

scp

ftp

scp

ftp

20

Page 21: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

1. Potential bottlenecks in endpoints • CPU and memory

• Higher clock speed is better than # of cores • E.g., 2 x Intel Xeon® Broadwell processor E5-2643 3.4 GHz (total 12 cores)

• RAM: 32 GB or more recommended

• Disk I/O• Disk type (SATA HDD, SSD), configuration (RAID), and file system (ext4, GPFS, Lustre)• Decent server with HDD, EXT4 with RAID performs around 4 Gbps• RAID is required to get > 1 Gbps (ref: http://fasterdata.es.net/data-transfer-tools/)

• Network Interface Controller (NIC)• Wireless: don’t expect much (mostly < 130 Mbps)• Wired: 1/10/40/100 Gbps

• Miscellaneous tuning• NIC tx buffer, enabling jumbo frames (9K instead of 1.5K), TCP/UDP tuning …

You won’t get more than 1 Gb/s (125 MB/s) withyour laptop, most desktops, and un-optimized servers

21

Page 22: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

2. Potential bottlenecks in networks

• Bandwidth• E.g., 1/10/40/100 Gb/s

• Congestion • Time of day

• Distance• E.g., Round Trip Time (RTT) between Stanford – Princeton: ~ 80ms

• “Things” along the way• Routers, switches, firewalls, NAT, security devices, …

22

Page 23: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Our network, their network, and networks in between

Parts we have limited visibility, and no control over

23

Page 24: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

3. Transfer tools: Single vs multi stream

dstsrc

dstsrc

Single stream- scp- ftp- rsync

Multi stream- GridFTP- BBCP

- Less packet loss (w/ dups)- Better utilization of link

Faster transfer speed

24

Page 25: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Transfer tools: scp vs. GridFTP

(ref: http://fasterdata.es.net/data-transfer-tools/)

Downloading 500 GB data

8 hours

10 minutes

25

Page 26: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

4. Transfer settings: EncryptionTool Encrypted Control Encrypted Data

FTPHTTP (even password-based access)BBCPBBFTPGlobus/GridFTP

SCPSFTPrsync over SSHGlobus/GridFTP with encryption-on HTTPS

✔ ✔

Data encryption provides best security, but negatively impacts transfer speed

26

Page 27: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

In summary• Data transfer speed is affected by: Endpoints, network, transfer tool,

and transfer settings

• In most cases, your endpoint cannot handle much

• Use wired connection, and check your bandwidth as far as you can• Ask for 10Gbps or 40Gbps connection if needed

• Use better transfer tool if possible.• scp, (s)ftp, rsync, and wget/curl work fine for small transfers.• For large transfer (> GBs) over the WAN (RTT > 5ms: beyond Philly or NYC),

don’t expect much from: scp, (s)ftp, rsync, wget/curl, robocopy

27

Page 28: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Hey… that’s a lot of work

Photo from Flickr (Billy Abbott)

28

Endpoint

Network

Transfer Tool

TransferSettings

Page 29: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

What we have at Princeton: Data Transfer Nodes

Test DTN Tigress DTN Lewis-Sigler DTN

Globus (GridFTP)

ESnet Internet2

10 Gbps(Mar. 2016 -)

10 Gbps(Nov. 2015 -)

PNI DTN CS DTN

29

Data Transfer Nodes (DTNs)• High-end servers• 10 Gbps connections• Tuned and optimized. RAID configured• Good transfer tools (e.g., Globus)• Supported

Page 30: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Leverage our resources

• Leverage our resources

• Contact us or your departmental staff• About existing departmental DTNs and best work/dataflow• About having a departmental DTN

30

Page 31: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Data Transfer Nodes @ Princeton• Test DTN

• /u/<NetID> • Contact: [email protected]

• Tigress DTN (Princeton TIGRESS)• GPFS-based /tigress and scratch (tiger, della, orbital) disk space• Contact: [email protected]

• LSI DTN (Lewis-Sigler Institute Core DTN)• LSI local cluster storage, lab data volumes, and scratch spaces• Contact: [email protected]

• PNI DTN (Princeton Neuroscience Institute DTN)• All PNI `/Jukebox` Volumes (Bucket and Scratch spaces) are available• Contact: [email protected]

• CS DTN (Computer Science Department DTN) • /n/fs/scratch/ and CS “project” storage spaces• Contact: [email protected]

• Physics DTN (Princeton Physics DTN) • Restricted to users who has an account on Feynman cluster. /group, /scratch, and /mnt/<project> NFS files systems.• Contact: Vinod Gupta ([email protected]), Sumit Saluja ([email protected])

31

Page 32: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

What is Globus?

• Fast, reliable data transfer and management service

• Uses GridFTP underneath

• Main advantages• Fast transfer speed (multi-stream) • Convenient to use: “Fire-and-Forget”

https://www.globus.org

32

Page 33: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

How it works

1. Pick two endpoints

2. Submit transfer request at the Globus website

3. Dataset is transferred between two endpoints• Your machine’s web browser is just a “remote control” • But, your machine can be an endpoint too (more later)

4. Get notification when transfer is done

33

A B

1. Select endpoints2. Request transfer

3. Data

Globus.org

4. Get notification

Page 34: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

What you need to use Globus

• Account• Have Princeton NetID? You’re set.

• Web browser• Chrome, Firefox, Safari, IE (Edge), etc• E.g., use smartphone to submit a transfer job (note: your phone is not

transferring the dataset)

• Access to source and destination endpoints

34

Page 35: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Your peers are actively using it!

35

09/2017:

Total 257 transfers via Globus

Page 36: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Globus has good coverage • Universities

• Cornell, NYU, Yale, Johns Hopkins, Dartmouth, Purdue, Georgia Tech, Virginia Tech, UVA, Michigan, Indiana, Stanford, Berkeley, and U Chicago,…

• Most of the DOE national labs• ESnet at CERN, ANL, LBNL, LLNL, LANL, ORNL, and PNNL

• National computing facilities • NERSC, NCSA, SDSC …

• Federal agencies• NIH, USDA, NASA/JPL, USGS …

• Over 50,000 registered endpoints at over 500 institutions worldwide

36

Page 37: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

How to use it

37

’On’ by default. ‘On’ is around 25% slower than ‘Off’

‘On’ is 10%-40% slower than ’Off’

Page 38: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Globus Sharing

• Share file or directory with other Globus users

38

Page 39: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Data Transfer via CLI

• Install Globus CLI package (Linux or Cygwin on Windows)• https://docs.globus.org/cli/installation/

• Documentation• https://docs.globus.org/cli/

• Scheduling transfer jobs• $ echo "globus transfer $ep1:/share/godata/file1.txt $ep2:~/file1.txt --

label 'CLI Test Transfer 2’ " | at 13:07 oct 30 2017• Cron job: 0 2 * * * runglobus.sh &> ~/runglobus.log

39

Page 40: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Globus Connect Personal

• Make your own machine a Globus endpoint

• Mac, Windows, Linux

• You are the administrator for your own Globus endpoint

• Limited performance(# of streams), but convenient!

https://www.globus.org/globus-connect-personal

40

Page 41: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Use cases

41

e.g.,Tigress DTN

SMB mount

Scenario 1• Data on your laptop• Globus Connect Personal

Scenario 2• Data on shared Windows PC• Globus Connect Personal

Better for Scenario 2• Data on shared Windows PC• SMB mount to Linux server• Globus Connect Server

Page 42: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Learn more about Globus

• Globus documentation• https://docs.globus.org/how-to/

• Research Computing mini-course• “Transferring Large Data Sets, Plus Hands-on Tutorial with the Globus Transfer

Tool” (Spring, 2018)• http://www.princeton.edu/researchcomputing/education/mini-courses/

42

Page 43: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Other tools• BBCP

• Free, easy to use, and comparable performance to Globus• Mac OS X, Linux-based systems. SSH-based access control• Both endpoints need it installed, but easier to install and configure• Supported DTNs: Test DTN, Tigressdata• “$ bbcp -V -s 16 /local/path/largefile.tar remotesystem:/remote/path/largefile.tar”• More info

• http://www.slac.stanford.edu/~abh/bbcp/• https://www.olcf.ornl.gov/kb_articles/transferring-data-with-bbcp/

• Fast Data Transfer (FDT)• Java-based tool from Caltech & CERN (http://monalisa.cern.ch/FDT/)• Can theoretically run in any Operating System, including Windows• Need server-side running in server mode• “$ java -jar ./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu ~/file.633M -d /userdata/hjm”• Slower than BBCP or Globus

43

Page 44: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Other tools• Aria2c (https://aria2.github.io)

• Faster http/https, ftp, sftp, BitTorrent, and Metalink download tool (x4 faster)• Windows, Mac, Linux, Android App• http: “$ ./aria2c -x4 -k1M http://foo.com/foo.zip”

• LFTP• Faster download (get) speed (2-5x) for ftp, http, sftp, fish, torrent. Upload (put) speed is same. • Seems to be compatible with normal FTP, HTTP servers. • Mac OS X, Linux-based systems. (apt-get install lftp; yum install lftp; brew install lftp) • ftp: “$ lftp ftp://speedtest.tele2.net” • http: “$ lftp -e 'pget -n 5 foo.zip' http://foo.com/” • More info: http://lftp.tech=

• HPN-patched SCP/SSH• https://www.psc.edu/hpn-ssh• Slower than BBCP or Globus

44

Page 45: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Summary and Takeaways

• Small and quick transfers: basic transfer tools

• Large bulk transfers: Data Transfer Nodes (DTNs) and Globus when possible

• Large bulk transfers, but Globus is unavailable: Better tools aforementioned

• Know your environment and limitations• Endpoints, network, transfer tool, and transfer settings

• Speak up and reach out to us 45

Page 46: Data Transfer Chem Workshop - Princeton Universitychemlabs.princeton.edu › ... › 2017 › 11 › Data_Transfer_intro.pdf · 2017-11-01 · 1.Start data transfer using SCP at 10pm

Q&[email protected]

Computational Science and Engineering Support: [email protected]

46