data transfer chem workshop - princeton universitychemlabs.princeton.edu › ... › 2017 › 11 ›...
TRANSCRIPT
File Transfer
Joon KimPICSciE
Research Computing WorkshopDepartment of Chemistry
10/30/2017
Introduction
Hyojoon (Joon) Kim. Ph.D.Cyber Infrastructure Engineer
Role:Design & build a campus network and infrastructure that will
better support our researchers
2
Our goal today
• Overview of widely used tools
• Learn about data transfer basics• Pick the right tool for your job• Know what to expect
• Learn about Globus
• Q&A
3
Data transfer
4
Data source
Data destination
This part is our topic!
Why do we care?
Without good practice, you will waste time and effort
Time Effort
5
1. Start data transfer using SCP at 10pm. Usually takes 10 hours.
2. At 2am, there was a brief 1-minute network outage. Transfer job aborted.
3. Arrive 8am in the morning. See the damage. Start again, which will take 10 hours.
4. Lost a day of work.
Why do we care?
Without good practice, you will waste time and effort
Time Effort
6
1. Start data transfer using SCP at 10pm. Usually takes 10 hours.
Really?Are you SURE that’s the best?
We want you to
Focus on your research, not on transferring data around
Time Effort
7
Data Transfer Tools
8
Transfer tools
What transfer tool do you use?
9
What we normally see
10
Secure Copy (SCP)• Secure Copy (SCP)
• Uses SSH for authentication and data transfer (TCP port 22)• Unix-based systems (including Mac OS X): Should have it by default• Windows: WinSCP (https://winscp.net/eng/download.php)
11
Secure Copy (SCP)• Secure Copy (SCP)
• Uses SSH for authentication and data transfer (TCP port 22)• Unix-based systems (including Mac OS X): Should have it by default• Windows: WinSCP (https://winscp.net/eng/download.php)
12
rsync• rsync (rsync over SSH)
• Sync files and directories between two endpoints. • Good for running backups. Careful with “--delete” option (this *mirrors* directories)• Unix-based systems (including Mac OS X): Should have it by default• Windows: CwRsync (https://itefix.net/cwrsync)
13
File Transfer Protocol (FTP)• ‘Secure’ File Transfer Protocol (‘S’FTP)
• Widely used for file transfers• SFTP is more secure. Use it if available.• Unix-based systems (including Mac OS X): Should have it by default• Windows: FileZilla (https://filezilla-project.org)
14
File Transfer Protocol (FTP)• ‘Secure’ File Transfer Protocol (‘S’FTP)
• Widely used for file transfers• SFTP is more secure. Use it if available.• Unix-based systems (including Mac OS X): Should have it by default• Windows: FileZilla (https://filezilla-project.org)
15
These tools are okay, but not always
• Great compatibility. Widely available. • Small datasets. Quick transfers. (< 15 mins)
• Large bulk data transfers.• Transfers on unreliable connections and
hosts.
16
When should you look for other solutions?• Transfer is unreliable and takes a long time, to the point it affects
your workflow
• You are getting speeds less than (for large datasets):• Within campus
• 800 Mbps (Mega bits per second). • 100 GB = ~820,000 Mb. Takes ~ 17 minutes
• 3-4 Gbps if you have a 10G connection• 100 GB = ~800 Gb. Takes ~ 4 minutes
• Between campus and outside • Hard to tell because of things out of our control• 200 Mbps – 5000 Mbps (5 Gbps)
• 100 GB = ~820,000 Mb. Takes ~ 1 hour
17
Data Transfer Basics
18
Data transfer: Overview
• The key players• Endpoints
• Network
• Transfer tool
• Transfer settings
Source Destination
Encrypted vs. not encrypted
SCPSFTP
rsync over ssh
FTPrsync
1/10/100 Gbps
19
(Why) is my data transfer slow?
Where are the bottlenecks?
Source Destination
scp
ftp
scp
ftp
20
1. Potential bottlenecks in endpoints • CPU and memory
• Higher clock speed is better than # of cores • E.g., 2 x Intel Xeon® Broadwell processor E5-2643 3.4 GHz (total 12 cores)
• RAM: 32 GB or more recommended
• Disk I/O• Disk type (SATA HDD, SSD), configuration (RAID), and file system (ext4, GPFS, Lustre)• Decent server with HDD, EXT4 with RAID performs around 4 Gbps• RAID is required to get > 1 Gbps (ref: http://fasterdata.es.net/data-transfer-tools/)
• Network Interface Controller (NIC)• Wireless: don’t expect much (mostly < 130 Mbps)• Wired: 1/10/40/100 Gbps
• Miscellaneous tuning• NIC tx buffer, enabling jumbo frames (9K instead of 1.5K), TCP/UDP tuning …
You won’t get more than 1 Gb/s (125 MB/s) withyour laptop, most desktops, and un-optimized servers
21
2. Potential bottlenecks in networks
• Bandwidth• E.g., 1/10/40/100 Gb/s
• Congestion • Time of day
• Distance• E.g., Round Trip Time (RTT) between Stanford – Princeton: ~ 80ms
• “Things” along the way• Routers, switches, firewalls, NAT, security devices, …
22
Our network, their network, and networks in between
Parts we have limited visibility, and no control over
23
3. Transfer tools: Single vs multi stream
dstsrc
dstsrc
Single stream- scp- ftp- rsync
Multi stream- GridFTP- BBCP
- Less packet loss (w/ dups)- Better utilization of link
Faster transfer speed
24
Transfer tools: scp vs. GridFTP
(ref: http://fasterdata.es.net/data-transfer-tools/)
Downloading 500 GB data
8 hours
10 minutes
25
4. Transfer settings: EncryptionTool Encrypted Control Encrypted Data
FTPHTTP (even password-based access)BBCPBBFTPGlobus/GridFTP
✔
SCPSFTPrsync over SSHGlobus/GridFTP with encryption-on HTTPS
✔ ✔
Data encryption provides best security, but negatively impacts transfer speed
26
In summary• Data transfer speed is affected by: Endpoints, network, transfer tool,
and transfer settings
• In most cases, your endpoint cannot handle much
• Use wired connection, and check your bandwidth as far as you can• Ask for 10Gbps or 40Gbps connection if needed
• Use better transfer tool if possible.• scp, (s)ftp, rsync, and wget/curl work fine for small transfers.• For large transfer (> GBs) over the WAN (RTT > 5ms: beyond Philly or NYC),
don’t expect much from: scp, (s)ftp, rsync, wget/curl, robocopy
27
Hey… that’s a lot of work
Photo from Flickr (Billy Abbott)
28
Endpoint
Network
Transfer Tool
TransferSettings
What we have at Princeton: Data Transfer Nodes
Test DTN Tigress DTN Lewis-Sigler DTN
Globus (GridFTP)
ESnet Internet2
10 Gbps(Mar. 2016 -)
10 Gbps(Nov. 2015 -)
PNI DTN CS DTN
29
Data Transfer Nodes (DTNs)• High-end servers• 10 Gbps connections• Tuned and optimized. RAID configured• Good transfer tools (e.g., Globus)• Supported
Leverage our resources
• Leverage our resources
• Contact us or your departmental staff• About existing departmental DTNs and best work/dataflow• About having a departmental DTN
30
Data Transfer Nodes @ Princeton• Test DTN
• /u/<NetID> • Contact: [email protected]
• Tigress DTN (Princeton TIGRESS)• GPFS-based /tigress and scratch (tiger, della, orbital) disk space• Contact: [email protected]
• LSI DTN (Lewis-Sigler Institute Core DTN)• LSI local cluster storage, lab data volumes, and scratch spaces• Contact: [email protected]
• PNI DTN (Princeton Neuroscience Institute DTN)• All PNI `/Jukebox` Volumes (Bucket and Scratch spaces) are available• Contact: [email protected]
• CS DTN (Computer Science Department DTN) • /n/fs/scratch/ and CS “project” storage spaces• Contact: [email protected]
• Physics DTN (Princeton Physics DTN) • Restricted to users who has an account on Feynman cluster. /group, /scratch, and /mnt/<project> NFS files systems.• Contact: Vinod Gupta ([email protected]), Sumit Saluja ([email protected])
31
What is Globus?
• Fast, reliable data transfer and management service
• Uses GridFTP underneath
• Main advantages• Fast transfer speed (multi-stream) • Convenient to use: “Fire-and-Forget”
https://www.globus.org
32
How it works
1. Pick two endpoints
2. Submit transfer request at the Globus website
3. Dataset is transferred between two endpoints• Your machine’s web browser is just a “remote control” • But, your machine can be an endpoint too (more later)
4. Get notification when transfer is done
33
A B
1. Select endpoints2. Request transfer
3. Data
Globus.org
4. Get notification
What you need to use Globus
• Account• Have Princeton NetID? You’re set.
• Web browser• Chrome, Firefox, Safari, IE (Edge), etc• E.g., use smartphone to submit a transfer job (note: your phone is not
transferring the dataset)
• Access to source and destination endpoints
34
Your peers are actively using it!
35
09/2017:
Total 257 transfers via Globus
Globus has good coverage • Universities
• Cornell, NYU, Yale, Johns Hopkins, Dartmouth, Purdue, Georgia Tech, Virginia Tech, UVA, Michigan, Indiana, Stanford, Berkeley, and U Chicago,…
• Most of the DOE national labs• ESnet at CERN, ANL, LBNL, LLNL, LANL, ORNL, and PNNL
• National computing facilities • NERSC, NCSA, SDSC …
• Federal agencies• NIH, USDA, NASA/JPL, USGS …
• Over 50,000 registered endpoints at over 500 institutions worldwide
36
How to use it
37
’On’ by default. ‘On’ is around 25% slower than ‘Off’
‘On’ is 10%-40% slower than ’Off’
Globus Sharing
• Share file or directory with other Globus users
38
Data Transfer via CLI
• Install Globus CLI package (Linux or Cygwin on Windows)• https://docs.globus.org/cli/installation/
• Documentation• https://docs.globus.org/cli/
• Scheduling transfer jobs• $ echo "globus transfer $ep1:/share/godata/file1.txt $ep2:~/file1.txt --
label 'CLI Test Transfer 2’ " | at 13:07 oct 30 2017• Cron job: 0 2 * * * runglobus.sh &> ~/runglobus.log
39
Globus Connect Personal
• Make your own machine a Globus endpoint
• Mac, Windows, Linux
• You are the administrator for your own Globus endpoint
• Limited performance(# of streams), but convenient!
https://www.globus.org/globus-connect-personal
40
Use cases
41
e.g.,Tigress DTN
SMB mount
Scenario 1• Data on your laptop• Globus Connect Personal
Scenario 2• Data on shared Windows PC• Globus Connect Personal
Better for Scenario 2• Data on shared Windows PC• SMB mount to Linux server• Globus Connect Server
Learn more about Globus
• Globus documentation• https://docs.globus.org/how-to/
• Research Computing mini-course• “Transferring Large Data Sets, Plus Hands-on Tutorial with the Globus Transfer
Tool” (Spring, 2018)• http://www.princeton.edu/researchcomputing/education/mini-courses/
42
Other tools• BBCP
• Free, easy to use, and comparable performance to Globus• Mac OS X, Linux-based systems. SSH-based access control• Both endpoints need it installed, but easier to install and configure• Supported DTNs: Test DTN, Tigressdata• “$ bbcp -V -s 16 /local/path/largefile.tar remotesystem:/remote/path/largefile.tar”• More info
• http://www.slac.stanford.edu/~abh/bbcp/• https://www.olcf.ornl.gov/kb_articles/transferring-data-with-bbcp/
• Fast Data Transfer (FDT)• Java-based tool from Caltech & CERN (http://monalisa.cern.ch/FDT/)• Can theoretically run in any Operating System, including Windows• Need server-side running in server mode• “$ java -jar ./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu ~/file.633M -d /userdata/hjm”• Slower than BBCP or Globus
43
Other tools• Aria2c (https://aria2.github.io)
• Faster http/https, ftp, sftp, BitTorrent, and Metalink download tool (x4 faster)• Windows, Mac, Linux, Android App• http: “$ ./aria2c -x4 -k1M http://foo.com/foo.zip”
• LFTP• Faster download (get) speed (2-5x) for ftp, http, sftp, fish, torrent. Upload (put) speed is same. • Seems to be compatible with normal FTP, HTTP servers. • Mac OS X, Linux-based systems. (apt-get install lftp; yum install lftp; brew install lftp) • ftp: “$ lftp ftp://speedtest.tele2.net” • http: “$ lftp -e 'pget -n 5 foo.zip' http://foo.com/” • More info: http://lftp.tech=
• HPN-patched SCP/SSH• https://www.psc.edu/hpn-ssh• Slower than BBCP or Globus
44
Summary and Takeaways
• Small and quick transfers: basic transfer tools
• Large bulk transfers: Data Transfer Nodes (DTNs) and Globus when possible
• Large bulk transfers, but Globus is unavailable: Better tools aforementioned
• Know your environment and limitations• Endpoints, network, transfer tool, and transfer settings
• Speak up and reach out to us 45