archiving data from durham to ral using the file transfer service (fts)
TRANSCRIPT
Lydia Heck, Campus network engineering workshop19/10/2016 Archiving data from Durham to RAL using the
File Transfer Service (FTS)
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
2
Archiving data from Durham to RAL using the File Transfer
Service (FTS)Lydia Heck
Institute for Computational CosmologyManager of the DiRAC-2/2.5 Data Centric Facility
COSMA
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
3
Introduction to DiRACl DIRAC -- Distributed Research utilising Advanced
Computing established in 2009 with DiRAC-1 • Support of research in theoretical astronomy, particle physics and
nuclear physics
• Funded by STFC with infrastructure money allocated from the Department for Business, Innovation and Skills (BIS)
• The running costs, such as staff costs and electricity are funded by STFC
• DiRAC is classed as a major research facility by STFC on a par with the big telescopes
What is DiRACl A national service run/managed/allocated by the
scientists who do the science funded by BIS and STFC
l The systems are built around and for the applications with which the science is done.
l We do not rival a facility like ARCHER, as we do not aspire to run a general national service.
19 October 2016 4Campus Network Engineering for Data Intensive Science Workshop
What is DiRAC – cont’d?
l For the highlights of science carried out on the DiRAC facility please see: http://www.dirac.ac.uk/science.html
l Specific example: Large scale structure calculations with the Eagle run
4096 cores ~8 GB RAM/core 47 days = 4,620,288 cpu hours 200 TB of data
19 October 2016 5Campus Network Engineering for Data Intensive Science Workshop
The DiRAC computing systems
19 October 2016 6Campus Network Engineering for Data Intensive Science Workshop
Blue GeneEdinburgh
CosmosCambridge
ComplexityLeicester
Data CentricDurham
Data AnalyticCambridge
COSMA @ DiRAC (Data Centric) Durham – Data Centric
system –IBM IDataplex 6720 Intel Sandy Bridge
cores 53.8 TB of RAM FDR10 infiniband 2:1
blocking 2.5 Pbyte of GPFS
storage (2.2 Pbyte used!)
19 October 2016 7Campus Network Engineering for Data Intensive Science Workshop
Resources of DiRACl Long projects with significant amount of CPU hours allocated
for 3 years typically on a specific system on one or more of the available 5 systems. Resources available:
l l l l
l
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
8
System cpu hours storage locationBluegene 98,304 cores 861 M 1 PB (GPFS) Edinburgh
Data Centric 6720 Xeon cores
59 M 2.5 PB (GPFS) Durham (DiRAC2)
Data Centric 8000 Xeon cores
> 71 M 2.5 PB data (Lustre)1.8 PB scratch (Lustre)
Durham (DiRAC2.5)
Complexity 4352 Xeon cores
38 M 0.8 PB (Panasas) Leicester
Data Analytic 4800 Xeon cores
42 M 0.75 PB (Lustre) Cambridge
SMP 1784 Xeon cores shared memory
15.6M 146 TB (EXT) Cambridge
Why do we need to copy data ? During and when a project is completed copy data to home institutions
l requires additional storage resource at researchers’ home institutionsl Not enough provision – will require additional funds.
Make backup copiesl if disaster struck many cpu hours of calculations would be lost.
Copy data to other sites to leverage compute resources for post processing.Storage on HPC facility runs out of capacity data creation considerably above expectation ?
l
19 October 2016 9Campus Network Engineering for Data Intensive Science Workshop
Why do we copy data to RAL ?
Research data must now be available to interested parties for specified period of timel We could install DiRAC's own archive
• requires funds and there is (currently) no budgetWe needed to get started:
l to gain experiencel to get a valid backupl to remove data as the resources run outl Identify bottlenecks and technical challenges
Jeremy Yates (Director of DiRAC) negotiated access to the RAL archiving systems
Set up collaborations and make use of previous experience and pool resources
AND: copy data!l
l l
19 October 2016 10Campus Network Engineering for Data Intensive Science Workshop
Network connectivity of Durham University
• 2012 – upgrade to 4x1 Gbit to Janet• Janet advised to investigate optimal utilisation of
available bandwith before applying for further upgrade
• 2014 – upgrade to 6 Gbit to Janet
• currently: 8 Gbit to Janet should be a full 10 Gbit by the end of the year – technical issues
19 October 2016 11Campus Network Engineering for Data Intensive Science Workshop
network bandwidth – situation for Durham
l 2014: Measured throughput ?l l
19 October 2016 12Campus Network Engineering for Data Intensive Science Workshop
2014: Measured Limits ?l l
19 October 2016 13Campus Network Engineering for Data Intensive Science Workshop
September 2014 – Measured limits l l
19 October 2016 14Campus Network Engineering for Data Intensive Science Workshop
Making optimal use of available bandwidth
• planning and investment to by-pass the external campus firewall:• Prepartory work started in October/November 2014 two new
routers (~£80k) – configured for throughput with minimal ACL enough to safeguard site.
• deploying internal firewalls – part of new security infrastructure anyhow but essential for such a venture
• security now relies on front-end systems of Durham DiRAC and Durham GridPP
• IPPP was moved outside the firewall in April 2015 with a clear mandate to manage security for their installation.
• The DiRAC Data Transfer system was moved outside about 1 month later.
19 October 2016 15Campus Network Engineering for Data Intensive Science Workshop
GridPP Site FW config for endpoint node
19 October 2016 16Campus Network Engineering for Data Intensive Science Workshop
GridFTPPort
blockingGridFTP
Pass thru
GridFTP
GridFTP
Monitor w/fw
GridFTPBypass site fw
Result for DiRAC and GridPP in Durham
• guaranteed 3 Gbit/sec in/out• Consequences:
• pushed the network performance for Durham GridPP from bottom 3 in the country to top 5 of the UK GridPP sites
• Now they experience different bottlenecks, but they under their control
• DiRAC data transfers achieve up to 300 – 400 Mbyte/sec throughput to RAL on archiving depending on file sizes.
• faster data sharing with other collaboration sites
• recently (October 2016) offered service to Earth Sciences with 70-80 MByte/sec from site in Switzerland
•
19 October 2016 17Campus Network Engineering for Data Intensive Science Workshop
Collaboration between DiRAC and GridPP/RAL
l Durham Institute for Computational Cosmology (ICC) volunteered to be the prototype installation
l Huge thanks to Jens Jensen and Brian Davies - there were many emails exchanged, many questions asked and many answers given.
l Resulting document “Setting up a system for data archiving using FTS3” by Lydia Heck, Jens Jensen and Brian Davies
19 October 2016 18Campus Network Engineering for Data Intensive Science Workshop
l https://www.cosma.dur.ac.uk/documentation
Setting up the archiving tools
l Identify appropriate hardware – could mean extra expense:
need freedom to modify and experiment with cannot have HPC users logged in and working
when you need to reboot the system!l free to do very latest security updates
This might not always be possible on an HPC system
l requires optimal connection to storage For the transfer system this meant an infiniband
card19 October 2016 19Campus Network Engineering for Data
Intensive Science Workshop
Setting up the archiving tools
l Create an interface to access the file/archving service at RAL using the GridPP tools• gridftp – Globus Toolkit – also provides Globus
Connect
• Trust anchors (egi-trustanchors)
• voms tools (emi3-xxx)
• fts3 (cern)
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
20
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
21
Chose to use FTS3 with GridFTP
User submits transfer lists
(and credentials)
GPFS
data.cosma.dur.ac.uk(GridFTP)
CASTOR-GEN
srm-dirac.gridpp.rl.ac.uk(SRM)
GridFTP
FTS3
Learning to use certificates and proxies l long-lived voms proxy?
l myproxy-init; myproxy-logon; voms-proxy-init; fts-transfer-delegation
l How to create a proxy and delegation that lasts weeks even months?
l This is still an issue for a voms proxy. But circumvented it using normal proxy.
l grid-proxy-init; fts-transfer-delegationl grid-proxy-init –valid HH:MMl fts-transfer-delegation –e time-in-seconds l creates proxy that lasts up to certificate life time.
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
22
Experiences
1. Large files – optimal throughput limited by network bandwidth
2. Many small files – limited by latency
3. many parallel sessions: impedes on proper functioning of archive server.
4. Ownership, creation dates not preserved – one grid owner
5. Simple approach of “just” pushing files will not work!
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
23
Actions to overcome issues
• tar files up in chunks - ~256 Gbyte • exclude checked out versioning subdirectories• preserves ownership, and time stamps in the tar archive• keep record of archived files
• Files to transfer are large – limited by bandwidth, not by latency
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
24
Open issues
l depends on single admin to carry out. Not automatic.
l what happens when content in directories change? – complete new archive sessions?
l Create a tool more like rsync – requires extensive scripting
l When trying to get data back, get back all of a subset, to find single or string of files
19 October 2016 Campus Network Engineering for Data Intensive Science Workshop
25
Conclusionsl With the right network speed we can archive the DiRAC data to
RAL or anywhere else with the right tools and connectivity.l Documenting the procedure is very important to transfer the
knowledge and duplicating effort. The documentation is online https://www.cosma.dur.ac.uk/documentation
l Each DiRAC site should have their own dirac0X accountl Start with and keep on archiving – this is more difficult as it is
not completely automatic yet and more development is required.
l Collaboration between DiRAC and GridPP/RAL DOES work!l The work has been of benefit to other transfer actions, which
significantly helps research and reflects well on the service we can deliver.
l Can we aspire to more? 19 October 2016 Campus Network Engineering for Data
Intensive Science Workshop26