data transfer nodes (dtns) - deic · data transfer nodes (dtns) improving large-scale data transfer...

27
Data Transfer Nodes (DTNs) Improving large-scale data transfer performance www.geant.org Tim Chown, Jisc GÉANT GN4-3 project, WP6 joint WP leader DeiC Conference, 31 October 2019

Upload: others

Post on 04-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Data Transfer Nodes (DTNs)

Improving large-scale data transfer performance

www.geant.org

Tim Chown, JiscGÉANT GN4-3 project, WP6 joint WP leader

DeiC Conference, 31 October 2019

2 www.geant.org

• A little about the GÉANT GN4-3 project

• The data transfer problem / challenge

• Science DMZ architecture and DTNs

• Using perfSONAR to test the benefits of Science DMZ and DTNs

• DTN-related projects

• NREN DTN survey

• Transfers to/from commercial cloud providers

• Commercial vs R&E networks

• Closing thoughts

Introduction

3 www.geant.org

• Collaborative project between the European NRENs

• Runs the GÉANT network and associated services• https://www.geant.org/Projects/GEANT_Project_GN4-3

• Approximately €70m over 4 years (Jan 2019 – Dec 2022)

• Parallel GN4-3N project restructuring backbone network

• WP6 is Network Technologies and Services Development• Evaluating new technologies and building new services• Task 1 – Network Technology Evolution (includes DTNs in Year 1)• Task 2 – Network Services Evolution and Development• Task 3 – Management and Monitoring• Co-led by Ivana Golub (PSNC) and me

The GÉANT GN4-3 project

4 www.geant.org

• Growing interest in R&E community in moving large volumes of research data • From point of capture or generation to a

remote computing facility• For remote data visualisation• Data replication, distributed storage and

backups• To or from cloud providers

• Data set volumes are increasing• 10 TB data sets are not unusual • 100 TB is no longer very large

The problem / challenge

www.diamond.ac.uk

www.skatelescope.org

5 www.geant.org

Researcher network expectations?

https://community.jisc.ac.uk/groups/janet-end-end-performance-initiative/document/network-expectations-data-intensive-science

6 www.geant.org

• ESnet published the Science DMZ model in 2012/13:• https://www.es.net/assets/pubs_presos/sc13sciDMZ-final.pdf

• Three key elements:• Design an appropriate campus network architecture, avoiding local

bottlenecks and causes of packet loss, especially generic campus firewalls• Deploy persistent network performance measurement (i.e., perfSONAR)• Optimise data transfer node (DTN) design and configuration / tuning

• Apply security policy without negatively impacting performance• Streamlined filters, not complex deep packet inspection

• Differential handling of day-to-day and science traffic

The Science DMZ and DTNs – optimising data transfers

7 www.geant.org

Example of a Science DMZ architecture

Dark

FiberDark

Fiber

10GE

Dark

Fiber

10GE

10G

Border Router

WAN

Science DMZ

Switch/Routers

Enterprise Border

Router/Firewall

Site / Campus

LAN

Project A DTN

(building A)

Per-project

security

policy

perfSONAR

perfSONAR

Facility B DTN

(building B)

Cluster DTN

(building C)

perfSONARperfSONAR

Cluster

(building C)

fasterdata.es.net

Design aims to minimizepacket loss and thusmaximise TCP throughput,especially for higher RTT(international) paths.

Security via efficient ACLsand host firewalls, notgeneric campus firewall

8 www.geant.org

• Jisc is encouraging Janet-connected sites to deploy perfSONAR, so we set up our own servers against which they can run tests

• When working with communities who want to move data, we can also host perfSONAR meshes for them• Jisc provides MaDDash on a VM platform• Allows at-a-glance view of network performance across a community• Offering central archiving of measurement data soon

• Our nodes:• London PoP near GÉANT, 10G: https://ps-londhx1.ja.net/toolkit/• Slough shared DC, 10G: http://ps-slough-10g.ja.net/toolkit/

• Nodes are open for remote tests, and available over IPv4 or IPv6

• Also have perfSONAR small nodes (PMP-like) available for loan

Using perfSONAR to test the benefit of Science DMZ and DTNs

9 www.geant.org

• An example of data that was being moved by physical media• Southampton -VIS X-Ray Imaging Centre• Taking samples to Diamond Light Source about six times a year• Might gather 10-40 TB of experimental result data per visit• One data set typically a ~50 GB file, plus up to 5,000 8-25 MB files• Tried using network and rsync; obtained ~30 MB/s (240 Mbit/s)• Using physical media the full transfer process took around 3 weeks

• We ought to be able to do better…• Diamond end has already deployed Science DMZ• Southampton has a 10 Gbit/s campus link to Janet• A target of ~2 Gbit/s would allow ~1 TB per hour

Case study – University of Southampton

www.diamond.ac.uk

10 www.geant.org

• Met with Diamond and campus IT & research staff

• Agreed a phased plan of action:• 1. Change to using Globus transfer tools• 2. Deploy perfSONAR to measure network characteristics• 3. Engineer a 10 Gbit/s link to the research file store,

internal to the campus firewall• 4. Pilot a 10 Gbit/s Science DMZ DTN at campus edge

• Outcome:• External data transfers achieving 2-4 Gbit/s• Able to transfer their most recent 12 TB data set

in 6-12 hours (i.e., overnight)

Working with the computing service and researchers

JANET

10G

10G

Core Core

10G 10G

10G

External DTN

External PerfSONAR

10G

10G

Internal DTN Internal PerfSONAR

10G

10G

10G

10G

10G

10G

1G

10G

perfsonar-b5-mgtem1

perfsonar-b5-datap1p1

perfsonar-extp1p1

10G10G

11 www.geant.org

• We set up a perfSONAR mesh for the Southampton case study (running on a Jisc VM)

• Used measurement points at Diamond, Janet (London), and two at Southampton (by internal file store, and by DTN at campus edge)

perfSONAR network measurements

12 www.geant.org

Jisc London pS node to Southampton internal filestore pS node

Throughput peaksand troughsevery day/night;small (<1%) loss notably impactsperformancePacket loss when

firewall loaded

Christmas vacation period

13 www.geant.org

Jisc London pS node to Southampton campus DTN pS node

More consistentthroughput to DTN

No packet loss

14 www.geant.org

• AENEAS - https://www.aeneas2020.eu• Federated European Science Data Center (ESDC) to support the astronomy

community in achieving the goals of the Square Kilometer Array (SKA)

• PRP - https://prp.ucsd.edu• Effort to improve data transfer performance between the DoE ASCR HPC

facilities at ANL, LBNL, ORNL, NCSA• FIONA DTNs have SSD but also support additional GPU compute

• Process - https://www.process-project.eu• Creating data applications for collaborative research: Exascale learning on

medical image data, and many other research applications

• Data Mover Challenge 2020• https://www.sc-asia.org/data-mover-challenge-2020/• Seven teams testing their software on a worldwide DTN network• Includes DTNs in Europe, Asia and the US

Examples of DTN-related projects

15 www.geant.org

• Useful to be able to let our members run disk-to-disk tests, to try different data transfer software and to test disk i/o and tuning

• We host two 10G Data Transfer Nodes (DTNs) at our Slough DC

• Production DTN:• Hosts a Globus endpoint; can read/write at 10Gbit/s• Allows direct iperf tests by prior arrangement

• Experimental DTN:• Runs alternative TCP protocols, e.g., TCP-BBR• Allows alternative transfer tools to be evaluated: WDT, QUIC, …

• Not built as staging DTNs

• Challenge: federated access to the systems – OAuth? eduGAIN?

• We’re running 100G DTN / transfer tests in a private testbed• Our first university, Imperial College, recently connected to Janet at 100G

Jisc’s backbone / reference DTN deployments

16 www.geant.org

• I spotted that DeiC supports ad-hoc iperf tests• https://www.deic.dk/en/node/759

• Can be useful – we support this on our Slough DTN and our NOC runs a server for its internal use• But doesn’t provide data over time

• And measurements may conflict with other tests

Aside – quick iperf tests

17 www.geant.org

• Run during October 2019

• 29 NREN responses (from GÉANT APM contacts)

• User groups mentioned that move data intensively thanks to the NRENs/GÉANT networks:

• Physics : (HEP) LHCONE, Astro-physics, LOFAR | HPC: PRACE | Astronomy | Biology: human brain, ELIXIR ... | Environment and climate research: CMIP6, Copernicus

• Some NRENs see their role purely as transport capacity providers

• Results presented at STF18• https://wiki.geant.org/pages/viewpage.action?spaceKey=APM&title=18th

+STF+-+Copenhagen%2C+22-23+Oct+2019

NREN DTN survey

18 www.geant.org

• Network

• Long distance transfers, firewalls, last mile networking, connection capacity

• Poor network performance and difficulty to troubleshoot

• Tuning campus, LAN and local systems

• Perceived to be difficult to implement Science DMZ with security

• IAAS usage without coordination with NREN

• Low user expectation – researchers transport large volumes of data transfer using hard drives

NREN DTN Survey – reported issues

19 www.geant.org

NREN DTN Survey – measuring transfer performance?

“Other” includes internalprobes, iperf, NetMinder,Cacti, HawkEye, and in-house tools.

20 www.geant.org

NREN DTN survey – ways to support large-scale data transfers?

“Other” includes remote support,help with system tuning, Aspera,dedicated links, LHCONE, running head nodes, consultancy

21 www.geant.org

NREN DTN Survey – other assistance?

“Other” includes optimizing TCP stack, talks at events, help with Globus,annual meetings, bandwidth checks,engaging with research communities

22 www.geant.org

• The DTN survey showed there is (currently) no clear demand for specific software or hardware development to support improved data transfer performance• Good tools and practices exist• Part of the issue is research engagement and dissemination• Many NRENs are doing this well, some less so

• We are thus setting up a focus group in WP6T2 to take the data transfer infrastructure work area forward• 11 NRENs are interested in working with the project• Will identify priorities, and consider providing a best practice wiki• One suggestion is to explore DTN-as-a-Service• An example to look at is AARNet’s CloudStor service

GÉANT DTN work – beyond the survey

23 www.geant.org

• Globus Connect• https://www.globus.org/globus-connect• Run endpoint on the DTN

• Presents GUI to researcher• Just “drag and drop” data to transfer• Can selectively transfer files

• Base transfer tool free to use• Subscription for advanced features

• Uses GridFTP under the hood• Parallel TCP data transfer• Typically four data streams• Adds resilience to packet loss

Bear in mind researchers need simple tools – Globus example

24 www.geant.org

• Excellent paper to be presented at CHEP2019• “Characterising network paths in and out of the clouds”• https://indico.cern.ch/event/773049/contributions/3473824/

• Cloud computing increasingly important for many science disciplines• Details of networking to/from cloud not well-documented

• Intra-cloud throughput of many 100’s of Gbps• Free to transfer data into cloud provider

• Data export is the interesting part• Tests show 20-30Gbps export possible• Ballpark movement costs - $70/TB egress, $20/TB between regions• Provider may waive network costs up to 15% of total bill• So if you use a lot of compute, the data movement may be “free”

What about transfers to/from commercial cloud providers?

25 www.geant.org

• Do NREN R&E networks deliver better large-scale data transfer performance?• Are we better at supporting our researchers and scientists?

• Interesting comparison done in 2017• https://connect.geant.org/2017/05/15/taking-it-to-the-limit-

testing-the-performance-of-re-networking

• Europe (GÉANT) to Australia (AARNet), between DTNs• 9.27Gbps over R&E network, with 400MB TCP buffers• Commercial provider 1 – 0.9Gbps with 200MB buffer• Commercial provider 2 – 1.72 Gbps with 300MB buffer,

dropped to zero after ~30 seconds – anti-DoS kicking in?

Commercial vs R&E networks

26 www.geant.org

• You almost certainly have users or research communities wanting to move significant amounts of data• Do they know what is available to them to support that?• Research engagement is really, really important!

• Science DMZ principles, including DTNs, should be adopted• Differentiate handling of day-to-day and science traffic

• Being able to measure performance is really useful• perfSONAR for network throughput, but consider disk-to-disk too

• Deploying an NREN backbone DTN can be helpful for users• But not many NRENs are doing this at present• Perhaps there’s a potential DTN-as-a-Service offering to be built?

• Need to think cloud for the future

Closing thoughts

Thank you

www.geant.org

Any questions?

Contact: [email protected]

© GÉANT Association on behalf of the GN4 Phase 3 project (GN4-3).The research leading to these results has received funding fromthe European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 856726 (GN4-3).