parallel session: supporting data-intensive applications

74
Parallel Session A: Supporting data intensive applications Chair: Tim Chown

Upload: jisc

Post on 12-Apr-2017

116 views

Category:

Education


1 download

TRANSCRIPT

Title of presentation

Parallel Session A:Supporting data intensive applicationsChair: Tim Chown

Please switch your mobile phones to silent

17:30 - 19:00No fire alarms scheduled. In the event of an alarm, please follow directions of NCC staffExhibitor showcase and drinks reception 18:00 - 19:00Birds of a feather sessions

Janet end-to-end performance initiative updateTim Chown, Jisc

Why is end-to-end performance important?Overall goal: help optimise a sites use of its Janet connectivitySeeing a growth in data-intensive science applicationsIncludes established areas like GridPP (moving LHC data around)As well as new areas like cryo-electron microscopySeeing an increasing number of remote computation scenariose.g., scientific networked equipment, no local computeMight require 10Gbit/s to remote compute to return computation results on a 100GB data set to a researcher for timely visualisationStarting to see more 100Gbit/s connectivity requestsLikely to have challenging data transfer requirements behind themAs networking people, how do we help our researchers and their applications?11/04/2017Janet end-to-end performance initiative update

11/04/20174Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Speaking to your researchersAre you or your computing service department speaking to your researchers?If not, how do you understand their data-intensive requirements?If so, is this happening on a regular basis?Ideally youd want to be able to plan ahead, rather than adapt on the flyDo you conduct networking future looks?Any step changes might dwarf your sites organic growthThis issue should have some attention at CIO or PVC Research levelDo you know what your application elephants are?Whats the breakdown of your sites network traffic?How are you monitoring network flows?11/04/2017Janet end-to-end performance initiative update

Researcher expectationsHow can we help set and manage researcher expectations?

One aspect is helping them articulate their network requirementsVolume of data in time X => data rate requiredIts also about understanding practical limitationsCan determine theoretical network throughpute.g. in principle, you can transfer 100TB over a 10Gbit/s link in 1 dayBut in practice, many factors may prevent thisWe should encourage researchers to speak to their computing serviceComputing services can in turn talk to the Janet Service Deskjisc.ac.uk/contact

And noting here that cloud capabilities are becoming increasingly important11/04/2017Janet end-to-end performance initiative update

Janet end-to-end performance initiativeThis is the context in which Jisc set up the Janet end-to-end performance initiativeThe goals of the initiative include:Engaging with existing data-intensive research communities and identifying emerging communitiesCreating dialogue between Jisc, computing service groups, and research communitiesHolding workshops, facilitating discussion on e-mail lists, etc.Helping researchers manage expectationsEstablishing and sharing best practices in identifying and rectifying causes of poor performanceMore information:jisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative

11/04/2017Janet end-to-end performance initiative update

Understanding the factors affecting E2EAchieving optimal end-to-end performance is a multi-faceted problem.

It includes:Appropriate provisioning between the end sitesProperties of the local campus network (at each end), including capacity of the Janet connectivity, internal LAN design, the performance of firewalls and the configuration of other devices on the pathEnd system configuration and tuning; network stack buffer sizes, disk I/O, memory management, etc.The choice of tools used to transfer data, and the underlying network protocols

To optimise end-to-end performance, you need to address each aspect

11/04/2017Janet end-to-end performance initiative update

Janet network engineeringFrom Jiscs perspective, its important to ensure there is sufficient capacity in the network for its connected member sitesWe perform an ongoing review of network utilisationProvision the backbone network and regional linksProvision external connectivity, to other NRENs and networksObserve the utilisation, model growth, predict aheadStep changes have bigger impact at the local rather than backbone scaleJanet has no differential queueing for regular IP trafficThe Netpath service exists for dedicated / overlay linksIn general, Jisc plans regular network upgrades with a view to ensuring that there is sufficient latent capacity in the network11/04/2017Janet end-to-end performance initiative update

Janet backbone, October 201611/04/2017Janet end-to-end performance initiative update

Major external links, October 201611/04/2017Janet end-to-end performance initiative update

E2EPI site visitsWe (Duncan Rand and I) have visited a dozen or so sitesMet with a variety of networking staff and researchersAnd spoken to many others via emailReally interesting to hear what sites are doingSome good practice evident, especially in local network engineeringe.g. routing those elephants around main campus firewallsCampus firewalls often not designed for single high throughput flowsSeeing varying use of site linkse.g. 10G for campus, 10G resilient, 20G for research (GridPP)Some sites using their resilient link for bulk data transfersSome rate limiting of researcher trafficTo avoid adverse impact on general campus trafficWed encourage sites to talk to the JSD rather than rate limit11/04/2017Janet end-to-end performance initiative update

The Science DMZ modelESnet published the Science DMZ design pattern in 2012/13es.net/assets/pubs_presos/sc13sciDMZ-final.pdfThree key elements:Network architecture; avoiding local bottlenecksNetwork performance measurementData transfer node (DTN) design and configurationAlso important to apply your security policy without impacting performanceThe NSFs Cyberinfrastructure (CC*) Program funded this model in over 100 US universities:See nsf.gov/funding/pgm_summ.jsp?pims_id=504748No current funding equivalent in the UK; its down to individual campuses tofund changes to network architectures for data-intensive scienceBut this can and should be part of your network architecture evolution 11/04/2017Janet end-to-end performance initiative update

Good news youre doing a lot alreadyThere are several examples of sites in the UK that have a form of Science DMZ deploymentIn many cases the deployments were made without knowledge of the Science DMZ modelBut Science DMZ is simply a set of good principles to follow, so its not surprising that some Janet sites are already doing itExamples in the UK:Diamond Light Source (more from Alex next)JASMIN/CEDA Data Transfer ZoneImperial College GridPP; supports up to 40Gbit/s of IPv4/IPv6To realise the benefit, both end sites need to apply the principles11/04/2017

Janet end-to-end performance initiative update

Examples of campus network engineeringIn principle you can just use your Janet IP serviceWhere specific guarantees are required, the Netpath Plus service is availableSee jisc.ac.uk/netpath But then youll not be able to exceed that capacitySome sites split their campus links, e.g. 10G campus, 10G scienceAgain, be careful about using your resilient link for bulk dataIts better to speak to the JSD about a primary link upgradeAnd ensure appropriate resilience for thatSome sites rate-limit research trafficSome examples of physical / virtual overlaysThe WLCG (for LHC data) deployed OPN (optical) and LHCONE (virtual) networksNot clear that one overlay per research community would scaleAt least one site is exploring Cisco ACI (their SDN solution)11/04/2017Janet end-to-end performance initiative update

Measuring network characteristicsIts important to have telemetry on your networkThe Science DMZ model recommends perfSONAR for thisMore from Alex and Szymon later in this sessionCurrent version about to be 4.0 (as of April 17th, all being well )perfSONAR uses proven measurement tools under the hoode.g. iperf and owampCan run between two perfSONAR systems or build a meshCollects telemetry over time; throughput, loss, latency, traffic pathHelps you assess the impact of changes to your network or systemsAnd to understand variance in characteristics over timeIt can highlight poor performance, but doesnt troubleshoot per se11/04/2017Janet end-to-end performance initiative update

Example: UK GridPP perfSONAR mesh

11/04/2017Janet end-to-end performance initiative update

Janet perfSONAR / DTN test node(s)Weve installed a 10G perfSONAR node at a Janet PoP in LondonLets you test your sites throughput to/from Janet backboneMight be useful to you if you know youll want to run some data-intensive applications in the near future, but dont yet have perfSONAR at the far end, or if you just want to benchmark your sites connectivityAsk us if youre interested in using itWere planning to add a second perfSONAR test node in our Slough DCAlso planning to install a 10G reference SSD-based DTN thereSee https://fasterdata.es.net/science-dmz/DTN/ This will allow disk-to-disk tests, using a variety of transfer toolsWe can also run a perfSONAR mesh for you, using MaDDash on a VMWe may also deploy an experimental perfSONAR nodee.g. to evaluate the new Google TCP-BBR implementation11/04/2017Janet end-to-end performance initiative update

Small node perfSONARIn some cases, just an indicative perfSONAR test is usefuli.e., run loss/latency tests as normal, but limit throughout tests to 1Gbit/sFor this scenario, you can build a small node perfSONAR system for under 250Jisc took part in the GEANT small node pilot project, using Gigabyte Brix:IPv4 and IPv6 test mesh at http://perfsonar-smallnodes.geant.org/maddash-webui/ We now have a device build that we can offer to communities for testingAim to make them as plug and play as possibleAnd a stepping stone to a full perfSONAR nodeFurther information and TNC2016 meeting slide deck:https://lists.geant.org/sympa/d_read/perfsonar-smallnodes/11/04/2017Janet end-to-end performance initiative update

perfSONAR small node test mesh11/04/2017Janet end-to-end performance initiative update

Aside: Googles TCP-BBRTraditional TCP performs poorly even with just a fraction of 1% loss rateGoogle have been developing a new version of TCPTCP-BBR was open-sourced last yearRequires just sender-side deploymentSeeks high throughput with a small queueGood performance at up to 15% lossGoogle using it in production todayWould be good to explore this furtherUnderstand impact on other TCP variantsAnd when used for parallelised TCP applications like GridFTPSee the presentation from the March 2017 IETF meeting:etf.org/proceedings/98/slides/slides-98-iccrg-an-update-on-bbr-congestion-control-00.pdf 11/04/2017Janet end-to-end performance initiative update

Building on Science DMZ?We should seek to establish good principles and practices at all campusesAnd the research organisations they work with, like DiamondTheres already a good foundation at many GridPP sitesThe Janet backbone is heading towards 600G capacity and beyondWe can seed further communities of good practice on this foundatione.g. the DiRAC HPC community, the SES consortium, And grow a Research Data Transfer Zone (RDTZ) within and between campusesBuild towards a UK RDTZInspired by the US Pacific Research Platform model of multi-site, multi-discipline research-driven collaboration built on NSF Science DMZ investment Many potential benefits, such as enabling new types of workflowe.g. streaming data to CPUs without the need to store locally11/04/2017Janet end-to-end performance initiative update

11/04/201722Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Transfer toolsYour researchers are likely to find many available data transfer tools:Theres the simpler old friends like ftp and scpBut these are likely to give a bad initial impression of what your network can doTheres TCP-based tools designed to mitigate the impact of packet lossGridFTP typically uses four parallel TCP streamsGlobus Connect is free for non-profit research and education useSee globus.org/globus-connect Theres tools to support management of transfersFTS see http://astro.dur.ac.uk/~dph0elh/documentation/transfer-data-to-ral-v1.4.pdfTheres also a commercial UDP-based option, AsperaSee asperasoft.com/

It would be good to establish more benchmarking of these tools at Janet campuses

11/04/2017Janet end-to-end performance initiative update

E2E performance to cloud computeWere seeing growing interest in the use of commercial cloud computee.g. to provide remote CPU for scientific equipmentComplements compute available at the new ESPRC Tier-2 HPC facilitiesepsrc.ac.uk/research/facilities/hpc/tier2/ Anecdotal reports of 2-3Gbit/s into AWSe.g. by researchers at the Institute of Cancer ResearchSee presentations at RCUK Cloud Workshop - https://cloud.ac.uk/ Bandwidth for AWS depends on the VM sizeSee https://aws.amazon.com/ec2/instance-types/ Were keen to explore cloud compute connectivity furtherIncludes AWS, MS ExpressRoute, And scaling Janet connectivity to these services as appropriate 11/04/2017Janet end-to-end performance initiative update

Future plans for E2EPIOur future plans include:Writing up and disseminating best practice case studiesGrowing a UK RDTZ by promoting such best practices within communitiesDeploying a second 10G Janet perfSONAR test node at our Slough DCDeploying a 10G reference DTN at Slough and performing transfer tool benchmarkingPromoting wider campus perfSONAR deployment and community meshesIntegrating perfSONAR data with router link utilisationDeveloping our troubleshooting support furtherExperimenting with Googles TCP-BBRExploring best practices in implementing security models for Science DMZExpanding Science DMZ to include IPv6, SDN and other technologiesRunning a second community E2EPI workshop in OctoberHolding a hands-on perfSONAR training event (after 4.0 is out)11/04/2017Janet end-to-end performance initiative update

Useful linksJanet E2EPI roject pagejisc.ac.uk/rd/projects/janet-end-to-end-performance-initiative E2EPI Jisc community pagehttps://community.jisc.ac.uk/groups/janet-end-end-performance-initiative JiscMail E2EPI list (approx 100 subscribers)jiscmail.ac.uk/cgi-bin/webadmin?A0=E2EPI Camus Network Engineering for Data-Intensive Science workshop slidesjisc.ac.uk/events/campus-network-engineering-for-data-intensive-science-workshop-19-oct-2016Fasterdata knowledge basehttp://fasterdata.es.net/eduPERT knowledge basehttp://kb.pert.geant.net/PERTKB/WebHome 11/04/2017Janet end-to-end performance initiative update

contact

Tim ChownJisc

[email protected]

jisc.ac.uk

Using perfSONAR and Science DMZ to resolve throughput issuesAlex White, Diamond

Diamond Light Source

perfSONAR and Science DMZ at Diamond

DiamondThe Diamond machine is a type of particle acceleratorCERN: high energy particles smashed together, analyse the crashDiamond: exploits the light produced by high energy particles undergoing accelerationUse this light to study matter like a super microscopeWhat is the Diamond Light Source?perfSONAR and Science DMZ at Diamond

DiamondIn the Oxfordshire countryside, by the A34 near DidcotDiamond is a not-for-profit joint venture between STFC and the Wellcome TrustCost of use: Free access for scientists through a competitive scientific application processOver 7000 researchers from academia and industry have used our facilityWhat is the Diamond Light Source?perfSONAR and Science DMZ at Diamond

The MachineThree particle accelerators:Linear acceleratorBooster SynchrotronStorage ring48 straight sections angled together to make a ring562m longcould be called a tetracontakaioctagon

perfSONAR and Science DMZ at Diamond

The MachineperfSONAR and Science DMZ at Diamond

The ScienceBright beams of light from each bending magnet or wiggler are directed into laboratories known as beamlines.We also have several cutting-edge electron microscopes.All of these experiments operate concurrently!SpectroscopyCrystallographyTomography (think CAT Scan)InfraredX-ray absorptionX-ray scatteringperfSONAR and Science DMZ at Diamond

x-ray diffractionperfSONAR and Science DMZ at Diamond

Data RatesCentral Lustre and GPFS filesystems for science data:420TB, 900TB, 3.3PB (as of 2016)Typical x-ray camera: 4MB frame, 100x per secondAn experiment can easily produce 300GB-1TBScientists want to take their data home

Data-intensive researchperfSONAR and Science DMZ at Diamond

Data RatesEach dataset might only be downloaded once

I dont know where my users are

Some scientists want to push data back to DiamondData-intensive problemsperfSONAR and Science DMZ at Diamond

Each dataset might only be downloaded onceUsers might only want a small section of their data over the course of an hour of data collection, perhaps only the last five minutes were usefulDoes not make sense to push every byte to a CDN

I dont know where my users areSome organisations have federated agreements with partner institutions dark fibre, or VPNs over the Internet. I however have users performing ad-hoc access from anywhere.sI cant block segments of the world we have users from Russia, China etc

Some scientists want to push data back to DiamondReprocessing of experiments with new techniquesNew facilities located outside Diamond using our software analysis stack

11/04/201737Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Sneakernet

The old ways are the best?perfSONAR and Science DMZ at Diamond

When I started at Diamond, this is how users moved large amounts of data offsite.Its a dedicated machine called a data dispenserits actually a member of the Lustre and GPFS parallel filesystemsit speaks USB3Its an embarassment11/04/201738Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Network LimitsScience data downloads from Diamond to visiting users institutes were inconsistent and sloweven though the facility had a 10Gb/s JANET connection from STFC.The limit on download speeds was delaying post-experiment analysis at users home institutes.The ProblemperfSONAR and Science DMZ at Diamond

Characterising the problemOur target: a stable 50Mb/s over a 10ms pathUsing our sites shared JANET connectionIn real terms, 10ms was approximately DLS to Oxford

perfSONAR and Science DMZ at Diamond

We started testing Diamonds own network by using cheap servers with perfSONAR, to run AD-HOC tests.First we checked them back-to-back that they could do 10Gb, then we testeddeep inside the Science network next to the storage serversat the edge of the Diamond network, where we link to STFCThen we used perfSONARs built-in community list to find a third server just up the road at the University of Oxford to test against11/04/201740Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Baseline findingsInside our network: 10Gb/s, no packet lossOver the STFC/JANET segment between Diamonds edge and the Physics Department at Oxford:low and unpredictable speedsa small amount of packet loss

Baseline findings with iperfperfSONAR and Science DMZ at Diamond

Packet loss? Is that a problem? In my experience, on the local area, no, it's never concerned me before.

This was the attitude of my predecessors also.

11/04/201741Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Packet Loss

TCP Throughput =

MSS (Packet Size)Latency (RTT)Packet Loss probabilityTCP Performance is predicted by the Mathis equationperfSONAR and Science DMZ at Diamond

There are three major factors that affect TCP performance. All three are interrelated.

The Mathis equation predicts the maximum transfer speed of a TCP link.

TCP has the concept of a Window Size, which describes the amount of data that can be in flight in a TCP connection between ACK packets.

The sending system will send a TCP window worth of data, and then wait for an acknowledgement from the receiver.

At the start of a transfer, TCP begins with a small window size, which it ramps up in size as more and more data is successfully sent. Youve heard of this before; this is just Bandwidth Delay product calculations for TCP Performance over Long Fat Networks.

However, BDP calculations for a Long Fat network assume a lossless path, and in the real world, we dont have lossless paths.

When a TCP stream encounters packet loss, it has to recover. To do this, TCP rapidly decreases the window size for the amount of data in flight. This sounds obvious, but the effect of spurious packet loss on TCP is pretty profound.11/04/201742Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Packet Loss

Interesting effects of packet lossperfSONAR and Science DMZ at Diamond

With everything being equal, the time for a TCP connection to recover from loss goes up as the latency goes up.11/04/201743Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Packet LossAccording to Mathis, to achieve our initial goal over a 10ms path, the tolerable packet loss is: 0.026% (maximum)perfSONAR and Science DMZ at Diamond

The packet loss between Diamond and Oxford was too small to previously cause concern to network engineers, but was found to be large enough to disrupt high-speed transfers over distances greater than a few miles.

11/04/201744Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Finding the problem in the Last MileWe worked with STFC to connect a perfSONAR server directly to the main Harwell campus border router to look for loss in the last mile

perfSONAR and Science DMZ at Diamond

So... once we had an idea what the problem was, instead of just complaining at them, I could take actual fault data to my upstream supplier.

Phil at STFC was gracious in working with us to allow us to connect up another PerfSonar server.

Testing with this server let us run traffic just through the firewall, which showed us that the packet loss problem was with the firewall, and not with the JANET connection between us and Oxford

11/04/201745Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Science DMZThe Fix: Science DMZperfSONAR and Science DMZ at Diamond

Science DMZ (an idea from ESnet) is otherwise known as moving your data transfer server to the edge of your network

Your JANET connection has high latency, but hopefully no loss.

If your local LAN has an amount of packet loss, you might still see reasonable speed over it because the latency on the LAN is so small.

The Science DMZ model puts your application server between the long and short latency parts, bridging them. Essentially, youre splitting the latency domains.

11/04/201746Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

SecurityData intensive science traffic interacts poorly with enterprise firewallsDoes this mean we can use the Science DMZ idea to just ignore security?No!Aside: Security without FirewallsperfSONAR and Science DMZ at Diamond

This section is cribbed from an utterly excellent presentation by Kate Petersen Mace from Esnet

Science DMZ is about reducing degrees of freedom and removing the number of network devices in the critical path.

Removing redundant devices, methods of entry etc, is a positive enhancement to security.11/04/201747Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

SecurityImplementing a Science DMZ means segmenting your network trafficApply specific controls to data transfer hostsAvoid unnecessary controlsScience DMZ as a Security ArchitectureperfSONAR and Science DMZ at Diamond

The typical corporate LAN is a wild-west filled with devices that you didnt do the engineering on yourself:Network printersWeb browsersFinancial databasesIn my case, modern oscilloscopes running Windows 95That cool touch-screen display kiosk in receptionWindows devices

In comparison to the corporate LAN, the Science DMZ is a calm, relaxed place.You know precisely what services youre moving to the data transfer hosts.Youre typically only exposing one service to the Internet FTP for example, or Aspera.

You leave the enterprise firewall to do its job that of preventing access to myriad unknown ports, and you use a more appropriate tool to secure each side.11/04/201748Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

SecurityOnly run the services you needProtect each host:Router ACLsOn-host firewall Linux iptables is performantTechniques to secure Science DMZ hostsperfSONAR and Science DMZ at Diamond

Linux iptables will happily do port ACLs on the host at 10Gb with no real hit on the CPU.11/04/201749Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Globus GridFTP

perfSONAR and Science DMZ at DiamondWe adopted Globus GridFTP as our recommended transfer methodIt uses parallel TCP streamsIt retries and resumes automaticallyIt has a simple, web-based interface

This means that if a packet gets dropped in one stream, the other streams carry on the transfer while the affected one recovers.

11/04/201750Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Diamond Science DMZThe Diamond Science DMZperfSONAR and Science DMZ at Diamond

This is how we implemented the Science DMZ at diamondYou can see the Data Transfer node, connected directly to the campus site front door routersThere is a key question on how do you get your data to your Data Transfer node?Move data to it on a hard driveRoute your data through your lossy firewall front door its low latency, so Mathis says this should still be quickWe went a step further and put in a private, non-routable fibre link to a server in the datacentre.The internal server has an important second function Authentication of access to data. I didnt want to put an AD server in the Science DMZ if I didnt need to, so this internal server is running the Authentication server for Globus Online from inside our corporate network11/04/201751Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Diamond Science DMZTest data: 2Gb/s+ consistently between Diamond and the ESnet test point at Brookhaven Labs, New York State, USASpeed records!Biggest: Electron Microscope data from DLS to Imperial1120GB at 290Mb/sFastest: Crystallography dataset from DLS to Newcastle260GB at 480Mb/sActual transfers:perfSONAR and Science DMZ at Diamond

Bad bitsGlobus logsNetflowGlobus install My own use of perfSONARBad bitsperfSONAR and Science DMZ at Diamond

I tend to use perfSONAR just for ad-hoc testing. Its fabulous for this, and I can direct my users to it for self-service testing of their network connections.I know it is very useful for periodic network monitoring, but I dont take advantage of that yet11/04/201753Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

SCP and AsperaSCPAspera

Different Transfer Tools?perfSONAR and Science DMZ at Diamond

The common implementation of SCP has a fixed TCP window size. It will never grow big enough to fill a fast link.Aspera UDP based system, commercial.never used it, Ive heard some people from various corners of science talk about it, but Ive never been asked for it. Im sure it would do well on the Science DMZ if we ever need Aspera.11/04/201754Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Near Future10Gb+ performanceGlobus Cluster for multiple 10Gb links40Gb single links

Near FutureperfSONAR and Science DMZ at Diamond

- globus online default stream splitting -> per file or chunks?11/04/201755Title of presentation (Insert > Header & Footer > Notes and Handouts > Header > Apply to all)

Thank you!Use real world testing

Never believe your vendor

Zero packet loss is crucial

Enterprise firewalls introduce packet lossAlex WhiteDiamond Light [email protected]

perfSONAR and Science DMZ at Diamond

contact

Alex WhiteDiamond

[email protected]

jisc.ac.uk

Essentials of the Modern Performance Monitoring with perfSONARSzymon Trocha, PSNC / GANT, [email protected] April 2017

MotivationsIdentify problems, when they happen or (better) earlierThe tools must be available (at campus endpoints, demarcations between networks, at exchange points, and near data resources such as storage and computing elements, etc)Access to testing resources11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

Problem statementThe global Research & Education network ecosystem is comprised of hundreds of international, national, regional and local-scale networksWhile these networks all interconnect, each network is owned and operated by separate organizations (called domains) with different policies, customers, funding models, hardware, bandwidth and configurations11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

This complex, heterogeneous set of networks must operate seamlessly from end to end to support your science and research collaborations that are distributed globally

Where Are The (multidomain) Problems?11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

SourceCampus

SCongested or faulty links between domainsCongested intra- campus linksDDestinationCampusLatency dependant problems inside domains with small RTT

ChallengesDelivering end-to-end performance Get the user, service delivery teams, local campus and metro/backbone network operators working together effectivelyHave tools in placeKnow your (network) expectationsBe aware of network troubleshooting11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

What is perfSONAR?Its infeasible to perform at-scale data movement all the time as we see in other forms of science, we need to rely on simulationsperfSONAR is a tool to:Set network performance expectationsFind network problems (soft failures)Help fix these problemsAll in multi-domain environmentsThese problems are all harder when multiple networks are involvedperfSONAR is provides a standard way to publish monitoring dataThis data is interesting to network researchers as well as network operators11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

The ToolkitNetwork performance comes down to a couple of key metrics:Throughput (e.g. how much can I get out of the network)Latency (time it takes to get to/from a destination)Packet loss/duplication/ordering (for some sampling of packets, do they all make it to the other side without serious abnormalities occurring?)Network utilization (the opposite of throughput for a moment in time)We can get many of these from a selection of measurement tools the perfSONAR Toolkit

The perfSONAR Toolkit is an open source implementation and packaging of the perfSONAR measurement infrastructure and protocolsAll components are available as RPMs, DEBs, and bundled as a CentOS ISOVery easy to install and configure (usually takes less than 30 minutes for default install)11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

Importance of Regular TestingWe cant wait for users to report problems and then fix themThings just break sometimesBad system or network tuningFailing opticsSomebody messed around in a patch panel and kinked a fiberHardware goes badProblems that get fixed have a way of coming backSystem defaults come back after hardware/software upgradesNew employees may not know why the previous employee set things up a certain way and back out fixesImportant to continually collect, archive, and alert on active test results11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

perfSONAR Deployment PossibilitiesSmall nodeDedicated serverA single CPU with multiple cores (2.7 GHz for 10Gps tests)4GB RAM1Gps onboard (mgmt + delay)10Gps PCI-slot NIC (throughput)Low cost small PCe.g. GIGABYTE BRIX GB-BACE-3150Various models11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

Ad hoc testingDocker

Deployment Styles11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

BeaconIslandMesh

Location criteriaWhere it can be integrated into the facility software/hardware management systems?Where it can do the most good for the network operators or users?Where it can do the most good for the community?11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

EDGENEXT TO SERVICES

Distributed Deployment In a Few Steps11/04/2017Essentials of the Modern Performance Monitoring with perfSONARHOSTSCENTRAL SERVER

New 4.0 release announcementIntroduction of pSchedulerNew software for measurements to replace BWCTLGUI changesMore information, better presentationOS support upgradeDebian 7, Ubuntu 14Centos7Still supporting Centos6Mesh config GUIMaddash alerting11/04/2017Essentials of the Modern Performance Monitoring with perfSONARApril 17th

4.0 Data Colletion11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

New GUI11/04/2017Essentials of the Modern Performance Monitoring with perfSONAR

Thank [email protected]/04/2017Essentials of the Modern Performance Monitoring with perfSONARhttp://www.perfsonar.net/http://docs.perfsonar.net/http://www.perfsonar.net/about/getting-help/perfSONAR videoshttps://learn.nsrc.org/perfsonarHands-on course in London is coming

Szymon Trocha

PSNC/[email protected]

11/04/2017Title of presentation (Insert > Header & Footer > Slide > Footer > Apply to all)

jisc.ac.uk