hpc computing at cern - use cases from the engineering and physics communities michal husejko,...

29
HPC computing at CERN - use cases from the engineering and physics communities Michal HUSEJKO, Ioannis AGTZIDIS IT/PES/ES 1

Upload: myron-conley

Post on 03-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

HPC computing at CERN - use cases from the engineering and physics communities

HPC computing at CERN- use cases from the engineering and physics communitiesMichal HUSEJKO, Ioannis AGTZIDISIT/PES/ES11AgendaIntroduction Where we are nowCERN used applications requiring HPC infrastructureUser cases (Engineering)Ansys MechanicalAnsys FluentPhysics HPC applicationsNext stepsQ&A

22AgendaIntroduction Where we are nowCERN used applications requiring HPC infrastructureUser cases (Engineering)Ansys MechanicalAnsys FluentPhysics HPC applicationsNext stepsQ&A

33IntroductionSome 95% of our applications are served well with bread-and-butter machines We (CERN IT) have invested heavily in AI including layered approach to responsibilities, virtualization, private cloud.There are certain applications, traditionally called HPC applications, which have different requirementsEven though these applications sail under common HPC name, they are different and have different requirementsThese applications need detailed requirements analysis4Some 95% of our applications are served perfectly well with standard bred-and-butter machines (dual CPU, 3...4 GB/core, a bit of disk space, GigE or at best 10GE), as most of these applications are single-threaded

In order to operate these machines more efficiently, we have invested into AI, including layered approach of responsibilities, virtualisation, private cloud

A small fraction of applications (engineering, theory, accelerator) have requirements that are not well served by these bred-and-butter machines, due to core count, memory size/latency/bandwidth, interprocess communication, ... traditionally called HPC. Different architectures required

Even if all that is sailing under the label HPC, the requirements (and hence the optimal technical solutions) can be very different

Every effort must be made to serve these requirements as much as possible by the same layered approach. In order to understand to what extent this is possible, detailed understanding of the requirements is needed.

4Scope of talkWe contacted our user community and started to gather continuously user requirementsWe have started detailed system analysis of our HPC applications to gain knowledge of their behavior.In this talk I would like to present the progress and the next stepsAt a later stage, we will look how the HPC requirements can fit into the IT infrastructure

5In order to understand the requirements of HPC applications, we have reached to our user community to gather user test cases.

With this in hand, we have prepared system analysis environment, based on available hardware and standard Linux performance monitoring tools . With the main goal to better understand requirements of user applications.

In this talk I would like to introduce some of the hpc applications, which are present at CERN, I would like to present the progress on understanding the requirements imposed by this application. And present next steps which we will take to go towards deciding how this will fit into our IT infrastructure.

5

HPC applicationsEngineering applications:Used at CERN in different departments to model and design parts of the LHC machine.IT-PES-ES section is supporting the user community of these toolsTools used for: structural analysis, Fluid Dynamics, Electromagnetics, MultiphysicsMajor commercial tools: Ansys, Fluent, HFSS, Comsol, CSTbut also open source: OpenFOAM (fluid dynamics)

Physics simulation applicationsPH-TH Lattice QCD simulationsBE LINAC4 plasma simulationsBE beams simulation (CLIC, LHC etc)HEP simulation applications for theory and accelerator physics

6Based on our interaction with user community, we have distilled two different groups of HPC applications: Engineering applications (general purpose HPC applications, targeting different markets) and Physics simulation applications (purpose build HPC applications developed at CERN, with some with help of external institutes).

Engineering applications

blah blah ...

Physics simulation

blah blah

BACKUP:Transition: While working closely with our users, we have understood that in principle, we have two type of HPC applications present at CERN.

6AgendaIntroduction Where we are nowCERN used applications requiring HPC infrastructureUser cases (Engineering)Ansys MechanicalAnsys FluentPhysics HPC applicationsNext stepsQ&A

7To visually explain the challenges imposed by this applications, I will walk you through two user cases, which challenge computing power, memory size and bandwidth, file I/O and interconnect of the system which they run on.

7Use case 1: Ansys MechanicalWhere?LINAC4 Beam Dump SystemWho ?Ivo Vicente Leitao, Mechanical Engineer (EN/STI/TCD) How ?Ansys Mechanical for design modeling and simulations (stress and thermal structural analysis)Use case 1: Ansys Mechanical88

How does it work ?Ansys MechanicalStructural analysis: stress and thermal, steady and transientFinite Element MethodWe have physical problem defined by differential equationsIt is impossible to analytically solve it for complicated structure (problem)We divide problem into subdomains (elements)We solve differential equations (numerically) for selected points (nodes)And then by the means of approximation functions we project solution tothe global structureExample has 6.0 Million (6M0) of mesh nodesCompute intensiveMemory intensiveUse case 1: Ansys Mechanical9To perform stress and thermal analysis of a given structure, Ansys Mechanical employs Finite Element Method to numerically solve differential equations which define physical problem.

In order to perform numerical analysis, the structure under study is imported into the Ansys, and the meshing is performed.

Meshing creates 3D (could 2D) mesh of points (nodes). Then on each node we solve differential equations.

And then we combine solutions for all nodes to form global solution.

The example contains 6 Millions of nodes. Due to high number of nodes, it will create high demand for computing power and memory usage.

9

Use case 1: Ansys Mechanical10Doczytac o tym w dokumencie od Ivo10

Simulation resultsMeasurement hardware configuration:2x HP 580 G7 server (4x E7-8837, 512 GB RAM, 32c), 10 Gb low latency Ethernet linkTime to obtain single cycle 6M0 solution:8 cores -> 63 hours to finish simulation, 60 GB RAM used during simulation64 cores -> 17 hours to finish simulation, 2*200 GB RAM used during simulationUser interested in 50 cycles: would need 130 days on 8 cores, or 31 days on 64 cores It is impossible to get simulation results for this case in a reasonable time on a standard user engineering workstation Use case 1: Ansys Mechanical11So, we took this case and we simulated it on our test setup.

It took 63 hours to simulate the user case on 8 cores, and 17 hours on 64 cores.

And This was just one cycle. User will need to simulate 50 cycles to get meaningful results of beam deposition in dump system.

50 cycles will take around 31 days on 64 core system, and 130 days on 8 core system.

and we dont want engineers to start buying expensive 128 GB of RAM workstations to keep them under the desk.11ChallengesWhy do we care ?Everyday we are facing users asking us how to speed up some engineering applicationChallengesProblem size and its complexity are challenging user computer workstations in terms of computing power, memory size, and file I/OThis can be extrapolated to other Engineering HPC applicationsHow to solve the problem ?Can we use current infrastructure to provide a platform for these demanding applications ? or do we need something completely new ? and if something new, how this could fit into our IT infrastructureSo, lets have a look at what is happening behind the scene

12The way the users are working: design, simulation, correction, simulation, re-design, simulation, before they actually get final results.

This is happening due to increasing problem size and increasing number of details being simulated. Also more and more users are interested in Multiphisics simulations.12Analysis toolsStandard Linux performance monitoring tools used:Memory usage: sar, Memory bandwidth: Intel PCM (Performance Counter Monitor, open source)CPU usage: iostat, dstatDisk I/O : dstatNetwork traffic monitoring: netstatMonitoring scripts started from the same node where the simulation job is started. Collection of measurement results is done automatically by our tools.13To better understand the requirements imposed on computing infrastructure we need to perform detailed system analysis of these application.

Before you can start performance analysis you need to have some monitoring tools available on a system where applications will be tested.

We did investigation, and chose tools which we think give us a good overview of what is happening at system level, while running our test case.

blah

Also, We have developed scripts to wrap around standard Linux monitoring tools, to be able to measure timestamped parameters of different system aspects.

With that we were able to get system level measurements, which then we could analize visully, to see some bottlenecks or trends.13Multi-core scalabilityMeasurement info:LINAC4 beam dump system, single cycle simulation64c@1TB, 2 nodes of (quad socket Westmere, E7-8837, 512 GB), 10 Gb iWARPResults:Ansys Mechnical simulation scales well beyond single multi-core box.Greatly improved number of jobs/week, or simulation cycles/weekNext steps: scale on more than two nodes and measure impact of MPIConclusionMulti-core platforms neededto finish simulation in reasonabletimeUse case 1 : Ansys Mechanical14To better understand how many cores these applications can utilize, we have started multi-core scalability measurements.

So, we took previously presented user case (beam dump system), and we run it on distributed node (two nodes, 64 cores).14Memory requirementsIn-core/out-core simulations (avoiding costly file I/O)In-core = most of temporary data is stored in the RAM (still can write to disk during simulation) Out-of-core = uses files on file system to store temporary data.Preferable mode is in-core to avoid costly disk I/O accesses, but this requires increased RAM memory and its bandwidthAnsys Mechanical (and some other engineering applications) has limited scalabilityDepends heavily on solver and user problemAll commercial engineering application use some licencing scheme, which can put skew on choice of a platformConclusion:We are investigating if we can spread required memory on multiple dual socket systems, or 4 socket systems are necessary for some HPC applicationsThere are certain engineering simulations which seem to be limited by a memory bandwidth, this has to be also considered when choosing a platformUse case 1 : Ansys Mechanical15To avoid costly file I/O we need to have enough RAM to store all temporary data in it.

Some tools have scalability limitations there is a boundary where required amount of RAM for optimal run can be spread over.

All the commercial tools have some license scheme which put another constraint on scalability.

In order to profit from these days CPU power, the temporary data used during the simulation nAnsys HPC licensing320 euro @ core (research)700 euro @ core (associate)

15Disk I/O impactAnsys MechanicalBE CLIC test systemTwo Supermicro servers (dual E5-2650, 128 GB), 10 Gb iWARP back to back.Disk I/O impact on speedup. Two configurations compared.Measured with sar, and iostatApplications spends a lot of time in iowaitUsing SSD instead of HDD increasesjobs/week by almost 100 %Conclusion:We need to investigate more casesto see if this is a marginal caseor something more commonUse case 3 : Ansys Mechanical16There could be some cases (we observed at least one). That even though you have big enough memory, the system is still performing a lot of file I/O operations.

In that case using SSD, instead of HDD, can lead to better results.

We have run the test case on HDD, and in the log files we have observed that simulation process spends hell a lot of the time in IOWAIT, so most of the cores were stalled waiting for disk access.

Based on that we decided to mount SSD drivers, to perform comparison. And as expected we observed immediately improvement.

We still have to investigate if this is marginal case, or it is more common.16AgendaIntroduction Where we are nowCERN used applications requiring HPC infrastructureUser cases (Engineering)Ansys MechanicalAnsys FluentPhysics HPC applicationsNext stepsQ&A

1717Use case 2: Fluent CFDComputational Fluid Dynamics (CFD) application, Fluent (now provided by Ansys)Beam dump system at PS booster.Heat is generated inside the dump and you need to cool it in order to avoid it to melt or break because of mechanical stresses.Extensively parallelized MPI-based softwarePerformance characteristics similar to other MPI-based software:Importance of low latency for short messagesImportance of bandwidth for medium and big messages

1818Interconnect network latency impactAnsys FluentCFD heavy test case from CFD group ( EN-CV-PJ)64c@1TB, 2 nodes of (quad socket Westmere, E7-8837, 512 GB), 10 Gb iWARPSpeedup beyond single node can be diminished because of high latency interconnect.The graph shows good scalability for 10 Gb low latency beyond single box, and dips in performance when switched to 1 Gb for node to node MPINext step: Perform MPI statistical analysis (size and type of messages, computation vs. communication)Use case 2 : Fluent19We took case from CERN group which performs CFD simulations.

This tool is reportd as able to scale very well on multiple of nodes ( lineary up to 10-15000 nodes).

We measured impact of interconnect latency on number of jobs per week.

Going from low latency to high latency interconnect diminishes speedup achieved with multi node setup.19Memory bandwidth impactAnsys Fluent:Measured with Intel PCMSupermicro SandyBridge server (Dual E5-2650), 102.5 GB/s peak memory bandwidthObserved few seconds peaks demanding 57 GB/s, during period=5s. This is very close to numbers measured with STREAM synthetic benchmark on this platform.Memory bandwidth measured with Intel PCM at memory controller levelNext step: check impact of memory speed on solution time

Use case 2 : Fluent20Sampling rate 1 second, Intel PCM

20Analysis done so farWe have invested our time to build first generation of tools in order to monitor different system parametersMulti-core scalability (Ansys Mechanical)Memory size requirements (Ansys Mechanical)Memory bandwidth requirements (Fluent)Interconnect network (Fluent)File I/O (Ansys Mechanical)Redo some partsWestmere 4 sockets -> SandyBridge 4 socketsNext steps:Start performing detailed interconnect monitoring by using MPI tracing tools (Intel Trace Analyzer and Collector)2121AgendaIntroduction Where we are nowCERN used applications requiring HPC infrastructureUser cases (Engineering)Ansys MechanicalAnsys FluentPhysics HPC applicationsNext stepsQ&A

2222Physics HPC applicationsPH-TH:Lattice QCD simulationsBE LINAC4 plasma simulations:plasma formation in the Linac4 ion sourceBE CLIC simulations:preservation of the Luminosity in time, under the effects of dynamic imperfections, such as vibrations, ground motion, failures of accelerator components23We have tested system level monitoring on single node/dual node. We are investigating how to scale system level analysis to multi node cluster.

preservation of the Luminosity in time, under the effects of dynamic imperfections, such as vibrations, ground motion, failures of accelerator components, etc. etc

plasma formation in the Linac4 ion source, a parallel code is currently under development in collaboration with KEIO University (Japan).

The goal is to study the RF plasma heating mechanism and how the resulting plasma properties affect the ion production, with the goal of optimizing the geometry and the operation of the ion source. The code is a Particle-In-Cell algorithm, parallelized via Message Passing Interface (MPI)

23Lattice QCDMPI based application with inline assembly in the most time-critical parts of the programMain objective is to investigate:Impact of memory bandwidth on performanceImpact of interconnection network on performance (comparison of 10 Gb iWARP and Infiniband QDR)

2424BE LINAC4 Plasma studiesMPI based applicationUsers requesting system with 250 GB of RAM for 48 cores.Main objective is to investigate:Scalability of application beyond 48 cores for a reason of spreading memory requirement on more cores than 48

2525ClustersTo better understand requirements of CERN Physics HPC applications two clusters have been preparedInvestigate ScalabilityInvestigate importance of interconnect, memory bandwidth and file i/oTest configuration20x Sandy Bridge dual socket nodes with 10 Gb iWARP low latency link16x Sandy Bridge dual socket nodes with Quad Data Rate (40Gb/s) Infiniband2626AgendaIntroduction Where we are nowCERN used applications requiring HPC infrastructureUser cases (Engineering)Ansys MechanicalAnsys FluentPhysics HPC applicationsNext stepsQ&A

2727Next stepsActivity started to better understand requirements of CERN HPC applicationsThe standard Linux performance monitoring tools give us a very detailed overview of system behavior for different applicationsNext steps are to:Refine our approach and our scripts to work at higher scale (first target 20 nodes).gain more knowledge about impact of interconnection network on MPI jobs

28We have embarked on this activity to better understand how to better serve HPC requirements.

28Thank youQ&A2929