integrating dirac work ows in supercomputers
TRANSCRIPT
Integrating DIRAC workflows inSupercomputersStatus and next steps
Alexandre Boyer - Universite Clermont Auvergne CERNalexandrefranckboyercernch
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 121
Introduction
The DIRAC WMS implements the Pilot-Job paradigm
bull Able to federate a large variety of heterogeneous computingresources
bull Mainly Grid Sites Clouds but also opportunistic resources Whatabout Supercomputers
Supercomputers represent an important computing power
bull Different from the Grid Sites integrating VO-specific workflows onsuch machines through DIRAC requires work
bull Each machine is unique and the landscape quickly evolves
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 221
Table of Contents
bull DIRAC WMS on Grid Sites
bull DIRAC WMS and Supercomputers
bull Tackling the distributed computing challenges
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 321
DIRAC WMS on Grid Sites
WMS Workflow
ComputingElement
1 Submit pilots
Grid SiteWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421
DIRAC WMS on Grid Sites
rdquoTypicalrdquo job requirements
Single-Coreallocation
x86architecture
SLC6CC7compatible
4 SP Jobs running on a Grid Site
CVMFS endpointmounted on the nodes
gt2Gb RAMper core
Internet
Network access(Fetch jobs upload outputs)
Worker Node
Core
LRMS accessiblefrom outside
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521
DIRAC WMS and Supercomputers
Presentation
Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time
bull Twice a year top500org releases the list of the most powerful SCof the world
bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores
bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores
bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Introduction
The DIRAC WMS implements the Pilot-Job paradigm
bull Able to federate a large variety of heterogeneous computingresources
bull Mainly Grid Sites Clouds but also opportunistic resources Whatabout Supercomputers
Supercomputers represent an important computing power
bull Different from the Grid Sites integrating VO-specific workflows onsuch machines through DIRAC requires work
bull Each machine is unique and the landscape quickly evolves
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 221
Table of Contents
bull DIRAC WMS on Grid Sites
bull DIRAC WMS and Supercomputers
bull Tackling the distributed computing challenges
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 321
DIRAC WMS on Grid Sites
WMS Workflow
ComputingElement
1 Submit pilots
Grid SiteWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421
DIRAC WMS on Grid Sites
rdquoTypicalrdquo job requirements
Single-Coreallocation
x86architecture
SLC6CC7compatible
4 SP Jobs running on a Grid Site
CVMFS endpointmounted on the nodes
gt2Gb RAMper core
Internet
Network access(Fetch jobs upload outputs)
Worker Node
Core
LRMS accessiblefrom outside
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521
DIRAC WMS and Supercomputers
Presentation
Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time
bull Twice a year top500org releases the list of the most powerful SCof the world
bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores
bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores
bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Table of Contents
bull DIRAC WMS on Grid Sites
bull DIRAC WMS and Supercomputers
bull Tackling the distributed computing challenges
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 321
DIRAC WMS on Grid Sites
WMS Workflow
ComputingElement
1 Submit pilots
Grid SiteWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421
DIRAC WMS on Grid Sites
rdquoTypicalrdquo job requirements
Single-Coreallocation
x86architecture
SLC6CC7compatible
4 SP Jobs running on a Grid Site
CVMFS endpointmounted on the nodes
gt2Gb RAMper core
Internet
Network access(Fetch jobs upload outputs)
Worker Node
Core
LRMS accessiblefrom outside
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521
DIRAC WMS and Supercomputers
Presentation
Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time
bull Twice a year top500org releases the list of the most powerful SCof the world
bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores
bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores
bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
DIRAC WMS on Grid Sites
WMS Workflow
ComputingElement
1 Submit pilots
Grid SiteWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 421
DIRAC WMS on Grid Sites
rdquoTypicalrdquo job requirements
Single-Coreallocation
x86architecture
SLC6CC7compatible
4 SP Jobs running on a Grid Site
CVMFS endpointmounted on the nodes
gt2Gb RAMper core
Internet
Network access(Fetch jobs upload outputs)
Worker Node
Core
LRMS accessiblefrom outside
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521
DIRAC WMS and Supercomputers
Presentation
Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time
bull Twice a year top500org releases the list of the most powerful SCof the world
bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores
bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores
bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
DIRAC WMS on Grid Sites
rdquoTypicalrdquo job requirements
Single-Coreallocation
x86architecture
SLC6CC7compatible
4 SP Jobs running on a Grid Site
CVMFS endpointmounted on the nodes
gt2Gb RAMper core
Internet
Network access(Fetch jobs upload outputs)
Worker Node
Core
LRMS accessiblefrom outside
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 521
DIRAC WMS and Supercomputers
Presentation
Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time
bull Twice a year top500org releases the list of the most powerful SCof the world
bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores
bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores
bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
DIRAC WMS and Supercomputers
Presentation
Definition A mainframe computer that is among the largest fastestor most powerful of those available at a given time
bull Twice a year top500org releases the list of the most powerful SCof the world
bull 1 Fugaku is composed of ARM processors and contains sim7 millioncores
bull 2 and 3 leverage IBM Power processors and Nvidia GPUs andcontain sim15-2 million cores
bull In comparison WLCG provides sim1 million cores (many additionalparameters have to be taken into account for a fair comparisonthough)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 621
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
DIRAC WMS and Supercomputers
Features of Supercomputers
SingleMulti-Coreallocation
x86Non-x86architecture
FastNodes-interconnectivty
2Multi-Process(MP)Jobsrunningonafatnode
1MPJobrunningonGPU
GPUusage
SharedFileSystem
HPC
Nonetworkconnectivity
Restrictiveuserauthentication
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 721
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
DIRAC WMS and Supercomputers
Challenges
Software architecture (VO)
bull SC are many-core architecture
bull They can include non x86CPUs (ARM AMD Power)GPUs
bull They might contain less than2Gbcore
Distributed computing (DIRAC)
bull SC policies may differ fromthose of HEP Grid Sites
bull They might lack of CVMFSoutbound connectivity externalaccess to the LRMS
rArr SC are all made differently hard to build a unique solution for all ofthem
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 821
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challenges
Overview
bull 1 main variable directly affects the chosen solution (push pull)
+ Do WNs have an external connectivity yes (or only via the edgenode) no
bull Other variables generate variations that can be added up to theproposed solution
+ Is CVMFS mounted on the WNs yes no
+ Is LRMS accessible from outside yes no
+ What type of allocations can we make single-core multi-coremulti-node
rArr We will go through different cases from the easiest to the hardestone
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 921
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPull model single-core allocation
Similar to a Grid Site
bull Uncommon for a SC
bull Often need tocollaborate with thesystem administrators
Single-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1021
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPull model Multi-core allocation
Integrated since v7r0
SC often require their users to allocate many cores or even nodes to runa program (queue configuration)
Fat node partition [3]
bull One pilot per fat nodeexecute several SPMPjobs per allocation
bull In the Queue conf addLocalCEType=Pool
andNumberOfProcessors=N
Multi-Coreallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Shared FS
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1121
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPull model Multi-node allocation
Almost integrated in v7r1
Allows to get a large number of resources with a small number ofallocations
Sub-Pilots (specific to SLURM currently)
bull One sub-pilot per fatnode allocated pilotssharing a same idstatus and output
bull In the Queue conf addParallelLibrary=PL
andNumberOfNodes=Nlt-Mgt
Shared FS
Multi-Nodeallocation
CVMFS endpointmounted on the nodes
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1221
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPull model CVMFS not mounted on WNs
Not Integrated VO action
SC by default do not provide CVMFS on the WNs
CVMFS-exec on the shared FS [2]
bull Mount CVMFS as anunprivileged user
bull Purely a siteadminVOaction actually mightneed to add aparameter in DIRAC toease the process
CVMFS endpoint
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS notmounted
CVMFS-exec on theshared file system
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1321
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPull model No remote access to LRMS
Integrated since v7r0 VO action
Some SC can only be accessed via a VPN (No CE no direct SSH)
Site Director on the edge node
bull Directly submit pilotsfrom the edge node
bull Would need to beallowed to executeagents on the edgenode
bull Would need to beupdated manually
LRMS not accessiblefrom outside
Internet
External connectivity(Fetch jobs upload outputs)
LRMS accessible from outside(push pilots from the DIRAC server)
Pilot-Factory onthe edge node
CVMFS endpointmounted on the nodes
Shared FS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1421
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPull model Ext connectivity only from the edge node
Not Integrated VO action
Some SC only provide external connectivity from the edge node Pilotscannot directly interact with DIRAC services in this case
Gateway
bull Would be installed onthe edge node (ifpossible)
bull Would capture the Pilotand Job calls and wouldredirect them
Shared FS
External connectivityonly from the edge node
Internet
External connectivity(Fetch jobs upload outputs)
Gateway serviceon the edge node
captures calls andredirects them
LRMS accessible from outside(push pilots from the DIRAC server)
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1521
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPush model No Ext connectivity
In Progress v7r2
Some SC do not provide any external connectivity at all neither on theWNs or the edge node
PushJobAgent
bull Works like a pilotoutside of the SC
bull Fetches jobs deals withinputs and outputssubmits the applicationpart to a SC
bull Require a direct accessto the LRMS
Shared FS
No externalConnectivity
CVMFS endpointmounted on the nodes
Internet
1 Fetch a joband get inputs
3 Get outputsand store them
DIRAC 2 Submitthe
Application
Computing Element
Edge Node
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1621
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challengesPush model No Ext connectivity Multi-corenode
In Progress v7r2
BundleCE
bull Would aggregatemultiple applicationsinto one allocation
Shared FS
No externalConnectivity
Multi-core Multi-node
Internet
DIRAC
1 SubmitApplications
2 AggregateApplications
3 Submitthem as 1multi-core
job
Computing Element
Bundle CE
CVMFS endpointmounted on the nodes
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1721
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Tackling the distributed computing challenges
Push model No Ext connectivity No CVMFS
In Progress v7r2 VO action
As it was already said SC do not provide CVMFS by defaultCVMFS-exec cannot be used in this context
Subset-CVMFS-Builder
bull Run amp extract CVMFSdependencies of givenjobs
bull UseCVMFS-Shrinkwrapper[1] to make a subset ofCVMFS
bull Test it amp deploy it onthe SC shared FS
No externalConnectivity
No CVMFS
Internet
Build and deploy asubset of CVMFS on
the shared FS
CVMFS SubsetBuilder
CVMFS Subset
CVMFSendpoint
Edge NodeDIRAC
Computing Element
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1821
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Conclusion
Status
bull Support many SC with external connectivity (multi-core allocations)
bull Tools to exploit SC with no external connectivity are in progress
Next Steps
bull Provide the push model solution and its variations
bull Work on DB12 (CPU Power computation) support for multi-coreallocations
bull Provide a complete documentation about SC integration
bull Provide side projects to minimize VO actions(Subset-CVMFS-Builder)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 1921
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Thanks
Any questions Comments
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2021
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
CVMFS cvmfs-shrinkwrap utilityhttpscvmfsreadthedocsioenstablecpt-
shrinkwraphtmlcpt-shrinkwrap Online accessed 4May 2021 2021
CVMFS cvmfsexechttpsgithubcomcvmfscvmfsexec Online accessed4 May 2021 2021
Federico Stagni Andrea Valassi and Vladimir RomanovskiyldquoIntegrating LHCb workflows on HPC resources status andstrategiesrdquo In arXiv200613603 [hep-ex physicsphysics](June 2020) arXiv 200613603 urlhttparxivorgabs200613603
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
Backup Slides
Real use-cases we have in LHCb
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Available HPCs
bull Piz Daint in CSCS (Suisse)
bull Marconi-A2 in CINECA (Italy) ndash not used anymore
bull SDumont in LNCC (Brazil)
bull MareNostrum in BSC (Spain)
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Piz Daint CSCS
bull Ranked 12th in Top500 (Nov 2020)
bull 387872 cores (Nov 2020)
+ Collaboration with the local System Administrators allows atraditional Grid Site usage
rArr No change required
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Marconi-A2 CINECA
bull Ranked 19th in Top500 (Nov2019)
+ External connectivity from theWNs
+ CVMS mounted on the WNs
+ Accessible via a CE
bull 348000 cores (Nov 2019)
- Multi-core allocations 272logical cores per node (IntelKNL)
- Low memorycore
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Marconi-A2 CINECA Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
HTCondorCE
1 Submit pilots
Marconi-A2Worker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Marconi-A2 CINECA Status
rArr More details about LHCb work on CINECA [3]
rArr Marconi-A2 has been replaced by Marconi-100 V100 GPUs andPower9 CPUs cluster
Done
bull Exploited 68272 cores pernode not enough memory formore jobs
To be done
bull Nothing to do Marconi-A2disappeared
bull LHCb software not ready forGPUs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
SDumont LNCC
bull Ranked 277th in Top500 (Nov2020)
+ External connectivity from theWNs
+ CVMFS mounted on the WNs
+ Accessible via SSH (specialaccess)
bull 33856 cores (Nov 2020)
- Multi-core allocations 24 or 48logical cores per node
- Multi-node allocations 21nodes per allocation requiredby some queues
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
SDumont LNCC Development
bull External Connectivity Use thepull model
bull Multi-core allocations Use thefat node partitioning variation
bull Multi-node allocations Use thesub-pilots variation
1 Submit pilots via ssh
SDumontWorker nodes
DIRAC
Services
Pilot-Factory
WaitingJobs
2 Fetchjobs
LRMS CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
SDumont LNCC Status
Done
bull Exploit 2424 and 4848 coresper node
To be done
bull Multi-node allocation shouldhave results soon
bull Dirac Benchmark not adaptedto multi-core allocations 20of the jobs run out of time
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Mare Nostrum BSC
bull Ranked 42nd in Top500 (Nov2020)
+ Accessible via a CE (ARC) andalso SSH
+ Single-core allocation possiblebut not recommended
bull 153216 cores (Nov 2020)
- No network connectivity
- CVMFS not mounted on theWNs
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Mare Nostrum BSC Development
bull No external connectivity Usethe push model
bull No CVMFS mounted on theWNs Use theSubset-CVMFS-Buildervariation
bull To get multi-core allocationsUse the BundleCE variation ARCCE
MareNostrumWorker nodes
DIRAC
Services Waiting
Jobs
1 Fetchjobs BundleCE
CVMFS SubsetBuilder
2 Submitapps
Subset CVMFS
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-
LHCb-supercomputers collaboration
Mare Nostrum BSC Status
Done
bull Prototype to run simplesubmissions (Hello World)
bull First version of theSubset-CVMFS-Builder
To be done
bull CE configuration to run jobswithin Singularity
bull BundleCE to aggregatemultiple jobs in an allocation
Virtual DIRAC Usersrsquo Workshop - Monday 10th May 2021 2121
- Introduction
- Table of Contents
- DIRAC WMS on Grid Sites
- DIRAC WMS and Supercomputers
- Tackling the distributed computing challenges
- References
- LHCb-supercomputers collaboration
-