opportunistic cloud computing for german hep
TRANSCRIPT
Opportunistic Cloud Computing for German HEP
R. F. von Cube, G. Quast, R. Caspart, M. Fischer, M. Giffels, C. Heidecker, R. Hofsaess,M. Horzela, E. Kuehn, M. J. Schnepf | ErUM Data IDT Collaboration Meeting, May 11, 2021
KIT – The Research University in the Helmholtz Association www.kit.edu
Clear challengeExpect exploding demand forcomputing infrastructure
Proposed solutionsSoftware improvementsIntegration of additionalnon-HEP resourcesOptimization of existingworkflows
2020 2022 2024 2026 2028 2030Year
0
10000
20000
30000
40000
50000
60000
Tota
l CPU
[kHS
06-y
ears
] CMSPublicTotal CPU
2020 estimates
Run 3 Run 4
Run4: 200PU and 275fb 1/yr, 7.5 kHz, no on-going R&D includedRun4: 200PU and 500fb 1/yr, 10 kHz, no on-going R&D included10 to 20% annual resource increase
CMSOfflineComputingResults
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
2/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
The HEP Computing Challenge I
Access to multiple, heterogeneous resourcesDifficult for experiments and users:
Multiple identities, “submission types”
Assessment which resource is available and suitable
Experiments can’t negotiate with each resource provider
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
3/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
The HEP Computing Challenge II
Multiple experiments
on heterogeneous resourcesOpportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Computing in German (Astro-)ParticlePhysics
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
Multiple experiments on heterogeneous resources
Opportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Computing in German (Astro-)ParticlePhysics
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
Multiple experiments on heterogeneous resources
Opportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Computing in German (Astro-)ParticlePhysics
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
HTCondor CE(s)
Data Lake Entry Point
Single Point(s) of EntryHGF
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
HTCondorBatch System
HTCondor CE(s)
Data Lake Entry Point
Single Point(s) of EntryHGF HTCondorBatch System
Multiple experiments on heterogeneous resourcesOpportunistic, transparent resource integration and lightweight siteoperation
Data locality for efficient usage
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Computing in German (Astro-)ParticlePhysics
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
HTCondor CE(s)
Data Lake Entry Point
Single Point(s) of EntryHGF
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
HTCondorBatch System
HTCondor CE(s)
Data Lake Entry Point
Single Point(s) of EntryHGF HTCondorBatch System
HPCs
TIER 3s
TIER 1
Clouds
TIER 2s
Storage
Storage
Cache
Cache
Cache
Cache
HTCondor CE(s)
Data Lake Entry Point
Single Point(s) of EntryHGF
HTCondor CE(s)
Data Lake Entry Point
Single Point(s) of EntryHGF HTCondorBatch System
In the future
Multiple experiments on heterogeneous resourcesOpportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Computing in German (Astro-)ParticlePhysics
Resource integration A1
HTCondor-CE cloud-htcondor-ce-1-kit
HTCondor OBS GridKa
Single Point of Entry
Bonn Tier 3(BAF)
KIT HPC(ForHLR2)
KIT Tier 3(TOpAS)
LMU Munich OpenStack
Bonn HPC(BONNA)
KIT HPC(HoreKa)
Data locality A2
Batch System
Job Flow
CacheInformation
WorkerNode
WorkerNode
Job To DataMatching
JobSubmission
Central Management Opportunistic Resource
GridStorageElement
Resource Pool
Cache
XRootDProxy
HEPUser
DataFlow
WorkerNode
Coordination
Service
Lightweight A1/B2site operation
GSI
HTCondor CEgrid-ce-1-rwth.gridka.de
TIER 2
A lot of progress was made in the scope of ErUM Data!⇒ always looking for new collaborators. Contact us!
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
5/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Areas of Development at KIT
Resource integration A1
HTCondor-CE cloud-htcondor-ce-1-kit
HTCondor OBS GridKa
Single Point of Entry
Bonn Tier 3(BAF)
KIT HPC(ForHLR2)
KIT Tier 3(TOpAS)
LMU Munich OpenStack
Bonn HPC(BONNA)
KIT HPC(HoreKa)
Data locality A2
Batch System
Job Flow
CacheInformation
WorkerNode
WorkerNode
Job To DataMatching
JobSubmission
Central Management Opportunistic Resource
GridStorageElement
Resource Pool
Cache
XRootDProxy
HEPUser
DataFlow
WorkerNode
Coordination
Service
Lightweight A1/B2site operation
GSI
HTCondor CEgrid-ce-1-rwth.gridka.de
TIER 2
A lot of progress was made in the scope of ErUM Data!⇒ always looking for new collaborators. Contact us!
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
5/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Areas of Development at KIT
Dynamic integration through single point of entryProvide resources transparent to experiments and users
React on resource demand
Integration of resources into common overlay batch system (OBS)Experiments and users authenticate only against single point of entry usingstandard grid technology
Assessment of resource fit to current job mix through metrics allocation andutilization
COBALD/TARDIS decision based on feedback not predictionIncrease well-used resources, release badly used
Allocation of resources through generalized pilots, so-called dronesDrone can be any executable, container, or virtual machine, depending onresource
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
6/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
COBALD/TARDIS Resource Manager
allocate opportunistic resources: Cloud, HPC, university clustersprovide opportunistic resources as extension to community-specificcomputing resourcesintegrate resources transparently and based on current demand
External SiteExternal SiteResource Provider
integrate
into OBS
Local Site
Overlay
Batch System
(OBS)
job
submission
jobs
usage
monitoring
Access Point
schedule request and start drones
Resource Pool
dronedrone
drone
users
increase
resources
decrease
resources
utilization
usage
monitoring
TA ISCOBalD
TA ISCOBalD
TA ISCOBalD
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
7/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Resource Integration
TARDISDynamically provisions and integratesresources into overlay batch system.
COBALDAssesses the suitability of resources to thecurrent job mix with metrics allocation andutilization.
notused
Utilization
Allo
catio
n
Mem
ory
Frac
tion
CPU Fraction
1
1
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
8/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Ressource Assessment with TARDIS andCOBALD
TARDISDynamically provisions and integratesresources into overlay batch system.
COBALDAssesses the suitability of resources to thecurrent job mix with metrics allocation andutilization.
notused
Utilization
Allo
catio
n
Mem
ory
Frac
tion
CPU Fraction
1
1
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
8/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Ressource Assessment with TARDIS andCOBALD
TARDISDynamically provisions and integratesresources into overlay batch system.
COBALDAssesses the suitability of resources to thecurrent job mix with metrics allocation andutilization.
notused
Utilization
Allo
catio
n
Mem
ory
Frac
tion
CPU Fraction
1
1
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
8/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Ressource Assessment with TARDIS andCOBALD
HTCondor-CE cloud-htcondor-ce-1-kit
HTCondor OBS GridKa
Single Point of Entry
Bonn Tier 3(BAF)
KIT HPC(ForHLR2)
KIT Tier 3(TOpAS)
LMU Munich OpenStack
Bonn HPC(BONNA)
KIT HPC(HoreKa)
built prototype of federated infrastructure with U Bonn, LMU, and KITserves multiple VOs / Experiments
→ wide-range and beneficial collaborations across communities
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
9/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Resource Integration and Lightweight SiteOperation
HTCondor-CE cloud-htcondor-ce-1-kit
HTCondor OBS GridKa
Single Point of Entry
Bonn Tier 3(BAF)
KIT HPC(ForHLR2)
KIT Tier 3(TOpAS)
LMU Munich OpenStack
Bonn HPC(BONNA)
KIT HPC(HoreKa)
built prototype of federated infrastructure with U Bonn, LMU, and KITserves multiple VOs / Experiments
→ wide-range and beneficial collaborations across communities
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
9/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Resource Integration and Lightweight SiteOperation
Resource integration A1
HTCondor-CE cloud-htcondor-ce-1-kit
HTCondor OBS GridKa
Single Point of Entry
Bonn Tier 3(BAF)
KIT HPC(ForHLR2)
KIT Tier 3(TOpAS)
LMU Munich OpenStack
Bonn HPC(BONNA)
KIT HPC(HoreKa)
Data locality A2
Batch System
Job Flow
CacheInformation
WorkerNode
WorkerNode
Job To DataMatching
JobSubmission
Central Management Opportunistic Resource
GridStorageElement
Resource Pool
Cache
XRootDProxy
HEPUser
DataFlow
WorkerNode
Coordination
Service
Lightweight A1/B2site operation
GSI
HTCondor CEgrid-ce-1-rwth.gridka.de
TIER 2
A lot of progress was made in the scope of ErUM Data!⇒ always looking for new collaborators. Contact us!
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
10/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Areas of Development at KIT
Compute elements (CEs) are entry points for users and experiments to gridsites
Necessary to be operated for each site
GridKa operates multiple CEs for the Karlsruhe Tier 1
Deployment automated with puppet modules
Operate CE as a service for other sitesOverhead is minimal to run additional CE
Only minor changes in “standard” HTCONDOR-CE configuration
“Remote CE” implemented for the Aachen Tier 2 site⇒ Aachen doesn’tneed to operate CE anymore
Interested Tier 2 sites may contact us
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
11/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Lightweight Tier 2 Operations
Aachen physics department operates standard WLCG tier 2 site∼ 5100 cores pledged to CMS plus storage and grid services
Aachen researchers have access to university HPC cluster CLAIXResources integrated into WLCG tier 2 site
Bash script submitted to CLAIX’ SLURM workload manager sets up and startsHTCONDOR in unprivileged user accountJobs are started in SINGULARITY containers, providing WLCG environmentUsage completely transparent to experiments and users
CLAIX dynamically made available through COBALD/TARDIS
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
12/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
First Integration in aCMS Tier 2 WLCG Site
NetworkingHTCONDOR daemons communicate using the “Condor Connection Broker”
SingularitySupport for nested SINGULARITY containers for GlideInWMS-pilots
Activation of user namespaces, usage of sandbox-image
Singularity bind mounts unset in container
CVMFSConfig CMS_LOCAL_SITE is dangling symbolic link on host
Local SITECONF is provided through bind mount within container
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
13/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Integration of CLAIX:Challenges and Solutions
NetworkingHTCONDOR daemons communicate using the “Condor Connection Broker”
SingularitySupport for nested SINGULARITY containers for GlideInWMS-pilots
Activation of user namespaces, usage of sandbox-image
Singularity bind mounts unset in container
CVMFSConfig CMS_LOCAL_SITE is dangling symbolic link on host
Local SITECONF is provided through bind mount within container
To overcome some challenges, close communicationwith CLAIX was essential, however, very productive!
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
13/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Integration of CLAIX:Challenges and Solutions
Smooth start of test setup after close communication with CLAIX andAachen grid teamVirtually doubled number of cores, available to experiment, for a week
Submission blocked by HPC center after using ∼ 7 times monthly quota
Thanks to Aachen grid team and CLAIX service personnel!
computing project of 7.5 Mio core-h for the next year just got approved
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
14/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Aachen: Scaling up
Start withtest setup
Increase no. ofcores to 5000
Further sub-mission blocked
Smooth start of test setup after close communication with CLAIX andAachen grid teamVirtually doubled number of cores, available to experiment, for a weekSubmission blocked by HPC center after using ∼ 7 times monthly quota
Thanks to Aachen grid team and CLAIX service personnel!
computing project of 7.5 Mio core-h for the next year just got approved
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
14/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Aachen: Scaling up
Start withtest setup
Increase no. ofcores to 5000
Further sub-mission blocked
Smooth start of test setup after close communication with CLAIX andAachen grid teamVirtually doubled number of cores, available to experiment, for a week
Submission blocked by HPC center after using ∼ 7 times monthly quota
Thanks to Aachen grid team and CLAIX service personnel!
computing project of 7.5 Mio core-h for the next year just got approved
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
14/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Aachen: Scaling up
COBALD/TARDIS allows for transparent usage of opportunistic resources
Resources are integrated dynamically into one overlay batch system
Aachen tier 2 CMS pledge was doubled for a week
Join us on on
GitHub: github.com/MatterMiners
Gitter: gitter.im/MatterMiners/community
Twitter: twitter.com/matterminers
Motivation COBALD/TARDIS CMS Tier 2 Setup Summary
15/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT
Summary