opportunistic cloud computing for german hep

26
Opportunistic Cloud Computing for German HEP R. F. von Cube , G. Quast, R. Caspart, M. Fischer, M. Giffels, C. Heidecker, R. Hofsaess, M. Horzela, E. Kuehn, M. J. Schnepf | ErUM Data IDT Collaboration Meeting, May 11, 2021 KIT – The Research University in the Helmholtz Association www.kit.edu

Upload: others

Post on 23-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Opportunistic Cloud Computing for German HEP

Opportunistic Cloud Computing for German HEP

R. F. von Cube, G. Quast, R. Caspart, M. Fischer, M. Giffels, C. Heidecker, R. Hofsaess,M. Horzela, E. Kuehn, M. J. Schnepf | ErUM Data IDT Collaboration Meeting, May 11, 2021

KIT – The Research University in the Helmholtz Association www.kit.edu

Page 2: Opportunistic Cloud Computing for German HEP

Clear challengeExpect exploding demand forcomputing infrastructure

Proposed solutionsSoftware improvementsIntegration of additionalnon-HEP resourcesOptimization of existingworkflows

2020 2022 2024 2026 2028 2030Year

0

10000

20000

30000

40000

50000

60000

Tota

l CPU

[kHS

06-y

ears

] CMSPublicTotal CPU

2020 estimates

Run 3 Run 4

Run4: 200PU and 275fb 1/yr, 7.5 kHz, no on-going R&D includedRun4: 200PU and 500fb 1/yr, 10 kHz, no on-going R&D included10 to 20% annual resource increase

CMSOfflineComputingResults

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

2/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

The HEP Computing Challenge I

Page 3: Opportunistic Cloud Computing for German HEP

Access to multiple, heterogeneous resourcesDifficult for experiments and users:

Multiple identities, “submission types”

Assessment which resource is available and suitable

Experiments can’t negotiate with each resource provider

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

3/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

The HEP Computing Challenge II

Page 4: Opportunistic Cloud Computing for German HEP

Multiple experiments

on heterogeneous resourcesOpportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Computing in German (Astro-)ParticlePhysics

Page 5: Opportunistic Cloud Computing for German HEP

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

Multiple experiments on heterogeneous resources

Opportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Computing in German (Astro-)ParticlePhysics

Page 6: Opportunistic Cloud Computing for German HEP

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

Multiple experiments on heterogeneous resources

Opportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Computing in German (Astro-)ParticlePhysics

Page 7: Opportunistic Cloud Computing for German HEP

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

HTCondor CE(s)

Data Lake Entry Point

Single Point(s) of EntryHGF

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

HTCondorBatch System

HTCondor CE(s)

Data Lake Entry Point

Single Point(s) of EntryHGF HTCondorBatch System

Multiple experiments on heterogeneous resourcesOpportunistic, transparent resource integration and lightweight siteoperation

Data locality for efficient usage

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Computing in German (Astro-)ParticlePhysics

Page 8: Opportunistic Cloud Computing for German HEP

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

HTCondor CE(s)

Data Lake Entry Point

Single Point(s) of EntryHGF

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

HTCondorBatch System

HTCondor CE(s)

Data Lake Entry Point

Single Point(s) of EntryHGF HTCondorBatch System

HPCs

TIER 3s

TIER 1

Clouds

TIER 2s

Storage

Storage

Cache

Cache

Cache

Cache

HTCondor CE(s)

Data Lake Entry Point

Single Point(s) of EntryHGF

HTCondor CE(s)

Data Lake Entry Point

Single Point(s) of EntryHGF HTCondorBatch System

In the future

Multiple experiments on heterogeneous resourcesOpportunistic, transparent resource integration and lightweight siteoperationData locality for efficient usage

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

4/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Computing in German (Astro-)ParticlePhysics

Page 9: Opportunistic Cloud Computing for German HEP

Resource integration A1

HTCondor-CE cloud-htcondor-ce-1-kit

HTCondor OBS GridKa

Single Point of Entry

Bonn Tier 3(BAF)

KIT HPC(ForHLR2)

KIT Tier 3(TOpAS)

LMU Munich OpenStack

Bonn HPC(BONNA)

KIT HPC(HoreKa)

Data locality A2

 Batch System

 

Job Flow

CacheInformation

WorkerNode

 

WorkerNode

Job To DataMatching

JobSubmission

 

Central Management Opportunistic Resource

GridStorageElement

Resource Pool

Cache

 XRootDProxy

HEPUser

DataFlow

WorkerNode

 

 Coordination

Service

Lightweight A1/B2site operation

GSI

HTCondor CEgrid-ce-1-rwth.gridka.de

TIER 2

A lot of progress was made in the scope of ErUM Data!⇒ always looking for new collaborators. Contact us!

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

5/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Areas of Development at KIT

Page 10: Opportunistic Cloud Computing for German HEP

Resource integration A1

HTCondor-CE cloud-htcondor-ce-1-kit

HTCondor OBS GridKa

Single Point of Entry

Bonn Tier 3(BAF)

KIT HPC(ForHLR2)

KIT Tier 3(TOpAS)

LMU Munich OpenStack

Bonn HPC(BONNA)

KIT HPC(HoreKa)

Data locality A2

 Batch System

 

Job Flow

CacheInformation

WorkerNode

 

WorkerNode

Job To DataMatching

JobSubmission

 

Central Management Opportunistic Resource

GridStorageElement

Resource Pool

Cache

 XRootDProxy

HEPUser

DataFlow

WorkerNode

 

 Coordination

Service

Lightweight A1/B2site operation

GSI

HTCondor CEgrid-ce-1-rwth.gridka.de

TIER 2

A lot of progress was made in the scope of ErUM Data!⇒ always looking for new collaborators. Contact us!

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

5/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Areas of Development at KIT

Page 11: Opportunistic Cloud Computing for German HEP

Dynamic integration through single point of entryProvide resources transparent to experiments and users

React on resource demand

Integration of resources into common overlay batch system (OBS)Experiments and users authenticate only against single point of entry usingstandard grid technology

Assessment of resource fit to current job mix through metrics allocation andutilization

COBALD/TARDIS decision based on feedback not predictionIncrease well-used resources, release badly used

Allocation of resources through generalized pilots, so-called dronesDrone can be any executable, container, or virtual machine, depending onresource

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

6/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

COBALD/TARDIS Resource Manager

Page 12: Opportunistic Cloud Computing for German HEP

allocate opportunistic resources: Cloud, HPC, university clustersprovide opportunistic resources as extension to community-specificcomputing resourcesintegrate resources transparently and based on current demand

External SiteExternal SiteResource Provider

integrate

into OBS

Local Site

Overlay

Batch System

(OBS)

job

submission

jobs

usage

monitoring

Access Point

schedule request and start drones

Resource Pool

dronedrone

drone

users

increase

resources

decrease

resources

utilization

usage

monitoring

TA ISCOBalD

TA ISCOBalD

TA ISCOBalD

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

7/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Resource Integration

Page 13: Opportunistic Cloud Computing for German HEP

TARDISDynamically provisions and integratesresources into overlay batch system.

COBALDAssesses the suitability of resources to thecurrent job mix with metrics allocation andutilization.

notused

Utilization

Allo

catio

n

Mem

ory

Frac

tion

CPU Fraction

1

1

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

8/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Ressource Assessment with TARDIS andCOBALD

Page 14: Opportunistic Cloud Computing for German HEP

TARDISDynamically provisions and integratesresources into overlay batch system.

COBALDAssesses the suitability of resources to thecurrent job mix with metrics allocation andutilization.

notused

Utilization

Allo

catio

n

Mem

ory

Frac

tion

CPU Fraction

1

1

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

8/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Ressource Assessment with TARDIS andCOBALD

Page 15: Opportunistic Cloud Computing for German HEP

TARDISDynamically provisions and integratesresources into overlay batch system.

COBALDAssesses the suitability of resources to thecurrent job mix with metrics allocation andutilization.

notused

Utilization

Allo

catio

n

Mem

ory

Frac

tion

CPU Fraction

1

1

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

8/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Ressource Assessment with TARDIS andCOBALD

Page 16: Opportunistic Cloud Computing for German HEP

HTCondor-CE cloud-htcondor-ce-1-kit

HTCondor OBS GridKa

Single Point of Entry

Bonn Tier 3(BAF)

KIT HPC(ForHLR2)

KIT Tier 3(TOpAS)

LMU Munich OpenStack

Bonn HPC(BONNA)

KIT HPC(HoreKa)

built prototype of federated infrastructure with U Bonn, LMU, and KITserves multiple VOs / Experiments

→ wide-range and beneficial collaborations across communities

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

9/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Resource Integration and Lightweight SiteOperation

Page 17: Opportunistic Cloud Computing for German HEP

HTCondor-CE cloud-htcondor-ce-1-kit

HTCondor OBS GridKa

Single Point of Entry

Bonn Tier 3(BAF)

KIT HPC(ForHLR2)

KIT Tier 3(TOpAS)

LMU Munich OpenStack

Bonn HPC(BONNA)

KIT HPC(HoreKa)

built prototype of federated infrastructure with U Bonn, LMU, and KITserves multiple VOs / Experiments

→ wide-range and beneficial collaborations across communities

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

9/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Resource Integration and Lightweight SiteOperation

Page 18: Opportunistic Cloud Computing for German HEP

Resource integration A1

HTCondor-CE cloud-htcondor-ce-1-kit

HTCondor OBS GridKa

Single Point of Entry

Bonn Tier 3(BAF)

KIT HPC(ForHLR2)

KIT Tier 3(TOpAS)

LMU Munich OpenStack

Bonn HPC(BONNA)

KIT HPC(HoreKa)

Data locality A2

 Batch System

 

Job Flow

CacheInformation

WorkerNode

 

WorkerNode

Job To DataMatching

JobSubmission

 

Central Management Opportunistic Resource

GridStorageElement

Resource Pool

Cache

 XRootDProxy

HEPUser

DataFlow

WorkerNode

 

 Coordination

Service

Lightweight A1/B2site operation

GSI

HTCondor CEgrid-ce-1-rwth.gridka.de

TIER 2

A lot of progress was made in the scope of ErUM Data!⇒ always looking for new collaborators. Contact us!

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

10/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Areas of Development at KIT

Page 19: Opportunistic Cloud Computing for German HEP

Compute elements (CEs) are entry points for users and experiments to gridsites

Necessary to be operated for each site

GridKa operates multiple CEs for the Karlsruhe Tier 1

Deployment automated with puppet modules

Operate CE as a service for other sitesOverhead is minimal to run additional CE

Only minor changes in “standard” HTCONDOR-CE configuration

“Remote CE” implemented for the Aachen Tier 2 site⇒ Aachen doesn’tneed to operate CE anymore

Interested Tier 2 sites may contact us

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

11/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Lightweight Tier 2 Operations

Page 20: Opportunistic Cloud Computing for German HEP

Aachen physics department operates standard WLCG tier 2 site∼ 5100 cores pledged to CMS plus storage and grid services

Aachen researchers have access to university HPC cluster CLAIXResources integrated into WLCG tier 2 site

Bash script submitted to CLAIX’ SLURM workload manager sets up and startsHTCONDOR in unprivileged user accountJobs are started in SINGULARITY containers, providing WLCG environmentUsage completely transparent to experiments and users

CLAIX dynamically made available through COBALD/TARDIS

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

12/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

First Integration in aCMS Tier 2 WLCG Site

Page 21: Opportunistic Cloud Computing for German HEP

NetworkingHTCONDOR daemons communicate using the “Condor Connection Broker”

SingularitySupport for nested SINGULARITY containers for GlideInWMS-pilots

Activation of user namespaces, usage of sandbox-image

Singularity bind mounts unset in container

CVMFSConfig CMS_LOCAL_SITE is dangling symbolic link on host

Local SITECONF is provided through bind mount within container

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

13/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Integration of CLAIX:Challenges and Solutions

Page 22: Opportunistic Cloud Computing for German HEP

NetworkingHTCONDOR daemons communicate using the “Condor Connection Broker”

SingularitySupport for nested SINGULARITY containers for GlideInWMS-pilots

Activation of user namespaces, usage of sandbox-image

Singularity bind mounts unset in container

CVMFSConfig CMS_LOCAL_SITE is dangling symbolic link on host

Local SITECONF is provided through bind mount within container

To overcome some challenges, close communicationwith CLAIX was essential, however, very productive!

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

13/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Integration of CLAIX:Challenges and Solutions

Page 23: Opportunistic Cloud Computing for German HEP

Smooth start of test setup after close communication with CLAIX andAachen grid teamVirtually doubled number of cores, available to experiment, for a week

Submission blocked by HPC center after using ∼ 7 times monthly quota

Thanks to Aachen grid team and CLAIX service personnel!

computing project of 7.5 Mio core-h for the next year just got approved

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

14/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Aachen: Scaling up

Page 24: Opportunistic Cloud Computing for German HEP

Start withtest setup

Increase no. ofcores to 5000

Further sub-mission blocked

Smooth start of test setup after close communication with CLAIX andAachen grid teamVirtually doubled number of cores, available to experiment, for a weekSubmission blocked by HPC center after using ∼ 7 times monthly quota

Thanks to Aachen grid team and CLAIX service personnel!

computing project of 7.5 Mio core-h for the next year just got approved

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

14/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Aachen: Scaling up

Page 25: Opportunistic Cloud Computing for German HEP

Start withtest setup

Increase no. ofcores to 5000

Further sub-mission blocked

Smooth start of test setup after close communication with CLAIX andAachen grid teamVirtually doubled number of cores, available to experiment, for a week

Submission blocked by HPC center after using ∼ 7 times monthly quota

Thanks to Aachen grid team and CLAIX service personnel!

computing project of 7.5 Mio core-h for the next year just got approved

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

14/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Aachen: Scaling up

Page 26: Opportunistic Cloud Computing for German HEP

COBALD/TARDIS allows for transparent usage of opportunistic resources

Resources are integrated dynamically into one overlay batch system

Aachen tier 2 CMS pledge was doubled for a week

Join us on on

GitHub: github.com/MatterMiners

Gitter: gitter.im/MatterMiners/community

Twitter: twitter.com/matterminers

Motivation COBALD/TARDIS CMS Tier 2 Setup Summary

15/15 May 11, 2021 R. Florian von Cube: Opportunistic Cloud Computing for German HEP ETP, KIT

Summary