sim@p1: using cloudscheduler for offline processing on the...

15
Sim@P1: Using Cloudscheduler for offline processing on the ATLAS HLT farm F Berghaus for the Sim@P1 team on behalf of the ATLAS Collaboration

Upload: others

Post on 16-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

Sim@P1: Using Cloudscheduler for offline processing on the ATLAS HLT farm

F Berghaus for the Sim@P1 team

on behalf of the ATLAS Collaboration

Page 2: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 2

Outline

Definition: What is Sim@P1 Status: Current operation Plan: Integration of Cloudscheduler

Page 3: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 3

What is Sim@P1?

ATLASDetector

Level 1Trigger

HighLevel

Trigger

CERNData Centre

Racks Servers per rack

Cores per node

RAM per node

RAM per core

Total cores

1-4, 6-13, 94, 95 32 16 ~24 Gbyte ~1.5 Gbyte 10K

64-69 40 16

16-26, 75-77 32 48 ~64 Gbyte ~1.3 Gbyte 64K

70-74, 79-90 40 48

44-54 40 56 64 Gbyte ~1.1 GByte

Total: 58 74K

Sim@P1 = Simulation at point one

Page 4: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 4

Sim@P1: Current Operation Dedicated VLAN for offline access to list of hosts in CERN General

Purpose Network Compute resources isolated by virtualization

GPN Uplink Non-modular

SwitchGPN 1 Gbps

Point 1 Gateways

Control Network Core Routers

(modular, multiple- line cards)

OpenStack Controller

OpenStack Controller

2x GbE

Glance Service

Glance Service

2x GbE

8x 10 Gbps Up to 40 Gbps available

2x GbEGPNDHCP + Ganglia

DHCP + Ganglia

CVMFS + Frontier Squid Proxy

CVMFS + Frontier Squid Proxy

4x GbE

Castor Router (modular, multiple- line cards

RabbitMQMySQL

HLT Server 1x GbE1x GbEHLT ServerHLT Server

Rack32 or 40

2x GbE1x GbE

Page 5: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 5

Sim@P1: Current Operation

Boot: Puppet launches nova on

worker nodes Puppet executes scripts to

launch instances Instances connect to condor

CM + 2 Sched in GPN

APF submits to each Sched Shutdown:

Puppet kills nova on worker nodes

Puppet calls cleanup scripts

30k

Jun 2018Jan 2017

Page 6: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 6

CernVM at Point 1

20MB CernVM3 micro-kernel distributed from glance CernVM3 caches in ATLAS software and operating system Two SQUID servers in P1 are sufficient to provide software

Jun 11 Jun 12 Jun 13 Jun 14 Jun 15 Jun 16 Jun 170.00.10.20.30.40.50.60.70.80.91.0

0

500

1000

1500

2000

2500

Proxy 1 Proxy 2 Hosts

Out

boun

t Tra

ffic

[Gbp

s]

Num

ber o

f hos

ts

Page 7: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 7

Issues with current operation

Hard to maintain: Many undocumented scripts Scripts spread over many servers in P1 and in GPN

No error handling for running instances Hard to update or modify

Page 8: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 8

Proposal for Sim@P1

Page 9: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 9

Cloudscheduler Batch system on distributed

cloud infrastructure

In production for offline processing for

ATLAS (2012 - present) Belle-II (2014 - present)

Cloud Interface(Nova, Boto,

OCCI)

Un

ivers

ity C

lust

er

...

Ph

ysi

cal

Host

Vir

tual

Mach

ine

Ph

ysi

cal

Host

Vir

tual

Mach

ine

Cloud Interface(Nova, Boto,

OCCI)

Inst

itu

tiu

on

Clu

ster

...

Ph

ysi

cal

Host

Vir

tual

Mach

ine

Ph

ysi

cal

Host

Vir

tual

Mach

ine

Cloud Interface(Amazon, Google,

Microsoft)

Com

merc

ial C

lou

d

Vir

tual

Mach

ine

Vir

tual

Mach

ine

Job Scheduler(HTCondor, Torque, etc.)

Cloud Scheduler

Sch

ed

ule

r S

tatu

s C

om

munic

ati

on

User

Page 10: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 10

Cloudscheduler at Point 1 Proposal for long shutdown two [LS2]:

Cloudscheduler & OpenStack run in P1 Network Polling thread and HTCondor run in CERN GPN Cloudscheduler and polling thread interact with database

OpenStack Controller

Cloudscheduler

DHCP + Ganglia

MariaSQL and HTCondor Poller

HLT ServerHLT ServerHLT Server

Racks32 or 40

GPN

HTCondor

Harvester

PanDA

ATLA

S D

istr

ibut

ed C

ompu

ting:

Its

Cen

tral

Ser

vice

s, C

hris

Lee

Tra

ck 3

@ 1

6:00

Page 11: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 11

Cloudscheduler at Point 1

Communication flow for Cloudscheduler Requires channel to database between P1 & GPN

Negotiator Collector

SchedSchedSched

Cloudscheduler Poller

MySQLCloudscheduler

OpenStack

TPU

Puppet

CernVM

P1 Control Network P1 Data Network CERN GPN

Proposal

Under Evaluation

Page 12: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 12

Harvester Job Submission Harvester pull mechanism allows job-specific resource request Condor reports resources availability to Harvester to improve

PanDA job brokering

Harvester : an edge service harvesting

heterogeneous resources for ATLAS

T Maenow, Track 3 @ 11:15 on Thursday

Proposal

Under Evaluation

SchedSchedSched

Cloudscheduler Poller

MySQLCloudscheduler

OpenStack

Puppet

TPU CernVM Negotiator Collector

HarvesterPanDA

Page 13: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 13

Summary

Sim@P1 is successfully operating Cloudscheduler setup to ease operation under

evaluation PanDA Harvester setup for job more flexible job

submission

Page 14: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 14

Cloudscheduler TeamK Casteels, C Driemel, M Ebert, C Leavett-Brown, M Paterson, R Seuster, R Sobie, R P Taylor,T Weiss-Gibbons

Thanks to many contributors

Sim@P1 Team

A Di Girolamo, C Lee, P Love, J Schovancova, R Walker

TDAQ TeamF Brasolin, D A Scannicchio, M E Pozo Astigarraga

Page 15: Sim@P1: Using Cloudscheduler for offline processing on the ...heprcdocs.phys.uvic.ca/presentations/chep-simp1-2018.pdfSim@P1: Using Cloudscheduler for offline processing on the ATLAS

10 July 2018 Frank Berghaus 15

P1 Network Upgrades

Tentative future

GPN Uplink Non-modular

SwitchGPN 1 Gbps

Point 1 Gateways

Control Network Core Routers

OpenStack Controller

OpenStack Controller

2x 10GbE

Glance Service

Glance Service

2x 10GbE

2x 100 GbpsGbEGPNDHCP + Ganglia

DHCP + Ganglia

CVMFS + Frontier Squid Proxy

CVMFS + Frontier Squid Proxy

2x GbE

Castor Router

RabbitMQMySQL

HLT Server 10GbE10GbEHLT ServerHLT Server

Rack32 or 40

2x 10GbE2x 10GbEData and control share one link

Follow the rest of P1 network

Hiding some heterogeneity