peter kreuzer, rwth aachen/cern oliver gutsche, fermilab cms computing shift personnel (csp)...

51
Peter Kreuzer, RWTH Aachen/CERN Oliver Gutsche, Fermilab Shift Personnel (CSP) Tutorial 10. January 2011

Upload: myles-moore

Post on 25-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Peter Kreuzer, RWTH Aachen/CERNOliver Gutsche, Fermilab

CMS Computing Shift Personnel

(CSP) Tutorial

10. January 2011

2CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Tutorial Structure‣ Today :

‣ Brief Introduction to CMS Computing

‣ General Description of Computing Shift Procedure

‣ Subscription to the CMS Computing E-Log

‣ Organization of Vidyo access from local CMS center

‣ Questions

‣ After this tutorial and >= 2 months prior to 1st shift :

‣ New shifters go through the Shift Procedure and shadow experienced CSP by taking „passive“ shifts (only E-log reports, NO alarms)

‣ After 2 „passive“ shifts :

‣ Sign off by Peter/Oli

‣ Full participation as CSP

‣ Possibility to sign-up via the WEB

3CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Brief Introduction to CMS Computing

4CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Overview of the CMS Distributed Computing System

CAFCAF

‣ Multi-tiered distributed computing infrastructure based on GRID technologies for resource access and data movement

‣ Many new challenges compared to established HEP experiments:

‣ Data distribution, user localization, site monitoring, support responsibilities

5CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Overview of the CMS Distributed Computing System

‣ Data archival (cold copy)

‣ Prompt reconstruction

‣ Time critical calibration & alignment

CAFCAF

Tier-0 / CAF

6CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Overview of the CMS Distributed Computing System

‣ Data archival (hot copy)

‣ Reprocessing, skimming, MC production

‣ Data serving

CAFCAF

Tier-1

7CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Overview of the CMS Distributed Computing System

‣ Centralized Simulation

‣ Distributed Data Analysis

CAFCAF

Tier-2

8CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Overview of the CMS Distributed Computing System

‣ Transfer rates

‣ Processing resources

CAFCAF

Tier-1 level:~35k jobs/day

Tier-2 level:~100k

jobs/day

300 MB/s

600 MB/s

Down: 50-500 MB/s burstsUp: 20 MB/s sustained

Resources

9CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Overview of the CMS Distributed Computing System

In total:7 Tier-1 across 3 In total:7 Tier-1 across 3 continents~50 Tier-2 continents~50 Tier-2 across 4 continentsacross 4 continents

10

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CSP introduction

11

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CSP Role and Required expertise‣ The CSP is mainly monitoring systems and raising alarms

‣ monitor computing infrastructure and services at checkpoint hours by going through a set of checklists

‣ identify problems

‣ Create E-Log reports

‣ trigger actions

‣ open Savannah tickets, in particular to CMS Sites

‣ contact CRC, Core Computing Operators & Experts, Computing Experts On Call

➞ We are working on making the CSP role even more active in problem trouble-shooting

‣ Required expertise of the CSP

‣ Fair understanding of CMS distributed computing infrastructure + services required for data processing, transfers and analysis

‣ Physicist or technician from a collaborating CMS institute

‣ Tutorial + 2-3 assisted “passive” shifts

12

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CMS policy for Computing Shifts‣ The Computing Shifts are accounted within standard MoA

service work defined by CMS (as Central CMS Shifts) see http://cms.cern.ch/iCMS/admin/moamanagement

‣ Standard requirement : 8 points per author per institute

‣ 1 CSP shift == 0.75 points week / 1.25 points week-end

‣ no extra credit for night shifts since covering all time zones

‣ (special arrangements not excluded)

‣ During Data taking computing shifts are carried out :

‣ From Main CMS Centres : CMS CC or FNAL/ROC

‣ From Remote CMS Centres : see http://lucas-nice.web.cern.ch/lucas-nice/cms-centre/www/CMS-Centres-Worldwide.pdf

‣ In 8 hours shifts (09-17/17-01/01-09), with 1 CSP per shift

‣ With the support of a Computing Run Coordinator who is on duty at CERN during 1 week periods

‣ With the support of CMS Core Computing Operators & Experts

13

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Other Roles & interactions with CSP‣ Computing Run Coordinator (CRC)

‣ Subscribes to all CSP E-log sub-sections

‣ Assists CSP in raising alarms/tickets for complex cases

‣ Calls EOC during off-working hours (see below)

‣ Core Computing Operator or Expert (FacOps, DataOps, AnaOps)

‣ Subscribes to relevant CSP E-Log sub-sections

‣ Supports CSP during working hours

‣ Computing Expert On Call (EOC)

‣ Responsible of a particular service

‣ Alarmed by CSP via Email/IM/Tel during working hours

‣ Alarmed by CRC if really needed off-working hours

‣ CMS Site Contact Person

‣ Responds to alarms (e.g. Savannah, GGUS tickets)

‣ Other shifters (DQM, Online, Detector, …)

‣ In temporary absence of CRC, the CSP is the Core Computing contact for any shifter at P5/CMS Center/FNAL ROC

‣ CSP procedure responsible

‣ Assigns CSP shifts

14

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CSP tools

15

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Prerequisites‣ The CSP should be

‣ CMS member

‣ if you don’t, please fill up the WEB registration form http://cms.cern.ch/iCMS/jsp/secr/reg/reg.jsp

‣ After the form has been submitted, an email is sent to your Institute Representative (Team Leader) for approval

‣ If you have never been to CERN, it is necessary to send a copy of your passport to Anastasia Dolya, CMS Secretariat, CERN - PH Department, CH -1211 Geneva 23, Switzerland

‣ have a CMS Computer account

‣ for the Computer account, please contact [email protected]

‣ a Hypernews account

‣ a GRID certificate + CMS VO registration

‣ Please follow the link https://twiki.cern.ch/twiki/bin/view/CMS/WorkBookRunningGrid#Get_a_Grid_certificate_and_the_r for a guideline on how to proceed

16

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Most important CSP tools‣ Main CSP Shift Instructions

‣ https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts

‣ Vidyo connection to the Tandberg system (other CMS Centres)

‣ https://vidyoportal.cern.ch/

‣ Shift Sign-Up tool

‣ http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Shiftlist/daily

‣ Instant Messenger under “FacOpsShifter” account

‣ https://twiki.cern.ch/twiki/bin/view/CMS/InstructionsForAIMForComputingShifters

‣ Computing Plan of the Day

‣ http://cmsdoc.cern.ch/cmscc/shift/today.jsp

‣ Account in the CSP E-log

‣ https://prod-grid-logger.cern.ch/elog/

‣ Savannah account ( “cmscompinfrasup” member) for opening tickets

‣ https://savannah.cern.ch/projects/cmscompinfrasup/

‣ Membership in e-group [email protected]

‣ subscribe via https://e-groups.cern.ch/e-groups/EgroupsSearch.do

17

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Shift Subscription tool‣ http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Shiftlist/ShiftSel

ection?shift_type=25

‣ Shift selection : Blue == available on any slot that day / Green == available on a particular slot that day

‣ Preferably, please always check the Green box corresponding to your time zone slot to avoid being approved for other time zones

‣ Warning : when selecting Green, Blues get automatically selected, so please deselect it to avoid confusion

18

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Shift Subscription policies

By end 2010, we actually have more demand for shifts than available slots (95 potential shifters !), so approvals need to follow stricter policies :

➡ shift requests can be made anytime for any open shift period➡ shift approvals will follow a monthly schedule, where shifts are approved two months in advance to allow for a reasonable planning horizon for all shifters

- example : all shift requests for January are reviewed beginning of November, the shift requests are balanced between the different groups/regions and shifts are approved

➡ In the monthly approval process, we would like to follow the following procedure:-shift requests from shifters in their own time zone have priority-within a time zone, balance shift requests first on group/institute level, then on the level of individual shifters➡We are also regularly publishing the CSP shift planning and accounting tables, per time zone, per group and per shifter, see next slide.

19

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CSP Planning and Accounting‣ https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShiftContacts#CSP_Planning_and_Accounting

Example for European time zone :

20

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

The CMS Computing Logbook‣ https://prod-grid-logger.cern.ch/elog/

‣ 2 (unpleasant) features : need to enter your elog pwd the first time accessing a given section

‣ need to regularly re-load your browser to see updates

21

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

The Savannah ticketing tool

‣ main tool to communicate with sites and DataOps/FacOps/AnaOps to solve infrastructure problems

‣ Savannah Instructions for CSP :

‣ https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts

Submit a ticket

22

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

SavannahCategory: mostly

SAM tests, Job Robot, Data transfers, ...

Severity: You judge !

Privacy: “Public”

Assigned to: either DataOps, FacOps,

AnaOps or T1/T2 site squad

Use GGUS: YES for T1s, NO for T2s

Site: T1/T2 site squad

‣ Subject: if connected to a specific site, begin with [SITE]

‣ Example: [T1_US_FNAL]

‣ For Tier-1, please systematically bridge to GGUS (WLCG ticketing) via Use GGUS: Yes

‣ More information about that here : https://twiki.cern.ch/twiki/bin/view/CMS/FacOpsSavannahGGUS

23

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

The Vidyo interface‣ We have setup a

permanent Vidyo ➞ MCU video bridge

‣ Connects to the permanent video feed between the main CMS Centers and P5

‣ Remote shifters can be in direct contact with CMS Centers at CMS CC, P5, FNAL ROC shifters

‣ To avoid having too many connections, only one CSP shifter is allowed to connect at all times

‣ CSP has to log on at the beginning of shift and log off at end

‣ Every remote CMS Center needs a Remote Video Admin (to connect to MCU) :

‣ Responsible to check that system is used properly and holding the connection details

‣ Vidyo-capable PC (Window and MAC client OK, Linux client still Beta version)

‣ Sites with existing “Tanberg” or “Polycom” devices will be connected to MCU directly

24

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CSP procedures

25

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

General

26

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklist I: Core

‣ CERN/Core infrastructure monitoring :

‣ Main checks: CERN/IT SSB, CMS Service Gridmaps, CMS Services scheduled upgrade, CASTORCMS instances

27

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklist 2 : Tier-0

‣ Tier-0 workflows monitoring :

‣ Main checks: Storage Manager, T0Mon, tier0export pool, networking, batch/LSF farm, jobs

28

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklist 3 : CAF

‣ CAF workflows monitoring :

‣ Main checks: free space/usage per CAF stakeholder on cmscaf pool, networking, batch/LSF farm, jobs

29

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklist 4 : Data Transfers

‣ Distributed Data Transfer monitoring :

‣ Main checks: Queued based monitoring for Tier-1s (not for T2s), Status of PhEDEx agents at sites

Soon O

bsol

ete,

see

next

slide

30

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

New Checklist 4 : Data Transfers

‣ Distributed Data Transfer monitoring. Main checks :

‣ Status of PhEDEx agents at sites

‣ Queued based monitoring for Tier-1s (not for T2s)

‣This new tool will be tested with shifters during November and deployed by end of 2010, replacing the existing tool.

31

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklist 5 : Grid Sites

‣ Distributed Grid sites monitoring :

‣ Main checks: SAM, JobRobot, Downtimes, Commissioning links, Savannah

32

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklist 5 : Grid Sites

‣ Important

‣ CSP is asked to investigate the problem in as much detail as possible

‣ This helps the admin which will receive any Savannah tickets to quickly and easily solve the problem

‣ DON’T REPORT THAT SITE X HAS A MEDIUM SIZE RED BALL!!!

‣ Report that site x shows failures in the <to be filled> SAM test

‣ In the body, investigate further what the problem is by clicking through the information provided till you reach the detailed error report

1 2

33

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Checklists 6&7 : T1/T2 workflows

‣ Tier-1 workflows monitoring :

‣ Main checks: not covered so far, currently relying on T1 admins, T1 coordinators, DataOps

‣ Plan to add ProdMon/Dashboard monitoring + GlideIn Fabric monitoring

‣ Tier-2 workflows monitoring :

‣ Main checks: not covered so far, currently relying on T2 admins, T2 coordinators and CRAB support team

‣ Plan to collaborate with AnalysisOps monitoring

‣ Plan to add ProdMon/Dashboard monitoring

34

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Some real examples

35

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

CAF monitoring• Free space on CMS CAF disk starts to shrink, due to an unexpected

reason

• CSP instructions (CAF) : If the fraction of free space on cmscaf as shown in URL1 goes below 10% and if this was not already mentioned in the Computing Plan of the Day and there is no already opened Savannah ticket, open an ELOG in the "CAF" category

10%

• If no detection/alarm by CSP, the free space might shrink to 0, with the consequence that the critical Tier-0 to CAF data flow breaks

• This really happened ! …and some uncontrolled emergency data flushing on the CAF had to be done ➞ WORST CASE

SCENARIO !

36

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Computing Plan of the Day

• Note : 3 Russian sites in downtime !

37

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Grid Site Monitoring• Example CMS Site Status

Board :

JINR in Scheduled downtime Ignore Waiting Room

T2_CN_Beijing shows a red ball !Known by Comp. Plan of Day?

No ! So what to do ?

38

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Grid Site Monitoring‣ Investigate further:

‣ Click on link next to “red ball”

‣ Check the different problem categories and even drill further down to check for the real problem

‣ Report in E-log

‣ Advanced CSP can open Savannah ticket to site

‣ Subject should include: [SITE] and as specific short description of the problem as possible

‣ Do not only mention that the site has a “red ball” !!!

‣ Ticket should contain as many details as found out during investigation

39

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Other news on GRID site monitoring

• “lens symbol” == already known issue. NO Elog/ticket needed (still check if it is still the same problem)

• “At work symbol” == Site scheduled downtime. NO Elog/ticket needed

Note : Unscheduled downtimes are not yet marked with the “At work symbol”, so double-check with the Computing Plan of the Day and with CMS Google Downtime Calendar (see next slide) before opening Elog/ticket.

• If T1 red, small ball, CSP should open Elog/Savannah quasi immediately (1-2h)

• If T2, follow instructions when/how open Elog/Savannah

40

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Other news on GRID site monitoring

CMS Google Downtime Calendar

41

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

PhEDEx Components Status Page All Russian T2s have their PhEDEx componentsdown since ~3h What to do ?

Check Computing Plan of the Day!

42

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Evolution of CSP procedure

43

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Where we stand and where we go‣ Summer 08: CMS Computing shift procedures created

‣ Fall 08: introduced the concept of Computing Shift Person (CSP) and Computing Run Coordinator (CRC)

‣ Winter 08: ~100 shifts done by pool of ~30 computing experts at CMS Centre@CERN & FNAL/ROC

‣ 2009: CSP shifts covered by CMS collaborators at remote CMS Centres

‣ Pool of 45 CSPs from 3 time-zones (Asia, America, Europe)

‣ CMS Centres : Beijing, Rio, Sao Paulo, Texas Tech, Univ. of Florida, Aachen, DESY, FNAL, CERN

‣ 2010: extend above philosophy

‣ Pool of 70 CSPs (new remote Centres: GridKa, INFN Bologna, ... )

‣ Encourage strong remote teams who can provide local CSP support

‣ Strengthen role of CSP in trouble-shooting issues

‣ Enforce 24/7 coverage of critical services in shift procedures

‣ Move away from “Twiki” to DQM-like monitoring (in progress)

44

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Critical Services and Sites• We are currently revising the Criticality Level of all

CMS services• CSP instructions will be adapted accordingly

– Frequency of checks– List of experts to contact– Type of alarm : Elog, Savannah, telephone to CRC (who

might raise GGUS alarm or call Expert on Call)• As a general rule : the closer you are to the

detector data stream, the more critical :– Tier-0 : processing and storage– CAF : processing and storage– Central Services at CERN (Core) : DBS, PhEDEx, …– Tier-0 – Tier-1 transfers– Tier-1 Site Availability

➞ Please pay special attention to these workflows• And always read the Computing Plan of the Day

carefully

45

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

24/7 Critical Services&Sites Coverage (II)Service/Facilities

MonitoringCSP checks

every 2 hoursStatus Green ?

E-LogBook & Ticketing tool

Expert answer within

1 hour ?

No

No

Service/Site Alarm Procedure

Yes

Expert Computing Operations

Problem solved ?NoCore System Alarming

Yes

Yes

Computing Run Coordinator (CRC) reachable 24/7 for :- Critical Service recovery procedure- Priority (GGUS-Team) ticket to site

CMS Core Computing experts / CMS Site admins(*) : - Apply routine service / infrastructure operations and monitoring- Respond as On-Call Experts to Alarms

CSP

CSP

CRC

CERN/IT

(*) CMS has dedicated site-contacts and site-admins(**) highly critical alarms to Tier-0/1s are sent via GGUS-Alarm tickets and can trigger phone calls(***) CRC, Service Expert or Site Admin actions are systematically reported back to the E-LogBook or Savannah or GGUS, for transparency purposes.

(**)

(***)

(***)

46

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

What CSP should always do ?• Subscribe to CSP shifts well in advance (> 1 week). If

cancel, consult P.Kreuzer/O.Gutsche AND remove shift subscription

• Carefully read the Computing Plan of the Day and keep an eye on it during the whole shift. If Plan missing, read report by previous shifter and complain via AIM or email to CRC

• Always connect to the instant messenger CSP account “FacOpsShifter”. When leaving the shift desk, inform outside world by changing status of messenger (e.g. to “away for lunch”)

• When reporting an issue in the proper Elog section, provide details of the observed problem (not just the link)

• Regularly read Elog responses or announcements by CRC or Computing Experts, in all Elog sections (reload browser !)

• Write detailed final shift reports in Elog; even if nothing new has occurred during shift, report on main open issues

• Once trained (2-3 passive shifts), open Savannah tickets in case of well identified site issue, by carefully following the instructions http://twiki.ihep.ac.cn/twiki/bin/view/CMS/Savannah

47

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

What CSP should never do ?

• Ignore a suspicious problem because too complex to understand solution : inform CRC or Computing experts via Elog

• Open a Savannah ticket without following the CSP instruction to identify a site problem (PhEDEx Component, SAM) or if confused about an observed problem solution : consult CRC, Computing Experts via Elog

• Cancel shifts or being replaced without reporting solution : inform shift responsible in advance and cancel subscription in shiftlist

48

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Last steps

49

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Passive shifts

‣ Passive shifts

‣ Go through already signed up shifts and determine CSP time slot for doing passive shifts

‣ Contact CSP shifter and check if she/he is willing to act as passive shift host

‣ Confirm with O.Gutsche/P.Kreuzer

‣ Shift Subscription

‣ Once passive shifts done, subscribe to shifts (ideally 2 months in advance) via http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Shiftlist/ShiftSelection?shift_type=25

50

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

Subscriptions‣ Assumption:

‣ Shifter already has CERN account and HyperNews account

‣ Sign up for elog access:

‣ https://prod-grid-logger.cern.ch/elog/

‣ Sign up for e-group [email protected]

‣ https://e-groups.cern.ch/e-groups/EgroupsSearch.do

‣ Sign up for correct Savannah access to write tickets:

‣ Login to Savannah (CERN afs login)

‣ https://savannah.cern.ch/my/groups.php

‣ under "Request for inclusion" type "CMS" and "search", this will display all groups, then click on "CMS Computing Infrastructure Support"

‣ Peter & Oli will approve the request

‣ Get a valid Grid Certificate and CMS VO registration

‣ https://twiki.cern.ch/twiki/bin/view/CMS/WorkBookRunningGrid#Get_a_Grid_certificate_and_the_r

51

CMS Computing Shift Personnel (CSP) Tutorial01/10/11

And now we can practice more if you wish

Simply open https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShifts

Many Thanks for you attention and we are looking forward to work with you !