amod report february 11-17 2013

AMOD Report February 11-17 2013

Torre Wenaus, BNL

February 19, 2013

Torre Wenaus 2

Activities

• Datataking until the 14th• Tail end of high priority

Moriond processing• MC not quite keeping the

grid full• Sustained high levels of

user analysis• ~1.1M production jobs

(group, MC, validation, reprocessing)

• ~3.5M analysis jobs• ~680 analysis users

Torre Wenaus 3

Sayonara T0 prompt reco

Torre Wenaus 4

Production & Analysis

Production

Analysis~24k min – 43k max

Torre Wenaus 5

Data transfers

Torre Wenaus 6

Concurrent jobs, daily completed jobs

Torre Wenaus 7

Tier 0, Central Services, ADC

• Hammercloud SLS grey through the week. HC OK, problem is with SLS. SLS server being replaced.

• The usual occasional T0 bsub submit time spikes• Tue: CMS usage visible in monitoring. ANALY_T2_CH_CERN queue had 306k jobs in

transferring – not our issue, fortunately! Ale will take up monitoring changes with Valeri. Need VO attribute on site

• Wed evening: T0 ALARM ticket submitted by T0 team, pending job accumulation. When reviewed by LSF expert 90min later, accumulation was gone, system had been operating normally. Was really below threshold for an alarm ticket. GGUS:91501

• Thu: T2 transfer monitor plots briefly missing in part, password issue, quick;y fixed• Fri: reported recurrence of lcg-utils problem in 3pm meeting at Rod’s request. First

ticketed at TW Feb 6, appeared Feb 14 at RAL. Seen also by LHCb on SL6. lcg-cp with –srm-timeout option sleeps for full timeout period if completion takes more than ~2sec. GGUS:91223– Reported in 3pm meeting to prompt some developer action, developer

responded, ticket is now in ‘waiting for [our] reply’

Torre Wenaus 8

Tier 1 Centers

• CNAF, FZK, NIKHEF out of T0 export through the week, DATADISKs almost full

• Thu am: Taiwan-LCG2: transfer errors to DATATAPE. Castor stager DB deadlock. Recovered in ~6hrs. GGUS:91505

• Fri am: BNL: fraction of transfers failing from several sources to BNL-OSG2_DATADISK. Fixed by rebalancing load across dCache pools. GGUS:91548

• Sat pm: transfer failures to Taiwan. Attributed by site to busy disk servers, OK again and ticket closed Sun night. GGUS:91581

• Sun pm: Source errors in transfers from TRIUMF-LCG2 and other CA sites. FTS cannot contact non-CA FTS servers. Resolved Mon pm. GGUS:91588

Torre Wenaus 9

Tier 2 calibration centers

• Mon am: IFIC-LCG2: resolved SRM server glitch causing transfer failures since Sun. GGUS:91327

• Sun am: IFIC-LCG2: SRM down, CALIBDISK failures of functional test transfers, all file transfers failing. Failure in one RAID group, taken offline, restored Lustre and SRM. GGUS:91586

Torre Wenaus 10

Frequent bouts of T2 transfer stalls

FTS congestion an issue during week, eg. This on a Moriond task with jobsstuck in transferring http://savannah.cern.ch/support/?135872 Tomas talked to Cedric - Rucio will use point to point FTS so shouldn't see this problem

http://savannah.cern.ch/support/?135872

Torre Wenaus 11

Other

• Clouds were running out of assigned tasks during the week. Would be very desirable to sustain a deeper todo queue of tasks.

• More clarity in monitoring and/or documentation needed for shifters on what sites are Tier 3s and how they should be treated– Shifter asked, how do we identify what sites are Tier 3s?– Answer offered by expert shifter:

• Look at number of WN in Panda. If it's 5-10 than it's Tier 3• Look at the space tokens. If there are only 3, like Scratch, Localgroup,...

then it's Tier 3. Tier 2 required to have more space tokens like Datadisk etc.

– Is there someplace that makes it clear? Would be better if it was obvious – apparent in the same monitoring that leads shifters to conclude there’s a site problem, which maybe should be treated differently (low priority if addressed at all?) if it’s a Tier 3

– A suggestion from shifter: add in https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCoS#How_to_Submit_GGUS_Team_Tickets some words about how handle Tier-3s

Torre Wenaus 12

Thanks

• Thanks to all shifters and helpful experts!

amod report february 11-17 2013

Documents