atlas dc2 seen from prague tier2 center - some remarks

8
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004

Upload: carney

Post on 09-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

ATLAS DC2 seen from Prague Tier2 center - some remarks. Atlas sw workshop September 2004. Hardware in Prague available for ATLAS. Golias: 32 dual CPU nodes PIII1.13GHz, 1GB RAM upgraded since July: + 49 dual CPU Xeon 3.06 GHz, 2 GB RAM (WN) 3TB disk space reserved for atlas - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ATLAS DC2  seen from Prague Tier2 center -  some remarks

ATLAS DC2 seen from Prague Tier2 center - some remarks

Atlas sw workshop

September 2004

Page 2: ATLAS DC2  seen from Prague Tier2 center -  some remarks

Hardware in Prague available for ATLAS

• Golias: • 32 dual CPU nodes PIII1.13GHz, 1GB RAM• upgraded since July: + 49 dual CPU Xeon 3.06 GHz, 2 GB

RAM (WN)• 3TB disk space reserved for atlas• PBSPro batch system• lcgatlasprod queue reserved for atlas VO members, high

priority

• Skurut:• 16 dual CPU nodes PIII700MHz, 1GB RAM• OpenPBS batch system• queues: lcgpbs-short, long, infinite, used mainly by atlas

• 2 independent CEs in LCG2

Page 3: ATLAS DC2  seen from Prague Tier2 center -  some remarks

Jobs waiting for input or output replication, sometimes hanging ‘forever’:Example:

Job Id Queue User Node CPUTime WallTime34031.golias lcgatlasprod atlas001 golias30 03:09:28 43:30:3934035.golias lcgatlasprod atlas002 golias03 04:17:38 43:19:1834113.golias lcgatlasprod atlas002 golias10 03:00:41 41:52:1134127.golias lcgatlasprod atlas001 golias11 04:19:11 41:21:4634583.golias lcgatlasprod atlassgm goliasx56 00:00:17 26:01:14...

Not yet cured:

running jobs, 20.9.2004:Job Id Queue User Node CPUTime WallTime55162.golias lcgatlasprod atlassgm goliasx42 00:00:03 102:19:4558528.golias lcgatlasprod atlas001 golias02 11:22:40 11:33:1358529.golias lcgatlasprod atlas001 golias03 00:00:16 11:33:49...

Usually such long jobs are killed either by administrator or by PBS time limit

Page 4: ATLAS DC2  seen from Prague Tier2 center -  some remarks

July 1 – September 21GOLIAS jobs CPU

(days)

Elapsed

(days)

all 4811 1653 1992

long (cpu>100s) 2377 1653 1881

short 2434 .4 111

SKURUT jobs CPU

(days)

Elapsed

(days)

all 1446 1507 1591

long (cpu>100s) 870 1507 1554

short 576 .2 37

number of jobs in DQ: 1349 done 1231 failed = 2580 jobs

number of jobs in DQ: 362 done 572 failed = 934 jobs

Page 5: ATLAS DC2  seen from Prague Tier2 center -  some remarks

Job distribution

• almost always not enough jobs on GOLIAS

ATLAS

• SKURUT usage much better

Page 6: ATLAS DC2  seen from Prague Tier2 center -  some remarks

Memory usage

atlas jobs on GOLIAS, july – september (part) 2004

Page 7: ATLAS DC2  seen from Prague Tier2 center -  some remarks

CPU Time

PIII1.13GHz

Xeon 3.06GHz

hours hours

PIII700MHz

hours

queue limit: 48 hours later changed to 72 hours

Page 8: ATLAS DC2  seen from Prague Tier2 center -  some remarks

Miscellaneous

• no job name in the local batch system – difficult to identify

• no (?) documentation where to look for log files, which logs are relevant

• lost jobs due to CPU time limit - no warning• lost jobs due to one missconfigured node -

spotted from local logs and by Simone too• some jobs loop forever – where to send this

information?