Download - London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,

London Tier2 StatusO.van der Aa

21/03/2007 London Tier2 Status Slide 2

LT2

Current Resource Status

• 7 GOC Sites using sge, pbs, pbspro– UCL: Central, Hep– Imperial: Hep, LeSC, ICT– Queen Mary – Royal Holloway– Brunel

• Total– CPU: 2.6 MSI2K– Disk: 94 TB disk (DPM and dCache)


LT2

MoU, where are we ?

0

500

1000

1500

2000

2500

3000

1Q05 2Q05 3Q05 4Q05 1Q06 2Q06 3Q06 4Q06

London KSI2K

UCLRHULQMULImperialBrunel

For the Disk we are

at 48% of what was promised

ButKSI2K/TB

Ratio=28 !

Sept 2007 CPUTarget


LT2

London CPU Load

0%

10%

20%

30%

40%

50%

60%

70%

80%

Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07

CPU Usage in London Grid

•Usage=(Apel CPU Time)/(Potential CPU Time)•Potential CPU Time=(KSI2K Online)*(hours in a month)•Monthly potential=1.7 MSI2K*hours

Gives an view on how well we perform wrt cpu


LT2

CPU Time per VO

0

0.2

0.4

0.6

0.8

1

1.2

1.4


London Grid Delivered MSI2K*hours

other gridbiomeddzerobabarcdflhcbcmsatlasalice

1) Biomed stopped in dec2) Recovered with lhcb/cms

- Supporting 21 VO helpsto keep you CPU busy.

London GridDelivered CPU

Fractions

babar0%

dzero3%

other grid

alice0%biomed

23%

atlas19%

lhcb34%

cms19%

cdf0%

aliceatlascmslhcbcdfbabardzerobiomedother grid


LT2

CPU Time: Site contributions

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


London Grid Delivered

IC-LeSC

IC-HEP

UCL-HEP

UCL-CCC

QMUL

RHUL

Brunel


LT2

What contribution in the UK Tier2

0%

20%

40%

60%

80%

100%


Fraction of CPU Time

SouthGrid

ScotGrid

NorthGrid

LondonT2

GridIreland


LT2

New resources OnlineIn the last quarter, both Imperial and Brunel

440 KSI2K60 TB (dCache)

208 KSI2K6 TB (DPM)

Now has a second 1Gb connection


LT2

New resources to come

• Imperial ICT shared resources: we will get 300 KSI2K out of it. – It runs pbspro.– We’ll use the IC-HEP SE. – What is currently there ?

• We have one frontend with a VM to run RHEL3/i386 for the CE installation

• All RPMS installed

– What need to be done ?• Accounting. • Adapt the GIP plugins.


LT2

New resources to come

• RHUL new cluster– Will be located at

• 265 kSI2k of CPU.• 126 TB storage.

– Remotely manager but there will be staffon site that can reboot, change disks.

– The existing resources will also move there.

– UL-CC is the SJ5 POP.


LT2

New VO’s

• NGS enabled at Imperial LeSC but– Test suite failed globus submission without

queue parameter.– Does not seem to be a problem on the sge

jobmanager side.

• Camont and Total enabled on our RB.– RB coping difficulty with the cms production.


LT2

• Storage is our weak point.– “Tune” DPM installs in all London sites.– Start with the biggest sites (QMUL)

• Install more pools to distribute the load.• Make sure we use the latest kernels.• Allocate individual pools for big VO’s.

– Stress the SE using CMS merge jobs or ATLAS equivalent.

• Cross site support– Becoming more and more important.

• Example: help solve getting atlas data out of ucl.– Almost all sites agreed to give access to others.

• But the level of access is not uniform.• Need to be implemented • How do we handle tickets ?

What do we need to improve ?


LT2

• Every site admins have to many source of monitoring:– SAM, Gstat, CMS Dashboard,

– GridLoad, LogWatches, Dirac Monitoring. • Need to aggregate different sources in one place.

– Nagios is a good candidate. Possibly one instance in London

• Example

What to improve: better monitoring

#Aborted Jobs

Home dir full Problem solved


LT2

Conclusion

• CPU– Monthly > 1 MSI2k*hours.– Utilization around 65%.– Will get additional 565 KSI2K.

• Disk– We really need more focus– Tune our DPM setups– Increase our cpu/disk ratio– Test with real cms/atlas jobs

• Availability– Cross site support.– Integrate the monitoring tools that exists

there within nagios.

16 April: LT2 Workshop at Imperial to encourage

non HEP users on to the Grid.


LT2

Thanks to all of the Team

M. Aggarwal, D. Colling, A. Chamberlin, S. George, K. Georgiou, M. Green, W. Hay, P. Hobson, P. Kyberd, A. Martin, G. Mazza, D. Rand, G. Rybkine, G. Sciacca, K. Septhon,

B. Waugh,

LT2


LT2

Backup slide: LCG RB backlog

Matching to slowA lot of jobs waiting to be matchedWhat is the cure ?Move to the glite wms ?

Download - London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,

Top Related