London Tier2 StatusO.van der Aa
21/03/2007 London Tier2 Status Slide 2
LT2
Current Resource Status
• 7 GOC Sites using sge, pbs, pbspro– UCL: Central, Hep– Imperial: Hep, LeSC, ICT– Queen Mary – Royal Holloway– Brunel
• Total– CPU: 2.6 MSI2K– Disk: 94 TB disk (DPM and dCache)
21/03/2007 London Tier2 Status Slide 3
LT2
MoU, where are we ?
0
500
1000
1500
2000
2500
3000
1Q05 2Q05 3Q05 4Q05 1Q06 2Q06 3Q06 4Q06
London KSI2K
UCLRHULQMULImperialBrunel
For the Disk we are
at 48% of what was promised
ButKSI2K/TB
Ratio=28 !
Sept 2007 CPUTarget
21/03/2007 London Tier2 Status Slide 4
LT2
London CPU Load
0%
10%
20%
30%
40%
50%
60%
70%
80%
Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07
CPU Usage in London Grid
•Usage=(Apel CPU Time)/(Potential CPU Time)•Potential CPU Time=(KSI2K Online)*(hours in a month)•Monthly potential=1.7 MSI2K*hours
Gives an view on how well we perform wrt cpu
21/03/2007 London Tier2 Status Slide 5
LT2
CPU Time per VO
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07
London Grid Delivered MSI2K*hours
other gridbiomeddzerobabarcdflhcbcmsatlasalice
1) Biomed stopped in dec2) Recovered with lhcb/cms
- Supporting 21 VO helpsto keep you CPU busy.
London GridDelivered CPU
Fractions
babar0%
dzero3%
other grid
alice0%biomed
23%
atlas19%
lhcb34%
cms19%
cdf0%
aliceatlascmslhcbcdfbabardzerobiomedother grid
21/03/2007 London Tier2 Status Slide 6
LT2
CPU Time: Site contributions
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07
London Grid Delivered
IC-LeSC
IC-HEP
UCL-HEP
UCL-CCC
QMUL
RHUL
Brunel
21/03/2007 London Tier2 Status Slide 7
LT2
What contribution in the UK Tier2
0%
20%
40%
60%
80%
100%
Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07
Fraction of CPU Time
SouthGrid
ScotGrid
NorthGrid
LondonT2
GridIreland
21/03/2007 London Tier2 Status Slide 8
LT2
New resources OnlineIn the last quarter, both Imperial and Brunel
440 KSI2K60 TB (dCache)
208 KSI2K6 TB (DPM)
Now has a second 1Gb connection
21/03/2007 London Tier2 Status Slide 9
LT2
New resources to come
• Imperial ICT shared resources: we will get 300 KSI2K out of it. – It runs pbspro.– We’ll use the IC-HEP SE. – What is currently there ?
• We have one frontend with a VM to run RHEL3/i386 for the CE installation
• All RPMS installed
– What need to be done ?• Accounting. • Adapt the GIP plugins.
21/03/2007 London Tier2 Status Slide 10
LT2
New resources to come
• RHUL new cluster– Will be located at
• 265 kSI2k of CPU.• 126 TB storage.
– Remotely manager but there will be staffon site that can reboot, change disks.
– The existing resources will also move there.
– UL-CC is the SJ5 POP.
21/03/2007 London Tier2 Status Slide 11
LT2
New VO’s
• NGS enabled at Imperial LeSC but– Test suite failed globus submission without
queue parameter.– Does not seem to be a problem on the sge
jobmanager side.
• Camont and Total enabled on our RB.– RB coping difficulty with the cms production.
21/03/2007 London Tier2 Status Slide 12
LT2
• Storage is our weak point.– “Tune” DPM installs in all London sites.– Start with the biggest sites (QMUL)
• Install more pools to distribute the load.• Make sure we use the latest kernels.• Allocate individual pools for big VO’s.
– Stress the SE using CMS merge jobs or ATLAS equivalent.
• Cross site support– Becoming more and more important.
• Example: help solve getting atlas data out of ucl.– Almost all sites agreed to give access to others.
• But the level of access is not uniform.• Need to be implemented • How do we handle tickets ?
What do we need to improve ?
21/03/2007 London Tier2 Status Slide 13
LT2
• Every site admins have to many source of monitoring:– SAM, Gstat, CMS Dashboard,
– GridLoad, LogWatches, Dirac Monitoring. • Need to aggregate different sources in one place.
– Nagios is a good candidate. Possibly one instance in London
• Example
What to improve: better monitoring
#Aborted Jobs
Home dir full Problem solved
21/03/2007 London Tier2 Status Slide 14
LT2
Conclusion
• CPU– Monthly > 1 MSI2k*hours.– Utilization around 65%.– Will get additional 565 KSI2K.
• Disk– We really need more focus– Tune our DPM setups– Increase our cpu/disk ratio– Test with real cms/atlas jobs
• Availability– Cross site support.– Integrate the monitoring tools that exists
there within nagios.
16 April: LT2 Workshop at Imperial to encourage
non HEP users on to the Grid.
21/03/2007 London Tier2 Status Slide 15
LT2
Thanks to all of the Team
M. Aggarwal, D. Colling, A. Chamberlin, S. George, K. Georgiou, M. Green, W. Hay, P. Hobson, P. Kyberd, A. Martin, G. Mazza, D. Rand, G. Rybkine, G. Sciacca, K. Septhon,
B. Waugh,
LT2
21/03/2007 London Tier2 Status Slide 16
LT2
Backup slide: LCG RB backlog
Matching to slowA lot of jobs waiting to be matchedWhat is the cure ?Move to the glite wms ?