GridPP: Executive Summary
Tony Doyle
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Exec2 SummaryGrid Status:
Geographical View: GridMapHigh-level View: ProjectMap
Topical View: CASTORPerformance Monitoring
Disaster PlanningTransition Point
The Icemen Cometh
Outline
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
• 2007 is the third full year for the Production Grid• More than 10,000 kSI2k and 1 Petabyte of disk storage• The UK is the largest CPU provider on the EGEE Grid• Total CPU used of 25 GSI2k-hours in the last year to Sept.• The GridPP2 project has met 86% of its targets with 93% of
the metrics within specification (up to 07Q2)• The GridPP2 project has been extended by 7 months to April
2008– The LCG (full) Grid Service is underway– The aim is to continue to improve reliability and performance
• The GridPP3 proposal has been approved for 3 years through to March 2011 [total cost of £29.5m]– The aim is to provide a performant service to the experiments
• We anticipate a challenging period especially for the support of experiment applications running on the Grid
Exec2 Summary
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
•To create a UK Particle Physics Grid and the computing technologies required for the Large Hadron Collider (LHC) at CERN
•To place the UK in a leadership position in the international development of an EU Grid infrastructure
Context
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
• View can be geographical, high-level or topical
VO Viewscross-location
Top-level View
GeographicalViews
Federation,Partner,Site, etc.
Next level of GridMaps
Large-scale Federated Grid
Services Infrastructure
Global GridMap
Application Domain GridMap
Local GridMap Local GridMap Local GridMap
Alert Corrective action effect
Views of the Grid
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
1. Geographical Status
http://gridmap.cern.ch/gm/
“A Leadership Position”
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Resource Status
The latest availability figures are (approx. in case of Tier-1):
Tier-1 Tier-2 TotalCPU [kSI2k] 1500 8588 10,088Disk [TB] 750 743 1,493Tape [TB] >800 >800•GridPP2 capacity targets met•Combined effort from all Institutions
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Aim: by 2008 (full year’s data taking)
- CPU ~100MSI2k (100,000 CPUs)
- Storage ~80PB - Involving >100 institutes
worldwide
- Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT)
1. Prototype went live in September 2003 in 12 countries
2. Extensively tested by the LHC experiments in September 2004
3. February 2006 25,547 CPUs, 4398 TB storage
Status in Oct 2007: 245 sites, 40,518
CPUs, 24,135 TB storage
Grid Status
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Resource Accounting
100,000 3GHz CPUs
CPU resources at ~required levels
(just in time delivery)
time
LHC start-upCPU
Grid-accessible disk accounting being improved
Grid Operations Centre
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
2. High-Level Status
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 0 .1 0 0 .1 1 0 .1 2 0 .1 3 0 .1 4 0 .1 5 0 .1 6 0 .1 7 0 .1 0 0 0 .1 0 1 0 .1 0 2 0 .1 0 3 0 .1 0 4 0 .1 0 5 0 .1 0 6 0 .1 0 7 0 .1 0 8 0 .1 0 9 0 .1 1 0 0 .1 1 1 0 .1 1 2 0 .1 1 3 0 .1 1 4 0 .1 1 5 0 .1 1 6
0 .1 8 0 .1 9 0 .2 0 0 .2 1 0 .2 2 0 .2 3 0 .2 4 0 .2 5 0 .2 6 0 .2 7 0 .2 8 0 .2 9 0 .3 0 0 .3 1 0 .3 2 0 .3 3 0 .3 4 0 .1 1 7 0 .1 1 8 0 .1 1 9 0 .1 2 0 0 .1 2 1 0 .1 2 2 0 .1 2 3 0 .1 2 4 0 .1 2 5 0 .1 2 6 0 .1 2 7 0 .1 2 8 0 .1 2 9 0 .1 3 0 0 .1 3 1 0 .1 3 2 0 .1 3 3
0 .3 5 0 .3 6 0 .3 7 0 .3 8 0 .3 9 0 .4 0 0 .4 1 0 .4 2 0 .4 3 0 .4 4 0 .4 5 0 .4 6 0 .4 7 0 .4 8 0 .4 9 0 .5 0 0 .5 1 0 .1 3 4 0 .1 3 5 0 .1 3 6 0 .1 3 7 0 .1 3 8 0 .1 3 9 0 .1 4 0 0 .1 4 1 0 .1 4 2 0 .1 4 3 0 .1 4 4 0 .1 4 5 0 .1 4 6 0 .1 4 70 .5 2 0 .5 3 0 .5 4 0 .5 5 0 .5 6 0 .5 7 0 .5 8 0 .5 9 0 .6 0 0 .6 1 0 .6 2 0 .6 3 0 .6 4 0 .6 5 0 .6 6 0 .6 7 0 .6 8
2 .1 3 .1 4 .1 5 .1 6 .1
1 .1 .1 1 .1 .2 1 .1 .3 1 .1 .4 2 .1 .1 2 .1 .2 2 .1 .3 2 .1 .4 2 .1 .5 3 .1 .1 3 .1 .2 3 .1 .3 3 .1 .4 3 .1 .5 4 .1 .1 4 .1 .2 4 .1 .3 4 .1 .4 4 .1 .5 5 .1 .1 5 .1 .2 5 .1 .3 5 .1 .4 5 .1 .5 6 .1 .1 6 .1 .2 6 .1 .3 6 .1 .4 6 .1 .5
1 .1 .5 2 .1 .6 2 .1 .7 2 .1 .8 2 .1 .9 2 .1 .1 0 3 .1 .6 3 .1 .7 3 .1 .8 3 .1 .9 3 .1 .1 0 4 .1 .6 4 .1 .7 4 .1 .8 4 .1 .9 4 .1 .1 0 5 .1 .6 5 .1 .7 5 .1 .8 5 .1 .9 5 .1 .1 0 6 .1 .6 6 .1 .7 6 .1 .8 6 .1 .9
2 .1 .1 1 2 .1 .1 2 3 .1 .1 1 3 .1 .1 2 3 .1 .1 3 4 .1 .1 1 4 .1 .1 2 5 .1 .1 1 5 .1 .1 2
2 .2 3 .2 4 .2 5 .2 6 .2
1 .2 .1 1 .2 .2 1 .2 .3 1 .2 .4 2 .2 .1 2 .2 .2 2 .2 .3 2 .2 .4 2 .2 .5 3 .2 .1 3 .2 .2 3 .2 .3 3 .2 .4 3 .2 .5 4 .2 .1 4 .2 .2 4 .2 .3 4 .2 .4 4 .2 .5 5 .2 .1 5 .2 .2 5 .2 .3 5 .2 .4 5 .2 .5 6 .2 .1 6 .2 .2 6 .2 .3 6 .2 .4 6 .2 .5
1 .2 .5 2 .2 .6 2 .2 .7 2 .2 .8 2 .2 .9 2 .2 .1 0 3 .2 .6 3 .2 .7 4 .2 .6 4 .2 .7 4 .2 .8 4 .2 .9 4 .2 .1 0 5 .2 .6 5 .2 .7 5 .2 .8 5 .2 .9 5 .2 .1 0 6 .2 .6 6 .2 .7 6 .2 .8 6 .2 .9 6 .2 .1 0
2 .2 .1 1 2 .2 .1 2 2 .2 .1 3 2 .2 .1 4 2 .2 .1 5 4 .2 .1 1 4 .2 .1 2 4 .2 .1 3 4 .2 .1 4 4 .2 .1 5 5 .2 .1 1 5 .2 .1 2 5 .2 .1 3 5 .2 .1 4 5 .2 .1 5 6 .2 .1 1 6 .2 .1 2 6 .2 .1 3 6 .2 .1 4
2 .3 3 .3 4 .3 6 .3
1 .3 .1 1 .3 .2 1 .3 .3 2 .3 .1 2 .3 .2 2 .3 .3 2 .3 .4 2 .3 .5 3 .3 .1 3 .3 .2 3 .3 .3 3 .3 .4 3 .3 .5 4 .3 .1 4 .3 .2 4 .3 .3 4 .3 .4 4 .3 .5 6 .3 .1 6 .3 .2 6 .3 .3 6 .3 .4 6 .3 .5
2 .3 .6 2 .3 .7 2 .3 .8 2 .3 .9 2 .3 .1 0 3 .3 .6 3 .3 .7 3 .3 .8 3 .3 .9 3 .3 .1 0 4 .3 .6 4 .3 .7 4 .3 .8 4 .3 .9 4 .3 .1 0
2 .3 .1 1 3 .3 .1 1 3 .3 .1 2 3 .3 .1 3 4 .3 .1 1 4 .3 .1 2 4 .3 .1 3
2 .4 3 .4 4 .4 6 .4
2 .4 .1 2 .4 .2 2 .4 .3 2 .4 .4 2 .4 .5 3 .4 .1 3 .4 .2 3 .4 .3 3 .4 .4 3 .4 .5 4 .4 .1 4 .4 .2 4 .4 .3 4 .4 .4 4 .4 .5 6 .4 .1 6 .4 .2 6 .4 .3 6 .4 .4
2 .4 .6 2 .4 .7 2 .4 .8 2 .4 .9 2 .4 .1 0 3 .4 .6 3 .4 .7 3 .4 .8 3 .4 .9 3 .4 .1 0 4 .4 .6 4 .4 .7 4 .4 .8 4 .4 .9 4 .4 .1 0
2 .4 .1 1 2 .4 .1 2 2 .4 .1 3 2 .4 .1 4 2 .4 .1 5 3 .4 .1 1 3 .4 .1 2 3 .4 .1 3 3 .4 .1 4 3 .4 .1 5
2 .5 3 .5 9 0 D a y s2 .5 .1 2 .5 .2 2 .5 .3 2 .5 .4 2 .5 .5 3 .5 .1 3 .5 .2 3 .5 .3 3 .5 .4 3 .5 .5
2 .5 .6 2 .5 .7 2 .5 .8 2 .5 .9 2 .5 .1 0 3 .5 .6 3 .5 .7 3 .5 .8 3 .5 .9 M o n i to r O K 1 .1 .1 2 .5 .1 1 2 .5 .1 2 2 .5 .1 3 2 .5 .1 4 M o n i to r n o t O K 1 .1 .1 M ile s to n e c o m p le te 1 .1 .1
2 .6 3 .6 M ile s to n e o v e rd u e 1 .1 .1
2 .6 .1 2 .6 .2 2 .6 .3 2 .6 .4 2 .6 .5 3 .6 .1 3 .6 .2 3 .6 .3 3 .6 .4 3 .6 .5 M i le s to n e d u e s o o n 1 .1 .1
2 .6 .6 2 .6 .7 2 .6 .8 2 .6 .9 2 .6 .1 0 3 .6 .6 3 .6 .7 3 .6 .8 3 .6 .9 3 .6 .1 0 M i le s to n e n o t d u e s o o n 1 .1 .1
2 .6 .1 1 2 .6 .1 2 2 .6 .1 3 I t e m no t A c tiv e 1 .1 .1
O t he r L in k N e tw o r k L H C D e p lo ym e n t
P r o j e c t P la n n in g
C M S
P o r t a l
S t a t u s D a te - 3 0 / J u n /0 7 + n e x t
U K Q C D
N a v ig a te d o w nE x t e rn a l l in k
P h e n o G r id
L H C A p p s
1 .1
1 .3
S e c u r ity
In fo M o n
D e s ig n
S e r v ic e C h a l le n g e s
P r o d u c t io n G r i d M i l e s to n e s P r o d u c t io n G r i d M e t r ic s
1L C G E x te r n a l
4M / S /N
5N o n -L H C A p p s M a n a g e m e n t
2 3
K n o w le d g e T r a n s fe r
L H C b
G A N G A
A T L A S
In te r o p e r a b il it yS a m G r id
E n g a g e m e n tW o r k lo a d
6
1 .2
D e v e lo p m e n t
D is s e m in a t io n
P r o j e c t E x e c u t io n
B a B a rM e ta d a t a
S t o r a g e
U p d a te
C le a r
Production Grid project nearing successful completion…
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Tape oriented Disk orientedRequest oriented
Castor
3. Topical Status
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Castor
Experiments• migration to Castor is an
important milestone• weekly technical
meetings set up• deployment of separate
instances of Castor 2.1.3 for ATLAS, CMS and LHCb
• The current progress, next steps and concerns of the experiments in this area are provided in the User Board report.
Tier-1• http://www.gridpp.ac.uk/wiki/
RAL_Tier1_CASTOR_Experiments_Technical_Issues
• CASTOR 2.1.3 has recently proven to be robust under test loads and early service challenge trials
• CASTOR 2.1.4 ready for deployment (“disk1” storage classes)
• Tier-1 review planned for November
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
• Bridging the Experiment-Grid Gap..
Availability Status
htt
p:/
/hepw
ww
.ph.q
mul.ac.
uk/~
lloyd/g
ridpp/u
kgri
d.h
tml
90% max80% typical
c.f.95% T2 target
98% T1 target
80% max70% typical
c.f.~95% target
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
htt
p:/
/ww
w3
.egee.c
esg
a.e
s/gri
dsi
te/a
ccounti
ng/C
ESG
A/t
ree_e
gee.p
hp
ResourcesAccumulated EGEE CPU Usage 102,191,758
kSI2k-hoursor >100 GSI2k-hours (!)
Via APEL accounting
UKI: 24,788,212 kSI2k-hours
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Past year’s CPU Usage
by experiment
UK Resources
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Past year’s CPU Usage
by Region
UK Resources
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
06/2
0/04
07/1
9/04
08/1
7/04
09/1
5/04
10/1
4/04
11/1
2/200
4
12/1
1/200
4
01/0
9/200
5
02/0
7/200
5
03/0
8/200
5
04/0
6/200
5
05/0
5/200
5
06/0
3/200
5
07/0
2/200
5
07/3
1/05
08/2
9/05
09/2
7/05
10/2
6/05
11/2
4/05
23/1
2/200
5
21/0
1/200
6
19/0
2/200
6
20/0
3/200
6
18/0
4/200
6
18/0
5/200
6
16/0
6/200
6
15/0
7/200
6
13/0
8/200
6
11/0
9/200
6
12/1
0/200
6
11/1
1/200
6
10/1
2/200
6
08/0
1/200
6
08/0
2/200
6
09/0
3/200
6
Date
Pu
bli
shed
jo
b s
lots
UK total job slots
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
06/0
2/2
004
07/0
3/2
004
08/0
3/2
004
09/0
3/2
004
10/0
4/2
004
11/0
4/2
004
12/0
5/2
004
01/0
5/2
005
02/0
5/2
005
03/0
8/2
005
04/0
8/2
005
05/0
9/2
005
06/0
9/2
005
07/1
0/2
005
08/1
0/2
005
09/1
0/2
005
10/1
1/2
005
11/1
1/2
005
12/1
2/2
005
12/0
1/2
006
12/0
2/2
006
15/0
3/2
006
15/0
4/2
006
17/0
5/2
006
17/0
6/2
006
18/0
7/2
006
18/0
8/2
006
18/0
9/2
006
22/1
0/2
006
22/1
1/2
006
23/1
2/2
006
23/0
1/2
006
25/0
2/2
006
Date
% j
ob
slo
ts u
sed
% EGEE slots used % UK slots used
2004 2005 2006 2007
2004 2005 2006 2007
Job Slots and Use
Currently ~51% which falls short of the 70% target
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Tier-1
•CPU, disk and tape resources being built up according to plan•2008 procurement well underway
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
(measured by UK Tier-1 and Tier-2 for all VOs)
~90% CPU efficiency due to i/o bottlenecksConcern that this is currently ~70% at the Tier-1
Efficiency
Each experiment needs to work to improve their
system/deployment practice anticipating e.g. hanging
gridftp connections during batch work
A big issue for the Tier-2s..A bigger issue for the Tier-1..
target
htt
p:/
/ww
w.g
ridpp.a
c.uk/
pm
b/d
ocs
/Gri
dPP-P
MB
-11
3-I
neffi
cient_
Jobs_
v1
.0.p
df
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Intervention PolicyAll UK sites are given flexibility to deal with
stalled jobs (in order that their CPUs are occupied more fully overall) according to the following stalled job definition:
Any job consuming <10 minutes CPU over a given 6 hour period (efficiency < 0.027) is considered stalled
There is a recognised intervention scheme for UK sites
Stalled Jobs
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
SAM site testing
htt
p:/
/hepw
ww
.ph.q
mul.ac.
uk/~
lloyd/g
ridpp/s
am
.htm
l
• Performance over past 6 months to be used for Tier-2 hardware allocations..
The metric is to be based onSAM Test Efficiency x (CPU Delivered + Disk Available)
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
SAM site testing
htt
p:/
/hepw
ww
.ph.q
mul.ac.
uk/~
lloyd/g
ridpp/s
am
.htm
l
30/05/07 27/08/07
Overall SAM Test Success Rate
0
20
40
60
80
100
120
30
/05
/20
07
06
/06
/20
07
13
/06
/20
07
20
/06
/20
07
27
/06
/20
07
04
/07
/20
07
11
/07
/20
07
18
/07
/20
07
25
/07
/20
07
01
/08
/20
07
08
/08
/20
07
15
/08
/20
07
22
/08
/20
07
29
/08
/20
07
05
/09
/20
07
12
/09
/20
07
19
/09
/20
07
26
/09
/20
07
%age success
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Experiments categorisation:• Non-scalability or general failure of the Grid data
transfer / placement system.• Non-scalability or general failure of the Grid workload
management system.• Non-scalability or general failure of the metadata /
bookkeeping system.• Medium-term loss of data storage resources.• Medium-term loss of CPU resources.• Long-term loss of data or data storage resources.• Long-term loss of CPU resources.• Medium- or long-term loss of wide area network.• Grid security incident.• Mis-estimation of resource requirements.
Disaster Planning
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Disaster Modes:
Importance ofCommunication:
Work in progress..
Disaster Planning
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Transition PointFrom UK Particle Physics perspective the Grid is the basis
for computing in the 21st Century:1. needed to utilise computing resources efficiently and securely2. uses gLite middleware (with evolving standards for interoperation)3. required significant investment from PPARC (STFC) – (£100m) over
10 yrs - including support from HEFCE/SFC4. required 3 years’ prototype testbed development [GridPP1]5. provides a working production system that has been running for three
years in build-up to LHC data-taking [GridPP2]6. enables seamless discovery of computing resources:
utilised to good effect across the UK – internationally significant7. not (yet) as efficient as end-user analysts require:
ongoing work to improve performance8. ready for LHC – just in time delivery9. future operations-led activity as part of LCG, working with
EGEE/EGI (EU) and NGS (UK) [GridPP3]10.future challenge is to exploit this infrastructure to
perform (previously impossible) physics analyses from the LHC (and ILC and Fact and..)
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Proposal Writing Proposal Defence
Proposal Approval
31st March 2006 – P
PARC Call 31 st October 2007? –
Grants implemented
Transition Point
• Planning..• Good things take time.. ~20 months
Implementation
Oversight Committee11 October 2007 Tony Doyle - University of Glasgow
Security
Network Monitoring
Information Services
Grid Data Management
Storage Interfaces
Workload Management
Transition Point
GridPP would like to thank all the middleware developers who have
contributed to the establishment of the Production Grid