infso-ri-508833 enabling grids for e-science sa1 ian bird sa1 activity leader cern it department...
Post on 18-Dec-2015
219 views
TRANSCRIPT
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
SA1
Ian Bird
SA1 Activity Leader
CERN IT Department
EGEE Final Review 23rd – 24th May 2006
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Outline
• Recommendations from intermediate focused review
• Highlights of last 3 months of the project
• Summary of SA1 achievements and open issues
sites
CPU
0
5000
10000
15000
20000
25000
30000
35000
Jan-05
Feb-05
Mar-05
Apr-05 May-05
Jun-05
Jul-05 Aug-05
Sep-05
Oct-05 Nov-05
Dec-05
Jan-06
Feb-06
Mar-06
Apr-06
Jo
bs
/ d
ay
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 3
Enabling Grids for E-sciencE
INFSO-RI-508833
SA1 Achievements
• Scale of the infrastructure– Has grown steadily during the project
– Now slowed – expansion with related projects
• Sustained real production use of the infrastructure– Which is supported by the operations teams
• Maturing but evolving operations procedures– Dealing with all aspects of operations
• User support – GGUS is becoming the central coordination point, use is growing
• Middleware distribution– Now clear how to evolve the production service
– Convergence between existing LCG-2.x and gLite-1.x
• Progress on interoperability and interoperation– With OSG significant progress, progress with ARC
– Related projects
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 4
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 16 – i
“Plan the migration procedure of service support for gLite in full production service more clearly with precise dates and mandates for each site, and advertise to the users well in advance.”
& comment:
“Pre-production service must not take on a life of its own…”
• Early set up of TCG; – forum for agreeing schedules across the technical and application
activities.
– Schedule proposed and agreed for 2006 – see next slide
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 5
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 16 – ii
• Deliver and deploy LCG-2.7.0 end January 2006
– Bug fixes, patches, etc. accumulated since last major release in August.
Delivered on time and deployed
• Prepare gLite-3.0 for initial deployment in May 2006.
– Convergence of LCG-2.x and gLite-1.x
– Evolutionary from deployment point of view – will not be a big-bang change of production service
– Schedule driven by LCG service challenges
• Foresee second major “release” on October/November timescale
– Added functionality – driven by apps via TCG
• Quickfixes, security patches – May be produced at any time,
deployed with agreement of TCG
• Client tools – May be updated more frequently,
and can be deployed rapidly without need for major upgrades
• Other stand-alone services may be deployed centrally or at a few sites
– To demonstrate functionality or provide new facilities
– Usually need by-hand installation
Deployment schedule for 2006
In general we try to move away from big-bang releases:• Focus on service/component upgrades where possible• Check-point releases to consolidate changes and to provide new sites a starting point • See this more like a Linux distribution – major releases with continual component updates, security patches, etc.
• Pre-production service – now integral part of the release process – should demonstrate new releases• Continuous process of integration, certification, pre-production testing eventual deployment
In general we try to move away from big-bang releases:• Focus on service/component upgrades where possible• Check-point releases to consolidate changes and to provide new sites a starting point • See this more like a Linux distribution – major releases with continual component updates, security patches, etc.
• Pre-production service – now integral part of the release process – should demonstrate new releases• Continuous process of integration, certification, pre-production testing eventual deployment
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 6
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 17 – i
“Help to establish exemplary procedures for interoperations of more divergent infrastructures and take the lead in such activities.”
• Several avenues– Collaborative activities – security and operational policy
– Interoperability
– Interoperation / shared operation – workshops
– Other projects
• Joint collaborative activities:– Security – JSPG, MWSG, GridPMAs
– Grid Interoperability Now (GIN) group – many projects Very active in GGF17
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 7
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 17 – ii
Interoperability
Several initiatives at various stages
• With OSG– Most advanced – cross job submission has been put in place for WLCG
Used in production by US-CMS for several months
– EGEE Generic Info Provider installed on OSG site (now in VDT) Allows all sites to be seen in info system
– GStat and SFT can run on OSG sites
– EGEE clients installed on OSG-LCG sites
– Inversely – EGEE sites can run OSG jobs
– All use SRM SEs;
– File catalogues are application choice – LFC widely used
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Interoperability – cont.
• With ARC/NorduGrid– Strategies:1. Agree standard interfaces at site level & evolve services for these interfaces 2. Present these interfaces at Grid boundary portal to forward and translate3. Deploy EGEE and ARC CE in parallel
• Large sites for LCG
• 1 is long-term goal; 2 is medium term solution• Several workshops to follow progress
• Work on information system (GLUE)• EGEEARC submission works
• With NAREGI– First workshop in March– Several joint activities agreed; work just starting
Information system translators (GLUE ↔ CIM) Data management tools – NAREGI will test EGEE LFC, FTS, DPM Job management JDL ↔ JSDL etc. Security
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 17 – iii
Operations (Interoperations)• Joint operations:
– WLCG is a strong driver – bring together EGEE and OSG grid operations
– Extend ROC concept Structures for routing tickets – prototype to be demonstrated in June Use of GOC-DB for OSG sites OSG sites join weekly operations meeting Run SFTs on LCG production sites in OSG Agreed ops VO for joint operations
– Accounting – for LCG – use GGF usage record
• Related projects– EUMedGrid, BalticGrid, EELA, EUChinaGrid, SEE-Grid:
– implement EGEE operational concepts and procedures
• Operations workshops– Explicitly joint with OSG, ensure related projects attend
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 10
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 17 – iv
• Future:– Shared operations will be a reality – required for LCG
EGEE, OSG, ARC, NAREGI
– EGEE-II Explicit tasks on interoperability ARC and UNICORE
– Expectation is for coexisting campus, local, regional, national, international grid infrastructure
Coexistence, interoperability, interoperations, common policies will be a way of life
– Long term sustainable infrastructure after EGEE-II will be built on this work
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Recommendation 18
“Move away from present primary dependence on particular flavours of both processors and Linux and provide support for more heterogeneous resources, including supercomputers, to allow increased collaborative adoption at major computing centres.”
• Current porting status:– Several ports to other architectures: IA64, several Linux flavours. Available a few
months after main release;– Done by partners; outside of main build and integration system
• Future:– Important to have several important ports close to or part of main integration and
testing; – Include 64-bit cleanliness as part of build test – will flag as failure– Move to ETICS to provide distributed build system to support many platforms; helps
tie porting partners into central process Partner interested in a particular port can provide build and test hardware and ETICS can
help integrate this into the process– TCG should agree a reasonable/realistic set of standard primary platforms to be
provided as part of base release E.g. SL4 + Debian on 32 and 64 bit Other ports can be asynchronous and should be certified by partners providing resources
– Supercomputers – should be supported by ports to relevant OS, MPI Collaboration with DEISA in EGEE-II
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
SA1 Highlights
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 13
Enabling Grids for E-sciencE
INFSO-RI-508833
EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage
EGEE Grid Sites : Q1 2006
sites
CPU
EGEE: Steady growth over the lifetime of the project
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 14
Enabling Grids for E-sciencE
INFSO-RI-508833
A global, federated e-Infrastructure
EGEE infrastructure~ 200 sites in 39 countries~ 20 000 CPUs> 5 PB storage> 20 000 concurrent jobs per day> 60 Virtual Organisations
EUIndiaGrid
EUMedGrid
SEE-GRID
EELA
BalticGrid
EUChinaGridOSGNAREGI
Related projects & collaborations are where the future expansion of resources will come from
Project Anticipated resources (initial estimates)
Related Infrastructure projects
SEE-grid 6 countries, 17 sites, 150 cpu
EELA 5 countries, 8 sites, 300 cpu
EUMedGrid 6 countries
BalticGrid 3 countries, fewx100 cpu
EUChinaGrid TBC
Collaborations
OSG 30 sites, 10000 cpu
ARC 15 sites, 5000 cpu
DEISA Supercomputing resources
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 15
Enabling Grids for E-sciencE
INFSO-RI-508833
Use of the infrastructure
Total
non-LCG0
5000
10000
15000
20000
25000
30000
35000
Jan-05 Feb-05 Mar-05 Apr-05 May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06
No
. jo
bs/
day
CPU - cpu-years/month
0
50
100
150
200
250
300
Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06
cpu-
year
/ m
onth
CPU time delivered
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06
SI2
K-h
ou
rs/m
on
th
lhcb
geant4
cms
biomed
atlas
alice
Sustained & regular workloads of >30K jobs/day• spread across full infrastructure• doubling/tripling in last 6 months – no effect on operations
Sustained & regular workloads of >30K jobs/day• spread across full infrastructure• doubling/tripling in last 6 months – no effect on operations
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 16
Enabling Grids for E-sciencE
INFSO-RI-508833
Use of the infrastructureMassive data transfers > 1.5 GB/sMassive data transfers > 1.5 GB/s
• Several applications now depend on EGEE as their primary computing resource
Sustainability:• Usage can (and does) grow without need for additional operational effort
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 17
Enabling Grids for E-sciencE
INFSO-RI-508833
EGEE Operations Process• Grid operator on duty
– 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei
– Crucial in improving site stability and management
– Expanding to all ROCs in EGEE-II• Operations coordination
– Weekly operations meetings– Regular ROC managers meetings– Series of EGEE Operations Workshops
Nov 04, May 05, Sep 05, June 06• Geographically distributed responsibility
for operations:– There is no “central” operation– Tools are developed/hosted at different sites:
GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon)
• Procedures described in Operations Manual
– Introducing new sites– Site downtime scheduling– Suspending a site– Escalation procedures– etc
Highlights:• Distributed operation• Evolving and maturing procedures• Procedures being in introduced into and shared with the related infrastructure projects
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 18
Enabling Grids for E-sciencE
INFSO-RI-508833
Site Functional Tests• Site Functional Tests (SFT)
– Framework to test (sample) services at all sites
– Shows results matrix– Detailed test log available for
troubleshooting and debugging– History of individual tests is kept – Can include VO-specific tests (e.g. sw
environment)– Normally >80% of sites pass SFTs
NB of 180 sites, some are not well managed
• Very important in stabilising sites:• Apps use only good sites• Bad sites are automatically excluded• Sites work hard to fix problems
Extending to service availability:• measure availability by service, site, VO• each service has associated service class defining required availability (Critical, highly available, etc.)
First approach to SLA
Use to generate alarms• generate trouble tickets• call out support staff
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 19
Enabling Grids for E-sciencE
INFSO-RI-508833
Middleware Distributions and Stacks• Terminology:
– EGEE deploys a middleware distribution Drawn from various middleware products, stacks, etc. Do not confuse the distribution with development projects or with software packages Count on 6 months from software developer “release” to production deployment
– The EGEE distribution: Current production version labelled: LCG-2.7.0 New production version labelled: gLite-3.0
Name change to hopefully reduce confusion
• EGEE distribution contents:
LCG-2.7.0:– VDT – packaging Globus 2.4, Condor,
MyProxy– EDG workload management– LCG components:
BDII (info sys), catalogue (LFC), DPM, data management libraries and CLI tools monitoring tools
– gLite: R-GMA, VOMS, FTS
gLite-3.0:– Based on LCG-2.7.0, and– gLite workload management– Other gLite components (not in the
distribution but provided as services): AMGA, Hydra, Fireman gLite-IO
evolution
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 20
Enabling Grids for E-sciencE
INFSO-RI-508833
Inte
grat
ion
Inte
grat
ion
VDT/OSG
OMII-Europe
JRA1
SA3
…
Tes
ting
& C
ertif
icat
ion
Support, analysis, debuggingSupport, analysis, debugging
Pro
duct
ion
serv
ice
Pro
duct
ion
serv
ice
SA1P
re-p
rodu
ctio
n se
rvic
e
Mid
dlew
are
prov
ider
s
SA3
Certification activities SA3+SA1
Process to deployment
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 21
Enabling Grids for E-sciencE
INFSO-RI-508833
Central Application
(GGUS)
DeploymentSupport
MiddlewareSupport
NetworkSupport
Operations Support
TPM
ROC 1 ROC 10ROC…
VOSupport
Interface
Webportal
The Support Model
““Regional Support with Central Coordination"Regional Support with Central Coordination"
The ROCs, VOs and other project-wide groups such
as the middleware
groups (JRA), network groups (NA), service
groups (SA) are
connected via a central
integration platform provided
by GGUS.
Regional Support units
User Support unitsTechnical Support units
•GGUS is now being used for all problem reporting:•Operational, deployment and user support•VOs are using it for their support system•The use is growing steadily
Tickets per month
050
100150200250
300350400
450500
Janu
ary
Febru
ary
Mar
chApr
ilM
ayJu
ne July
Augus
t
Septe
mber
Octobe
r
Novem
ber
Decem
ber
Janu
ary
Febru
ary
Mar
chApr
il
month
am
ou
nt
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 22
Enabling Grids for E-sciencE
INFSO-RI-508833
Security & Policy
Collaborative policy development– Many policy aspects are collaborative
works; e.g.:
• Joint Security Policy Group
• Certification Authorities– EUGridPMA IGTF, etc.
• Grid Acceptable Use Policy (AUP)– common, general and simple AUP – for all VO members using many Grid
infrastructures EGEE, OSG, SEE-GRID, DEISA,
national Grids…
• Incident Handling and Response – defines basic communications paths– defines requirements (MUSTs) for IR– not to replace or interfere with local
response plans
Security & Availability Policy
UsageRules
Certification Authorities
AuditRequirements
Incident Response
User Registration & VO Management
Application Development& Network Admin Guide
VOSecurity
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 23
Enabling Grids for E-sciencE
INFSO-RI-508833
SA1 goals for EGEE-II
Key goal:– We have a large running production infrastructure; But EGEE-II MUST take what we
have now and make it:
Reliable Middleware components fail, error reporting is missing, …
There is an application responsibility here too – needs effort … but ! The service has been running non-stop for > 2 years
Robust Must continue to address service aspects – move away from
prototypes Usable
It is still hard to use for many users; still too slow to introduce new VOs
Acceptable It must be easy to deploy in a wide variety of environments
and coexist with other grid infrastructures Sustainable
The infrastructure must become sustainable for the long term
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 24
Enabling Grids for E-sciencE
INFSO-RI-508833
SA1 Outlook
• LHC VOs must achieve reliable production and analysis in 2006– Will be making significant use of resources– Applications must bring resources show commitment
• Consolidate and improve existing services: Focus on– Reliability, robustness, manageability, performance, scalability, etc.– Evolution or replacement of services driven by needs of application (or
operations/security/manageability) TCG has key role here
• Expand grid operations– Spread expertise to ROCs– Collaboration with OSG, A-P, etc. and related projects– Start to negotiate SLAs – Sustainability: processes evolving, spread of expertise and tasks– Resource sharing and negotiation – must become streamlined
Will need a mechanism for cost/credit for use of resources
Ian Bird, SA1, EGEE Final Review 23-24th May 2006 25
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
• SA1 has built a large production grid infrastructure• In constant and extensive daily production use
– Several applications depend on it for resources
• Tools and processes are maturing and evolving• Security and usage policies also evolving
• We have a basic set of middleware that addresses most requirements
• Production middleware is converged now LCG-2 + gLite gLite 3
• EGEE-II will focus on making this sustainable and really usable