infso-ri-508833 enabling grids for e-science sa1 ian bird sa1 activity leader cern it department...

25
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Post on 18-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

SA1

Ian Bird

SA1 Activity Leader

CERN IT Department

EGEE Final Review 23rd – 24th May 2006

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Outline

• Recommendations from intermediate focused review

• Highlights of last 3 months of the project

• Summary of SA1 achievements and open issues

sites

CPU

0

5000

10000

15000

20000

25000

30000

35000

Jan-05

Feb-05

Mar-05

Apr-05 May-05

Jun-05

Jul-05 Aug-05

Sep-05

Oct-05 Nov-05

Dec-05

Jan-06

Feb-06

Mar-06

Apr-06

Jo

bs

/ d

ay

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 3

Enabling Grids for E-sciencE

INFSO-RI-508833

SA1 Achievements

• Scale of the infrastructure– Has grown steadily during the project

– Now slowed – expansion with related projects

• Sustained real production use of the infrastructure– Which is supported by the operations teams

• Maturing but evolving operations procedures– Dealing with all aspects of operations

• User support – GGUS is becoming the central coordination point, use is growing

• Middleware distribution– Now clear how to evolve the production service

– Convergence between existing LCG-2.x and gLite-1.x

• Progress on interoperability and interoperation– With OSG significant progress, progress with ARC

– Related projects

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 4

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 16 – i

“Plan the migration procedure of service support for gLite in full production service more clearly with precise dates and mandates for each site, and advertise to the users well in advance.”

& comment:

“Pre-production service must not take on a life of its own…”

• Early set up of TCG; – forum for agreeing schedules across the technical and application

activities.

– Schedule proposed and agreed for 2006 – see next slide

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 5

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 16 – ii

• Deliver and deploy LCG-2.7.0 end January 2006

– Bug fixes, patches, etc. accumulated since last major release in August.

Delivered on time and deployed

• Prepare gLite-3.0 for initial deployment in May 2006.

– Convergence of LCG-2.x and gLite-1.x

– Evolutionary from deployment point of view – will not be a big-bang change of production service

– Schedule driven by LCG service challenges

• Foresee second major “release” on October/November timescale

– Added functionality – driven by apps via TCG

• Quickfixes, security patches – May be produced at any time,

deployed with agreement of TCG

• Client tools – May be updated more frequently,

and can be deployed rapidly without need for major upgrades

• Other stand-alone services may be deployed centrally or at a few sites

– To demonstrate functionality or provide new facilities

– Usually need by-hand installation

Deployment schedule for 2006

In general we try to move away from big-bang releases:• Focus on service/component upgrades where possible• Check-point releases to consolidate changes and to provide new sites a starting point • See this more like a Linux distribution – major releases with continual component updates, security patches, etc.

• Pre-production service – now integral part of the release process – should demonstrate new releases• Continuous process of integration, certification, pre-production testing eventual deployment

In general we try to move away from big-bang releases:• Focus on service/component upgrades where possible• Check-point releases to consolidate changes and to provide new sites a starting point • See this more like a Linux distribution – major releases with continual component updates, security patches, etc.

• Pre-production service – now integral part of the release process – should demonstrate new releases• Continuous process of integration, certification, pre-production testing eventual deployment

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 6

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 17 – i

“Help to establish exemplary procedures for interoperations of more divergent infrastructures and take the lead in such activities.”

• Several avenues– Collaborative activities – security and operational policy

– Interoperability

– Interoperation / shared operation – workshops

– Other projects

• Joint collaborative activities:– Security – JSPG, MWSG, GridPMAs

– Grid Interoperability Now (GIN) group – many projects Very active in GGF17

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 17 – ii

Interoperability

Several initiatives at various stages

• With OSG– Most advanced – cross job submission has been put in place for WLCG

Used in production by US-CMS for several months

– EGEE Generic Info Provider installed on OSG site (now in VDT) Allows all sites to be seen in info system

– GStat and SFT can run on OSG sites

– EGEE clients installed on OSG-LCG sites

– Inversely – EGEE sites can run OSG jobs

– All use SRM SEs;

– File catalogues are application choice – LFC widely used

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 8

Enabling Grids for E-sciencE

INFSO-RI-508833

Interoperability – cont.

• With ARC/NorduGrid– Strategies:1. Agree standard interfaces at site level & evolve services for these interfaces 2. Present these interfaces at Grid boundary portal to forward and translate3. Deploy EGEE and ARC CE in parallel

• Large sites for LCG

• 1 is long-term goal; 2 is medium term solution• Several workshops to follow progress

• Work on information system (GLUE)• EGEEARC submission works

• With NAREGI– First workshop in March– Several joint activities agreed; work just starting

Information system translators (GLUE ↔ CIM) Data management tools – NAREGI will test EGEE LFC, FTS, DPM Job management JDL ↔ JSDL etc. Security

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 17 – iii

Operations (Interoperations)• Joint operations:

– WLCG is a strong driver – bring together EGEE and OSG grid operations

– Extend ROC concept Structures for routing tickets – prototype to be demonstrated in June Use of GOC-DB for OSG sites OSG sites join weekly operations meeting Run SFTs on LCG production sites in OSG Agreed ops VO for joint operations

– Accounting – for LCG – use GGF usage record

• Related projects– EUMedGrid, BalticGrid, EELA, EUChinaGrid, SEE-Grid:

– implement EGEE operational concepts and procedures

• Operations workshops– Explicitly joint with OSG, ensure related projects attend

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 10

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 17 – iv

• Future:– Shared operations will be a reality – required for LCG

EGEE, OSG, ARC, NAREGI

– EGEE-II Explicit tasks on interoperability ARC and UNICORE

– Expectation is for coexisting campus, local, regional, national, international grid infrastructure

Coexistence, interoperability, interoperations, common policies will be a way of life

– Long term sustainable infrastructure after EGEE-II will be built on this work

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Recommendation 18

“Move away from present primary dependence on particular flavours of both processors and Linux and provide support for more heterogeneous resources, including supercomputers, to allow increased collaborative adoption at major computing centres.”

• Current porting status:– Several ports to other architectures: IA64, several Linux flavours. Available a few

months after main release;– Done by partners; outside of main build and integration system

• Future:– Important to have several important ports close to or part of main integration and

testing; – Include 64-bit cleanliness as part of build test – will flag as failure– Move to ETICS to provide distributed build system to support many platforms; helps

tie porting partners into central process Partner interested in a particular port can provide build and test hardware and ETICS can

help integrate this into the process– TCG should agree a reasonable/realistic set of standard primary platforms to be

provided as part of base release E.g. SL4 + Debian on 32 and 64 bit Other ports can be asynchronous and should be certified by partners providing resources

– Supercomputers – should be supported by ports to relevant OS, MPI Collaboration with DEISA in EGEE-II

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

SA1 Highlights

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 13

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage

EGEE Grid Sites : Q1 2006

sites

CPU

EGEE: Steady growth over the lifetime of the project

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 14

Enabling Grids for E-sciencE

INFSO-RI-508833

A global, federated e-Infrastructure

EGEE infrastructure~ 200 sites in 39 countries~ 20 000 CPUs> 5 PB storage> 20 000 concurrent jobs per day> 60 Virtual Organisations

EUIndiaGrid

EUMedGrid

SEE-GRID

EELA

BalticGrid

EUChinaGridOSGNAREGI

Related projects & collaborations are where the future expansion of resources will come from

Project Anticipated resources (initial estimates)

Related Infrastructure projects

SEE-grid 6 countries, 17 sites, 150 cpu

EELA 5 countries, 8 sites, 300 cpu

EUMedGrid 6 countries

BalticGrid 3 countries, fewx100 cpu

EUChinaGrid TBC

Collaborations

OSG 30 sites, 10000 cpu

ARC 15 sites, 5000 cpu

DEISA Supercomputing resources

Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Use of the infrastructure

Total

non-LCG0

5000

10000

15000

20000

25000

30000

35000

Jan-05 Feb-05 Mar-05 Apr-05 May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06

No

. jo

bs/

day

CPU - cpu-years/month

0

50

100

150

200

250

300

Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06

cpu-

year

/ m

onth

CPU time delivered

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06

SI2

K-h

ou

rs/m

on

th

lhcb

geant4

cms

biomed

atlas

alice

Sustained & regular workloads of >30K jobs/day• spread across full infrastructure• doubling/tripling in last 6 months – no effect on operations

Sustained & regular workloads of >30K jobs/day• spread across full infrastructure• doubling/tripling in last 6 months – no effect on operations

Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 16

Enabling Grids for E-sciencE

INFSO-RI-508833

Use of the infrastructureMassive data transfers > 1.5 GB/sMassive data transfers > 1.5 GB/s

• Several applications now depend on EGEE as their primary computing resource

Sustainability:• Usage can (and does) grow without need for additional operational effort

Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 17

Enabling Grids for E-sciencE

INFSO-RI-508833

EGEE Operations Process• Grid operator on duty

– 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei

– Crucial in improving site stability and management

– Expanding to all ROCs in EGEE-II• Operations coordination

– Weekly operations meetings– Regular ROC managers meetings– Series of EGEE Operations Workshops

Nov 04, May 05, Sep 05, June 06• Geographically distributed responsibility

for operations:– There is no “central” operation– Tools are developed/hosted at different sites:

GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon)

• Procedures described in Operations Manual

– Introducing new sites– Site downtime scheduling– Suspending a site– Escalation procedures– etc

Highlights:• Distributed operation• Evolving and maturing procedures• Procedures being in introduced into and shared with the related infrastructure projects

Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Site Functional Tests• Site Functional Tests (SFT)

– Framework to test (sample) services at all sites

– Shows results matrix– Detailed test log available for

troubleshooting and debugging– History of individual tests is kept – Can include VO-specific tests (e.g. sw

environment)– Normally >80% of sites pass SFTs

NB of 180 sites, some are not well managed

• Very important in stabilising sites:• Apps use only good sites• Bad sites are automatically excluded• Sites work hard to fix problems

Extending to service availability:• measure availability by service, site, VO• each service has associated service class defining required availability (Critical, highly available, etc.)

First approach to SLA

Use to generate alarms• generate trouble tickets• call out support staff

Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 19

Enabling Grids for E-sciencE

INFSO-RI-508833

Middleware Distributions and Stacks• Terminology:

– EGEE deploys a middleware distribution Drawn from various middleware products, stacks, etc. Do not confuse the distribution with development projects or with software packages Count on 6 months from software developer “release” to production deployment

– The EGEE distribution: Current production version labelled: LCG-2.7.0 New production version labelled: gLite-3.0

Name change to hopefully reduce confusion

• EGEE distribution contents:

LCG-2.7.0:– VDT – packaging Globus 2.4, Condor,

MyProxy– EDG workload management– LCG components:

BDII (info sys), catalogue (LFC), DPM, data management libraries and CLI tools monitoring tools

– gLite: R-GMA, VOMS, FTS

gLite-3.0:– Based on LCG-2.7.0, and– gLite workload management– Other gLite components (not in the

distribution but provided as services): AMGA, Hydra, Fireman gLite-IO

evolution

Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 20

Enabling Grids for E-sciencE

INFSO-RI-508833

Inte

grat

ion

Inte

grat

ion

VDT/OSG

OMII-Europe

JRA1

SA3

Tes

ting

& C

ertif

icat

ion

Support, analysis, debuggingSupport, analysis, debugging

Pro

duct

ion

serv

ice

Pro

duct

ion

serv

ice

SA1P

re-p

rodu

ctio

n se

rvic

e

Mid

dlew

are

prov

ider

s

SA3

Certification activities SA3+SA1

Process to deployment

Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 21

Enabling Grids for E-sciencE

INFSO-RI-508833

Central Application

(GGUS)

DeploymentSupport

MiddlewareSupport

NetworkSupport

Operations Support

TPM

ROC 1 ROC 10ROC…

VOSupport

Interface

Webportal

The Support Model

““Regional Support with Central Coordination"Regional Support with Central Coordination"

The ROCs, VOs and other project-wide groups such

as the middleware

groups (JRA), network groups (NA), service

groups (SA) are

connected via a central

integration platform provided

by GGUS.

Regional Support units

User Support unitsTechnical Support units

•GGUS is now being used for all problem reporting:•Operational, deployment and user support•VOs are using it for their support system•The use is growing steadily

Tickets per month

050

100150200250

300350400

450500

Janu

ary

Febru

ary

Mar

chApr

ilM

ayJu

ne July

Augus

t

Septe

mber

Octobe

r

Novem

ber

Decem

ber

Janu

ary

Febru

ary

Mar

chApr

il

month

am

ou

nt

Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 22

Enabling Grids for E-sciencE

INFSO-RI-508833

Security & Policy

Collaborative policy development– Many policy aspects are collaborative

works; e.g.:

• Joint Security Policy Group

• Certification Authorities– EUGridPMA IGTF, etc.

• Grid Acceptable Use Policy (AUP)– common, general and simple AUP – for all VO members using many Grid

infrastructures EGEE, OSG, SEE-GRID, DEISA,

national Grids…

• Incident Handling and Response – defines basic communications paths– defines requirements (MUSTs) for IR– not to replace or interfere with local

response plans

Security & Availability Policy

UsageRules

Certification Authorities

AuditRequirements

Incident Response

User Registration & VO Management

Application Development& Network Admin Guide

VOSecurity

Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 23

Enabling Grids for E-sciencE

INFSO-RI-508833

SA1 goals for EGEE-II

Key goal:– We have a large running production infrastructure; But EGEE-II MUST take what we

have now and make it:

Reliable Middleware components fail, error reporting is missing, …

There is an application responsibility here too – needs effort … but ! The service has been running non-stop for > 2 years

Robust Must continue to address service aspects – move away from

prototypes Usable

It is still hard to use for many users; still too slow to introduce new VOs

Acceptable It must be easy to deploy in a wide variety of environments

and coexist with other grid infrastructures Sustainable

The infrastructure must become sustainable for the long term

Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 24

Enabling Grids for E-sciencE

INFSO-RI-508833

SA1 Outlook

• LHC VOs must achieve reliable production and analysis in 2006– Will be making significant use of resources– Applications must bring resources show commitment

• Consolidate and improve existing services: Focus on– Reliability, robustness, manageability, performance, scalability, etc.– Evolution or replacement of services driven by needs of application (or

operations/security/manageability) TCG has key role here

• Expand grid operations– Spread expertise to ROCs– Collaboration with OSG, A-P, etc. and related projects– Start to negotiate SLAs – Sustainability: processes evolving, spread of expertise and tasks– Resource sharing and negotiation – must become streamlined

Will need a mechanism for cost/credit for use of resources

Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE  SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006

Ian Bird, SA1, EGEE Final Review 23-24th May 2006 25

Enabling Grids for E-sciencE

INFSO-RI-508833

Summary

• SA1 has built a large production grid infrastructure• In constant and extensive daily production use

– Several applications depend on it for resources

• Tools and processes are maturing and evolving• Security and usage policies also evolving

• We have a basic set of middleware that addresses most requirements

• Production middleware is converged now LCG-2 + gLite gLite 3

• EGEE-II will focus on making this sustainable and really usable