enabling grids for e-science glite for atlas production simone campana, cern/infn atlas production...
TRANSCRIPT
Enabling Grids for E-sciencE
www.eu-egee.org
gLite for ATLAS Production
Simone Campana, CERN/INFN
ATLAS production meetingMay 2, 2005
ATLAS production meeting, May-2-2005 2
Enabling Grids for E-sciencE
Outline
• Contents of gLite release 1.0– Components, architecture, and service interplay– Major differences to LCG-2– Major open issues– Future plans
• Deployment plan– LCG-2 vs gLite– gLite/LCG-2 coexistence– Current status of certification – Preproduction service
ATLAS production meeting, May-2-2005 3
Enabling Grids for E-sciencE
WMS
• Reengineering of the LCG-2 Workload Management System– Support partitioned jobs and jobs with dependencies
Might be considered as setup for the production machinery Would help solving problems of pre-staging etc … Side effects should be investigated and considered as well.
– Task Queue Persistent queue for submitted jobs
– Information Supermarket Read only information system cache Updated by
• Information systems (CE in push mode)
• CEMon (CE in pull mode)
• Combination of both Allows WMS to work in push and pull mode
ATLAS production meeting, May-2-2005 4
Enabling Grids for E-sciencE
WMS Cont’d
– Interface to Data Management EDG-RLS StorageIndex DLI
– Condor-C Job submission mechanism between the WM and the CE Improvement in term of reliability in respect of globus GRAM
– CE moving towards a VO based scheduler
• Future: – Web services– Bulk submission
ATLAS production meeting, May-2-2005 5
Enabling Grids for E-sciencE
WMS Cont’d
• Major problems– Failure rate ~12% (retrycount = 0), otherwise 100% success
Several reasons being investigated (e.g. race conditions) Shallow re-submission (i.e. retry of submission, not execution)
might help
– Matchmaking is being blocked sometimes Fix provided for Release 1.1 (end of April)
– Condor as backend not yet working– Not yet final architecture of CE:
One Schedd per local user id Need setuid services and head node monitoring (Globus+JRA3)
ATLAS production meeting, May-2-2005 6
Enabling Grids for E-sciencE
• Mostly new developments based on (and re-using some) AliEn services
• Storage Element– Storage Resource Manager rely on existing implementations– POSIX-I/O gLite-I/O– Access protocols gsiftp, gsidcap, rfio, …
• Catalogs– File Catalog– Replica Catalog– File Authorization Service– Metadata Catalog
• File Transfer– Data Scheduler planned for Release 2– File Transfer Service gLite FTS and glite-url-copy– File Placement Service gLite FPS
Data Management Services
gLite FiReMan Catalog (MySQL and Oracle)
gLite Metadata Catalog
ATLAS production meeting, May-2-2005 7
Enabling Grids for E-sciencE
• Addressing shortcomings of LCG-2 data management– RLS performance– Lack of consistent grid-storage interfaces– Unreliable data transfer layer
• Fireman Catalog– Hierarchical Name Space– Bulk Operations– ACLs– Web Services Interface– POOL Interface– Performance/scalability
• gLite I/O– Support of ACL’s– Support of Fireman catalog in addition to RLS
• File Transfer Service– Did not exist on LCG-2: to be evaluated in SC3
Data Management Cont’d
ATLAS production meeting, May-2-2005 8
Enabling Grids for E-sciencE
Catalogs
• Currently: Global catalog• Future: can think about Local File catalog
– Central location index (Storage Index) necessary– Lightweight and therefore more scalable– Yet to be demonstrated
• Comparison Fireman-LFC– Single files: LFC faster– Fireman offers Bulk capabilities. LFC does not.
• “Within the LHC experiments there is yet no decision - they really want to test both LFC and Fireman and select based on their needs. It is likely that different applications will choose a different catalogue. Since these are application dependent we will probably be in a situation where we have different VOs using different catalogues.”
ATLAS production meeting, May-2-2005 9
Enabling Grids for E-sciencE
Information and Monitoring Services
• R-GMA (Relational Grid Monitoring Architecture)– Implements GGF GMA standard– Development started in EDG, deployed on the production infrastructure
for accounting
ProducerService
RegistryService
ConsumerService
AP
IA
PI
Mediator
SchemaService
Consumerapplication
Producerapplication
Publish Tuples
Send Query
Receive Tuples
Register
LocateQu
ery
Tu
ples
SQL “CREATE TABLE”
SQL “INSERT”
SQL “SELECT”
ATLAS production meeting, May-2-2005 10
Enabling Grids for E-sciencE
Job Monitoring
• Currently ATLAS relies on the ProdDB and GridICE server.
• R-GMA and GridICE can coexist – At the momen, information from the GridICE sensors can be
published by R-GMA– Monitoring tools (Magoo) can therefore interface to R-GMA
directly – GridICE server still quite useful to quickly visualize status of
resources
• R-GMA can be used for application monitoring• R-GMA is being used since several months already for
accounting and operations. • Don’t forget Laurence Field mini tutorial on R-GMA
architecture and usage– End of May (not sure on the final date yet).
ATLAS production meeting, May-2-2005 11
Enabling Grids for E-sciencE
VOMS
• Derived from DataTag and EDG• Used for VO mgmt
– VOMS certificates understood by WMS and Data Mgmt (as of release 1.1)
• RFC compliance
• Major problems– Incompatibility with previous VOMS versions
Due to RFC compliance
ATLAS production meeting, May-2-2005 12
Enabling Grids for E-sciencE
Future Plans - WMS
• Move CE to final architecture– One scheduler per VO (can be provided by VO)– Head-node monitor and fork/set-uid service
• WS interface to WMS– With better support for bulk job-submission
• Support for pilot jobs
• Integration of network information (JRA4)
• Closer integration with Data Mgmt– Common job and data transfer DAGs– Data matchmaking for ranking
• “shallow” job-resubmission
• CE history used for ranking
• Use information from R-GMA in the information supermarket
ATLAS production meeting, May-2-2005 13
Enabling Grids for E-sciencE
Future Plans - DM
• Security in DM chain– Delegation – Work out security model (local vs. Grid)– Support SRMs with native ACL support– VOMS roles for ACL’s
• Distributed/Partitioned Catalogs– Need to define model
• Integration of network information (JRA4)
• Explore XROOTD
• Harmonize metadata interface (ARDA, PTF)
• Data Scheduler (equivalent to WMS for data transfer requests)
ATLAS production meeting, May-2-2005 14
Enabling Grids for E-sciencE
Outline
• Contents of gLite release 1.0– Components, architecture, and service interplay– Major differences to LCG-2– Major open issues– Future plans
• Deployment plan– LCG-2 vs gLite– gLite/LCG-2 coexistence– Current status of certification – Preproduction service
ATLAS production meeting, May-2-2005 15
Enabling Grids for E-sciencE
LCG2 Services
Client Libs Client Libs
ServicesServices
CECE
R-GMA MonBoxR-GMA MonBox
SE classicSE classic
SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPMSE SRMSE SRM
LCG WN LCG WN LCG WN LCG WN
LCG WN LCG WN LCG WN LCG WN
LCG WN LCG WN
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
VO (ldap)VO (ldap)VO (ldap)VO (ldap)
VO (ldap)VO (ldap)VO (ldap)VO (ldap)
RLSRLSRLSRLS
RLSRLSRLSRLS
LFCLFCLFCLFC
LFCLFCLFCLFC
Services/VO
LCG UI LCG UI LCG UI LCG UI
LCG UI LCG UI LCG UI LCG UI
LCG UI LCG UI
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
R-GMA RegistriesR-GMA
Registries
monitoringmonitoring
SFTSFT
FCRFCR
Global Services
APELAPEL
ATLAS production meeting, May-2-2005 16
Enabling Grids for E-sciencE
gLite 1
Services/VO
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
FireManFireManFireManFireMan
FireManFireManFireManFireMan
CE push/pull
CE push/pull
R-GMA MonBoxR-GMA MonBox
SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPMSRM SE SRM SE
gLite I/OgLite I/O
gLiteWN gLiteWN gLiteWN gLiteWN
gLiteWN gLiteWN gLiteWN gLiteWN
R-GMA RegistriesR-GMA
Registries
Global Services
DGAS??DGAS??
Client Libs Client Libs
ServicesServices
Generic ServicesGeneric Services
gLite UI gLite UI gLite UI gLite UI
gLite UI gLite UI gLite UI gLite UI Services are generic, but file ownership
limits access to gLite I/O service
MyProxyMyProxy
WLMWLM
CIC ServicesMyProxyMyProxy
WLMWLM
CIC ServicesMyProxyMyProxy
WLMWLM
CIC Services
Link between CEs and SEs via CEMon (later by R-GMA, temp. BDII)
ATLAS production meeting, May-2-2005 17
Enabling Grids for E-sciencE
gLite -LCG2 Coexistence
Services/VO
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
FireManFireManFireManFireMan
FireManFireManFireManFireMan
SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPM
SE SRMSE SRM
gLite I/OgLite I/O
R-GMA RegistriesR-GMA
Registries
Global Services
DGASDGAS
Client Libs Client Libs
ServicesServices
Generic ServicesGeneric Services
CE push/pullCE push/pull
R-GMA MonBoxR-GMA MonBox
CECE
R-GMA MonBoxR-GMA MonBox
SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPM
SE SRMSE SRM
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
VO (ldap)VO (ldap)VO (ldap)VO (ldap)
VO (ldap)VO (ldap)VO (ldap)VO (ldap)
RLSRLSRLSRLS
RLSRLSRLSRLS
LFCLFCLFCLFC
LFCLFCLFCLFC
Services/VO
gLite UI gLite UI gLite UI gLite UI
gLite UI gLite UI gLite UI gLite UI
LCG UI LCG UI LCG UI LCG UI
LCG UI LCG UI LCG UI LCG UI
LCG UI LCG UI
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
R-GMA RegistriesR-GMA
Registries
monitoringmonitoring
SFTSFT
FCRFCR
Global Services
APELAPEL
MyProxyMyProxy
WLMWLM
CIC ServicesMyProxyMyProxy
WLMWLM
CIC ServicesMyProxyMyProxy
WLMWLM
CIC Services
gLiteWN&LCG2 gLiteWN&LCG2
gLiteWN&LCG2 gLiteWN&LCG2
gLiteWN&LCG2 gLiteWN&LCG2
gLiteWN&LCG2 gLiteWN&LCG2
BDIIBDII
Services are generic, but file ownership limits access to gLite I/O service, could be worked around…..
Can be hosted on the same node
WLM can get info on CEs and SEs through CE Mon or BDII
WLM can interface via DLI to LFC, but the gLiteI/O not
ATLAS production meeting, May-2-2005 18
Enabling Grids for E-sciencE
Model
• Pro: The real gLite experience
No mixing, except for the WNs and UIs
No intense interoperability test needed
New versions can be released more quickly after they become available
LCG-2 can be evolved independently
• Cons:– Many additional services
More to come Not a valid model for small
sites
– Complex trickery needed to allow access to new/old data for the two systems
– LCG-2 can be evolved independently
ATLAS production meeting, May-2-2005 19
Enabling Grids for E-sciencE
gLite - LCG2 step 1
Client Libs Client Libs
ServicesServices
Generic ServicesGeneric Services
CE push/pullCE push/pull
CECE
R-GMA MonBoxR-GMA MonBox SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPM
SE SRMSE SRM
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
VO (ldap)VO (ldap)VO (ldap)VO (ldap)
VO (ldap)VO (ldap)VO (ldap)VO (ldap)
RLSRLSRLSRLS
RLSRLSRLSRLS
LFCLFCLFCLFC
LFCLFCLFCLFC
Services/VO
gLite UI gLite UI gLite UI gLite UI
gLite UI gLite UI gLite UI gLite UI
LCG UI LCG UI LCG UI LCG UI
LCG UI LCG UI LCG UI LCG UI
LCG UI LCG UI
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
MyProxyMyProxy
BDIIBDII
RBRB
CIC Services
R-GMA RegistriesR-GMA
Registries
monitoringmonitoring
SFTSFT
FCRFCR
Global Services
APELAPEL
MyProxyMyProxy
WLMWLM
CIC ServicesMyProxyMyProxy
WLMWLM
CIC ServicesMyProxyMyProxy
WLMWLM
CIC Services
gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2
gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2
Classic SEClassic SE
ATLAS production meeting, May-2-2005 20
Enabling Grids for E-sciencE
MyProxyMyProxy
BDIIBDII
WLMWLM
MyProxyMyProxy
BDIIBDII
WLMWLM
MyProxyMyProxy
BDIIBDII
WLMWLM
CIC Services
gLite - LCG2 step 2
Client Libs Client Libs
ServicesServices
Generic ServicesGeneric Services
CE push/pullCE push/pull
R-GMA MonBoxR-GMA MonBox SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPM
SE SRMSE SRM
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
LFCLFCLFCLFC
LFCLFCLFCLFC
Services/VO
gLite UI gLite UI gLite UI gLite UI
gLite UI gLite UI gLite UI gLite UI
R-GMA RegistriesR-GMA
Registries
monitoringmonitoring
SFTSFT
FCRFCR
Global Services
APELAPEL
gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2
gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2 gLiteWN&LCG2
ATLAS production meeting, May-2-2005 21
Enabling Grids for E-sciencE
MyProxyMyProxy
BDIIBDII
WLMWLM
MyProxyMyProxy
BDIIBDII
WLMWLM
MyProxyMyProxy
BDII??BDII??
WLMWLM
CIC Services
gLite - LCG2 step 2
Client Libs Client Libs
ServicesServices
Generic ServicesGeneric Services CE push/pullCE push/pull
R-GMA MonBoxR-GMA MonBox
SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPM
SE SRMSE SRM
VOMSVOMSVOMSVOMS
VOMSVOMSVOMSVOMS
LFCLFCLFCLFC
LFCLFCFireManFireMan
Services/VO
gLite UI gLite UI gLite UI gLite UI
gLite UI gLite UI gLite UI gLite UI
R-GMA RegistriesR-GMA
Registries
monitoringmonitoring
SFTSFT
FCR??FCR??
Global Services
DGASDGAS
gLiteWN gLiteWN gLiteWN gLiteWN
gLiteWNgLiteWNgLiteWN,LCG-utils,
gfal…gLiteWN,LCG-utils,
gfal…
gLite IOgLite IO
Ownership adjusted for access via gLite IO
Contains LFC data
LFCLFCLFCLFC
LFCLFCLFCLFC
SE
CASTOR
SE
CASTOR
SE
dCache
SE
dCache
SE
DPM
SE
DPM
SE SRMSE SRM
Users can decide to use different (multiple) catalogues, as long as they provide a DLI interface
Users can opt to use direct access to the storage via the LCG tools. However there will be limitations concerning the accessibility of the data.
ATLAS production meeting, May-2-2005 22
Enabling Grids for E-sciencE
Finer Points
• Interoperation with other grids– Mechanism to select between different client libs on the WN (done)– Need to keep an information system that can be used by LCG-3 and
OSG(grid3)
• Operations – We have to adapt our operations services to gLite production
Substantial effort Needed for preproduction service
– Monitoring has to be ported Partially done already
• Time?– Should be driven by VOs and experience gained – Fixed schedules tend to be not workable
ATLAS production meeting, May-2-2005 23
Enabling Grids for E-sciencE
To be clear on what is being certified
• Release 1.0 of gLitehttp://glite.web.cern.ch/glite/packages/R1.0/R20050331/default.asp
• Information system is a combination of BDII andCE Mon– Locations of SEs can only be stored in the BDII. This data is
entered manually into the BDII and is static.– In this release no gLite services use R-GMA as information
system, although R-GMA services are in the release and will be tested.
• No File Placement / File Transfer service– Using gridFTP on the certification test bed
• No accounting
ATLAS production meeting, May-2-2005 24
Enabling Grids for E-sciencE
When will gLite 1.0 be certified?
• Criteria for certification need to be agreed, but will probably include items such as:– Meeting SA1’s requirements on “deployability”– Successful completion of the gLite certification test suite– Job failure rate ≤ LCG-2 job failure rate– Stability of the system being acceptable– Acceptable number of critical/major bugs outstanding
• First attempt at certifying gLite therefore impossible to predict when it will be finished
ATLAS production meeting, May-2-2005 25
Enabling Grids for E-sciencE
Pre-production Service
• Pre-production planned in two phases:– Phase I:
Started when deployment of Certification testbed was successfully completed (middle of last week).
Sites: CNAF, NIKHEF, PIC(?), CESGA and CERN These sites will install all gLite site components + some gLite core
service(s) (for resilience) + LCG-2 site components (for investigating migration)
CESGA site is already being used by ARDA CMS!
– Phase II: Will start when the phase I sites are fully operational >12 sites in phase II All sites will install gLite; some will also install LCG-2
ATLAS production meeting, May-2-2005 26
Enabling Grids for E-sciencE
Infn-ECGI activity
• ECGI = Experiment Computing Grid Integration
• Working group created to – Help LHC experiments to understand and experiment
functionalities and features offered by the new gLite MW– Provide adequate user documentation
• Will work in strict collaboration and coordination with– the Experiment Integration Support (EIS) at CERN– the developers of the EGEE/LCG Grid Middleware.
• Whenever possible and needed the group will also– create system administration documentation– use dedicated resources for testing purposes.
• http://infn-ecgi.pi.infn.it/index.html