challenges of analysis for grid computing
DESCRIPTION
Challenges of Analysis for Grid Computing. Charles Loomis (LAL-Orsay) University College London November 25, 2005. Contents. Introduction What is grid computing? Why is it useful for the LHC? LCG/EGEE production service Middleware services Resources available Current usage - PowerPoint PPT PresentationTRANSCRIPT
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
Challenges of Analysis for Grid Computing
Charles Loomis (LAL-Orsay)
University College London
November 25, 2005
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Contents
• Introduction– What is grid computing?– Why is it useful for the LHC?
• LCG/EGEE production service– Middleware services– Resources available– Current usage
• Supporting analysis on the grid– Development needed to meet expectations– Use of grid in other application domains
• Summary
• Opinions are those of the author and may not reflect those of the LCG or EGEE projects!
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 3
Enabling Grids for E-sciencE
INFSO-RI-508833
What is the Grid?
• “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high computational capabilities.”The Grid, I. Foster and C. Kesselman, 1998
• Characteristics:– Critical part of the grid is the “middleware”.– Transparent access to all available resources. – Secure access across administrative boundaries.– Enables sharing of resources.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 4
Enabling Grids for E-sciencE
INFSO-RI-508833
Why the Grid?
• User– Reduced (or no) porting to take advantage of remote resources.– More available resources, less time waiting for answers.
• Experiment– No reinventing the wheel: reuse of high-level grid services.– Means of coordinating global computing resources.
• Institute– More efficient use of hardware.– Reduced outlay for hardware through sharing.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 5
Enabling Grids for E-sciencE
INFSO-RI-508833
What the Grid is Not!
• Unlimited, free resources– Sharing is expected make more resources available at lower
cost, but… sharing is a two-way street.– Users (or their institutes) must still provide resources equivalent
to their average consumption.
• The Borg– Making resources available in the grid is always voluntary. – Administrators can set policies on who can access those
resources, when, with what priority, etc.
• Magic– Cannot divine the needs of your applications.– Provides mechanism for creating generally useful services, but
users must still write application-level code or layer to bind to grid services.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 6
Enabling Grids for E-sciencE
INFSO-RI-508833
LHC and Grid Computing
• The computing needs of the LHC and goals of grid computing are a good match.
• Users and resources are globally distributed.• Scale of storage and computing resources requires
federations of diverse resources.– 43 PB of mass storage, 37 PB of disk storage– 105k SI2000 of computing
• Needs correspond well to base-level grid services.– Batch-like access to computing resources.– Storage of large data sets.– Metadata management for finding data.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 7
Enabling Grids for E-sciencE
INFSO-RI-508833
LCG
• LHC Computing Grid:– Prepare, deploy, and operate the computing environment to allow
the physicists to analyze the data from LHC detectors.
• Requires:– Storage and management of large amounts of data.– Easy access to data and associated metadata.– Access to local and remote computing resources.– Stable, reliable system for long periods of time:
§ Large productions of simulation.§ Chaotic access for data analysis.
• Goals are similar to those of grid computing.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 8
Enabling Grids for E-sciencE
INFSO-RI-508833
EGEE
• Enabling Grids for E-sciencE:– Provide and manage an European grid infrastructure to support
researchers from many disciplines.
• LCG and EGEE have similar aims:– LCG: world wide collaboration; one field.
§ Lifetime: ~20 years.
– EGEE: European grid; many fields.§ Lifetime: 2+2 years.
– EGEE-II: Proposed project to maintain infrastructure.§ Lifetime: 2 years.
• Division of Labor:– LCG: Provides and operates infrastructure.– EGEE: Re-engineers grid software.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Many Other Projects!
• Interoperability between middleware and infrastructures is a real concern.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis
Enabling Grids for E-sciencE
INFSO-RI-508833
Job Submission
User Interface
ResourceBroker
InformationSystem
ReplicaCatalogs
1. submit
2. query
3. query
4. submit
5. retrieve
6. retrieve
publish status
UserInterface
ResourceBroker
Information
System
ReplicaCatalog
StorageElement
Computing
Element
Site 1
StorageElement
Computing
Element
Site 2
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Security
• Public Key Infrastructure– Uses Grid Security Infrastructure (GSI) from Globus.
• Authentication (i.e. Who are you?)– Certificate Authorities (CA)
§ More than 30 CAs.§ Covers Europe, North America, and Asia.
– Principals: Hosts, People, Services. – Single sign-on:
§ User generates time-limited proxy.§ Proxy used to delegate authority.
• Authorization (i.e. What can you do?)– Done by Virtual Organization (VO).– Resources query VO membership server for membership list.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 12
Enabling Grids for E-sciencE
INFSO-RI-508833
Information System
• Information System is Backbone of Grid:– Used as a service index.– Transports status information to broker.
• MDS– LDAP-based system provided by Globus.– Augmented by plain-vanilla LDAP for performance (BDII).– Hierarchy of all grid information.
• R-GMA– Consumer/Producer model.– Uses relational DB behind.– Uses same information providers as MDS.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 13
Enabling Grids for E-sciencE
INFSO-RI-508833
Data Management
• Data Management– Storage Services
§ GridFTP (gsiftp) servers being phased out.§ Transition to SRM-based services.
– Transport protocols:§ gsiftp (remote, local access)§ rfio, posix (local access)§ http, https (limited support)
• VO Replica Catalog– Locations of replicated files.– RB uses these catalogs to find viable sites for jobs.
• VO Metadata Catalog– Information about data files on grid.– Accessed directly by end-users.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis
Enabling Grids for E-sciencE
INFSO-RI-508833
LCG/EGEE Production Service
> 200 sites> 20 kCPU> 13 PB
htt
p:/
/go
c03
.grid
-su
pp
ort
.ac.
uk/
go
og
lem
ap
s/lc
g.h
tml
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 15
Enabling Grids for E-sciencE
INFSO-RI-508833
• ATLAS data challenge (Rome, June 2005)– 200 CPU-years used– 380k jobs in total– 1.4M data files, 45TB– 10 people running production
• Total success rate: 52%
ATLAS Data Challenge
https://edms.cern.ch/document/641261/18
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 16
Enabling Grids for E-sciencE
INFSO-RI-508833
• WISDOM: Wide In Silico Docking on Malaria– 67 CPU-years in 37 days– 73k jobs in total– 947GB of data– 5 people running production
• Total success rate: 47%• W/O license failures: 65%
WISDOM Data Challenge
http://wisdom.eu-egee.fr/
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 17
Enabling Grids for E-sciencE
INFSO-RI-508833
Better Reliability
• Success rate ~60% is not adequate.– Painful, but workable for large productions.– Too frustrating for analysis.
• Certification– Avoid landing on a “bad” site, but reduces available resources.– Must make software easier to install and configure.
– Current ad hoc solution for Site Functionality Tests needs to be generalized and integrated with the matchmaking.
– Examples:§ SFT (and other batteries of tests)§ Application software validation§ Site security validation
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 18
Enabling Grids for E-sciencE
INFSO-RI-508833
Chaotic Access
• Service challenges and large scale productions stress the grid, but in a very organized manner.
• The large-scale analysis which will appear with real LHC data will be much more chaotic.
• Need to test how services will respond to this:– Batch systems with thousands of different users.– Storage systems caching large numbers of different files.– Metadata catalogs with large numbers of varied requests.– etc.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 19
Enabling Grids for E-sciencE
INFSO-RI-508833
Accessible Grid Software
• Grid clients required for all common platforms:– People are more efficient working in their usual environment.– Normal test progression is efficient; don’t interfere with this.
• Lightweight services for the laptop/workstation:– Changing analysis software or scripts to work in different
environments is error-prone and frustrating.– Allow users to see one environment by running lightweight
services on their laptop.– Ideally these would be visible in the grid, so that the user only
needs to indicate that jobs need more or different resources.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 20
Enabling Grids for E-sciencE
INFSO-RI-508833
Access Control Lists
• Large experiments are always a balance between collaboration and competition.
• Analysis tends to be competitive:– Need to use common resources,– But keep certain things private.
– Fine-grained Access Control Lists (ACLs) will need to be supported by nearly all services. E.g.§ Analysis jobs: who can kill them, reschedule them, …?§ Analysis software: who can read the code?§ Produced data: who can read, delete, list, … the data?
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 21
Enabling Grids for E-sciencE
INFSO-RI-508833
Priorities
• The fair amount of excess capacity on the production service, means most jobs are not significantly delayed.
• With large-scale analysis and production in parallel, this will change.
• Priorities will be needed:– For computational, storage, and network resources.– Must seamlessly incorporate policies from:
§ User: e.g. mix of analysis jobs and “service” jobs§ Experiment: e.g. critical realignment jobs before analysis jobs§ Sites: e.g. local users run with higher priority
– Must resolve conflicts between policies.§ E.g. high-priority access to CPU, but low-priority, to storage.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 22
Enabling Grids for E-sciencE
INFSO-RI-508833
Database Issues
• Users will need to store information about their analyses in databases.– Location of produced data files.– Metadata concerning those files.
• Common services:– Privacy and namespace issues must be resolved.
• Private services:– Federation issues must be resolved.
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 23
Enabling Grids for E-sciencE
INFSO-RI-508833
Communication
• Effective communication is vital for analysis.
• The grid should incorporate communication tools:– e-mail and mailing lists– chat– phone– video
• And facilitate their use. For example:– “single sign-on” for all services– automatic management of lists with VO authorization groups– management of MCU for video
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 24
Enabling Grids for E-sciencE
INFSO-RI-508833
Other Applications
• Biomedical applications– Public database usage– Large resource needs– Privacy concerns– Quasi-realtime response
• Earth sciences– Widely distributed data– “Complex” metadata searches– Commercial software– Quasi-realtime response
• Astrophysics– Sharing data between VOs
• Computational Chemistry– Large, parallel algorithms
Challenges of Analysis… – Nov. 25, 2005 – C. Loomis 25
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
• Grid technology fits well with the needs and constraints of the high-energy physics community.
• LCG/EGEE production service– Large number of globally-distributed resources available.– Successfully used by many experiments for large productions.– Will need to grow by 5 times to meet needs of LHC.
• Supporting analysis is challenging for the grid:– Reliability must increase significantly.– Better availability of the software on different platforms.– Finer-grained control over access to and use of resources.– Incorporation of new services into the grid.