the grid enabling resource sharing within virtual organizations
DESCRIPTION
The Grid Enabling Resource Sharing within Virtual Organizations. Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster. - PowerPoint PPT PresentationTRANSCRIPT
The GridEnabling Resource Sharing
within Virtual Organizations
Ian Foster
Mathematics and Computer Science Division
Argonne National Laboratory
and
Department of Computer Science
The University of Chicago
http://www.mcs.anl.gov/~foster
Invited Talk, Research Library Group Annual Conference, Amsterdam, April 22, 2002
2
[email protected] ARGONNE CHICAGO
Overview
The technology landscape– Living in an exponential world
Grid concepts– Resource sharing in virtual organizations
Petascale Virtual Data Grids– Data, programs, computers, computations as
community resources
3
[email protected] ARGONNE CHICAGO
Living in an Exponential World(1) Computing & Sensors
Moore’s Law: transistor count doubles each 18 months
Magnetohydro-dynamics
star formation
4
[email protected] ARGONNE CHICAGO
Living in an Exponential World:(2) Storage
Storage density doubles every 12 months Dramatic growth in online data (1 petabyte =
1000 terabyte = 1,000,000 gigabyte)– 2000 ~0.5 petabyte
– 2005 ~10 petabytes
– 2010 ~100 petabytes
– 2015 ~1000 petabytes? Transforming entire disciplines in physical and,
increasingly, biological sciences; humanities next?
5
[email protected] ARGONNE CHICAGO
Data Intensive Physical Sciences
High energy & nuclear physics– Including new experiments at CERN
Gravity wave searches– LIGO, GEO, VIRGO
Time-dependent 3-D systems (simulation, data)– Earth Observation, climate modeling
– Geophysics, earthquake modeling
– Fluids, aerodynamic design
– Pollutant dispersal scenarios Astronomy: Digital sky surveys
6
[email protected] ARGONNE CHICAGO
Ongoing Astronomical Mega-Surveys
Large number of new surveys– Multi-TB in size, 100M objects or larger
– In databases
– Individual archives planned and under way Multi-wavelength view of the sky
– > 13 wavelength coverage within 5 years Impressive early discoveries
– Finding exotic objects by unusual colors> L,T dwarfs, high redshift quasars
– Finding objects by time variability> Gravitational micro-lensing
MACHO2MASSSDSSDPOSSGSC-IICOBE MAPNVSSFIRSTGALEXROSATOGLE...
MACHO2MASSSDSSDPOSSGSC-IICOBE MAPNVSSFIRSTGALEXROSATOGLE...
8
[email protected] ARGONNE CHICAGO
Coming Floods of Astronomy Data
The planned Large Synoptic Survey Telescope will produce over 10 petabytes per year by 2008!– All-sky survey every few days, so will have
fine-grain time series for the first time
9
[email protected] ARGONNE CHICAGO
Data Intensive Biology and Medicine
Medical data– X-Ray, mammography data, etc. (many petabytes)
– Digitizing patient records (ditto) X-ray crystallography Molecular genomics and related disciplines
– Human Genome, other genome databases
– Proteomics (protein structure, activities, …)
– Protein interactions, drug delivery Virtual Population Laboratory (proposed)
– Simulate likely spread of disease outbreaks Brain scans (3-D, time dependent)
10
[email protected] ARGONNE CHICAGO
And comparisons must bemade among many
We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns – Grids will help get us there and further
A Brainis a Lotof Data!
(Mark Ellisman, UCSD)
11
[email protected] ARGONNE CHICAGO
An Exponential World: (3) Networks(Or, Coefficients Matter …)
Network vs. computer performance– Computer speed doubles every 18 months
– Network speed doubles every 9 months
– Difference = order of magnitude per 5 years 1986 to 2000
– Computers: x 500
– Networks: x 340,000 2001 to 2010
– Computers: x 60
– Networks: x 4000
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
12
[email protected] ARGONNE CHICAGO
Evolution of the Scientific Process
Pre-electronic– Theorize &/or experiment, alone or in small
teams; publish paper Post-electronic
– Construct and mine very large databases of observational or simulation data
– Develop computer simulations & analyses
– Exchange information quasi-instantaneously within large, distributed, multidisciplinary teams
13
[email protected] ARGONNE CHICAGO
The Grid
“Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations”
14
[email protected] ARGONNE CHICAGO
An Example Virtual Organization: CERN’s Large Hadron Collider
1800 Physicists, 150 Institutes, 32 Countries
100 PB of data by 2010; 50,000 CPUs?
15
[email protected] ARGONNE CHICAGO
Grid Communities & Applications:Data Grids for High Energy Physics
Tier2 Centre ~1 TIPS
Online System
Offline Processor Farm
~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance Regional Centre
Italy Regional Centre
Germany Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight (deprecated)
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Caltech ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
www.griphyn.org www.ppdg.net www.eu-datagrid.org
16
[email protected] ARGONNE CHICAGO
The Grid Opportunity:eScience and eBusiness
Physicists worldwide pool resources for peta-op analyses of petabytes of data
Civil engineers collaborate to design, execute, & analyze shake table experiments
An insurance company mines data from partner hospitals for fraud detection
An application service provider offloads excess load to a compute cycle provider
An enterprise configures internal & external resources to support eBusiness workload
18
[email protected] ARGONNE CHICAGO
Early 90s– Gigabit testbeds, metacomputing
Mid to late 90s– Early experiments (e.g., I-WAY), academic software
projects (e.g., Globus, Legion), application experiments 2002
– Dozens of application communities & projects– Major infrastructure deployments– Significant technology base (esp. Globus ToolkitTM)– Growing industrial interest – Global Grid Forum: ~500 people, 20+ countries
The Grid:A Brief History
19
[email protected] ARGONNE CHICAGO
Challenging Technical Requirements
Dynamic formation and management of virtual organizations
Online negotiation of access to services: who, what, why, when, how
Establishment of applications and systems able to deliver multiple qualities of service
Autonomic management of infrastructure elements
Open Grid Services Architecture
http://www.globus.org/ogsa
20
[email protected] ARGONNE CHICAGO
Data Intensive Science: 2000-2015
Scientific discovery increasingly driven by IT – Computationally intensive analyses
– Massive data collections
– Data distributed across networks of varying capability
– Geographically distributed collaboration Dominant factor: data growth (1 Petabyte = 1000 TB)
– 2000 ~0.5 Petabyte
– 2005 ~10 Petabytes
– 2010 ~100 Petabytes
– 2015 ~1000 Petabytes?
How to collect, manage,access and interpret thisquantity of data?
Drives demand for “Data Grids” to handleadditional dimension of data access & movement
21
[email protected] ARGONNE CHICAGO
Data Grid Projects Particle Physics Data Grid (US, DOE)
– Data Grid applications for HENP expts. GriPhyN (US, NSF)
– Petascale Virtual-Data Grids iVDGL (US, NSF)
– Global Grid lab TeraGrid (US, NSF)
– Dist. supercomp. resources (13 TFlops) European Data Grid (EU, EC)
– Data Grid technologies, EU deployment CrossGrid (EU, EC)
– Data Grid technologies, EU emphasis DataTAG (EU, EC)
– Transatlantic network, Grid applications Japanese Grid Projects (APGrid) (Japan)
– Grid deployment throughout Japan
Collaborations of application scientists & computer scientists
Infrastructure devel. & deployment
Globus based
22
[email protected] ARGONNE CHICAGO
Biomedical InformaticsResearch Network (BIRN)
Evolving reference set of brains provides essential data for developing therapies for neurological disorders (multiple sclerosis, Alzheimer’s, etc.).
Today – One lab, small patient base– 4 TB collection
Tomorrow– 10s of collaborating labs– Larger population sample– 400 TB data collection: more
brains, higher resolution– Multiple scale data integration
and analysis
23
[email protected] ARGONNE CHICAGO
GriPhyN = App. Science + CS + Grids
GriPhyN = Grid Physics Network– US-CMS High Energy Physics
– US-ATLAS High Energy Physics
– LIGO/LSC Gravity wave research
– SDSS Sloan Digital Sky Survey
– Strong partnership with computer scientists Design and implement production-scale grids
– Develop common infrastructure, tools and services
– Integration into the 4 experiments
– Application to other sciences via “Virtual Data Toolkit” Multi-year project
– R&D for grid architecture (funded at $11.9M +$1.6M)
– Integrate Grid infrastructure into experiments through VDT
24
[email protected] ARGONNE CHICAGO
GriPhyN Institutions
– U Florida
– U Chicago
– Boston U
– Caltech
– U Wisconsin, Madison
– USC/ISI
– Harvard
– Indiana
– Johns Hopkins
– Northwestern
– Stanford
– U Illinois at Chicago
– U Penn
– U Texas, Brownsville
– U Wisconsin, Milwaukee
– UC Berkeley
– UC San Diego
– San Diego Supercomputer Center
– Lawrence Berkeley Lab
– Argonne
– Fermilab
– Brookhaven
25
[email protected] ARGONNE CHICAGO
GriPhyN: PetaScale Virtual Data Grids
Virtual Data Tools
Request Planning &
Scheduling ToolsRequest Execution & Management Tools
Transforms
Distributed resources(code, storage, CPUs,networks)
Resource Management
Services
Resource Management
Services
Security and Policy
Services
Security and Policy
Services
Other Grid ServicesOther Grid
Services
Interactive User Tools
Production TeamIndividual Investigator Workgroups
Raw data source
~1 Petaop/s~100 Petabytes
26
[email protected] ARGONNE CHICAGO
GriPhyN Research Agenda Virtual Data technologies
– Derived data, calculable via algorithm
– Instantiated 0, 1, or many times (e.g., caches)
– “Fetch value” vs. “execute algorithm”
– Potentially complex (versions, cost calculation, etc) E.g., LIGO: “Get gravitational strain for 2 minutes around 200
gamma-ray bursts over last year” For each requested data value, need to
– Locate item materialization, location, and algorithm
– Determine costs of fetching vs. calculating
– Plan data movements, computations to obtain results
– Execute the plan
27
[email protected] ARGONNE CHICAGO
Virtual Data in Action
Data request may– Compute locally
– Compute remotely
– Access local data
– Access remote data Scheduling based on
– Local policies
– Global policies
– Cost
Major facilities, archives
Regional facilities, caches
Local facilities, cachesFetch item
28
[email protected] ARGONNE CHICAGO
GriPhyN Research Agenda (cont.) Execution management
– Co-allocation (CPU, storage, network transfers)
– Fault tolerance, error reporting
– Interaction, feedback to planning Performance analysis (with PPDG)
– Instrumentation, measurement of all components
– Understand and optimize grid performance Virtual Data Toolkit (VDT)
– VDT = virtual data services + virtual data tools
– One of the primary deliverables of R&D effort
– Technology transfer to other scientific domains
29
[email protected] ARGONNE CHICAGO
iVDGL: A Global Grid Laboratory
International Virtual-Data Grid Laboratory– A global Grid lab (US, EU, South America, Asia, …)
– A place to conduct Data Grid tests “at scale”
– A mechanism to create common Grid infrastructure
– A production facility for LHC experiments
– An experimental laboratory for other disciplines
– A focus of outreach efforts to small institutions Funded for $13.65M by NSF
“We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.”
From NSF proposal, 2001
30
[email protected] ARGONNE CHICAGO
Initial US-iVDGL Data Grid
Tier1 (FNAL)Proto-Tier2Tier3 university
UCSDFlorida
Wisconsin
FermilabBNL
Indiana
BU
Other sites to be added in
2002
SKC
Brownsville
Hampton
PSU
JHUCaltech
31
[email protected] ARGONNE CHICAGO
iVDGL Map (2002-2003)
Tier0/1 facility
Tier2 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
Tier3 facility
DataTAG
Surfnet
LaterBrazilPakistanRussiaChina
32
[email protected] ARGONNE CHICAGO
Programs as Community Resources:Data Derivation and Provenance
Most scientific data are not simple “measurements”; essentially all are:– Computationally corrected/reconstructed
– And/or produced by numerical simulation And thus, as data and computers become
ever larger and more expensive:– Programs are significant community resources
– So are the executions of those programs
33
[email protected] ARGONNE CHICAGO
Transformation Derivation
Data
created-by
execution-of
consumed-by/generated-by
“I’ve detected a calibration error in an instrument and
want to know which derived data to recompute.”
“I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.”
“I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”
“I want to apply an astronomical analysis program to millions of objects. If the results
already exist, I’ll save weeks of computation.”
34
[email protected] ARGONNE CHICAGO
The Chimera Virtual Data System(GriPhyN Project)
Virtual data catalog– Transformations,
derivations, data Virtual data language
– Data definition + query
Applications include browsers and data analysis applications
Data Grid Resources(distributed execution
and data management)
VDL Interpreter(manipulate derivations
and transformations)
Virtual Data Catalog(implements ChimeraVirtual Data Schema)
Virtual DataApplications
Virtual Data Language(definition and query)
Task Graphs(compute and data
movement tasks, withdependencies)
SQL
Chimera
35
[email protected] ARGONNE CHICAGO
NCSA Linux cluster
5) Secondary reports complete to master
Master Condor job running at
Caltech
7) GridFTP fetches data from UniTree
NCSA UniTree - GridFTP-enabled FTP server
4) 100 data files transferred via GridFTP, ~ 1 GB each
Secondary Condor job on WI
pool
3) 100 Monte Carlo jobs on Wisconsin Condor pool
2) Launch secondary job on WI pool; input files via Globus GASS
Caltech workstation
6) Master starts reconstruction jobs via Globus jobmanager on cluster
8) Processed objectivity database stored to UniTree
9) Reconstruction job reports complete to master
Early GriPhyN Challenge Problem:CMS Data Reconstruction
Scott Koranda, Miron Livny, others
36
[email protected] ARGONNE CHICAGO
0
20
40
60
80
100
120
Pre / Simulation Jobs / Post (UW Condor)
ooHits at NCSA
ooDigis at NCSA
Delay due to script error
Trace of a Condor-G Physics Run
37
[email protected] ARGONNE CHICAGO
AttributesSemantics
Knowledge
Information
Data
Ingest Services
Management AccessServices
(Model-based Access)
(Data Handling System )
MC
AT/H
DF
Gri
ds
XM
L D
TD
SD
LIP
XTM
DTD
Rule
s -
KQ
L
InformationRepository
Attribute-based Query
Feature-basedQuery
Knowledge orTopic-Based Query / Browse
KnowledgeRepository for Rules
RelationshipsBetweenConcepts
FieldsContainersFolders
Storage(Replicas,Persistent IDs)
Knowledge-based Data Grid Roadmap(Reagan Moore, SDSC)
38
[email protected] ARGONNE CHICAGO
New Programs
U.K. eScience program
EU 6th Framework U.S. Committee on
Cyberinfrastructure Japanese Grid
initiative
39
[email protected] ARGONNE CHICAGO
U.S. Cyberinfrastructure:Draft Recommendations
New INITIATIVE to revolutionize science and engineering research at NSF and worldwide to capitalize on new computing and communications opportunities 21st Century Cyberinfrastructure includes supercomputing, but also massive storage, networking, software, collaboration, visualization, and human resources– Current centers (NCSA, SDSC, PSC) are a key resource for the INITIATIVE
– Budget estimate: incremental $650 M/year (continuing) An INITIATIVE OFFICE with a highly placed, credible leader empowered to
– Initiate competitive, discipline-driven path-breaking applications within NSF of cyberinfrastructure which contribute to the shared goals of the INITIATIVE
– Coordinate policy and allocations across fields and projects. Participants across NSF directorates, Federal agencies, and international e-science
– Develop high quality middleware and other software that is essential and special to scientific research
– Manage individual computational, storage, and networking resources at least 100x larger than individual projects or universities can provide.
40
[email protected] ARGONNE CHICAGO
Summary Technology exponentials are changing the shape
of scientific investigation & knowledge– More computing, even more data, yet more
networking The Grid: Resource sharing & coordinated problem
solving in dynamic, multi-institutional virtual organizations
Petascale Virtual Data Grids represent the future in science infrastructure– Data, programs, computers, computations as
community resources
41
[email protected] ARGONNE CHICAGO
For More Information Grid Book
– www.mkp.com/grids The Globus Project™
– www.globus.org Global Grid Forum
– www.gridforum.org TeraGrid
– www.teragrid.org EU DataGrid
– www.eu-datagrid.org GriPhyN
– www.griphyn.org iVDGL
– www.ivdgl.org Background papers
– www.mcs.anl.gov/~foster