paul avery university of florida phys.ufl/~avery/ [email protected]
DESCRIPTION
Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ [email protected]. GriPhyN/iVDGL Outreach Workshop University of Texas, Brownsville March 1, 2002. What is a Grid?. - PowerPoint PPT PresentationTRANSCRIPT
Outreach Workshop (Mar. 1, 2002)
Paul Avery 1
Paul AveryUniversity of Florida
http://www.phys.ufl.edu/~avery/[email protected]
Global Data Grids for21st Century Science
GriPhyN/iVDGL Outreach WorkshopUniversity of Texas, Brownsville
March 1, 2002
Outreach Workshop (Mar. 1, 2002)
Paul Avery 2
What is a Grid?Grid: Geographically distributed computing
resources configured for coordinated use
Physical resources & networks provide raw capability
“Middleware” software ties it together
Outreach Workshop (Mar. 1, 2002)
Paul Avery 3
What Are Grids Good For?Climate modeling
Climate scientists visualize, annotate, & analyze Terabytes of simulation data
BiologyA biochemist exploits 10,000 computers to screen
100,000 compounds in an hour
High energy physics3,000 physicists worldwide pool Petaflops of CPU
resources to analyze Petabytes of data
EngineeringCivil engineers collaborate to design, execute, & analyze
shake table experimentsA multidisciplinary analysis in aerospace couples code and
data in four companies
From Ian Foster
Outreach Workshop (Mar. 1, 2002)
Paul Avery 4
What Are Grids Good For?Application Service Providers
A home user invokes architectural design functions at an application service provider…
…which purchases computing cycles from cycle providers
CommercialScientists at a multinational toy company design a new
product
Cities, communitiesAn emergency response team couples real time data,
weather model, population dataA community group pools members’ PCs to analyze
alternative designs for a local road
HealthHospitals and international agencies collaborate on
stemming a major disease outbreakFrom Ian Foster
Outreach Workshop (Mar. 1, 2002)
Paul Avery 5
Proto-Grid: SETI@homeCommunity: SETI researchers + enthusiastsArecibo radio data sent to users (250KB data
chunks)Over 2M PCs used
Outreach Workshop (Mar. 1, 2002)
Paul Avery 6
CommunityResearch group
(Scripps)1000s of PC ownersVendor (Entropia)
Common goalDrug designAdvance AIDS research
More Advanced Proto-Grid:Evaluation of AIDS Drugs
Outreach Workshop (Mar. 1, 2002)
Paul Avery 7
Why Grids?Resources for complex problems are distributed
Advanced scientific instruments (accelerators, telescopes, …)
Storage and computingGroups of people
Communities require access to common servicesScientific collaborations (physics, astronomy, biology, eng.
…)Government agenciesHealth care organizations, large corporations, …
Goal is to build “Virtual Organizations”Make all community resources available to any VO
memberLeverage strengths at different institutions Add people & resources dynamically
Outreach Workshop (Mar. 1, 2002)
Paul Avery 8
Grids: Why Now?Moore’s law improvements in computing
Highly functional endsystems
Burgeoning wired and wireless Internet connectionsUniversal connectivity
Changing modes of working and problem solvingTeamwork, computation
Network exponentials(Next slide)
Outreach Workshop (Mar. 1, 2002)
Paul Avery 9
Network Exponentials & Collaboration
Network vs. computer performanceComputer speed doubles every 18 monthsNetwork speed doubles every 9 monthsDifference = order of magnitude per 5 years
1986 to 2000Computers: x 500Networks: x 340,000
2001 to 2010?Computers: x 60Networks: x 4000
Scientific American (Jan-2001)
Outreach Workshop (Mar. 1, 2002)
Paul Avery 10
Grid ChallengesOverall goal: Coordinated sharing of resourcesTechnical problems to overcome
Authentication, authorization, policy, auditingResource discovery, access, allocation, controlFailure detection & recoveryResource brokering
Additional issue: lack of central control & knowledge
Preservation of local site autonomyPolicy discovery and negotiation important
Outreach Workshop (Mar. 1, 2002)
Paul Avery 11
Layered Grid Architecture(Analogy to Internet Architecture)
Application
FabricControlling things locally:Accessing, controlling resources
ConnectivityTalking to things:communications, security
ResourceSharing single resources:negotiating access, controlling use
CollectiveManaging multiple resources:ubiquitous infrastructure services
UserSpecialized services:App. specific distributed services
InternetTransport
Application
Link
Inte
rnet P
roto
col
Arc
hite
ctu
re
From Ian Foster
Outreach Workshop (Mar. 1, 2002)
Paul Avery 12
Globus Project and ToolkitGlobus Project™ (Argonne + USC/ISI)
O(40) researchers & developers Identify and define core protocols and services
Globus Toolkit™ 2.0A major product of the Globus ProjectReference implementation of core protocols & servicesGrowing open source developer community
Globus Toolkit used by all Data Grid projects todayUS: GriPhyN, PPDG, TeraGrid, iVDGLEU: EU-DataGrid and national projects
Recent announcement of applying “web services” to Grids
Keeps Grids in the commercial mainstreamGT 3.0
Outreach Workshop (Mar. 1, 2002)
Paul Avery 13
Globus General ApproachDefine Grid protocols & APIs
Protocol-mediated access to remote resources
Integrate and extend existing standards
Develop reference implementationOpen source Globus ToolkitClient & server SDKs, services, tools,
etc.
Grid-enable wide variety of toolsGlobus ToolkitFTP, SSH, Condor, SRB, MPI, …
Learn about real world problemsDeploymentTestingApplications
Diverse global services
Coreservice
s
Diverse resources
Applications
Outreach Workshop (Mar. 1, 2002)
Paul Avery 14
Data Intensive Science: 2000-2015Scientific discovery increasingly driven by IT
Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityGeographically distributed collaboration
Dominant factor: data growth (1 Petabyte = 1000 TB)
2000 ~0.5 Petabyte2005 ~10 Petabytes2010 ~100 Petabytes2015 ~1000 Petabytes?
How to collect, manage,access and interpret thisquantity of data?
Drives demand for “Data Grids” to handleadditional dimension of data access & movement
Outreach Workshop (Mar. 1, 2002)
Paul Avery 15
Data Intensive Physical SciencesHigh energy & nuclear physics
Including new experiments at CERN’s Large Hadron Collider
Gravity wave searchesLIGO, GEO, VIRGO
Astronomy: Digital sky surveysSloan Digital sky Survey, VISTA, other Gigapixel arrays“Virtual” Observatories (multi-wavelength astronomy)
Time-dependent 3-D systems (simulation & data)Earth Observation, climate modelingGeophysics, earthquake modelingFluids, aerodynamic designPollutant dispersal scenarios
Outreach Workshop (Mar. 1, 2002)
Paul Avery 16
Data Intensive Biology and Medicine
Medical dataX-Ray, mammography data, etc. (many petabytes)Digitizing patient records (ditto)
X-ray crystallographyBright X-Ray sources, e.g. Argonne Advanced Photon
Source
Molecular genomics and related disciplinesHuman Genome, other genome databasesProteomics (protein structure, activities, …)Protein interactions, drug delivery
Brain scans (3-D, time dependent)Virtual Population Laboratory (proposed)
Database of populations, geography, transportation corridors
Simulate likely spread of disease outbreaks
Craig Venter keynote@SC2001
Outreach Workshop (Mar. 1, 2002)
Paul Avery 17
Example: High Energy Physics“Compact” Muon Solenoid
at the LHC (CERN)
Smithsonianstandard man
Outreach Workshop (Mar. 1, 2002)
Paul Avery 18
1800 Physicists150 Institutes32 Countries
LHC Computing ChallengesComplexity of LHC interaction environment & resulting dataScale: Petabytes of data per year (100 PB by ~2010-12)GLobal distribution of people and resources
Outreach Workshop (Mar. 1, 2002)
Paul Avery 19
Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation
Global LHC Data Grid
Tier 1
T2
T2
T2
T2
T2
3
3
3
3
3
3
3
3
3
3
3
Tier 0 (CERN)
4 4 4 4
3 3
Key ideas:Hierarchical structureTier2 centers
Outreach Workshop (Mar. 1, 2002)
Paul Avery 20
Global LHC Data Grid
Tier2 Center
Online System
CERN Computer Center > 20
TIPS
USA CenterFrance Center
Italy Center UK Center
InstituteInstituteInstituteInstitute ~0.25TIPS
Workstations,other portals
~100 MBytes/sec
2.5 Gbits/sec
100 - 1000
Mbits/sec
Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute has ~10 physicists working on one or more channels
Physics data cache
~PBytes/sec
2.5 Gbits/sec
Tier2 CenterTier2 CenterTier2 Center
~622 Mbits/sec
Tier 0 +1
Tier 1
Tier 3
Tier 4
Tier2 Center Tier 2
Experiment CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1
Outreach Workshop (Mar. 1, 2002)
Paul Avery 21
Sloan Digital Sky Survey Data Grid
Outreach Workshop (Mar. 1, 2002)
Paul Avery 22
LIGO (Gravity Wave) Data Grid
HanfordObservatory
LivingstonObservatory
Caltech
MIT
INet2Abilene
Tier1 LSCLSC
LSCLSC
LSCTier2
OC3
OC48
OC3
OC12
OC48
Outreach Workshop (Mar. 1, 2002)
Paul Avery 23
Data Grid Projects Particle Physics Data Grid (US, DOE)
Data Grid applications for HENP expts. GriPhyN (US, NSF)
Petascale Virtual-Data Grids iVDGL (US, NSF)
Global Grid lab TeraGrid (US, NSF)
Dist. supercomp. resources (13 TFlops) European Data Grid (EU, EC)
Data Grid technologies, EU deployment CrossGrid (EU, EC)
Data Grid technologies, EU DataTAG (EU, EC)
Transatlantic network, Grid applications Japanese Grid Project (APGrid?) (Japan)
Grid deployment throughout Japan
Collaborations of application scientists & computer scientists
Infrastructure devel. & deployment
Globus based
Outreach Workshop (Mar. 1, 2002)
Paul Avery 24
Coordination of U.S. Grid ProjectsThree U.S. projects
PPDG: HENP experiments, short term tools, deployment
GriPhyN: Data Grid research, Virtual Data, VDT deliverable
iVDGL: Global Grid laboratory
Coordination of PPDG, GriPhyN, iVDGLCommon experiments + personnel, management integration iVDGL as “joint” PPDG + GriPhyN laboratory Joint meetings (Jan. 2002, April 2002, Sept. 2002) Joint architecture creation (GriPhyN, PPDG)Adoption of VDT as common core Grid infrastructureCommon Outreach effort (GriPhyN + iVDGL)
New TeraGrid project (Aug. 2001)13MFlops across 4 sites, 40 Gb/s networkingGoal: integrate into iVDGL, adopt VDT, common Outreach
Outreach Workshop (Mar. 1, 2002)
Paul Avery 25
Worldwide Grid CoordinationTwo major clusters of projects
“US based” GriPhyN Virtual Data Toolkit (VDT)“EU based” Different packaging of similar components
Outreach Workshop (Mar. 1, 2002)
Paul Avery 26
GriPhyN = App. Science + CS + Grids
GriPhyN = Grid Physics NetworkUS-CMS High Energy PhysicsUS-ATLAS High Energy PhysicsLIGO/LSC Gravity wave researchSDSS Sloan Digital Sky SurveyStrong partnership with computer scientists
Design and implement production-scale gridsDevelop common infrastructure, tools and services (Globus
based) Integration into the 4 experimentsBroad application to other sciences via “Virtual Data
Toolkit”Strong outreach program
Multi-year projectR&D for grid architecture (funded at $11.9M +$1.6M) Integrate Grid infrastructure into experiments through VDT
Outreach Workshop (Mar. 1, 2002)
Paul Avery 27
GriPhyN InstitutionsU FloridaU ChicagoBoston UCaltechU Wisconsin, MadisonUSC/ISIHarvard Indiana Johns HopkinsNorthwesternStanfordU Illinois at ChicagoU PennU Texas, BrownsvilleU Wisconsin,
MilwaukeeUC Berkeley
UC San DiegoSan Diego Supercomputer
CenterLawrence Berkeley LabArgonneFermilabBrookhaven
Outreach Workshop (Mar. 1, 2002)
Paul Avery 28
GriPhyN: PetaScale Virtual-Data Grids
Virtual Data Tools
Request Planning &
Scheduling ToolsRequest Execution & Management Tools
Transforms
Distributed resources(code, storage, CPUs,networks)
Resource Management
Services
Resource Management
Services
Security and Policy
Services
Security and Policy
Services
Other Grid ServicesOther Grid
Services
Interactive User Tools
Production TeamIndividual Investigator Workgroups
Raw data source
~1 Petaflop~100 Petabytes
Outreach Workshop (Mar. 1, 2002)
Paul Avery 29
GriPhyN Research AgendaVirtual Data technologies (fig.)
Derived data, calculable via algorithm Instantiated 0, 1, or many times (e.g., caches)“Fetch value” vs “execute algorithm”Very complex (versions, consistency, cost calculation, etc)
LIGO example“Get gravitational strain for 2 minutes around each of 200
gamma-ray bursts over the last year”
For each requested data value, need toLocate item location and algorithm Determine costs of fetching vs calculatingPlan data movements & computations required to obtain
results Execute the plan
Outreach Workshop (Mar. 1, 2002)
Paul Avery 30
Virtual Data in Action
Data request may Compute locally Compute remotely Access local data Access remote data
Scheduling based on Local policies Global policies Cost
Major facilities, archives
Regional facilities, caches
Local facilities, cachesFetch item
Outreach Workshop (Mar. 1, 2002)
Paul Avery 31
GriPhyN Research Agenda (cont.)Execution management
Co-allocation of resources (CPU, storage, network transfers)
Fault tolerance, error reporting Interaction, feedback to planning
Performance analysis (with PPDG) Instrumentation and measurement of all grid componentsUnderstand and optimize grid performance
Virtual Data Toolkit (VDT)VDT = virtual data services + virtual data toolsOne of the primary deliverables of R&D effortTechnology transfer mechanism to other scientific
domains
Outreach Workshop (Mar. 1, 2002)
Paul Avery 32
GriPhyN/PPDG Data Grid Architecture
Application
Planner
Executor
Catalog Services
Info Services
Policy/Security
Monitoring
Repl. Mgmt.
Reliable TransferService
Compute Resource Storage Resource
DAG
DAG
DAGMAN, Kangaroo
GRAM GridFTP; GRAM; SRM
GSI, CAS
MDS
MCAT; GriPhyN catalogs
GDMP
MDS
Globus
= initial solution is operational
Outreach Workshop (Mar. 1, 2002)
Paul Avery 33
Transparency wrt materialization
Id Trans F ParamName …i1 F X F.X …i2 F Y F.Y …i10 G Y P G(P).Y …
Trans Prog Cost …F URL:f 10 …G URL:g 20 …
Program storage
Trans. name
URLs for program location
Derived Data Catalog
Transformation Catalog
Update uponmaterialization
App specific attr. id …… i2,i10……
Derived Metadata Catalog
id
Id Trans Param Name …i1 F X F.X …i2 F Y F.Y …i10 G Y P G(P).Y …
Trans Prog Cost …F URL:f 10 …G URL:g 20 …
Program storage
Trans. name
URLs for program location
App-specific-attr id … … i2,i10……
id
Physical file storage
URLs for physical file location
Name LObjN …
F.X logO3 … …
LCN PFNs …logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6
Metadata Catalog
Replica Catalog
Logical Container Name
GCMS
Object Name
Transparency wrt location
Name LObjN … … X logO1 … …Y logO2 … …F.X logO3 … …G(1).Y logO4 … …
LCN PFNs …logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6
Replica Catalog
GCMSGCMS
Object Name
Catalog Architecture
Metadata Catalog
Outreach Workshop (Mar. 1, 2002)
Paul Avery 34
iVDGL: A Global Grid Laboratory
International Virtual-Data Grid LaboratoryA global Grid laboratory (US, EU, South America, Asia, …)A place to conduct Data Grid tests “at scale”A mechanism to create common Grid infrastructureA facility to perform production exercises for LHC
experimentsA laboratory for other disciplines to perform Data Grid testsA focus of outreach efforts to small institutions
Funded for $13.65M by NSF
“We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.”
From NSF proposal, 2001
Outreach Workshop (Mar. 1, 2002)
Paul Avery 35
iVDGL ComponentsComputing resources
Tier1, Tier2, Tier3 sites
NetworksUSA (TeraGrid, Internet2, ESNET), Europe (Géant, …)Transatlantic (DataTAG), Transpacific, AMPATH, …
Grid Operations Center (GOC) Indiana (2 people) Joint work with TeraGrid on GOC development
Computer Science support teamsSupport, test, upgrade GriPhyN Virtual Data Toolkit
Outreach effort Integrated with GriPhyN
Coordination, interoperability
Outreach Workshop (Mar. 1, 2002)
Paul Avery 36
Current iVDGL Participants Initial experiments (funded by NSF proposal)
CMS, ATLAS, LIGO, SDSS, NVO
U.S. Universities and laboratories(Next slide)
PartnersTeraGridEU DataGrid + EU national projects Japan (AIST, TITECH)Australia
Complementary EU project: DataTAG2.5 Gb/s transatlantic network
Outreach Workshop (Mar. 1, 2002)
Paul Avery 37
U Florida CMSCaltech CMS, LIGOUC San Diego CMS, CS Indiana U ATLAS, GOCBoston U ATLASU Wisconsin, Milwaukee LIGOPenn State LIGO Johns Hopkins SDSS, NVOU Chicago/Argonne CSU Southern California CSU Wisconsin, Madison CSSalish Kootenai Outreach, LIGOHampton U Outreach, ATLASU Texas, Brownsville Outreach, LIGOFermilab CMS, SDSS, NVOBrookhaven ATLASArgonne LabATLAS, CS
U.S. iVDGL Proposal Participants
T2 / Software
CS support
T3 / Outreach
T1 / Labs(funded elsewhere)
Outreach Workshop (Mar. 1, 2002)
Paul Avery 38
Initial US-iVDGL Data Grid
Tier1 (FNAL)Proto-Tier2Tier3 university
UCSDFlorida
Wisconsin
FermilabBNL
Indiana
BU
Other sites to be added in
2002
SKC
Brownsville
Hampton
PSU
JHUCaltech
Outreach Workshop (Mar. 1, 2002)
Paul Avery 39
iVDGL Map (2002-2003)
Tier0/1 facility
Tier2 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
Tier3 facility
DataTAG
Surfnet
LaterBrazilPakistanRussiaChina
Outreach Workshop (Mar. 1, 2002)
Paul Avery 40
SummaryData Grids will qualitatively and quantitatively
change the nature of collaborations and approaches to computing
The iVDGL will provide vast experience for new collaborations
Many challenges during the coming transitionNew grid projects will provide rich experience and lessonsDifficult to predict situation even 3-5 years ahead
Outreach Workshop (Mar. 1, 2002)
Paul Avery 41
Grid References Grid Book
www.mkp.com/grids Globus
www.globus.org Global Grid Forum
www.gridforum.org TeraGrid
www.teragrid.org EU DataGrid
www.eu-datagrid.org PPDG
www.ppdg.net GriPhyN
www.griphyn.org iVDGL
www.ivdgl.org