paul avery university of florida phys.ufl/~avery/ [email protected]

41
Outreach Wor kshop (Mar. 1, 2002) Paul Avery 1 Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ [email protected] Global Data Grids for 21 st Century Science GriPhyN/iVDGL Outreach Workshop University of Texas, Brownsville March 1, 2002

Upload: stesha

Post on 12-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ [email protected]. GriPhyN/iVDGL Outreach Workshop University of Texas, Brownsville March 1, 2002. What is a Grid?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 1

Paul AveryUniversity of Florida

http://www.phys.ufl.edu/~avery/[email protected]

Global Data Grids for21st Century Science

GriPhyN/iVDGL Outreach WorkshopUniversity of Texas, Brownsville

March 1, 2002

Page 2: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 2

What is a Grid?Grid: Geographically distributed computing

resources configured for coordinated use

Physical resources & networks provide raw capability

“Middleware” software ties it together

Page 3: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 3

What Are Grids Good For?Climate modeling

Climate scientists visualize, annotate, & analyze Terabytes of simulation data

BiologyA biochemist exploits 10,000 computers to screen

100,000 compounds in an hour

High energy physics3,000 physicists worldwide pool Petaflops of CPU

resources to analyze Petabytes of data

EngineeringCivil engineers collaborate to design, execute, & analyze

shake table experimentsA multidisciplinary analysis in aerospace couples code and

data in four companies

From Ian Foster

Page 4: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 4

What Are Grids Good For?Application Service Providers

A home user invokes architectural design functions at an application service provider…

…which purchases computing cycles from cycle providers

CommercialScientists at a multinational toy company design a new

product

Cities, communitiesAn emergency response team couples real time data,

weather model, population dataA community group pools members’ PCs to analyze

alternative designs for a local road

HealthHospitals and international agencies collaborate on

stemming a major disease outbreakFrom Ian Foster

Page 5: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 5

Proto-Grid: SETI@homeCommunity: SETI researchers + enthusiastsArecibo radio data sent to users (250KB data

chunks)Over 2M PCs used

Page 6: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 6

CommunityResearch group

(Scripps)1000s of PC ownersVendor (Entropia)

Common goalDrug designAdvance AIDS research

More Advanced Proto-Grid:Evaluation of AIDS Drugs

Page 7: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 7

Why Grids?Resources for complex problems are distributed

Advanced scientific instruments (accelerators, telescopes, …)

Storage and computingGroups of people

Communities require access to common servicesScientific collaborations (physics, astronomy, biology, eng.

…)Government agenciesHealth care organizations, large corporations, …

Goal is to build “Virtual Organizations”Make all community resources available to any VO

memberLeverage strengths at different institutions Add people & resources dynamically

Page 8: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 8

Grids: Why Now?Moore’s law improvements in computing

Highly functional endsystems

Burgeoning wired and wireless Internet connectionsUniversal connectivity

Changing modes of working and problem solvingTeamwork, computation

Network exponentials(Next slide)

Page 9: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 9

Network Exponentials & Collaboration

Network vs. computer performanceComputer speed doubles every 18 monthsNetwork speed doubles every 9 monthsDifference = order of magnitude per 5 years

1986 to 2000Computers: x 500Networks: x 340,000

2001 to 2010?Computers: x 60Networks: x 4000

Scientific American (Jan-2001)

Page 10: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 10

Grid ChallengesOverall goal: Coordinated sharing of resourcesTechnical problems to overcome

Authentication, authorization, policy, auditingResource discovery, access, allocation, controlFailure detection & recoveryResource brokering

Additional issue: lack of central control & knowledge

Preservation of local site autonomyPolicy discovery and negotiation important

Page 11: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 11

Layered Grid Architecture(Analogy to Internet Architecture)

Application

FabricControlling things locally:Accessing, controlling resources

ConnectivityTalking to things:communications, security

ResourceSharing single resources:negotiating access, controlling use

CollectiveManaging multiple resources:ubiquitous infrastructure services

UserSpecialized services:App. specific distributed services

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arc

hite

ctu

re

From Ian Foster

Page 12: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 12

Globus Project and ToolkitGlobus Project™ (Argonne + USC/ISI)

O(40) researchers & developers Identify and define core protocols and services

Globus Toolkit™ 2.0A major product of the Globus ProjectReference implementation of core protocols & servicesGrowing open source developer community

Globus Toolkit used by all Data Grid projects todayUS: GriPhyN, PPDG, TeraGrid, iVDGLEU: EU-DataGrid and national projects

Recent announcement of applying “web services” to Grids

Keeps Grids in the commercial mainstreamGT 3.0

Page 13: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 13

Globus General ApproachDefine Grid protocols & APIs

Protocol-mediated access to remote resources

Integrate and extend existing standards

Develop reference implementationOpen source Globus ToolkitClient & server SDKs, services, tools,

etc.

Grid-enable wide variety of toolsGlobus ToolkitFTP, SSH, Condor, SRB, MPI, …

Learn about real world problemsDeploymentTestingApplications

Diverse global services

Coreservice

s

Diverse resources

Applications

Page 14: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 14

Data Intensive Science: 2000-2015Scientific discovery increasingly driven by IT

Computationally intensive analysesMassive data collectionsData distributed across networks of varying capabilityGeographically distributed collaboration

Dominant factor: data growth (1 Petabyte = 1000 TB)

2000 ~0.5 Petabyte2005 ~10 Petabytes2010 ~100 Petabytes2015 ~1000 Petabytes?

How to collect, manage,access and interpret thisquantity of data?

Drives demand for “Data Grids” to handleadditional dimension of data access & movement

Page 15: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 15

Data Intensive Physical SciencesHigh energy & nuclear physics

Including new experiments at CERN’s Large Hadron Collider

Gravity wave searchesLIGO, GEO, VIRGO

Astronomy: Digital sky surveysSloan Digital sky Survey, VISTA, other Gigapixel arrays“Virtual” Observatories (multi-wavelength astronomy)

Time-dependent 3-D systems (simulation & data)Earth Observation, climate modelingGeophysics, earthquake modelingFluids, aerodynamic designPollutant dispersal scenarios

Page 16: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 16

Data Intensive Biology and Medicine

Medical dataX-Ray, mammography data, etc. (many petabytes)Digitizing patient records (ditto)

X-ray crystallographyBright X-Ray sources, e.g. Argonne Advanced Photon

Source

Molecular genomics and related disciplinesHuman Genome, other genome databasesProteomics (protein structure, activities, …)Protein interactions, drug delivery

Brain scans (3-D, time dependent)Virtual Population Laboratory (proposed)

Database of populations, geography, transportation corridors

Simulate likely spread of disease outbreaks

Craig Venter keynote@SC2001

Page 17: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 17

Example: High Energy Physics“Compact” Muon Solenoid

at the LHC (CERN)

Smithsonianstandard man

Page 18: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 18

1800 Physicists150 Institutes32 Countries

LHC Computing ChallengesComplexity of LHC interaction environment & resulting dataScale: Petabytes of data per year (100 PB by ~2010-12)GLobal distribution of people and resources

Page 19: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 19

Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation

Global LHC Data Grid

Tier 1

T2

T2

T2

T2

T2

3

3

3

3

3

3

3

3

3

3

3

Tier 0 (CERN)

4 4 4 4

3 3

Key ideas:Hierarchical structureTier2 centers

Page 20: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 20

Global LHC Data Grid

Tier2 Center

Online System

CERN Computer Center > 20

TIPS

USA CenterFrance Center

Italy Center UK Center

InstituteInstituteInstituteInstitute ~0.25TIPS

Workstations,other portals

~100 MBytes/sec

2.5 Gbits/sec

100 - 1000

Mbits/sec

Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute has ~10 physicists working on one or more channels

Physics data cache

~PBytes/sec

2.5 Gbits/sec

Tier2 CenterTier2 CenterTier2 Center

~622 Mbits/sec

Tier 0 +1

Tier 1

Tier 3

Tier 4

Tier2 Center Tier 2

Experiment CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1

Page 21: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 21

Sloan Digital Sky Survey Data Grid

Page 22: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 22

LIGO (Gravity Wave) Data Grid

HanfordObservatory

LivingstonObservatory

Caltech

MIT

INet2Abilene

Tier1 LSCLSC

LSCLSC

LSCTier2

OC3

OC48

OC3

OC12

OC48

Page 23: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 23

Data Grid Projects Particle Physics Data Grid (US, DOE)

Data Grid applications for HENP expts. GriPhyN (US, NSF)

Petascale Virtual-Data Grids iVDGL (US, NSF)

Global Grid lab TeraGrid (US, NSF)

Dist. supercomp. resources (13 TFlops) European Data Grid (EU, EC)

Data Grid technologies, EU deployment CrossGrid (EU, EC)

Data Grid technologies, EU DataTAG (EU, EC)

Transatlantic network, Grid applications Japanese Grid Project (APGrid?) (Japan)

Grid deployment throughout Japan

Collaborations of application scientists & computer scientists

Infrastructure devel. & deployment

Globus based

Page 24: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 24

Coordination of U.S. Grid ProjectsThree U.S. projects

PPDG: HENP experiments, short term tools, deployment

GriPhyN: Data Grid research, Virtual Data, VDT deliverable

iVDGL: Global Grid laboratory

Coordination of PPDG, GriPhyN, iVDGLCommon experiments + personnel, management integration iVDGL as “joint” PPDG + GriPhyN laboratory Joint meetings (Jan. 2002, April 2002, Sept. 2002) Joint architecture creation (GriPhyN, PPDG)Adoption of VDT as common core Grid infrastructureCommon Outreach effort (GriPhyN + iVDGL)

New TeraGrid project (Aug. 2001)13MFlops across 4 sites, 40 Gb/s networkingGoal: integrate into iVDGL, adopt VDT, common Outreach

Page 25: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 25

Worldwide Grid CoordinationTwo major clusters of projects

“US based” GriPhyN Virtual Data Toolkit (VDT)“EU based” Different packaging of similar components

Page 26: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 26

GriPhyN = App. Science + CS + Grids

GriPhyN = Grid Physics NetworkUS-CMS High Energy PhysicsUS-ATLAS High Energy PhysicsLIGO/LSC Gravity wave researchSDSS Sloan Digital Sky SurveyStrong partnership with computer scientists

Design and implement production-scale gridsDevelop common infrastructure, tools and services (Globus

based) Integration into the 4 experimentsBroad application to other sciences via “Virtual Data

Toolkit”Strong outreach program

Multi-year projectR&D for grid architecture (funded at $11.9M +$1.6M) Integrate Grid infrastructure into experiments through VDT

Page 27: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 27

GriPhyN InstitutionsU FloridaU ChicagoBoston UCaltechU Wisconsin, MadisonUSC/ISIHarvard Indiana Johns HopkinsNorthwesternStanfordU Illinois at ChicagoU PennU Texas, BrownsvilleU Wisconsin,

MilwaukeeUC Berkeley

UC San DiegoSan Diego Supercomputer

CenterLawrence Berkeley LabArgonneFermilabBrookhaven

Page 28: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 28

GriPhyN: PetaScale Virtual-Data Grids

Virtual Data Tools

Request Planning &

Scheduling ToolsRequest Execution & Management Tools

Transforms

Distributed resources(code, storage, CPUs,networks)

Resource Management

Services

Resource Management

Services

Security and Policy

Services

Security and Policy

Services

Other Grid ServicesOther Grid

Services

Interactive User Tools

Production TeamIndividual Investigator Workgroups

Raw data source

~1 Petaflop~100 Petabytes

Page 29: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 29

GriPhyN Research AgendaVirtual Data technologies (fig.)

Derived data, calculable via algorithm Instantiated 0, 1, or many times (e.g., caches)“Fetch value” vs “execute algorithm”Very complex (versions, consistency, cost calculation, etc)

LIGO example“Get gravitational strain for 2 minutes around each of 200

gamma-ray bursts over the last year”

For each requested data value, need toLocate item location and algorithm Determine costs of fetching vs calculatingPlan data movements & computations required to obtain

results Execute the plan

Page 30: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 30

Virtual Data in Action

Data request may Compute locally Compute remotely Access local data Access remote data

Scheduling based on Local policies Global policies Cost

Major facilities, archives

Regional facilities, caches

Local facilities, cachesFetch item

Page 31: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 31

GriPhyN Research Agenda (cont.)Execution management

Co-allocation of resources (CPU, storage, network transfers)

Fault tolerance, error reporting Interaction, feedback to planning

Performance analysis (with PPDG) Instrumentation and measurement of all grid componentsUnderstand and optimize grid performance

Virtual Data Toolkit (VDT)VDT = virtual data services + virtual data toolsOne of the primary deliverables of R&D effortTechnology transfer mechanism to other scientific

domains

Page 32: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 32

GriPhyN/PPDG Data Grid Architecture

Application

Planner

Executor

Catalog Services

Info Services

Policy/Security

Monitoring

Repl. Mgmt.

Reliable TransferService

Compute Resource Storage Resource

DAG

DAG

DAGMAN, Kangaroo

GRAM GridFTP; GRAM; SRM

GSI, CAS

MDS

MCAT; GriPhyN catalogs

GDMP

MDS

Globus

= initial solution is operational

Page 33: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 33

Transparency wrt materialization

Id Trans F ParamName …i1 F X F.X …i2 F Y F.Y …i10 G Y P G(P).Y …

Trans Prog Cost …F URL:f 10 …G URL:g 20 …

Program storage

Trans. name

URLs for program location

Derived Data Catalog

Transformation Catalog

Update uponmaterialization

App specific attr. id …… i2,i10……

Derived Metadata Catalog

id

Id Trans Param Name …i1 F X F.X …i2 F Y F.Y …i10 G Y P G(P).Y …

Trans Prog Cost …F URL:f 10 …G URL:g 20 …

Program storage

Trans. name

URLs for program location

App-specific-attr id … … i2,i10……

id

Physical file storage

URLs for physical file location

Name LObjN …

F.X logO3 … …

LCN PFNs …logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6

Metadata Catalog

Replica Catalog

Logical Container Name

GCMS

Object Name

Transparency wrt location

Name LObjN … … X logO1 … …Y logO2 … …F.X logO3 … …G(1).Y logO4 … …

LCN PFNs …logC1 URL1logC2 URL2 URL3logC3 URL4logC4 URL5 URL6

Replica Catalog

GCMSGCMS

Object Name

Catalog Architecture

Metadata Catalog

Page 34: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 34

iVDGL: A Global Grid Laboratory

International Virtual-Data Grid LaboratoryA global Grid laboratory (US, EU, South America, Asia, …)A place to conduct Data Grid tests “at scale”A mechanism to create common Grid infrastructureA facility to perform production exercises for LHC

experimentsA laboratory for other disciplines to perform Data Grid testsA focus of outreach efforts to small institutions

Funded for $13.65M by NSF

“We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.”

From NSF proposal, 2001

Page 35: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 35

iVDGL ComponentsComputing resources

Tier1, Tier2, Tier3 sites

NetworksUSA (TeraGrid, Internet2, ESNET), Europe (Géant, …)Transatlantic (DataTAG), Transpacific, AMPATH, …

Grid Operations Center (GOC) Indiana (2 people) Joint work with TeraGrid on GOC development

Computer Science support teamsSupport, test, upgrade GriPhyN Virtual Data Toolkit

Outreach effort Integrated with GriPhyN

Coordination, interoperability

Page 36: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 36

Current iVDGL Participants Initial experiments (funded by NSF proposal)

CMS, ATLAS, LIGO, SDSS, NVO

U.S. Universities and laboratories(Next slide)

PartnersTeraGridEU DataGrid + EU national projects Japan (AIST, TITECH)Australia

Complementary EU project: DataTAG2.5 Gb/s transatlantic network

Page 37: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 37

U Florida CMSCaltech CMS, LIGOUC San Diego CMS, CS Indiana U ATLAS, GOCBoston U ATLASU Wisconsin, Milwaukee LIGOPenn State LIGO Johns Hopkins SDSS, NVOU Chicago/Argonne CSU Southern California CSU Wisconsin, Madison CSSalish Kootenai Outreach, LIGOHampton U Outreach, ATLASU Texas, Brownsville Outreach, LIGOFermilab CMS, SDSS, NVOBrookhaven ATLASArgonne LabATLAS, CS

U.S. iVDGL Proposal Participants

T2 / Software

CS support

T3 / Outreach

T1 / Labs(funded elsewhere)

Page 38: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 38

Initial US-iVDGL Data Grid

Tier1 (FNAL)Proto-Tier2Tier3 university

UCSDFlorida

Wisconsin

FermilabBNL

Indiana

BU

Other sites to be added in

2002

SKC

Brownsville

Hampton

PSU

JHUCaltech

Page 39: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 39

iVDGL Map (2002-2003)

Tier0/1 facility

Tier2 facility

10 Gbps link

2.5 Gbps link

622 Mbps link

Other link

Tier3 facility

DataTAG

Surfnet

LaterBrazilPakistanRussiaChina

Page 40: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 40

SummaryData Grids will qualitatively and quantitatively

change the nature of collaborations and approaches to computing

The iVDGL will provide vast experience for new collaborations

Many challenges during the coming transitionNew grid projects will provide rich experience and lessonsDifficult to predict situation even 3-5 years ahead

Page 41: Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Outreach Workshop (Mar. 1, 2002)

Paul Avery 41

Grid References Grid Book

www.mkp.com/grids Globus

www.globus.org Global Grid Forum

www.gridforum.org TeraGrid

www.teragrid.org EU DataGrid

www.eu-datagrid.org PPDG

www.ppdg.net GriPhyN

www.griphyn.org iVDGL

www.ivdgl.org