paul avery university of florida phys.ufl/~avery/ [email protected]
DESCRIPTION
Grids for 21 st Century Data Intensive Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ [email protected]. University of Michigan May 8, 2003. Grids and Science. The Grid Concept. - PowerPoint PPT PresentationTRANSCRIPT
University of Michigan (May 8, 2003)
Paul Avery 1
Paul AveryUniversity of Florida
http://www.phys.ufl.edu/~avery/[email protected]
Grids for 21st CenturyData Intensive Science
University of MichiganMay 8, 2003
University of Michigan (May 8, 2003)
Paul Avery 2
Grids and Science
University of Michigan (May 8, 2003)
Paul Avery 3
The Grid ConceptGrid: Geographically distributed computing
resources configured for coordinated use
Fabric: Physical resources & networks provide raw capability
Middleware: Software ties it all together (tools, services, etc.)
Goal: Transparent resource sharing
University of Michigan (May 8, 2003)
Paul Avery 4
Fundamental Idea: Resource Sharing
Resources for complex problems are distributedAdvanced scientific instruments (accelerators, telescopes, …)Storage, computing, people, institutions
Communities require access to common servicesResearch collaborations (physics, astronomy, engineering, …)Government agencies, health care organizations,
corporations, …
“Virtual Organizations”Create a “VO” from geographically separated componentsMake all community resources available to any VO memberLeverage strengths at different institutions
Grids require a foundation of strong networkingCommunication tools, visualizationHigh-speed data transmission, instrument operation
University of Michigan (May 8, 2003)
Paul Avery 5
Some (Realistic) Grid ExamplesHigh energy physics
3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data
Fusion power (ITER, etc.)Physicists quickly generate 100 CPU-years of simulations
of a new magnet configuration to compare with data
AstronomyAn international team remotely operates a telescope in
real time
Climate modelingClimate scientists visualize, annotate, & analyze
Terabytes of simulation data
BiologyA biochemist exploits 10,000 computers to screen
100,000 compounds in an hour
University of Michigan (May 8, 2003)
Paul Avery 6
Grids: Enhancing Research & Learning
Fundamentally alters conduct of scientific researchCentral model: People, resources flow inward to labsDistributed model: Knowledge flows between distributed
teamsStrengthens universities
Couples universities to data intensive scienceCouples universities to national & international labsBrings front-line research and resources to studentsExploits intellectual resources of formerly isolated schoolsOpens new opportunities for minority and women
researchersBuilds partnerships to drive advances in
IT/science/eng“Application” sciences Computer SciencePhysics Astronomy, biology, etc.Universities LaboratoriesScientists StudentsResearch Community IT industry
University of Michigan (May 8, 2003)
Paul Avery 7
Grid ChallengesOperate a fundamentally complex entity
Geographically distributed resourcesEach resource under different administrative controlMany failure modes
Manage workflow across GridBalance policy vs. instantaneous capability to complete tasksBalance effective resource use vs. fast turnaround for priority
jobsMatch resource usage to policy over the long termGoal-oriented algorithms: steering requests according to
metrics
Maintain a global view of resources and system stateCoherent end-to-end system monitoringAdaptive learning for execution optimization
Build high level services & integrated user environment
University of Michigan (May 8, 2003)
Paul Avery 8
Data Grids
University of Michigan (May 8, 2003)
Paul Avery 9
Data Intensive Science: 2000-2015Scientific discovery increasingly driven by data
collection Computationally intensive analysesMassive data collectionsData distributed across networks of varying capability Internationally distributed collaborations
Dominant factor: data growth (1 Petabyte = 1000 TB)
2000 ~0.5 Petabyte2005 ~10 Petabytes2010 ~100 Petabytes2015 ~1000 Petabytes?
How to collect, manage,access and interpret thisquantity of data?
Drives demand for “Data Grids” to handleadditional dimension of data access & movement
University of Michigan (May 8, 2003)
Paul Avery 10
Data Intensive Physical SciencesHigh energy & nuclear physics
Including new experiments at CERN’s Large Hadron Collider
AstronomyDigital sky surveys: SDSS, VISTA, other Gigapixel arraysVLBI arrays: multiple- Gbps data streams“Virtual” Observatories (multi-wavelength astronomy)
Gravity wave searchesLIGO, GEO, VIRGO, TAMA
Time-dependent 3-D systems (simulation & data)Earth Observation, climate modelingGeophysics, earthquake modelingFluids, aerodynamic designDispersal of pollutants in atmosphere
University of Michigan (May 8, 2003)
Paul Avery 11
Data Intensive Biology and Medicine
Medical dataX-Ray, mammography data, etc. (many petabytes)Radiation Oncology (real-time display of 3-D images)
X-ray crystallographyBright X-Ray sources, e.g. Argonne Advanced Photon
Source
Molecular genomics and related disciplinesHuman Genome, other genome databasesProteomics (protein structure, activities, …)Protein interactions, drug delivery
Brain scans (1-10m, time dependent)
University of Michigan (May 8, 2003)
Paul Avery 12
1800 Physicists150 Institutes32 Countries
Driven by LHC Computing Challenges
Complexity: Millions of individual detector channelsScale: PetaOps (CPU), Petabytes (Data)Distribution: Global distribution of people & resources
University of Michigan (May 8, 2003)
Paul Avery 13
CMS Experiment at LHC“Compact” Muon Solenoid
at the LHC (CERN)
Smithsonianstandard man
University of Michigan (May 8, 2003)
Paul Avery 14
LHC Data Rates: Detector to Storage
Level 1 Trigger: Special Hardware
40 MHz ~1000 TB/sec
75 KHz 75 GB/sec
5 KHz 5 GB/sec
Level 2 Trigger: Commodity CPUs
100 Hz 100 – 1500 MB/sec
Level 3 Trigger: Commodity CPUs
Raw Data to storage
Physics filtering
University of Michigan (May 8, 2003)
Paul Avery 15
All charged tracks with pt > 2 GeV
Reconstructed tracks with pt > 25 GeV
(+30 minimum bias events)
109 events/sec, selectivity: 1 in 1013
LHC: Higgs Decay into 4 muons
University of Michigan (May 8, 2003)
Paul Avery 16
CMS Experiment
Hierarchy of LHC Data Grid Resources
Online System
CERN Computer Center > 20 TIPS
UKKorea RussiaUSA
Institute
100-1500 MBytes/s
2.5-10 Gbps
1-10 Gbps
10-40 Gbps
1-2.5 Gbps
Tier 0
Tier 1
Tier 3
Tier 4
Tier0/( Tier1)/( Tier2) ~ 1:1:1
Tier 2
Physics cache
PCs
Institute
Institute
Institute
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
~10s of Petabytes by 2007-8~1000 Petabytes in 5-7 years
University of Michigan (May 8, 2003)
Paul Avery 17
Digital Astronomy
Future dominated by detector improvements
• Moore’s Law growth in CCDs
• Gigapixel arrays on horizon
• Growth in CPU/storage tracking data volumes
Glass
MPixels
•Total area of 3m+ telescopes in the world in m2
•Total number of CCD pixels in Mpixels•25 year growth: 30x in glass, 3000x in pixels
University of Michigan (May 8, 2003)
Paul Avery 18
The Age of Astronomical Mega-Surveys
Next generation mega-surveys will change astronomy
Large sky coverageSound statistical plans, uniform systematics
The technology to store and access the data is here
Following Moore’s law
Integrating these archives for the whole community
Astronomical data mining will lead to stunning new discoveries
“Virtual Observatory” (next slides)
University of Michigan (May 8, 2003)
Paul Avery 19
Virtual Observatories
Source Catalogs Image Data
Specialized Data:Spectroscopy, Time Series,
PolarizationInformation Archives:
Derived & legacy data: NED,Simbad,ADS, etcDiscovery Tools:
Visualization, Statistics
Standards
Multi-wavelength astronomy,Multiple surveys
University of Michigan (May 8, 2003)
Paul Avery 20
Virtual Observatory Data Challenge
Digital representation of the skyAll-sky + deep fields Integrated catalog and image databasesSpectra of selected samples
Size of the archived data40,000 square degreesResolution < 0.1 arcsec > 50 trillion pixelsOne band (2 bytes/pixel) 100 TerabytesMulti-wavelength: 500-1000 TerabytesTime dimension: Many Petabytes
Large, globally distributed database enginesMulti-Petabyte data size, distributed widelyThousands of queries per day, Gbyte/s I/O speed per siteData Grid computing infrastructure
University of Michigan (May 8, 2003)
Paul Avery 21
Sloan Sky Survey Data Grid
University of Michigan (May 8, 2003)
Paul Avery 22
International Grid/Networking Projects
US, EU, E. Europe, Asia, S. America, …
University of Michigan (May 8, 2003)
Paul Avery 23
Global Context: Data Grid ProjectsU.S. Projects
Particle Physics Data Grid (PPDG) DOEGriPhyN NSF International Virtual Data Grid Laboratory (iVDGL) NSFTeraGrid NSFDOE Science Grid DOENSF Middleware Initiative (NMI) NSF
EU, Asia major projectsEuropean Data Grid (EU, EC)LHC Computing Grid (LCG) (CERN)EU national Projects (UK, Italy, France, …)CrossGrid (EU, EC)DataTAG (EU, EC) Japanese ProjectKorea project
University of Michigan (May 8, 2003)
Paul Avery 24
Particle Physics Data Grid
Funded 2001 – 2004 @ US$9.5M (DOE)Driven by HENP experiments: D0, BaBar, STAR, CMS, ATLAS
University of Michigan (May 8, 2003)
Paul Avery 25
PPDG GoalsServe high energy & nuclear physics (HENP)
experimentsUnique challenges, diverse test environments
Develop advanced Grid technologiesFocus on end to end integration
Maintain practical orientationNetworks, instrumentation, monitoringDB file/object replication, caching, catalogs, end-to-end
movement
Make tools general enough for wide communityCollaboration with GriPhyN, iVDGL, EDG, LCGESNet Certificate Authority work, security
University of Michigan (May 8, 2003)
Paul Avery 26
GriPhyN and iVDGL Both funded through NSF ITR program
GriPhyN: $11.9M (NSF) + $1.6M (matching) (2000 – 2005) iVDGL: $13.7M (NSF) + $2M (matching) (2001 – 2006)
Basic compositionGriPhyN: 12 funded universities, SDSC, 3 labs (~80
people) iVDGL: 16 funded institutions, SDSC, 3 labs (~80
people)Expts: US-CMS, US-ATLAS, LIGO, SDSS/NVOLarge overlap of people, institutions, management
Grid research vs Grid deploymentGriPhyN: CS research, Virtual Data Toolkit (VDT)
development iVDGL: Grid laboratory deployment4 physics experiments provide frontier challengesVDT in common
University of Michigan (May 8, 2003)
Paul Avery 27
GriPhyN Computer Science Challenges
Virtual data (more later)Data + programs (content) +programs (executions)Representation, discovery, & manipulation of workflows
and associated data & programs
PlanningMapping workflows in an efficient, policy-aware manner to
distributed resources
ExecutionExecuting workflows, inc. data movements, reliably and
efficiently
PerformanceMonitoring system performance for scheduling &
troubleshooting
University of Michigan (May 8, 2003)
Paul Avery 28
Goal: “PetaScale” Virtual-Data Grids
Virtual Data Tools
Request Planning &Scheduling Tools
Request Execution & Management Tools
Transforms
Distributed resources(code, storage, CPUs,networks)
ResourceManagement
Services
Security andPolicy
Services
Other GridServices
Interactive User Tools
Production TeamSingle Researcher Workgroups
Raw datasource
PetaOps Petabytes Performance
University of Michigan (May 8, 2003)
Paul Avery 29
GriPhyN/iVDGL Science Drivers
US-CMS & US-ATLASHEP experiments at LHC/CERN100s of Petabytes
LIGOGravity wave experiment100s of Terabytes
Sloan Digital Sky SurveyDigital astronomy (1/4 sky)10s of Terabytes
Data
gro
wth
Com
mu
nit
y g
row
th2007
2002
2001
Massive CPU Large, distributed datasets Large, distributed communities
University of Michigan (May 8, 2003)
Paul Avery 30
Virtual Data: Derivation and Provenance
Most scientific data are not simple “measurements”They are computationally corrected/reconstructedThey can be produced by numerical simulation
Science & eng. projects are more CPU and data intensive
Programs are significant community resources (transformations)
So are the executions of those programs (derivations)
Management of dataset transformations important!Derivation: Instantiation of a potential data productProvenance: Exact history of any existing data product
We already do this, but manually!
University of Michigan (May 8, 2003)
Paul Avery 31
Transformation Derivation
Data
product-of
execution-of
consumed-by/generated-by
“I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.”
“I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.”
“I want to search a database for 3 muon SUSY events. If a program that does this analysis exists, I won’t have to write one from scratch.”
“I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.”
Virtual Data Motivations (1)
University of Michigan (May 8, 2003)
Paul Avery 32
Virtual Data Motivations (2)Data track-ability and result audit-ability
Universally sought by scientific applications
Facilitates tool and data sharing and collaborationData can be sent along with its recipe
Repair and correction of dataRebuild data products—cf., “make”
Workflow managementOrganizing, locating, specifying, and requesting data
products
Performance optimizationsAbility to re-create data rather than move it
Manual /error prone Automated /robust
University of Michigan (May 8, 2003)
Paul Avery 33
“Chimera” Virtual Data System Virtual Data API
A Java class hierarchy to represent transformations & derivations
Virtual Data Language Textual for people & illustrative examples XML for machine-to-machine interfaces
Virtual Data Database Makes the objects of a virtual data definition persistent
Virtual Data Service (future) Provides a service interface (e.g., OGSA) to persistent
objects
Version 1.0 available To be put into VDT 1.1.7
University of Michigan (May 8, 2003)
Paul Avery 34
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Chimera Virtual Data System+ GriPhyN Virtual Data Toolkit
+ iVDGL Data Grid (many CPUs)
Chimera Application: SDSS Analysis
Galaxy cluster data
Size distribution
University of Michigan (May 8, 2003)
Paul Avery 35
Virtual Data and LHC ComputingUS-CMS
Chimera prototype tested with CMS MC (~200K events)Currently integrating Chimera into standard CMS production
tools Integrating virtual data into Grid-enabled analysis tools
US-ATLAS Integrating Chimera into ATLAS software
HEPCAL document includes first virtual data use casesVery basic cases, need elaborationDiscuss with LHC expts: requirements, scope, technologies
New ITR proposal to NSF ITR program ($15M)Dynamic Workspaces for Scientific Analysis Communities
Continued progress requires collaboration with CS groups
Distributed scheduling, workflow optimization, …Need collaboration with CS to develop robust tools
University of Michigan (May 8, 2003)
Paul Avery 36
iVDGL Goals and ContextInternational Virtual-Data Grid Laboratory
A global Grid laboratory (US, EU, E. Europe, Asia, S. America, …)
A place to conduct Data Grid tests “at scale”A mechanism to create common Grid infrastructureA laboratory for other disciplines to perform Data Grid
testsA focus of outreach efforts to small institutions
Context of iVDGL in US-LHC computing programDevelop and operate proto-Tier2 centersLearn how to do Grid operations (GOC)
International participationDataTagUK e-Science programme: support 6 CS Fellows per year
in U.S.
University of Michigan (May 8, 2003)
Paul Avery 37
US-iVDGL Sites (Spring 2003)
Partners?EUCERNBrazilAustraliaKoreaJapan
UF
Wisconsin
BNL
Indiana
Boston USKC
Brownsville
Hampton
PSU
J. Hopkins
Caltech
Tier1Tier2Tier3
FIU
FSUArlington
Michigan
LBL
Oklahoma
Argonne
Vanderbilt
UCSD/SDSC
NCSA
Fermilab
University of Michigan (May 8, 2003)
Paul Avery 38
US-CMS Grid Testbed
UCSD
Florida
Wisconsin
Caltech
Fermilab
FIU
FSU
Brazil
Korea
CERN
University of Michigan (May 8, 2003)
Paul Avery 39
Production Run for Monte Carlo data productionAssigned 1.5 million events for “eGamma Bigjets”
~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple
2 months continuous running across 5 testbed sitesDemonstrated at Supercomputing 2002
US-CMS Testbed Success Story
University of Michigan (May 8, 2003)
Paul Avery 40
Creation of WorldGridJoint iVDGL/DataTag/EDG effort
Resources from both sides (15 sites)Monitoring tools (Ganglia, MDS, NetSaint, …)Visualization tools (Nagios, MapCenter, Ganglia)
Applications: ScienceGridCMS: CMKIN, CMSIMATLAS:ATLSIM
Submit jobs from US or EU Jobs can run on any clusterDemonstrated at IST2002 (Copenhagen)Demonstrated at SC2002 (Baltimore)
University of Michigan (May 8, 2003)
Paul Avery 41
WorldGrid Sites
University of Michigan (May 8, 2003)
Paul Avery 42
Grid Coordination
University of Michigan (May 8, 2003)
Paul Avery 43
U.S. Project Coordination: TrilliumTrillium = GriPhyN + iVDGL + PPDG
Large overlap in leadership, people, experimentsDriven primarily by HENP, particularly LHC experiments
Benefit of coordinationCommon software base + packaging: VDT + PACMANCollaborative / joint projects: monitoring, demos, security,
…Wide deployment of new technologies, e.g. Virtual DataStronger, broader outreach effort
Forum for US Grid projects Joint view, strategies, meetings and workUnified entity to deal with EU & other Grid projects
University of Michigan (May 8, 2003)
Paul Avery 44
International Grid CoordinationGlobal Grid Forum (GGF)
International forum for general Grid effortsMany working groups, standards definitions
Close collaboration with EU DataGrid (EDG)Many connections with EDG activities
HICB: HEP Inter-Grid Coordination BoardNon-competitive forum, strategic issues, consensusCross-project policies, procedures and technology, joint
projectsHICB-JTB Joint Technical Board
Definition, oversight and tracking of joint projectsGLUE interoperability group
Participation in LHC Computing Grid (LCG)Software Computing Committee (SC2)Project Execution Board (PEB)Grid Deployment Board (GDB)
University of Michigan (May 8, 2003)
Paul Avery 45
HEP and International Grid ProjectsHEP continues to be the strongest science driver
(In collaboration with computer scientists)Many national and international initiativesLHC a particularly strong driving function
US-HEP committed to working with international partners
Many networking initiatives with EU colleaguesCollaboration on LHC Grid Project
Grid projects driving & linked to network developments
DataTag, SCIC, US-CERN link, Internet2
New partners being actively soughtKorea, Russia, China, Japan, Brazil, Romania, …Participate in US-CMS and US-ATLAS Grid testbedsLink to WorldGrid, once some software is fixed
University of Michigan (May 8, 2003)
Paul Avery 46
New Grid Efforts
An Inter-Regional Center for High Energy Physics Research and Educational
Outreach (CHEPREO) at Florida International University
E/O Center in Miami area iVDGL Grid Activities CMS Research AMPATH network (S.
America) Int’l Activities (Brazil, etc.)
Status: Proposal submitted Dec. 2002 Presented to NSF review panel Project Execution Plan
submitted Funding in June?
University of Michigan (May 8, 2003)
Paul Avery 48
A Global Grid Enabled Collaboratory for Scientific
Research (GECSR)$4M ITR proposal from
Caltech (HN PI,JB:CoPI) Michigan (CoPI,CoPI) Maryland (CoPI)
Plus senior personnel from Lawrence Berkeley Lab Oklahoma Fermilab Arlington (U. Texas) Iowa Florida State
First Grid-enabled CollaboratoryTight integration between
Science of Collaboratories Globally scalable work
environmentSophisticated collaborative
tools (VRVS, VNC; Next-Gen)Agent based monitoring &
decision support system (MonALISA)
Initial targets are the global HENP collaborations, but GESCR is expected to be widely applicable to other large scale collaborative scientific endeavors
“Giving scientists from all world regions the means to function as full partners in the process of search and discovery”
University of Michigan (May 8, 2003)
Paul Avery 49
Large ITR Proposal: $15MDynamic Workspaces:
Enabling Global Analysis Communities
University of Michigan (May 8, 2003)
Paul Avery 50
UltraLight Proposal to NSF
10 Gb/s+ network• Caltech, UF, FIU, UM, MIT• SLAC, FNAL• Int’l partners• Cisco
Applications• HEP• VLBI• Radiation Oncology• Grid Projects
University of Michigan (May 8, 2003)
Paul Avery 51
GLORIADNew 10 Gb/s network linking US-Russia-China
Plus Grid component linking science projectsH. Newman, P. Avery participating
Meeting at NSF April 14 with US-Russia-China reps.HEP people (Hesheng, et al.)
Broad agreement that HEP can drive Grid portionMore meetings planned
University of Michigan (May 8, 2003)
Paul Avery 52
SummaryProgress on many fronts in PPDG/GriPhyN/iVDGL
Packaging: Pacman + VDTTestbeds (development and production)Major demonstration projectsProductions based on Grid tools using iVDGL resources
WorldGrid providing excellent experienceExcellent collaboration with EU partnersBuilding links to our Asian and other partnersExcellent opportunity to build lasting infrastructure
Looking to collaborate with more international partners
Testbeds, monitoring, deploying VDT more widely
New directionsVirtual data a powerful paradigm for LHC computingEmphasis on Grid-enabled analysis
University of Michigan (May 8, 2003)
Paul Avery 53
Grid References Grid Book
www.mkp.com/grids Globus
www.globus.org Global Grid Forum
www.gridforum.org PPDG
www.ppdg.net GriPhyN
www.griphyn.org iVDGL
www.ivdgl.org TeraGrid
www.teragrid.org EU DataGrid
www.eu-datagrid.org