welcome and cyberinfrastructure overview msi cyberinfrastructure institute june 26-30, 2006
DESCRIPTION
Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006. Anke Kamrath Division Director, San Diego Supercomputer Center [email protected]. The Digital World. Entertainment. Shopping. Information. GAMESS. Geosciences. Data Management and Mining. - PowerPoint PPT PresentationTRANSCRIPT
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Anke KamrathDivision Director, San Diego Supercomputer Center
Welcome andCyberinfrastructure Overview
MSI Cyberinfrastructure InstituteJune 26-30, 2006
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
The Digital World
Shopping
Entertainment
Information
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Science is a Team Sport
Astronomy
Physics
Life Sciences
Modeling and Simulation
Data Managementand Mining
GAMESS
QCD
Geosciences
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Cyberinfrastructure – A Unifying Concept
Cyberinfrastructure = resources
(computers, data storage, networks, scientific
instruments, experts, etc.) + “glue”
(integrating software, systems, and organizations).
NSF’s “Atkins Report” provided a compelling vision for integrated Cyberinfrastructure
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
A Deluge of Data
• Today data comes from everywhere• “Volunteer” data• Scientific instruments• Experiments• Sensors and sensornets• Computer simulations• New devices (personal digital devices,
computer-enabled clothing, cars, …)
• And is used by everyone• Researchers, educators• Consumers• Practitioners• General public
• Turning the deluge of data into usable information for the research and education community requires an unprecedented level of integration, globalization, scale, and access
Data from sensors
Data from simulations
Data from
instruments
Data from analysis
Volunteer data
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Using Data as a Driver: SDSC Cyberinfrastructure
SDSCData
Cyberinfrastructure
Data-oriented HPC, Resources,
High-end storage,Large-scale data analysis,
simulation, modeling
Community Databasesand Data Collections,
Data management, mining and preservation
Data-oriented Tools, SW Applications, and Community
Codes
SRBBiology
Workbench
Data- and Computational
Science Education
and TrainingSummer Institute
Collaboration, Service and Community
Leadership for Data-oriented Projects
IT
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Impact on Technology: Data and Storage are Integral to Today’s Information Infrastructure
• Today’s “computer” is a coordinated set of hardware, software, and services providing an “end-to-end” resource.
• Cyberinfrastructure captures how the research and education community has redefined “computer”
network
data
computer
storage
fieldinstrument
network
computer
data
network
computerviz
computer
sensorsfield
data
wireless
Data and storage are an integral part of today’s “computer”
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Goal: SDSC’s Data Cyberinfrastructure should “extend the reach” of the local research and education environment.
Access to community and
reference data collections
More capable and/or higher capacity
computational resources
Multi-disciplinary expertise
Community codes, middleware, software
tools and toolkits
Building a National Data Cyberinfrastructure Center
Long-term Scienctific Data Preservation
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Impact on Applications: Data-oriented Research Driving the Next Generation of Technology Challenges
Compute (more FLOPS)
Dat
a (m
ore
BY
TE
S)
Home, Lab, Campus, Desktop
Applications
TraditionalHPC
Applications
Data-oriented Research
Applications
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Today’s Research Applications Span the Spectrum
Compute (more FLOPS)
Dat
a (m
ore
BY
TE
S)
Compute (more FLOPS)
Dat
a (m
ore
BY
TE
S)
Home, Lab, Campus, Desktop
TraditionalHPC
environment
Extreme I/O EnvironmentData Mgt. Envt.
Lends itself to Grid
Could be targeted efficiently on Grid
Difficult to target efficiently on Grid
NVOEOL
CiPres
SCECVisualization
ENZOVisualization
CFD
Turbulencefield
ClimateSCEC
Simulation ENZO simulation
QCD
Protein Folding/MD
TurbulenceReattachment
length
CPMD
MCell
GridSAT
Seti@Home
EverQuest
GAMESS
Data-oriented Environment
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Working with Compute and Data – Simulation, Analysis, Modeling
Simulation of Southern of 7.7 earthquake on lower San Andreas Fault
• Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m
• Builds on 10 years of data and models from the Southern California Earthquake Center
• Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each
• Simulation generates 45+ TB data
Resources RequiredComputers and Systems• 80,000 hours on DataStar• 256 GB memory p690 used
for testing, p655s used for production run, TG used for porting
• 30 TB Global Parallel file GPFS
• Run-time 100 MB/s data transfer from GPFS to SAM-QFS
• 27,000 hours post-processing for high resolution rendering
People • 20+ people for IT support• 20+ people in domain
research
Storage• SAM-QFS archival storage• HPSS backup• SRB Collection with
1,000,000 files
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Simulating an earthquake 1:
1. Divide up Southern California into “blocks”
2. For each block, get all the data on ground surface composition, geological structures, fault information, etc.
Big Data & Big Compute:
The Southern San Andreas Fault
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Simulating earthquake 2:
3. Map the blocks on to processors (brains) of the computer
SDSC’s DataStar – one of the 25 fastest
computers in the world
Big Data & Big Compute:
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Simulating an earthquake 3:
4. Run the simulation using current information on fault activity and the physics of earthquakes
Big Data & Big Compute:
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
• Simulating an earthquake 4:
5. The simulation outputs data on seismic wave velocity, earthquake magnitude,and other characteristics
Managing the data
• How much data was output?
• 47 TeraBytes which is
• 2+ times the printed materials in the Library of Congress! or
• The amount of music in 2000+ iPods! or
• 47 million copies of a typical DVD movie!
Where to store the data?
• In HPSS, a tape storage library that can hold 10 PetaByes (100000 Terabytes) -- 500 times the printed materials in the Library of Congress
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
How long will TeraShake take on your desktop computer?
Computing Platform
Number of Processors
Floating Point (arithmetic) Operations per second
Can run TeraShake in
Desktop 1 5.3 billion
DataStar at SDSC 1024 (240 used for TeraShake)
10.4 trillion 5 days
72 centuries!(approximate)
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Better Neurosurgery Through Cyberinfrastructure
• PROBLEM: Neuro-surgeons seek to remove as much tumor tissue as possible while minimizing removal of healthy brain tissue
• Brain deforms during surgery• Surgeons must align preoperative
brain image with intra-operative images to provide surgeons the best opportunity for intra-surgical navigation
Radiologists and neurosurgeons at Brigham and Women’s Hospital, Harvard Medical School exploring transmission of 30/40 MB brain images (generated during surgery) to SDSC for analysis and alignment
Finite element simulation on biomechanical model for volumetric deformation performed at SDSC; output results are sent to BWH where updated images are shown to surgeons
Transmission repeated every hour during 6-8 hour surgery.
Transmission and output must take on the order of minutes
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Community Data Repository: SDSC DataCentral
• Provides “data allocations” on SDSC resources to national science and engineering community
• Data collection and database hosting• Batch oriented access• Collection management services
• First broad program of its kind to support research and community data collections and databases
• Comprehensive resources• Disk: 400 TB accessible via HPC
systems, Web, SRB, GridFTP
• Databases: DB2, Oracle, MySQL
• SRB: Collection management
• Tape: 6 PB, accessible via file system, HPSS, Web, SRB, GridFTP
• 24/7 operations, collection specialists
Example Allocated Data Collections include
• Bee Behavior (Behavioral Science)
• C5 Landscape DB (Art)
• Molecular Recognition Database (Pharmaceutical Sciences)
• LIDAR (Geoscience)
• AMANDA (Physics)
• SIO_Explorer (Oceanography)
• Tsunami and Landsat Data (Earthquake Engineering)
• Terabridge (Structural Engineering)
DataCentral infrastructure includes: Web-based portal, security, networking, UPS systems, web services and software tools
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Public Data Collections Hosted in SDSC’s DataCentral
Seismology3D Ground Motion Collection for the LA Basin
AtmosphericSciences50 year Downscaling of Global Analysis over California Region
Earth SciencesNEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics
AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience
BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education
ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
GeophysicsMagnetics Information Consortium data
EducationUC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering
NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
GeologySD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean SciencesSoutheastern Coastal Ocean Observing and Prediction Data
Structural Engineering
TeraBridge
Various TeraGrid data collections
BiologyTransporter Classification Database
Biology TreeBase
Art Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Data Cyberinfrastructure Requires a Coordinated Approach
Storage hardware
Networked Storage (SAN)
Grid StorageFilesystems, Database Systems
Data Mining, Simulation Modeling, Analysis, Data Fusion
Applications: Medical informatics,Biosciences, Ecoinformatics,…
Knowledge-Based Integration Advanced Query Processing
Visualization
High speed networking
sensornets
How do we configure computer architectures to optimally support
data-oriented computing?
How do we collect, accessand organize data?
How do we obtain usableinformation from data?
How do we detect trends and relationships in data?
How do we represent data, information and knowledge
to the user?
How do we combine data, knowledge
and information management with simulation and modeling?
instrumentsHPC
inte
gra
tio
ninteroperability
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Working with Data: Data Integration for New Discovery
Data Integration in the Biosciences Data Integration in the Geosciences
DisciplinaryDisciplinaryDatabasesDatabasesUsersUsers
SoftwareSoftwareto accessto access
datadata
SoftwareSoftwareto federateto federate
datadata
OrganismsOrganisms
OrgansOrgans
CellsCells
AtomsAtoms
Bio-Bio-polymerspolymers
OrganellesOrganelles
Cell BiologyCell Biology
AnatomyAnatomy
PhysiologyPhysiology
ProteomicsProteomics
Medicinal Medicinal ChemistryChemistry
GenomicsGenomics
Where can we most safely build a nuclear waste dump?Where should we drill for oil?
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How does it relate to host rock structures?
DataIntegration
Geologic Map
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Complex “multiple-worlds”
mediation
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Preserving Data over the Long-Term
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Data Preservation
• Many Science, Cultural, and Official Collections must be sustained for the foreseeable future
• Critical collections must be preserved:
• community reference data collections (e.g. Protein Data Bank)
• irreplaceable collections (e.g. field data – tsunami recon)
• longitudinal data (e.g. PSID – Panel Study of Income Dynamics)
• No plan for preservation often means that data is lost or damaged
“….the progress of science and useful arts … depends on the reliable preservation of
knowledge and information for generations to come.”
“Preserving Our Digital Heritage”, Library of Congress
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
How much Digital Data*?
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
1 human brain at the
micron level = 1 PetaByte
1 novel = 1 MegaByte
iPod Shuffle (up to 120 songs) = 512 MegaBytes
Printed materials in the Library of Congress = 10 TeraBytes
SDSC HPSS tape archive = 6 PetaBytes
All worldwide information in one year
= 2 ExaBytes
1 Low Resolution
Photo = 100 KiloBytes
* Rough/average estimates
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Key Challenges for Digital Preservation
• What should we preserve?• What materials must be “rescued”?• How to plan for preservation of materials by
design?
• How should we preserve it?• Formats• Storage media• Stewardship – who is responsible?
• Who should pay for preservation?• The content generators?• The government?• The users?
• Who should have access?
Print media provides easy access for long periods of time
but is hard to data-mine
Digital media is easier to data-mine but requires management of evolution of media
and resource planning over time
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
What can go wrong
Entity at risk
Problem Frequency
File Corrupted media, disk failure 1 year
Tape+ Simultaneous failure of 2 copies
5 years
System+ Systemic errors in vendor SW, or Malicious user, or Operator error that deletes multiple copies
15 years
Archive+ Natural disaster, obsolescence of standards
50 - 100 years
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
SDSC Cyberinfrastructure Community Resources
COMPUTE SYSTEMS• DataStar
• 2396 Power4+ processors, IBM p655 and p690 nodes
• 10 TB total memory• Up to 2 GBps I/O to disk
• TeraGrid Cluster• 512 Itanium2 IA-64
processors• 1 TB total memory
• Intimidata• Only academic IBM Blue
Gene system• 2,048 PowerPC processors• 128 I/O nodes
http://www.sdsc.edu/user_services/
SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES
• User Services• Application/Community Collaborations• Education and Training• SDSC Synthesis Center• Community SW, toolkits, portals, codes
• http://www.sdsc.edu/
DATA ENVIRONMENT• 1 PB Storage-area Network
(SAN)• 10 PB StorageTek tape library• DB2, Oracle, MySQL• Storage Resource Broker• HPSS• 72-CPU Sun Fire 15K• 96-CPU IBM p690s
• http://datacentral.sdsc.edu/
Support for 60+ community data collections and
databases
Data management,
mining, analysis, and preservation