the future of scientific computing at harvard alyssa a. goodman professor of astronomy director,...
Post on 20-Dec-2015
213 views
TRANSCRIPT
The Future of Scientific Computing at HarvardThe Future of Scientific Computing at HarvardAlyssa A. GoodmanProfessor of AstronomyDirector, Initiative in Innovative Computing
Alyssa A. GoodmanProfessor of AstronomyDirector, Initiative in Innovative Computing
“The Heavy Red Bag”“The Heavy Red Bag”
How can computers advance (my) science?How can computers advance (my) science?
A new collaborative scientific initiative at Harvard.
Computational challenges are common across scientific disciplines
How to:
Acquire, transmit, organize, and query new kinds of data?
Apply distributed computing resources to solve complex problems?
Derive meaningful insight from large datasets?
Share, integrate and analyze knowledge across geographically dispersed
researchers?
Visually represent scientific results so as to maximize understanding?
Opportunity to collaborate and apply insights from one field to another
Filling the “Gap” between Science and Computer Science
Increasingly, core problems in science require computational solution
Typically hire/“home grow” computationalists, but often lack the expertise or funding to go beyond the immediate pressing need
Focused on finding elegant solutions to basic computer
science challenges
Often see specific, “applied” problems as outside their
interests
Scientific disciplines
Computer Science departments
“Workflow” & “Continuum”
Workflow Examples Astronomy Public Health
““Collect”Collect” TelescopeTelescope Microscope, Microscope,
Stethoscope, SurveyStethoscope, Survey
COLLECTCOLLECT ““National Virtual National Virtual Observatory”/Observatory”/COMPLETECOMPLETE
CDC WonderCDC Wonder
““Analyze”Analyze” Study the density Study the density structure of a star-structure of a star-forming glob of gasforming glob of gas
Find a link between one Find a link between one factory’s chlorine runoff factory’s chlorine runoff
& disease& disease
ANALYZEANALYZE Study the density Study the density structure of structure of allall star- star-
forming gas in…forming gas in…
Study the toxic effects Study the toxic effects of chlorine runoff of chlorine runoff in the in the
U.SU.S..
““Collaborate”Collaborate” Work with your student Work with your student
COLLABORATECOLLABORATE Work with 20 people in 5 countries, in real-timeWork with 20 people in 5 countries, in real-time
““Respond”Respond” Write a paper for a Journal.Write a paper for a Journal.
RESPONDRESPOND Write a paper, the quantitative results of which Write a paper, the quantitative results of which are shared globally, digitally.are shared globally, digitally.
IIC contact: AG, FAS
WorkflowWorkflow
Workflow
a.k.a. The Scientific Method (in the Age of the Age of High-Speed Networks, Fast Processors, Mass Storage, and Miniature Devices)
IIC contact: Matt Welsh, FAS
Workflow: The Harvard Virtual Brain
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0CVLT Discriminability Score
1000
2000
3000
4000
5000
6000
Left Hippocampal Volume
BWH/MGH and UCSD Data
Faculty of Arts and Sciences Harvard College Division of Engineering
Harvard School of Public Health
Faculty of Medicine Harvard Medical School Affiliated Teaching Hospitals
Data Acquisition MRI PET Microscopy etc.
Distributed Data Storage
Data Processing Analysis Visualization Integration etc.
Information Access Query Statistical Analysis Knowledge Management etc.
Establishing a Harvard-wide Neuroscience Infrastructure
Harvard IICIIC contact: David Kennedy, HMS/MGH
New technologies for measurement and simulation are transforming the “workflow.”
• Manual/low throughput• Solitary• Limited by two hands• Analog
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• High throughput • Automated/networked• Highly scalable• Digital
Biomedicine: pre-genomics Biomedicine: genomics era
Continuum
“Pure” Discipline Science
(e.g. Galileo)
“Pure” Computer Science
(e.g. Turing)
“Computational Science”Missing at Most Universities
Workflow & Continuum
For any particular scientific investigation:
Where does, and could, “computational science” make improvements in this cycle?
Harvard Public Health “NOW” (Oct. 2004)
"In the past, experiments did not involve such large data sets," observed Dyann Wirth, professor of infectious diseases in the Department of Immunology and Infectious Diseases and member of the advisory group for the core. "There has been a dramatic change in the past five to 10 years in the amount and availability of genomic data [or the DNA sequences themselves] and functional genomic data, [or the sequences’ purpose]." In the past five years alone, the genomes of humans, rats, and the malaria parasite Plasmodium Falciparum have been published, for example.
"One of the purposes of bioinformatics is to reduce the number of experiments that need to be done to achieve reliable information," said L.J. Wei, professor of biostatistics in the Department of Biostatistics and member of the advisory group for the core. "However, an issue right now is that there are huge data sets that can be run through different kinds of software programs, ending up with many data points. Unless we understand and use bioinformatics well, we may not even know which of those data points are important."
Filling the “computational science” gap: IIC
Problem-driven approach…focusing effort on solving problems that will have greatest impact & educational
valueCollaborative projects
…combining disciplinary knowledge with computer science expertise
Interdisciplinary effort…to ensure that best practices are shared across fields and that new tools and
methodologies will be broadly applicable
Links with industry…to draw on and learn from experience in applied computation
Institutional funding…to ensure effort is directed towards key needs and not driven solely by narrow
priorities of funding agencies
IIC at HarvardIIC at Harvard
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Numerical Simulation of
Star Formation
Bate, Bonnell & Bromm 2002 (UKAFF)
•MHD turbulence gives “t=0” conditions; Jeans mass=1 Msun
•50 Msun, 0.38 pc, navg=3 x 105 ptcls/cc
•forms ~50 objects
•T=10 K
•SPH, no B or •movie=1.4 free-fall times
QuickTime™ and aCinepak decompressor
are needed to see this picture.
Simulations &
Public Health
Goal:Statistical Comparison of “Real” and “Synthesized” Star Formation
Figure based on work of Padoan, Nordlund, Juvela, et al.Excerpt from realization used in Padoan & Goodman 2002.
Spectral Line Observations
Measuring Motions: Molecular Line Maps
Alves, Lada & Lada 1999
Radio Spectral-Line Survey
Radio Spectral-line Observations of Interstellar Clouds
Velocity from Spectroscopy
1.5
1.0
0.5
0.0
-0.5
Inte
nsit
y
400350300250200150100
"Velocity"
Observed Spectrum
All thanks to Doppler
Telescope Spectrometer
1.5
1.0
0.5
0.0
-0.5
Inte
nsit
y
400350300250200150100
"Velocity"
Observed Spectrum
Telescope Spectrometer
All thanks to Doppler
Velocity from Spectroscopy
Barnard’s Perseus
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
COMPLETE/FCRAO W(13CO)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
“Astronomical Medicine”
Excerpts from Junior Thesis of Michelle Borkin (Harvard College); IIC Contacts: AG (FAS) & Michael Halle (HMS/BWH/SPL)
IC 348
IC 348
“Astronomical Medicine”
“Astronomical Medicine”
“Astronomical Medicine”
After “Medical Treatment”After “Medical Treatment”Before “Medical Treatment”Before “Medical Treatment”
3D Slicer Demo (available after talk)
IIC contacts: Michael Halle & Ron Kikinis
Visualization Distributed Computing
Databases/ Provenance
Analysis & Simulations
Instrumentation
Physically meaningful combination of diverse data types.
e-Science aspects of large collaborations.
Sharing of data and computational resources and tools in real-time.
Management, and rapid retrieval, of data.
“Research reproducibility” …where did the data come from? How?
Development of efficient algorithms.
Cross-disciplinary comparative tools (e.g. statistical).
Improved data acquisition.
Novel hardware approaches (e.g. GPUs, sensors).
IIC: Five Research Branches
IIC: Innovative Organizational Model
Culture
Staffing
Promotion/career path
Criteria for promotion will give equal weight to scholarly activities, and to technological invention
No “class” distinctions made between teaching and non-teaching faculty, scientists and engineers, artists and designers working in the visualization program
Highly accomplished academics and senior experts whose careers have been primarily in industry, working together
How IIC will Function: Overview
IIC Objectives
Identify and fund projects that are likeliest to have the greatest and broadest impact
Pursue projects in way that will yield best outcome, enable shared learning, etc.
Enable new research for specific scientific disciplineGenerate new computational tools for broader application
Project execution
Dissemination of knowledge
Project selection
Role
Submit proposal in response to call for ideas
Evaluate/rank proposals for scientific merit: should this be a priority for IIC?
Evaluate/prioritize proposals according to technical feasibility, assess resource needs
Who participatesAny Harvard researcher (e.g., in genomics, fluid dynamics, epidemiology,neuroscience, nanoscience, comp bio, chemical biology, optics, geology, astronomy, quantum mechanics, et al.)
Harvard researchers representing broad interests of IIC stakeholders plus IIC Director & Dir. of Research
Consists of• IIC Director• Dirs. of Res. & Adm/Ops• Heads of IIC branches
Project Selection
Program Advisory Committee
Project proposals
IIC Management Team
Project Execution
Responsible for project execution and metrics for tracking progress/performance; interfaces with IIC branch heads
Scientists who “own” the problem and are committed to working with IIC staff to tackle it
IIC staff scientists assigned to work on project by relevant IIC branch heads. The same IIC staff member may serve on multiple IIC project teams
Discipline scientists IIC staff
Project Manager
IIC Project Team C, etc.
Discipline scientists IIC staff
Project Manager
IIC Project Team B
Discipline scientists IIC staff
Project Manager
IIC Project Team A
Dissemination of Knowledge
Seminars/colloquia Publications
Knowledge management
system
Communities of practice
• Scientific journals• IIC white papers
• Internal...• External…
• New tools• IIC process
Education is central to IIC’s mission
At Harvard:
Undergraduate & graduate courses focused on “data-intensive science”
New graduate certificate program, within existing Ph.D. programs
Research opportunities at undergraduate, graduate, and postdoctoral levels
Beyond Harvard:
New museum, highlighting the kind of science done at the IIC
IIC organization: research and education
Assoc Dir, Instrumentation
Assoc Dir, Visualization
Assoc Dir, Analysis & Simulation
Provost
IIC DirectorAssoc Provost
Dir of Admin & Operations
Project 1(Proj Mgr 1)
Project 2(Proj Mgr 2)
Project 3(Proj Mgr 3)
Dir of Education &Outreach
Etc.
CIO (systems)
Knowledgemgmt
Education &Outreach staff
Dean, Physical Sciences
Dir of Research
Assoc Dir, Databases/Data
Provenance
Assoc Dir, Distributed Computing
Visualization Distributed Computing
Databases/ Provenance
Analysis & Simulations
Instrumentation
Physically meaningful combination of diverse data types.
e-Science aspects of large collaborations.
Sharing of data and computational resources and tools in real-time.
Management, and rapid retrieval, of data.
“Research reproducibility” …where did the data come from? How?
Development of efficient algorithms.
Cross-disciplinary comparative tools (e.g. statistical).
Improved data acquisition.
Novel hardware approaches (e.g. GPUs, sensors).
IIC: Examples
Visualization: 3D Slicer (BWH Surgical Planning Lab)
IIC contacts: Michael Halle & Ron Kikinis
QuickTime™ and aCinepak decompressor
are needed to see this picture.
IIC contact: Felice Frankel (MIT)Work: Garstecki/Whitesides (FAS)
QuickTime™ and aCinepak decompressor
are needed to see this picture.
“Image and Meaning” (Visualization)
Distributed Computing: Semantics, Ontologies
IIC Contact: Tim Clark (HMS/MGH)
IIC Contact: Tim Clark (HMS/MGH)
Distributed Computing & Large Databases: Large Synoptic Survey Telescope
Optimized for time domainOptimized for time domain
scan modescan mode
deep modedeep mode
7 square degree field7 square degree field
6.5m effective aperture6.5m effective aperture
24th mag in 20 sec24th mag in 20 sec
> 5 Tbyte/night> 5 Tbyte/night
Real-time analysisReal-time analysis
Simultaneous multiple science goalsSimultaneous multiple science goals
IIC contact: Christopher Stubbs (FAS)
Astronomy High Energy Physics
LSST SDSS 2MASS MACHO DLS BaBar Atlas RHIC
First year of operation
2011 1998 2001 1992 1999 1998 2007 1999
Run-time data rate to storage (MB/sec)
5000 Peak
500 Avg
8.3
1
1
2.7
60 (zero-suppressd)
6*
540*
120* (’03)250* (’04)
Daily average datarate (TB/day)
20 0.02 0.016 0.008 0.012 0.6 60.0 3 (’03)10 (’04)
Annual data store(TB)
2000 3.6 6 1 0.25 300 7000 200 (’03)500 (’04)
Total data store capacity (TB)
20,000(10 yrs)
200 24.5 8 2 10,000 100,000 (10 yrs)
10,000 (10 yrs)
Peak computational load (GFLOPS)
140,000 100 11 1.00 0.600 2,000 100,000 3,000
Average computationalload (GFLOPS)
140,000 10 2 0.700 0.030 2,000 100,000 3,000
Data release delayacceptable
1 day moving
3 months static
2 months
6 months
1 year 6 hrs (trans)
1 yr (static
)
1 day (max)
<1 hr (typ)
Few days 100 days
Real-time alert of event
30 sec none none <1 hour 1 hr none none none
Type/number of processors
TBD 1GHzXeon
18
450MHz Sparc
28
60-70MHz Sparc
10
500MHz
Pentium5
Mixed/
5000
20GHz/
10,000
Pentium/
2500
Analysis & Simulations
Figure based on work of Padoan, Nordlund, Juvela, et al.Excerpt from realization used in Padoan & Goodman 2002.
Network Architecture• (Asymmetric) Fully Connected Networks
– Every node is connected to every other node– Connection may be excitatory (positive), inhibitory (negative), or
irrelevant (≈0).– Most general– : (Symmetric fully connected nets weights are symmetricwij=wji)
Input nodes:receive input from the
environmentOutput nodes:send
signals to theenvironment
Hidden nodes:no direct interaction to
the environment
Analysis & Simulations: Neural Net Models of Intelligence
Does Speed of Convergence in Neural Nets Predict Scores on Measures of “General Intelligence”?
Select from the lower 8 the one that completes the pattern in the top 9
IIC contact: Stephen Kosslyn (Psychology)
(Easier) Analysis of Large Data Sets: Mendelian Disease Genes
OMIM on the genome
0123456789
101112131415161718192021222324
0 50 100 150 200 250Position (MB)
Ch
rom
oso
me
12
Hello world 189Hello world 189Hello world 189Hello world 189
Hello world 189Hello world 189
Hello world 189Hello world 189
Large data files
reformat,merge,and filter
Can a biologist get from here to there?
Location of every known disease gene on the human genome
Without programming?
IIC contact: Eitan Rubin (FAS/CGR)
Instrumentation
IIC contact: Matt Welsh, FAS
IIC: Mission
The Institute for Innovative Computing (IIC) will make Harvard a world leader in the innovative and creative use of computational resources to address forefront scientific problems.
We will focus on developing capabilities that are applicable to multiple disciplines, by undertaking specific, well-defined projects, thereby developing tools and approaches that can be generalized and shared.
We will foster the flow of ideas and inventions along the continuum from basic science to scientific computation to computational science to computer science.
We will train a next generation of creative and computationally capable scientists, build linkages to industry, and communicate with the public at large.
Why Here?
Diverse group of senior faculty and accomplished scientists…
…spanning a wide range of relevant disciplines, e.g.,
Computer science
Physics, Chemistry, Astronomy, Statistics, Biology, Medicine, etc.
Psychology, Graphic Design
…with backgrounds in both academia and industry…
…deeply committed to the vision of a collaborative approach to solving the most compelling computing challenges facing scientists today