tony hey corporate vice president technical computing microsoft corporation
DESCRIPTION
Social Sciences. Life Sciences. e-Science and its Implications for the Library Community. Earth Sciences. Computer and Information Sciences. Tony Hey Corporate Vice President Technical Computing Microsoft Corporation. New Materials, Technologies and Processes. - PowerPoint PPT PresentationTRANSCRIPT
Tony HeyTony Hey Corporate Vice President Corporate Vice President
Technical ComputingTechnical Computing Microsoft CorporationMicrosoft Corporation
Computer andComputer andInformation Information SciencesSciences
Life SciencesLife Sciences
MultidisciplinaryMultidisciplinaryResearchResearch
Earth Earth SciencesSciences e-Science and its e-Science and its
Implications for the Library Implications for the Library CommunityCommunity
Social SciencesSocial SciencesNew Materials,New Materials,TechnologiesTechnologies
and Processesand Processes
Licklider’s VisionLicklider’s Vision
““Lick had this concept – all of the stuff Lick had this concept – all of the stuff linked together throughout the world, that linked together throughout the world, that you can use a remote computer, get data you can use a remote computer, get data from a remote computer, or use lots of from a remote computer, or use lots of computers in your job”computers in your job”
Larry Roberts – Principal Architect of the Larry Roberts – Principal Architect of the ARPANETARPANET
Physics and the WebPhysics and the Web
Tim Berners-Lee developed the Web at Tim Berners-Lee developed the Web at CERN as a tool for exchanging information CERN as a tool for exchanging information between the partners in physics between the partners in physics collaborationscollaborations
The first Web Site in the USA was a link to The first Web Site in the USA was a link to the SLAC library catalogue the SLAC library catalogue
It was the international particle physics It was the international particle physics community who first embraced the Webcommunity who first embraced the Web
‘‘Killer’ application for the Internet Killer’ application for the Internet Transformed modern world – academia, Transformed modern world – academia,
business and leisurebusiness and leisure
Beyond the Web?Beyond the Web? Scientists developing collaboration Scientists developing collaboration
technologies that go far beyond the capabilities technologies that go far beyond the capabilities of the Webof the Web To use remote computing resourcesTo use remote computing resources To integrate, federate and analyse information from To integrate, federate and analyse information from
many disparate, distributed, data resourcesmany disparate, distributed, data resources To access and control remote experimental To access and control remote experimental
equipmentequipment
Capability to access, move, manipulate and Capability to access, move, manipulate and mine data is the central requirement of these mine data is the central requirement of these new collaborative science applicationsnew collaborative science applications Data held in file or database repositories Data held in file or database repositories Data generated by accelerator or telescopes Data generated by accelerator or telescopes Data gathered from mobile sensor networksData gathered from mobile sensor networks
What is e-Science?What is e-Science?
‘‘e-Science is about global collaboration e-Science is about global collaboration in key areas of science, and the next in key areas of science, and the next generation of infrastructure that will generation of infrastructure that will enable it’enable it’
John TaylorJohn Taylor
Director General of Research CouncilsDirector General of Research Councils
UK, Office of Science and TechnologyUK, Office of Science and Technology
The e-Science VisionThe e-Science Vision
e-Science is about multidisciplinary science e-Science is about multidisciplinary science and the technologies to support such and the technologies to support such distributed, collaborative scientific researchdistributed, collaborative scientific research Many areas of science are in danger of being Many areas of science are in danger of being
overwhelmed by a ‘data deluge’ from new high-overwhelmed by a ‘data deluge’ from new high-throughput devices, sensor networks, satellite throughput devices, sensor networks, satellite surveys …surveys …
Areas such as bioinformatics, genomics, drug Areas such as bioinformatics, genomics, drug design, engineering, healthcare … require design, engineering, healthcare … require collaboration between different domain expertscollaboration between different domain experts
‘‘e-Science’ is a shorthand for a set of e-Science’ is a shorthand for a set of technologies to support collaborative technologies to support collaborative networked science networked science
e-Science – Vision and Realitye-Science – Vision and Reality
VisionVision
Oceanographic sensors - Project Neptune Oceanographic sensors - Project Neptune Joint US-Canadian proposalJoint US-Canadian proposal
RealityReality
Chemistry – The Comb-e-Chem ProjectChemistry – The Comb-e-Chem Project Annotation, Remote Facilities and e-PublishingAnnotation, Remote Facilities and e-Publishing
http://www.neptune.washington.edu/http://www.neptune.washington.edu/
Undersea Sensor
Network
Connected & Controllable
Over the Internet
Data Provenance
Visual Programmin
g
PersistentDistributed
Storage
Distributed Computatio
n
Interoperability & Legacy Support via
Web Services
Live Documents
Searching &
Visualization
Reputation& Influence
Reproducible Research
Collaboration
Handwriting
Dynamic Documents
Interactive Data
The Comb-e-Chem ProjectThe Comb-e-Chem Project
National X-RayService
Data Mining and Analysis
Automatic Annotation
Combinatorial Chemistry Wet Lab
HPC SimulationVideo Data
StreamD
iffra
ctom
eter
Middleware
StructuresDatabase
National Crystallographic SNational Crystallographic Serviceervice
X-Ray e-LaboratoryStructuresDatabase
ComputationService
Send sample material to
NCS service
Search materials database and predict properties using
Grid computations
Download full data on materials
of interest
Collaborate in e-Lab experiment and obtain structure
A digital lab book replacement that
chemists were able to use, and liked
Monitoring laboratory experiments using a broker delivered over GPRS on a PDA
Crystallographic e-PrintsCrystallographic e-PrintsDirect Access to Raw Data from scientific papers
Raw data sets can be very Raw data sets can be very large - stored at UK National large - stored at UK National Datastore using SRB softwareDatastore using SRB software
Grid
E-Scientists
Entire E-Science CycleEncompassing experimentation, analysis, publication, research, learning
5
Institutional Archive
LocalWebPublisher
Holdings
Digital Library
E-Scientists Graduate Students
Undergraduate Students
Virtual Learning Environment
E-Experimentation
E-Scientists
Technical Reports
Reprints
Peer-Reviewed Journal &
Conference Papers
Preprints & Metadata
Certified Experimental
Results & Analyses
Data, Metadata & Ontologies
eBank Project
Support for e-ScienceSupport for e-Science Cyberinfrastructure and e-InfrastructureCyberinfrastructure and e-Infrastructure
In the US, Europe and Asia there is a common In the US, Europe and Asia there is a common vision for the ‘cyberinfrastructure’ required to vision for the ‘cyberinfrastructure’ required to support the e-Science revolutionsupport the e-Science revolution
Set of Middleware Services supported on top of Set of Middleware Services supported on top of high bandwidth academic research networkshigh bandwidth academic research networks
Similar to vision of the Grid as a set of Similar to vision of the Grid as a set of services that allows scientists – and industry – services that allows scientists – and industry – to to routinelyroutinely set up ‘Virtual Organizations’ for set up ‘Virtual Organizations’ for their research – or businesstheir research – or business Many companies emphasize computing cycle Many companies emphasize computing cycle
aspect of Gridsaspect of Grids The ‘Microsoft Grid’ vision is more about data The ‘Microsoft Grid’ vision is more about data
management than about compute clustersmanagement than about compute clusters
Six Key Elements for a Global Six Key Elements for a Global Cyberinfrastructure for e- Cyberinfrastructure for e-ScienceScience 1.1. High bandwidth Research NetworksHigh bandwidth Research Networks
2.2. Internationally agreed AAA Internationally agreed AAA InfrastructureInfrastructure
3.3. Development Centers for Open Standard Development Centers for Open Standard Grid MiddlewareGrid Middleware
4.4. Technologies and standards for Data Technologies and standards for Data Provenance, Curation and PreservationProvenance, Curation and Preservation
5.5. Open access to Data and Publications Open access to Data and Publications via Interoperable Repositoriesvia Interoperable Repositories
6.6. Discovery Services and Collaborative Discovery Services and Collaborative ToolsTools
The Web Services ‘Magic Bullet’The Web Services ‘Magic Bullet’
Company A(J2EE)
Open Source(OMII)
Company C(.Net)
Web Services
ComputationalModeling
Real-worldData
Interpretation& Insight
PersistentDistributed
Data
Workflow,Data Mining& Algorithms
Technical Computing in MicrosoftTechnical Computing in Microsoft
Radical ComputingRadical Computing Research in potential breakthrough Research in potential breakthrough
technologiestechnologies
Advanced Computing for Science and Advanced Computing for Science and EngineeringEngineering Application of new algorithms, tools and Application of new algorithms, tools and
technologies to scientific and engineering technologies to scientific and engineering problemsproblems
High Performance ComputingHigh Performance Computing Application of high performance clusters Application of high performance clusters
and database technologies to industrial and database technologies to industrial applicationsapplications
New Science ParadigmsNew Science Paradigms Thousand years ago:Thousand years ago:
Experimental Science Experimental Science - - description of natural phenomenadescription of natural phenomena
Last few hundred years:Last few hundred years: Theoretical Science Theoretical Science - Newton’s Laws, Maxwell’s Equations …- Newton’s Laws, Maxwell’s Equations …
Last few decadesLast few decades:: Computational Science Computational Science - simulation of complex phenomena- simulation of complex phenomena
Today:Today: e-Science or Data-centric Science e-Science or Data-centric Science - unify theory, experiment, and simulation - unify theory, experiment, and simulation - using data exploration and data mining- using data exploration and data mining Data captured by instruments Data captured by instruments Data generated by simulationsData generated by simulations Processed by softwareProcessed by software Scientist analyzes databases/filesScientist analyzes databases/files
(With thanks to Jim Gray)(With thanks to Jim Gray)
2
22.
3
4
a
cG
a
a
2
22.
3
4
a
cG
a
a
CONTENT Scholarly Communication, Institutional Repositories
DATA Acquisition, Storage, Annotation, Provenance, Curation, Preservation
TOOLS Workflow, Collaboration, Visualization, Data Mining
Advanced Computing for Advanced Computing for Science and EngineeringScience and Engineering
. . .
Top 500 Supercomputer TrendsTop 500 Supercomputer Trends
Industry usage rising
Clusters over 50%
x86 is winning
GigE is gaining
Key Issues for e-Science Key Issues for e-Science
WorkflowsWorkflows The LEAD ProjectThe LEAD Project
The Data Chain The Data Chain From Acquisition to PreservationFrom Acquisition to Preservation
Scholarly CommunicationScholarly Communication Open Access to Data and Publications Open Access to Data and Publications
The LEAD ProjectThe LEAD Project
Better predictions for Mesoscale weather
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction/Detection
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
The LEAD VisionThe LEAD Vision
DYNAMIC OBSERVATIONSDYNAMIC OBSERVATIONS
Models and Algorithms Driving Sensors
The CS challenge: Build a virtual “eScience” laboratory toThe CS challenge: Build a virtual “eScience” laboratory tosupport experimentation and education leading to this vision. support experimentation and education leading to this vision.
Composing LEAD ServicesComposing LEAD Services
Need to construct workflows that are: Need to construct workflows that are: Data Driven Data Driven
The weather input stream defines the nature of the The weather input stream defines the nature of the computationcomputation
Persistent and AgilePersistent and Agile An agent mines a data stream and notices an An agent mines a data stream and notices an
“interesting” feature. This event may trigger a “interesting” feature. This event may trigger a workflow scenario that has been waiting for monthsworkflow scenario that has been waiting for months
AdaptiveAdaptive The weather changes The weather changes Workflow may have to change on-the-flyWorkflow may have to change on-the-fly ResourcesResources
Example LEAD WorkflowExample LEAD Workflow
The e-Science Data ChainThe e-Science Data Chain
Data AcquisitionData Acquisition Data IngestData Ingest MetadataMetadata AnnotationAnnotation ProvenanceProvenance Data StorageData Storage CurationCuration PreservationPreservation
The Data DelugeThe Data Deluge
In the next 5 years e-Science projects will In the next 5 years e-Science projects will produce more scientific data than has been produce more scientific data than has been collected in the whole of human historycollected in the whole of human history
Some normalizations:Some normalizations: The Bible = 5 MegabytesThe Bible = 5 Megabytes Annual refereed papers = 1 TerabyteAnnual refereed papers = 1 Terabyte Library of Congress = 20 TerabytesLibrary of Congress = 20 Terabytes Internet Archive (1996 – 2002) = 100 TerabytesInternet Archive (1996 – 2002) = 100 Terabytes
In many fields new high throughput devices, In many fields new high throughput devices, sensors and surveys will be producing sensors and surveys will be producing Petabytes of scientific dataPetabytes of scientific data
The Problem for the e-ScientistThe Problem for the e-Scientist
Data ingest Data ingest Managing a petabyteManaging a petabyte Common schemaCommon schema How to organize it?How to organize it? How to How to rereorganize it?organize it? How to coexist & cooperate with How to coexist & cooperate with
others?others?
Data Query and Visualization tools Data Query and Visualization tools Support/trainingSupport/training PerformancePerformance
Execute queries in a minute Execute queries in a minute Batch (big) query schedulingBatch (big) query scheduling
Experiments &Instruments
Simulationsfacts
facts
answers
questions
?Literature
Other Archives facts
facts
Digital Curation?Digital Curation? In 20 years can guarantee that the operating In 20 years can guarantee that the operating
system and spreadsheet program and the system and spreadsheet program and the hardware used to store data will not existhardware used to store data will not exist
Need research ‘curation’ technologies such as Need research ‘curation’ technologies such as workflow, provenance and preservation workflow, provenance and preservation
Need to liaise closely with individual research Need to liaise closely with individual research communities, data archives and librariescommunities, data archives and libraries
The UK has set up the ‘Digital Curation The UK has set up the ‘Digital Curation Centre’ in Edinburgh with Glasgow, UKOLN Centre’ in Edinburgh with Glasgow, UKOLN and CCLRCand CCLRC
Attempt to bring together skills of scientists, Attempt to bring together skills of scientists, computer scientists and librarianscomputer scientists and librarians
Digital Curation CentreDigital Curation Centre
Actions needed to maintain and utilise digital Actions needed to maintain and utilise digital data and research results over entire life-cycledata and research results over entire life-cycle For current and future generations of usersFor current and future generations of users
Digital PreservationDigital Preservation Long-run technological/legal accessibility and Long-run technological/legal accessibility and
usabilityusability
Data curation in science Data curation in science Maintenance of body of trusted data to represent Maintenance of body of trusted data to represent
current state of knowledge current state of knowledge
Research in tools and technologiesResearch in tools and technologies Integration, annotation, provenance, metadata, Integration, annotation, provenance, metadata,
security…..security…..
Berlin Declaration 2003Berlin Declaration 2003
‘‘To promote the Internet as a functional To promote the Internet as a functional instrument for a global scientific instrument for a global scientific knowledge base and for human knowledge base and for human reflection’reflection’
Defines open access contributions as Defines open access contributions as including:including: ‘‘original scientific research results, original scientific research results,
raw data and metadata, source raw data and metadata, source materials, digital representations of materials, digital representations of pictorial and graphical materials and pictorial and graphical materials and scholarly multimedia material’scholarly multimedia material’
NSF ‘Atkins’ Report on NSF ‘Atkins’ Report on Cyberinfrastructure Cyberinfrastructure
‘‘the primary access to the latest the primary access to the latest findings in a growing number of fields is findings in a growing number of fields is through the Web, then through classic through the Web, then through classic preprints and conferences, and lastly preprints and conferences, and lastly through refereed archival papers’through refereed archival papers’
‘‘archives containing hundreds or archives containing hundreds or thousands of terabytes of data will be thousands of terabytes of data will be affordable and necessary for archiving affordable and necessary for archiving scientific and engineering information’scientific and engineering information’
MIT DSpace VisionMIT DSpace Vision
‘‘Much of the material produced by Much of the material produced by faculty, such as datasets, experimental faculty, such as datasets, experimental results and rich media data as well as results and rich media data as well as more conventional document-based more conventional document-based material (e.g. articles and reports) is material (e.g. articles and reports) is housed on an individual’s hard drive or housed on an individual’s hard drive or department Web server. Such material department Web server. Such material is often lost forever as faculty and is often lost forever as faculty and departments change over time.’departments change over time.’
Publishing Data & Analysis Publishing Data & Analysis Is ChangingIs Changing
Roles
Authors
Publishers
Curators
Archives
Consumers
Traditional
Scientists
Journals
Libraries
Archives
Scientists
Emerging
Collaborations
Project web site
Data+Doc Archives
Digital Archives
Scientists
Data Publishing: The BackgroundData Publishing: The Background
In some areas – notably biology – databases are In some areas – notably biology – databases are replacing (paper) publications as a medium of replacing (paper) publications as a medium of communicationcommunication These databases are built and maintained with a These databases are built and maintained with a
great deal of human effortgreat deal of human effort They often do not contain source experimental data - They often do not contain source experimental data -
sometimes just annotation/metadatasometimes just annotation/metadata They borrow extensively from, and refer to, other They borrow extensively from, and refer to, other
databasesdatabases You are now judged by your databases as well as You are now judged by your databases as well as
your (paper) publicationsyour (paper) publications Upwards of 1000 (public databases) in geneticsUpwards of 1000 (public databases) in genetics
Data Publishing: The issuesData Publishing: The issues Data integration Data integration
Tying together data from various sourcesTying together data from various sources
Annotation Annotation Adding comments/observations to existing dataAdding comments/observations to existing data Becoming a new form of communicationBecoming a new form of communication
ProvenanceProvenance ‘‘Where did this data come from?’Where did this data come from?’
Exporting/publishing in agreed formatsExporting/publishing in agreed formats To other programs as well as peopleTo other programs as well as people
SecuritySecurity Specifying/enforcing read/write access to Specifying/enforcing read/write access to partsparts of of
your datayour data
Interoperable Repositories?Interoperable Repositories? Paul Ginsparg’s arXiv at Cornell has demonstrated Paul Ginsparg’s arXiv at Cornell has demonstrated
new model of scientific publishingnew model of scientific publishing Electronic version of ‘preprints’ hosted on the WebElectronic version of ‘preprints’ hosted on the Web
David Lipman of the NIH National Library of Medicine David Lipman of the NIH National Library of Medicine has developed PubMedCentral as repository for NIH has developed PubMedCentral as repository for NIH funded research papersfunded research papers Microsoft funded development of ‘portable PMC’ now being Microsoft funded development of ‘portable PMC’ now being
deployed in UK and other countriesdeployed in UK and other countries
Stevan Harnad’s ‘self-archiving’ EPrints project in Stevan Harnad’s ‘self-archiving’ EPrints project in Southampton provides a basis for OAI-compliant Southampton provides a basis for OAI-compliant ‘Institutional Repositories’‘Institutional Repositories’ Many national initiatives around the world moving towards Many national initiatives around the world moving towards
mandating deposition of ‘full text’ of publicly funded mandating deposition of ‘full text’ of publicly funded research papers in repositoriesresearch papers in repositories
Microsoft Strategy for e-ScienceMicrosoft Strategy for e-Science
Microsoft intends to work with the Microsoft intends to work with the scientific and library communities:scientific and library communities:
to define open standard and/or interoperable to define open standard and/or interoperable high-level services, work flows and toolshigh-level services, work flows and tools
to assist the community in developing open to assist the community in developing open scholarly communication and interoperable scholarly communication and interoperable repositoriesrepositories
AcknowledgementsAcknowledgements
With special thanks toWith special thanks to Kelvin Kelvin Droegemeier, Droegemeier, Geoffrey Fox, Jeremy Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Frey, Dennis Gannon, Jim Gray, Yike Yike Guo, Liz Lyon and Beth Plale Guo, Liz Lyon and Beth Plale