tony hey corporate vice president technical computing microsoft corporation

50
Tony Hey Tony Hey Corporate Vice President Corporate Vice President Technical Computing Technical Computing Microsoft Corporation Microsoft Corporation Computer and Computer and Information Information Sciences Sciences Life Life Sciences Sciences Multidisciplinary Multidisciplinary Research Research Earth Earth Sciences Sciences e-Science and its e-Science and its Implications for the Implications for the Library Community Library Community Social Sciences Social Sciences New Materials, New Materials, Technologies Technologies and Processes and Processes

Upload: marion

Post on 14-Jan-2016

38 views

Category:

Documents


2 download

DESCRIPTION

Social Sciences. Life Sciences. e-Science and its Implications for the Library Community. Earth Sciences. Computer and Information Sciences. Tony Hey Corporate Vice President Technical Computing Microsoft Corporation. New Materials, Technologies and Processes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Tony HeyTony Hey Corporate Vice President Corporate Vice President

Technical ComputingTechnical Computing Microsoft CorporationMicrosoft Corporation

Computer andComputer andInformation Information SciencesSciences

Life SciencesLife Sciences

MultidisciplinaryMultidisciplinaryResearchResearch

Earth Earth SciencesSciences e-Science and its e-Science and its

Implications for the Library Implications for the Library CommunityCommunity

Social SciencesSocial SciencesNew Materials,New Materials,TechnologiesTechnologies

and Processesand Processes

Page 2: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Licklider’s VisionLicklider’s Vision

““Lick had this concept – all of the stuff Lick had this concept – all of the stuff linked together throughout the world, that linked together throughout the world, that you can use a remote computer, get data you can use a remote computer, get data from a remote computer, or use lots of from a remote computer, or use lots of computers in your job”computers in your job”

Larry Roberts – Principal Architect of the Larry Roberts – Principal Architect of the ARPANETARPANET

Page 3: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Physics and the WebPhysics and the Web

Tim Berners-Lee developed the Web at Tim Berners-Lee developed the Web at CERN as a tool for exchanging information CERN as a tool for exchanging information between the partners in physics between the partners in physics collaborationscollaborations

The first Web Site in the USA was a link to The first Web Site in the USA was a link to the SLAC library catalogue the SLAC library catalogue

It was the international particle physics It was the international particle physics community who first embraced the Webcommunity who first embraced the Web

‘‘Killer’ application for the Internet Killer’ application for the Internet Transformed modern world – academia, Transformed modern world – academia,

business and leisurebusiness and leisure

Page 4: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Beyond the Web?Beyond the Web? Scientists developing collaboration Scientists developing collaboration

technologies that go far beyond the capabilities technologies that go far beyond the capabilities of the Webof the Web To use remote computing resourcesTo use remote computing resources To integrate, federate and analyse information from To integrate, federate and analyse information from

many disparate, distributed, data resourcesmany disparate, distributed, data resources To access and control remote experimental To access and control remote experimental

equipmentequipment

Capability to access, move, manipulate and Capability to access, move, manipulate and mine data is the central requirement of these mine data is the central requirement of these new collaborative science applicationsnew collaborative science applications Data held in file or database repositories Data held in file or database repositories Data generated by accelerator or telescopes Data generated by accelerator or telescopes Data gathered from mobile sensor networksData gathered from mobile sensor networks

Page 5: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

What is e-Science?What is e-Science?

‘‘e-Science is about global collaboration e-Science is about global collaboration in key areas of science, and the next in key areas of science, and the next generation of infrastructure that will generation of infrastructure that will enable it’enable it’

John TaylorJohn Taylor

Director General of Research CouncilsDirector General of Research Councils

UK, Office of Science and TechnologyUK, Office of Science and Technology

Page 6: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The e-Science VisionThe e-Science Vision

e-Science is about multidisciplinary science e-Science is about multidisciplinary science and the technologies to support such and the technologies to support such distributed, collaborative scientific researchdistributed, collaborative scientific research Many areas of science are in danger of being Many areas of science are in danger of being

overwhelmed by a ‘data deluge’ from new high-overwhelmed by a ‘data deluge’ from new high-throughput devices, sensor networks, satellite throughput devices, sensor networks, satellite surveys …surveys …

Areas such as bioinformatics, genomics, drug Areas such as bioinformatics, genomics, drug design, engineering, healthcare … require design, engineering, healthcare … require collaboration between different domain expertscollaboration between different domain experts

‘‘e-Science’ is a shorthand for a set of e-Science’ is a shorthand for a set of technologies to support collaborative technologies to support collaborative networked science networked science

Page 7: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

e-Science – Vision and Realitye-Science – Vision and Reality

VisionVision

Oceanographic sensors - Project Neptune Oceanographic sensors - Project Neptune Joint US-Canadian proposalJoint US-Canadian proposal

RealityReality

Chemistry – The Comb-e-Chem ProjectChemistry – The Comb-e-Chem Project Annotation, Remote Facilities and e-PublishingAnnotation, Remote Facilities and e-Publishing

Page 8: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

http://www.neptune.washington.edu/http://www.neptune.washington.edu/

Page 9: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Undersea Sensor

Network

Connected & Controllable

Over the Internet

Page 10: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Data Provenance

Page 11: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Visual Programmin

g

PersistentDistributed

Storage

Page 12: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Distributed Computatio

n

Interoperability & Legacy Support via

Web Services

Page 13: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Live Documents

Searching &

Visualization

Reputation& Influence

Page 14: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Reproducible Research

Page 15: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Collaboration

Page 16: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Handwriting

Page 17: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Dynamic Documents

Interactive Data

Page 18: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The Comb-e-Chem ProjectThe Comb-e-Chem Project

National X-RayService

Data Mining and Analysis

Automatic Annotation

Combinatorial Chemistry Wet Lab

HPC SimulationVideo Data

StreamD

iffra

ctom

eter

Middleware

StructuresDatabase

Page 19: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

National Crystallographic SNational Crystallographic Serviceervice

X-Ray e-LaboratoryStructuresDatabase

ComputationService

Send sample material to

NCS service

Search materials database and predict properties using

Grid computations

Download full data on materials

of interest

Collaborate in e-Lab experiment and obtain structure

Page 20: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

A digital lab book replacement that

chemists were able to use, and liked

Page 21: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Monitoring laboratory experiments using a broker delivered over GPRS on a PDA

Page 22: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Crystallographic e-PrintsCrystallographic e-PrintsDirect Access to Raw Data from scientific papers

Raw data sets can be very Raw data sets can be very large - stored at UK National large - stored at UK National Datastore using SRB softwareDatastore using SRB software

Page 23: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Grid

E-Scientists

Entire E-Science CycleEncompassing experimentation, analysis, publication, research, learning

5

Institutional Archive

LocalWebPublisher

Holdings

Digital Library

E-Scientists Graduate Students

Undergraduate Students

Virtual Learning Environment

E-Experimentation

E-Scientists

Technical Reports

Reprints

Peer-Reviewed Journal &

Conference Papers

Preprints & Metadata

Certified Experimental

Results & Analyses

Data, Metadata & Ontologies

eBank Project

Page 24: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Support for e-ScienceSupport for e-Science Cyberinfrastructure and e-InfrastructureCyberinfrastructure and e-Infrastructure

In the US, Europe and Asia there is a common In the US, Europe and Asia there is a common vision for the ‘cyberinfrastructure’ required to vision for the ‘cyberinfrastructure’ required to support the e-Science revolutionsupport the e-Science revolution

Set of Middleware Services supported on top of Set of Middleware Services supported on top of high bandwidth academic research networkshigh bandwidth academic research networks

Similar to vision of the Grid as a set of Similar to vision of the Grid as a set of services that allows scientists – and industry – services that allows scientists – and industry – to to routinelyroutinely set up ‘Virtual Organizations’ for set up ‘Virtual Organizations’ for their research – or businesstheir research – or business Many companies emphasize computing cycle Many companies emphasize computing cycle

aspect of Gridsaspect of Grids The ‘Microsoft Grid’ vision is more about data The ‘Microsoft Grid’ vision is more about data

management than about compute clustersmanagement than about compute clusters

Page 25: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Six Key Elements for a Global Six Key Elements for a Global Cyberinfrastructure for e- Cyberinfrastructure for e-ScienceScience 1.1. High bandwidth Research NetworksHigh bandwidth Research Networks

2.2. Internationally agreed AAA Internationally agreed AAA InfrastructureInfrastructure

3.3. Development Centers for Open Standard Development Centers for Open Standard Grid MiddlewareGrid Middleware

4.4. Technologies and standards for Data Technologies and standards for Data Provenance, Curation and PreservationProvenance, Curation and Preservation

5.5. Open access to Data and Publications Open access to Data and Publications via Interoperable Repositoriesvia Interoperable Repositories

6.6. Discovery Services and Collaborative Discovery Services and Collaborative ToolsTools

Page 26: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The Web Services ‘Magic Bullet’The Web Services ‘Magic Bullet’

Company A(J2EE)

Open Source(OMII)

Company C(.Net)

Web Services

Page 27: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

ComputationalModeling

Real-worldData

Interpretation& Insight

PersistentDistributed

Data

Workflow,Data Mining& Algorithms

Page 28: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Technical Computing in MicrosoftTechnical Computing in Microsoft

Radical ComputingRadical Computing Research in potential breakthrough Research in potential breakthrough

technologiestechnologies

Advanced Computing for Science and Advanced Computing for Science and EngineeringEngineering Application of new algorithms, tools and Application of new algorithms, tools and

technologies to scientific and engineering technologies to scientific and engineering problemsproblems

High Performance ComputingHigh Performance Computing Application of high performance clusters Application of high performance clusters

and database technologies to industrial and database technologies to industrial applicationsapplications

Page 29: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

New Science ParadigmsNew Science Paradigms Thousand years ago:Thousand years ago:

Experimental Science Experimental Science - - description of natural phenomenadescription of natural phenomena

Last few hundred years:Last few hundred years: Theoretical Science Theoretical Science - Newton’s Laws, Maxwell’s Equations …- Newton’s Laws, Maxwell’s Equations …

Last few decadesLast few decades:: Computational Science Computational Science - simulation of complex phenomena- simulation of complex phenomena

Today:Today: e-Science or Data-centric Science e-Science or Data-centric Science - unify theory, experiment, and simulation - unify theory, experiment, and simulation - using data exploration and data mining- using data exploration and data mining Data captured by instruments Data captured by instruments Data generated by simulationsData generated by simulations Processed by softwareProcessed by software Scientist analyzes databases/filesScientist analyzes databases/files

(With thanks to Jim Gray)(With thanks to Jim Gray)

2

22.

3

4

a

cG

a

a

2

22.

3

4

a

cG

a

a

Page 30: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

CONTENT Scholarly Communication, Institutional Repositories

DATA Acquisition, Storage, Annotation, Provenance, Curation, Preservation

TOOLS Workflow, Collaboration, Visualization, Data Mining

Advanced Computing for Advanced Computing for Science and EngineeringScience and Engineering

. . .

Page 32: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Key Issues for e-Science Key Issues for e-Science

WorkflowsWorkflows The LEAD ProjectThe LEAD Project

The Data Chain The Data Chain From Acquisition to PreservationFrom Acquisition to Preservation

Scholarly CommunicationScholarly Communication Open Access to Data and Publications Open Access to Data and Publications

Page 33: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The LEAD ProjectThe LEAD Project

Better predictions for Mesoscale weather

Page 34: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Analysis/Assimilation

Quality ControlRetrieval of Unobserved

QuantitiesCreation of Gridded Fields

Prediction/Detection

PCs to Teraflop Systems

Product Generation, Display,

Dissemination

End Users

NWSPrivate Companies

Students

The LEAD VisionThe LEAD Vision

DYNAMIC OBSERVATIONSDYNAMIC OBSERVATIONS

Models and Algorithms Driving Sensors

The CS challenge: Build a virtual “eScience” laboratory toThe CS challenge: Build a virtual “eScience” laboratory tosupport experimentation and education leading to this vision. support experimentation and education leading to this vision.

Page 35: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Composing LEAD ServicesComposing LEAD Services

Need to construct workflows that are: Need to construct workflows that are: Data Driven Data Driven

The weather input stream defines the nature of the The weather input stream defines the nature of the computationcomputation

Persistent and AgilePersistent and Agile An agent mines a data stream and notices an An agent mines a data stream and notices an

“interesting” feature. This event may trigger a “interesting” feature. This event may trigger a workflow scenario that has been waiting for monthsworkflow scenario that has been waiting for months

AdaptiveAdaptive The weather changes The weather changes Workflow may have to change on-the-flyWorkflow may have to change on-the-fly ResourcesResources

Page 36: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Example LEAD WorkflowExample LEAD Workflow

Page 37: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The e-Science Data ChainThe e-Science Data Chain

Data AcquisitionData Acquisition Data IngestData Ingest MetadataMetadata AnnotationAnnotation ProvenanceProvenance Data StorageData Storage CurationCuration PreservationPreservation

Page 38: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The Data DelugeThe Data Deluge

In the next 5 years e-Science projects will In the next 5 years e-Science projects will produce more scientific data than has been produce more scientific data than has been collected in the whole of human historycollected in the whole of human history

Some normalizations:Some normalizations: The Bible = 5 MegabytesThe Bible = 5 Megabytes Annual refereed papers = 1 TerabyteAnnual refereed papers = 1 Terabyte Library of Congress = 20 TerabytesLibrary of Congress = 20 Terabytes Internet Archive (1996 – 2002) = 100 TerabytesInternet Archive (1996 – 2002) = 100 Terabytes

In many fields new high throughput devices, In many fields new high throughput devices, sensors and surveys will be producing sensors and surveys will be producing Petabytes of scientific dataPetabytes of scientific data

Page 39: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

The Problem for the e-ScientistThe Problem for the e-Scientist

Data ingest Data ingest Managing a petabyteManaging a petabyte Common schemaCommon schema How to organize it?How to organize it? How to How to rereorganize it?organize it? How to coexist & cooperate with How to coexist & cooperate with

others?others?

Data Query and Visualization tools Data Query and Visualization tools Support/trainingSupport/training PerformancePerformance

Execute queries in a minute Execute queries in a minute Batch (big) query schedulingBatch (big) query scheduling

Experiments &Instruments

Simulationsfacts

facts

answers

questions

?Literature

Other Archives facts

facts

Page 40: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Digital Curation?Digital Curation? In 20 years can guarantee that the operating In 20 years can guarantee that the operating

system and spreadsheet program and the system and spreadsheet program and the hardware used to store data will not existhardware used to store data will not exist

Need research ‘curation’ technologies such as Need research ‘curation’ technologies such as workflow, provenance and preservation workflow, provenance and preservation

Need to liaise closely with individual research Need to liaise closely with individual research communities, data archives and librariescommunities, data archives and libraries

The UK has set up the ‘Digital Curation The UK has set up the ‘Digital Curation Centre’ in Edinburgh with Glasgow, UKOLN Centre’ in Edinburgh with Glasgow, UKOLN and CCLRCand CCLRC

Attempt to bring together skills of scientists, Attempt to bring together skills of scientists, computer scientists and librarianscomputer scientists and librarians

Page 41: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Digital Curation CentreDigital Curation Centre

Actions needed to maintain and utilise digital Actions needed to maintain and utilise digital data and research results over entire life-cycledata and research results over entire life-cycle For current and future generations of usersFor current and future generations of users

Digital PreservationDigital Preservation Long-run technological/legal accessibility and Long-run technological/legal accessibility and

usabilityusability

Data curation in science Data curation in science Maintenance of body of trusted data to represent Maintenance of body of trusted data to represent

current state of knowledge current state of knowledge

Research in tools and technologiesResearch in tools and technologies Integration, annotation, provenance, metadata, Integration, annotation, provenance, metadata,

security…..security…..

Page 42: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Berlin Declaration 2003Berlin Declaration 2003

‘‘To promote the Internet as a functional To promote the Internet as a functional instrument for a global scientific instrument for a global scientific knowledge base and for human knowledge base and for human reflection’reflection’

Defines open access contributions as Defines open access contributions as including:including: ‘‘original scientific research results, original scientific research results,

raw data and metadata, source raw data and metadata, source materials, digital representations of materials, digital representations of pictorial and graphical materials and pictorial and graphical materials and scholarly multimedia material’scholarly multimedia material’

Page 43: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

NSF ‘Atkins’ Report on NSF ‘Atkins’ Report on Cyberinfrastructure Cyberinfrastructure

‘‘the primary access to the latest the primary access to the latest findings in a growing number of fields is findings in a growing number of fields is through the Web, then through classic through the Web, then through classic preprints and conferences, and lastly preprints and conferences, and lastly through refereed archival papers’through refereed archival papers’

‘‘archives containing hundreds or archives containing hundreds or thousands of terabytes of data will be thousands of terabytes of data will be affordable and necessary for archiving affordable and necessary for archiving scientific and engineering information’scientific and engineering information’

Page 44: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

MIT DSpace VisionMIT DSpace Vision

‘‘Much of the material produced by Much of the material produced by faculty, such as datasets, experimental faculty, such as datasets, experimental results and rich media data as well as results and rich media data as well as more conventional document-based more conventional document-based material (e.g. articles and reports) is material (e.g. articles and reports) is housed on an individual’s hard drive or housed on an individual’s hard drive or department Web server. Such material department Web server. Such material is often lost forever as faculty and is often lost forever as faculty and departments change over time.’departments change over time.’

  

Page 45: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Publishing Data & Analysis Publishing Data & Analysis Is ChangingIs Changing

Roles

Authors

Publishers

Curators

Archives

Consumers

Traditional

Scientists

Journals

Libraries

Archives

Scientists

Emerging

Collaborations

Project web site

Data+Doc Archives

Digital Archives

Scientists

Page 46: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Data Publishing: The BackgroundData Publishing: The Background

In some areas – notably biology – databases are In some areas – notably biology – databases are replacing (paper) publications as a medium of replacing (paper) publications as a medium of communicationcommunication These databases are built and maintained with a These databases are built and maintained with a

great deal of human effortgreat deal of human effort They often do not contain source experimental data - They often do not contain source experimental data -

sometimes just annotation/metadatasometimes just annotation/metadata They borrow extensively from, and refer to, other They borrow extensively from, and refer to, other

databasesdatabases You are now judged by your databases as well as You are now judged by your databases as well as

your (paper) publicationsyour (paper) publications Upwards of 1000 (public databases) in geneticsUpwards of 1000 (public databases) in genetics

Page 47: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Data Publishing: The issuesData Publishing: The issues Data integration Data integration

Tying together data from various sourcesTying together data from various sources

Annotation Annotation Adding comments/observations to existing dataAdding comments/observations to existing data Becoming a new form of communicationBecoming a new form of communication

ProvenanceProvenance ‘‘Where did this data come from?’Where did this data come from?’

Exporting/publishing in agreed formatsExporting/publishing in agreed formats To other programs as well as peopleTo other programs as well as people

SecuritySecurity Specifying/enforcing read/write access to Specifying/enforcing read/write access to partsparts of of

your datayour data

Page 48: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Interoperable Repositories?Interoperable Repositories? Paul Ginsparg’s arXiv at Cornell has demonstrated Paul Ginsparg’s arXiv at Cornell has demonstrated

new model of scientific publishingnew model of scientific publishing Electronic version of ‘preprints’ hosted on the WebElectronic version of ‘preprints’ hosted on the Web

David Lipman of the NIH National Library of Medicine David Lipman of the NIH National Library of Medicine has developed PubMedCentral as repository for NIH has developed PubMedCentral as repository for NIH funded research papersfunded research papers Microsoft funded development of ‘portable PMC’ now being Microsoft funded development of ‘portable PMC’ now being

deployed in UK and other countriesdeployed in UK and other countries

Stevan Harnad’s ‘self-archiving’ EPrints project in Stevan Harnad’s ‘self-archiving’ EPrints project in Southampton provides a basis for OAI-compliant Southampton provides a basis for OAI-compliant ‘Institutional Repositories’‘Institutional Repositories’ Many national initiatives around the world moving towards Many national initiatives around the world moving towards

mandating deposition of ‘full text’ of publicly funded mandating deposition of ‘full text’ of publicly funded research papers in repositoriesresearch papers in repositories

Page 49: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

Microsoft Strategy for e-ScienceMicrosoft Strategy for e-Science

Microsoft intends to work with the Microsoft intends to work with the scientific and library communities:scientific and library communities:

to define open standard and/or interoperable to define open standard and/or interoperable high-level services, work flows and toolshigh-level services, work flows and tools

to assist the community in developing open to assist the community in developing open scholarly communication and interoperable scholarly communication and interoperable repositoriesrepositories

Page 50: Tony Hey  Corporate Vice President  Technical Computing  Microsoft Corporation

AcknowledgementsAcknowledgements

With special thanks toWith special thanks to Kelvin Kelvin Droegemeier, Droegemeier, Geoffrey Fox, Jeremy Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Frey, Dennis Gannon, Jim Gray, Yike Yike Guo, Liz Lyon and Beth Plale Guo, Liz Lyon and Beth Plale