from bio-informatics towards e-bioscience l.o. (bob) hertzberger computer architecture and parallel...

44
From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam [email protected]

Upload: melinda-carpenter

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

From Bio-Informatics towards e-BioScience

L.O. (Bob) Hertzberger

Computer Architecture and Parallel Systems GroupDepartment of Computer Science

Universiteit van Amsterdam

[email protected]

Page 2: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Background informationexperimental sciences

• There is a tendency to look ever deeper in: Matter e.g. Physics Universe e.g. Astronomy Life e.g. Life sciences

• Instrumental consequences are increase in detector: Resolution & sensitivityAutomation & robotization

• Therefore experiments change in nature & become increasingly more complex

Page 3: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Impact in the life sciences

• Impact of high throughput methods e.g. Omics experimentationgenome ===> genomics

Page 4: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

New technologies in Life Sciences research

University of Amsterdam

cell

Genomics

Transcriptomics

Proteomics

Metabolomics

RNA

protein

metabolites

DNA

Methodology/Technology

Page 5: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Omics impact

Page 6: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Impact in the life sciences

• Impact of high throughput methods e.g. Omics experimentationgenome ===> genomics

• Instrumentation being used in omics experimentation: Transcriptomics via among others; micro-arrays Proteomics via among others; Mass Spectroscopy (MS) Metabolomics via among others; MS & Nuclear Magnetic

Resonance (NMR)

Page 7: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Results in Paradigm shift in Life sciences

• Past experiments where hypothesis drivenEvaluate hypothesisComplement existing knowledge

• Present experiments are data drivenDiscover knowledge from large amounts

of data

Page 8: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Life sciences research: from gene to function

Gene DNA

NH2

COOH

Protein

Genome-wide micro-array analysis

“High-throughput” protein-analysis

mRNAAAAAAAAAA

function-2function-1 function-n

Whole-genome sequence projects

Protein function: -prediction by bioinformatics

-proof by laboratory research

cellnucleus

Gene expression by

RNA synthesis

mRNA translation byprotein synthesis

Page 9: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Developments towards Bio-informatics & e-Science

• Experiments become increasingly more complex• Driven by increase of detector developments• Results in an increase in amount and complexity

of data• Something has to be done to harness this

developmentBio-informatics to translate data into useful biological,

medical, pharmaceutical & agricultural knowledge

Page 10: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

The what of Bioinformatics

Bioinformatics is redefining rules and scientific approaches, resulting in the ‘new biology’. Within this new paradigm the traditional scientific boundaries are blurred, leaving no clear line between ‘dry or computational’ and ‘wet-based’ approaches

Page 11: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Role of bioinformatics

cell

Dat

a ge

nera

tion/

valid

atio

n

Dat

a in

tegr

atio

n/fu

sion

Dat

a us

age/

user

inte

rfac

ing

Genomics

Transcriptomics

Proteomics

Metabolomics

Integrative/System Biology

RNA

protein

metabolites

DNA

methodology Bioinformatics

Page 12: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Two sides of Bioinformatics

• The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledge

• Technological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentationNeed for e-Science support

 

Page 13: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Developments towards Bio-informatics & e-Science

• Experiments become increasingly more complex• Driven by increase of detector developments• Results in an increase in amount and complexity

of data• Something has to be done to harness this

developmentBio-informatics to translate data into useful biological,

medical, pharmaceutical & agricultural knowledgeVirtualization of experimental resources

enabling sharing & leading to e-BioScience

Page 14: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Life science/genomicsresearch consortia and

industry

Grid infrastructure

Bioinformatics

e-Science & research

infrastructure

e-Bioscience and life science

innovation domain e-Bioscience

& research infrastructure

Life science application areas

Generic e-Science ICT development and support

Network infrastructure and computing capacity

Page 15: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Why e-BioScience• There is an increasing necessity to use

results from other scientist e.g. share data & information:

Page 16: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Re-use and sharing of biological data (2)

Information content of omics data extremely high, however,• Data subject to noise, biological and technical variation• How to induce biological principles from these genome-wide data sets?

Approach: develop methodology for “reverse engineering” of biological mechanisms.

• Biggest challenge in bioinformatics today.

Need for external data sources for in-silico experimentation• Two practices for re-use and sharing of data

Collectively compile huge amounts of relevant data and make these available to the community. Examples: Bio-banking, compendia (e.g. NIH’s Affymetrix SNP repository).

Re-use information from different and diverse experiments to discover phenomena

Page 17: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Re-use and sharing of biological data (2)

Compendium example: re-use and sharing of Huntington data • Datasets: 404 Affymetrix Gene chips of measurements on extremely rare

human brain samples (Hodges et al. Hum. Mol. Genetics, 2006)• Available from NCBI GEO database (MIAME)• Goal: find genes involved in Huntington’s Disease • Approach:

Reanalyze gene expression data Combine genotype data and clinical data (e.g. using SigWin) Extend experiments with own ChIP on chip data

Page 18: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Resource Identification software

Repository of relevant meta-information from:• Data warehouses e.g. GEO, ArrayExpress, Protein Interaction database

• Literature (Mining of PubMed using Collexis)• Information resources specialized on diseases, genes,

proteins, e.g. OMIM, GenBank, Ensembl

Page 19: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Why e-BioScience• There is an increasing necessity to use results from

other scientist e.g. share data & information: Data repositories

Cohort studies in Bio-banking Biodiversity

Expensive and complex equipment Mass Spectroscopy MRI Other

Page 20: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Problems for the realization of e-BioScience

• Life Science field is still in an early stage of development and: First principles are not understood at all

• As a consequence experimental methods are not well established and will not for a time to come

• Because of the new forms of omics instrumentation there is a need for design for experimentation methods Lack correct logging of conditions under which experiments

are done is production of large amounts of data that request among

others statistical techniques for interpretation• As a consequence results are multi interpretable

Page 21: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Problems for the realization of e-BioScience

• Problems for bioinformatics & e-Bioscience: Rationalisation at this early stage is almost impossible Pre- standardization & standardization almost non existent Where there are standards they are inadequate because

multi interpretable (like MIAME for micro-array’s)

• In addition there are commercial end-user products that are difficult to integrate

• Users lack the training necessary to handle these complex experimental situation

• Only possible solution is to create a flexible experimentation environment for the end-users

Page 22: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Role of ICT in e-BioScience

• e-Science is a new form of science methodology complementing theoretical and experimental sciences.

• It is using generic methods and an ICT infrastructure to support this methodology. Web services as a paradigm/way of using/accessing information Grid is as a method of accessing & sharing computing resources

by virtualization

• What is missing in e-BioScience: Connection between biological problem & e-Bioscience User oriented tools that can be re-used and extended General model of ICT based integration Semantic support

ontology’s and semantic support for workflows to make user knowledge explicit

Page 23: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Consequences for bio-informatics & e-BioScience

• Considerable amounts of experimentation is necessary before a well established methodology will emerge

• The VL-e approach might be a good model & produces an environment in which the necessary experimentation can be realized

Page 24: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Enhancing the scientific process: e-BioLab

• Problem domain experts can focus on the biology because they are shielded from technical details by e-scientists.

• Viewpoints on the research question and the data semi-instantaneously can be expressed and visualized.

• Ideas and analyses can be retainedand documented.

• Facilities for remote collaboration are present*.

* Rauwerda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted)

Readily accessible data + models data mining

Small integration experiments

+ integration methods

Easyvisua-lization

Vague results

Basic model of problem area

e-BioOperator

BiologistsBiologists e-BioScientist

Motivation: • Interacting with the problem domain requires an environment in which the

domain can be opened up and ideas, hunches and notions on the data and crude models of the biology can be visualized

• A tangible space in which biologists, aided by e-scientists, will have the full potential of VL-e at their disposal.

An actual laboratory in which: • Problem domain experts (biologists, medical doctors) and scientists from

enabling disciplines jointly and in a creative manner work on the analyses and design of –omics experiments.

Basic concept of e-BioLab:

Page 25: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Enhancing the scientific process: e-BioLab (2) Realization:

• Large high resolution display (26.2 Mpixel) with high bandwidth (10 Gbit/s) connection to render cluster

• Full access to computational facilities and GRID middleware of VL-e• e-whiteboards and tablet PCs to share and store ideas• High definition video cameras for remote collaboration• Highly adaptable lab configuration.

Research into:• Problem Solving Environments for biology under study

• formulation of scientific workflows that allow for sufficient interactivity and guarantee reproducibility

• Maintaining an electronic lab journal for e-science experimentation• Methods for:

• Information Management of omics data• Biological Domain Interaction / Resource Identification• Modeling of Biological Information and Knowledge

• Remote scientific co-operation• Man-machine interaction

Page 26: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

High resolution displays in e-bioscienceC

lust

erin

g

Video remote collaborationGene lists Remote whiteboard

2

31

2

3

1

SO

M

Interesting Pathways GO catagories

Lite

ratu

re M

inin

gG

SE

A

Example: concurrently display in a discussion with a remote partner• Clustering results of microarray experiments• Interesting pathways that are predominant in certain clusters• Gene Ontology categories• Results from literature mining• Gene Set Enrichment of categories identified in literature mining• Notions depicted on the e-whiteboards

Page 27: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Virtual Lab for e-Science research Philosophy

• Multidisciplinary research and development of related ICT infrastructure

• Generic application support Application cases are drivers for computer & computational

science and engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever

possible

Page 28: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Generic e-Science services

Generic e-Science services

Grid ServicesHarness multi-domain distributed resources

Tec

hn

olog

y pu

shDomain

Specific tools

App

lica

tion

pul

l

Domain generice-BioScience services

Microarray pipeline

Mass spectroscopypipelinePathway visualization

Protein annotation

Generic e-Science services

Page 29: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Generic e-Science services

Generic e-Science services

Grid ServicesHarness multi-domain distributed resources

Tec

hn

olog

y pu

sh

Domain generice-Science services

Domain generice-Science services

Generic e-Science services

DomainSpecific tools

Micro-arrayTranscriptomics pipeline

Mass spectroscopyProteomics pipeline

Domain Generic services

App

lica

tion

pul

l

Page 30: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Bioinformatics methods in VL-e (1)

Example 1 – An application specific method modified by e-science into a generic one: SigWin*

• Starting point: Application specific method for detecting windows of increased gene expression on chromosomes** (implemented in C and perl for SAGE technology)

• Motivation:Broad interest from molecular biology in positional behaviour of any measurement data that can be mapped onto DNA sequences

• SigWin e-Science version:GRID-based modular workflow for detecting windows of significance in any sequence of values Widely applicable from gene expression to meteorology data Modules reusable for alternative workflows, e.g. protein modification Scalable to very large datasets

* Inda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted)** Versteeg et al, Genome Research, 2003

Page 31: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Bioinformatics methods: SigWin

Significant window detectorGeneralisation of RIDGE method

Human gene expression

Temperature in Amsterdam

DNA curvature of the Escherichia coli chromosome

Page 32: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Bioinformatics methods in VL-e (2)Example 2 – An application specific method composed of generic and specific modules in a workflow: OligoRAP*

• Purpose: a re-annotation workflow for oligo libraries• Motivation: rapidly evolving knowledge in genome analysis requires

frequent re-assessment of the molecules which are used to measure gene-expression.

• OligoRAP Uses set of application generic (BIOMOBY) BLAT and BLAST sequence

alignment (web)services. Uses application specific (BIOMOBY) annotation analysis service BIOMOBY: de-facto standard for bio-informatics webservices. Joint work of sequence analysis lab and micro-array lab Workflow:

• Adjustable filtering criteria make quality level of oligos explicit

• Workflow provenance makes re-annotation reproducible.

* P. Neerincx, H. Rauwerda, F. Verster, A. Kommadath, T.M. Breit, J.A.M. Leunissen, Poster ISMB 2006

Page 33: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Virtual Lab for e-Science research Philosophy

• Multidisciplinary research and development of related ICT infrastructure

• Generic application support Application cases are drivers for computer & computational science and

engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever possible

• Rationalization of experimental process Reproducible & comparable

• Two research experimentation environments Proof of concept for application experimentation Rapid prototyping for computer & computational science experimentation

Page 34: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Medical Diagnosis and ImagingProblem Solving Environment

Partners:• Universiteit van Amsterdam (UvA)• Academisch Medisch Centrum (AMC)• Vrije Universiteit Medisch Centrum (VUMC)• Philips Research• Philips Medical Systems• TU Delft• IBM

Applications:

1. Eddy current reduction

2. Matched Masked Bone Elimination

3. Functional brain imaging, DWI and fiber tracking

4. MR virtual colonoscopy

5. Parallel MEG data analyses

6. Grid-based data storage, retrieval and sharing

7. Interactive 3D medical visualization

Objective:

To study the design and implementation of a PSE for medical diagnosis and imaging to support and enhance the clinical diagnostic and therapeutic decision process.

1 3 4

5 7

Page 35: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Brain Imaging and Fiber Tractography

• Diffusion Weighted Imaging (DWI) Restricted Brownian motion results in anisotropy that can be

measured >= 6 measurements, reduced to tensor per voxel Largest eigenvectors give diffusion vector

• Whole volume fiber tracking can takemany hours Depends on size of volume and number

of measurements per voxel Suitable for parallelization

• Visualization techniques

Page 36: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Medical Diagnosis and ImagingProblem Solving Environment

VL-e generic services:• Provides:

Scientific visualization techniques Image processing algorithms

• Uses: Experiment editor Parallel processing techniques

Application specific services:• Access to PACS, DICOM• Interfaces to medical scanners (MRI)• In-house developed algorithms:

Eddy Current Reduction Matched Masked Bone Elimination

• Patient privacy

Grid Middleware

Surfnet

Virtual Laboratory

VL-e Environment

… MedicalApplications

Grid services:• Storage facilities (SRB)• High Performance Computing platforms• High Performance Visualization

platforms

Page 37: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Eddy current reduction

• Shear, magnification and translation as a result of residual currents in DWI 2D matching to correct Computationally expensive

• Parallelization throughdomain decomposition Computing cycles via Grid Integrated PACS solution

Effects of residual eddy currents onPhilips 3T Intera with DWI.Figure by Erik-Jan Vlieger, AMC.

Page 38: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Medical Diagnosis and ImagingProblem Solving Environment

2D/3D visualization

VL experiment topologyImage processing,Data storage

Filtering, analyses,simulation

Data retrieval,acquisition

Page 39: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

The situation in the Netherlands

• Netherlands Bio-Informatics Center (NBIC) was set up as part of the Dutch Genomics Initiative Netherlands Genomics Initiative (NGI)

• Its aim was to organize bio-informatics in the Netherlands and to generate sufficient critical mass also to support as a technology center the other genomics initiatives

• Organizational structure: Board of directors

Dr van Kampen scientific director Drs R. Kok executive director Prof. Dr. Hertzberger adjunct scientific director

Board of overseeing International Advisory board Scientific Committee Program Steering Group

Page 40: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Current NBIC activities• Currently NBIC runs three programs and took the initiative and

participates in another three joint activities besides collaboration such as with SURF (networking) and VL-e (e-Science):

• NBIC programs: BioRange: a bio-informatics research program of 25 M$ & 25 M$

matching BioAssist: a 10 M$ support program BioWise: a 3 M$ education program

• Participation in : Computation life sciences: a 5 M$ program with among others physics,

chemistry and computational science Pilot grid roll out: a 3M$ Grid rollout & support with Dutch Foundation for

computing (NCF) and others BIG GRID: a 35M$ GRID and e-Science program in the Netherlands

together with NCF, physics, VL-e and others

Page 41: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Program activities• Bio Range has four program lines:

Micro array related bio-informatics Proteomics related bio-informatics Integrated bio-informatics Informatics research for Bio-informatics

• All program lines comprise a number of collaborative projects with participation of groups all over the Netherlands

• Bio Assist runs two program lines Establishment of e-bioscience support environment Establishment of generic e-science infrastructure

• In future also addition towards biomedical as was illustrated

Page 42: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

The VL-e infrastructure

Grid Middleware

Surfnet

Application specificservice

Application Potential

Generic service &

Virtual Lab. services

Grid &

NetworkServices

Virtual Laboratory

VL-e Proof of Concept Environment

Telescience Medical Application Bio

InformaticsApplications

VL-e Experimental Environment

Virtual Lab.rapid prototyping

(interactive simulation)

Additional Grid Services

(OGSA services)

Network Service (lambda networking)

VL-e Certification Environment

Test & Cert.Compatibility

Test & Cert.Grid Middleware

Test & Cert.VL-software

Page 43: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Grid Middleware

SurfnetNetwork Service

(lambda networking)

Virtual Laboratory

VL-E Experimental Environment

VL-E Proof of concept Environment

Telescience Medical Application

Bio Applicatio

ns

Rapid prototyping(interactive simulation)

Additional Grid Services

(OGSA services)

e-Science Roll out

Application feedback

Sta

ble

A

pp

lica

tion

& V

L-e

com

pone

nt

Uns

tabl

e A

pplic

atio

n &

VL-

e co

mpo

nent

Grid Middleware

Surfnet

Virtual Laboratory

Big Grid

xxxx xxxxBioAssist

Total 25M$ support + 25M$ matchingTotal 35 M$ support

Page 44: From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit

Conclusions• Omics experiments change the face of life sciences• Bioinformatics can be considered to be an essential

enabler and is a form of e-Science• Will help to realize necessary paradigm shift in Life

Science experimentation• Better support of experimentation & optimal use of ICT

infrastructure requires rationalization experimentation process

• Information management essential technology• Bioinformatics can not be decoupled from e-Bio-science

applications• e-Bioscience also has to comprise biomedical applications