from bio-informatics towards e-bioscience l.o. (bob) hertzberger computer architecture and parallel...
TRANSCRIPT
From Bio-Informatics towards e-BioScience
L.O. (Bob) Hertzberger
Computer Architecture and Parallel Systems GroupDepartment of Computer Science
Universiteit van Amsterdam
Background informationexperimental sciences
• There is a tendency to look ever deeper in: Matter e.g. Physics Universe e.g. Astronomy Life e.g. Life sciences
• Instrumental consequences are increase in detector: Resolution & sensitivityAutomation & robotization
• Therefore experiments change in nature & become increasingly more complex
Impact in the life sciences
• Impact of high throughput methods e.g. Omics experimentationgenome ===> genomics
New technologies in Life Sciences research
University of Amsterdam
cell
Genomics
Transcriptomics
Proteomics
Metabolomics
RNA
protein
metabolites
DNA
Methodology/Technology
Omics impact
Impact in the life sciences
• Impact of high throughput methods e.g. Omics experimentationgenome ===> genomics
• Instrumentation being used in omics experimentation: Transcriptomics via among others; micro-arrays Proteomics via among others; Mass Spectroscopy (MS) Metabolomics via among others; MS & Nuclear Magnetic
Resonance (NMR)
Results in Paradigm shift in Life sciences
• Past experiments where hypothesis drivenEvaluate hypothesisComplement existing knowledge
• Present experiments are data drivenDiscover knowledge from large amounts
of data
Life sciences research: from gene to function
Gene DNA
NH2
COOH
Protein
Genome-wide micro-array analysis
“High-throughput” protein-analysis
mRNAAAAAAAAAA
function-2function-1 function-n
Whole-genome sequence projects
Protein function: -prediction by bioinformatics
-proof by laboratory research
cellnucleus
Gene expression by
RNA synthesis
mRNA translation byprotein synthesis
Developments towards Bio-informatics & e-Science
• Experiments become increasingly more complex• Driven by increase of detector developments• Results in an increase in amount and complexity
of data• Something has to be done to harness this
developmentBio-informatics to translate data into useful biological,
medical, pharmaceutical & agricultural knowledge
The what of Bioinformatics
Bioinformatics is redefining rules and scientific approaches, resulting in the ‘new biology’. Within this new paradigm the traditional scientific boundaries are blurred, leaving no clear line between ‘dry or computational’ and ‘wet-based’ approaches
Role of bioinformatics
cell
Dat
a ge
nera
tion/
valid
atio
n
Dat
a in
tegr
atio
n/fu
sion
Dat
a us
age/
user
inte
rfac
ing
Genomics
Transcriptomics
Proteomics
Metabolomics
Integrative/System Biology
RNA
protein
metabolites
DNA
methodology Bioinformatics
Two sides of Bioinformatics
• The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledge
• Technological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentationNeed for e-Science support
Developments towards Bio-informatics & e-Science
• Experiments become increasingly more complex• Driven by increase of detector developments• Results in an increase in amount and complexity
of data• Something has to be done to harness this
developmentBio-informatics to translate data into useful biological,
medical, pharmaceutical & agricultural knowledgeVirtualization of experimental resources
enabling sharing & leading to e-BioScience
Life science/genomicsresearch consortia and
industry
Grid infrastructure
Bioinformatics
e-Science & research
infrastructure
e-Bioscience and life science
innovation domain e-Bioscience
& research infrastructure
Life science application areas
Generic e-Science ICT development and support
Network infrastructure and computing capacity
Why e-BioScience• There is an increasing necessity to use
results from other scientist e.g. share data & information:
Re-use and sharing of biological data (2)
Information content of omics data extremely high, however,• Data subject to noise, biological and technical variation• How to induce biological principles from these genome-wide data sets?
Approach: develop methodology for “reverse engineering” of biological mechanisms.
• Biggest challenge in bioinformatics today.
Need for external data sources for in-silico experimentation• Two practices for re-use and sharing of data
Collectively compile huge amounts of relevant data and make these available to the community. Examples: Bio-banking, compendia (e.g. NIH’s Affymetrix SNP repository).
Re-use information from different and diverse experiments to discover phenomena
Re-use and sharing of biological data (2)
Compendium example: re-use and sharing of Huntington data • Datasets: 404 Affymetrix Gene chips of measurements on extremely rare
human brain samples (Hodges et al. Hum. Mol. Genetics, 2006)• Available from NCBI GEO database (MIAME)• Goal: find genes involved in Huntington’s Disease • Approach:
Reanalyze gene expression data Combine genotype data and clinical data (e.g. using SigWin) Extend experiments with own ChIP on chip data
Resource Identification software
Repository of relevant meta-information from:• Data warehouses e.g. GEO, ArrayExpress, Protein Interaction database
• Literature (Mining of PubMed using Collexis)• Information resources specialized on diseases, genes,
proteins, e.g. OMIM, GenBank, Ensembl
Why e-BioScience• There is an increasing necessity to use results from
other scientist e.g. share data & information: Data repositories
Cohort studies in Bio-banking Biodiversity
Expensive and complex equipment Mass Spectroscopy MRI Other
Problems for the realization of e-BioScience
• Life Science field is still in an early stage of development and: First principles are not understood at all
• As a consequence experimental methods are not well established and will not for a time to come
• Because of the new forms of omics instrumentation there is a need for design for experimentation methods Lack correct logging of conditions under which experiments
are done is production of large amounts of data that request among
others statistical techniques for interpretation• As a consequence results are multi interpretable
Problems for the realization of e-BioScience
• Problems for bioinformatics & e-Bioscience: Rationalisation at this early stage is almost impossible Pre- standardization & standardization almost non existent Where there are standards they are inadequate because
multi interpretable (like MIAME for micro-array’s)
• In addition there are commercial end-user products that are difficult to integrate
• Users lack the training necessary to handle these complex experimental situation
• Only possible solution is to create a flexible experimentation environment for the end-users
Role of ICT in e-BioScience
• e-Science is a new form of science methodology complementing theoretical and experimental sciences.
• It is using generic methods and an ICT infrastructure to support this methodology. Web services as a paradigm/way of using/accessing information Grid is as a method of accessing & sharing computing resources
by virtualization
• What is missing in e-BioScience: Connection between biological problem & e-Bioscience User oriented tools that can be re-used and extended General model of ICT based integration Semantic support
ontology’s and semantic support for workflows to make user knowledge explicit
Consequences for bio-informatics & e-BioScience
• Considerable amounts of experimentation is necessary before a well established methodology will emerge
• The VL-e approach might be a good model & produces an environment in which the necessary experimentation can be realized
Enhancing the scientific process: e-BioLab
• Problem domain experts can focus on the biology because they are shielded from technical details by e-scientists.
• Viewpoints on the research question and the data semi-instantaneously can be expressed and visualized.
• Ideas and analyses can be retainedand documented.
• Facilities for remote collaboration are present*.
* Rauwerda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted)
Readily accessible data + models data mining
Small integration experiments
+ integration methods
Easyvisua-lization
Vague results
Basic model of problem area
e-BioOperator
BiologistsBiologists e-BioScientist
Motivation: • Interacting with the problem domain requires an environment in which the
domain can be opened up and ideas, hunches and notions on the data and crude models of the biology can be visualized
• A tangible space in which biologists, aided by e-scientists, will have the full potential of VL-e at their disposal.
An actual laboratory in which: • Problem domain experts (biologists, medical doctors) and scientists from
enabling disciplines jointly and in a creative manner work on the analyses and design of –omics experiments.
Basic concept of e-BioLab:
Enhancing the scientific process: e-BioLab (2) Realization:
• Large high resolution display (26.2 Mpixel) with high bandwidth (10 Gbit/s) connection to render cluster
• Full access to computational facilities and GRID middleware of VL-e• e-whiteboards and tablet PCs to share and store ideas• High definition video cameras for remote collaboration• Highly adaptable lab configuration.
Research into:• Problem Solving Environments for biology under study
• formulation of scientific workflows that allow for sufficient interactivity and guarantee reproducibility
• Maintaining an electronic lab journal for e-science experimentation• Methods for:
• Information Management of omics data• Biological Domain Interaction / Resource Identification• Modeling of Biological Information and Knowledge
• Remote scientific co-operation• Man-machine interaction
High resolution displays in e-bioscienceC
lust
erin
g
Video remote collaborationGene lists Remote whiteboard
2
31
2
3
1
SO
M
Interesting Pathways GO catagories
Lite
ratu
re M
inin
gG
SE
A
Example: concurrently display in a discussion with a remote partner• Clustering results of microarray experiments• Interesting pathways that are predominant in certain clusters• Gene Ontology categories• Results from literature mining• Gene Set Enrichment of categories identified in literature mining• Notions depicted on the e-whiteboards
Virtual Lab for e-Science research Philosophy
• Multidisciplinary research and development of related ICT infrastructure
• Generic application support Application cases are drivers for computer & computational
science and engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever
possible
Generic e-Science services
Generic e-Science services
Grid ServicesHarness multi-domain distributed resources
Tec
hn
olog
y pu
shDomain
Specific tools
App
lica
tion
pul
l
Domain generice-BioScience services
Microarray pipeline
Mass spectroscopypipelinePathway visualization
Protein annotation
Generic e-Science services
Generic e-Science services
Generic e-Science services
Grid ServicesHarness multi-domain distributed resources
Tec
hn
olog
y pu
sh
Domain generice-Science services
Domain generice-Science services
Generic e-Science services
DomainSpecific tools
Micro-arrayTranscriptomics pipeline
Mass spectroscopyProteomics pipeline
Domain Generic services
App
lica
tion
pul
l
Bioinformatics methods in VL-e (1)
Example 1 – An application specific method modified by e-science into a generic one: SigWin*
• Starting point: Application specific method for detecting windows of increased gene expression on chromosomes** (implemented in C and perl for SAGE technology)
• Motivation:Broad interest from molecular biology in positional behaviour of any measurement data that can be mapped onto DNA sequences
• SigWin e-Science version:GRID-based modular workflow for detecting windows of significance in any sequence of values Widely applicable from gene expression to meteorology data Modules reusable for alternative workflows, e.g. protein modification Scalable to very large datasets
* Inda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted)** Versteeg et al, Genome Research, 2003
Bioinformatics methods: SigWin
Significant window detectorGeneralisation of RIDGE method
Human gene expression
Temperature in Amsterdam
DNA curvature of the Escherichia coli chromosome
Bioinformatics methods in VL-e (2)Example 2 – An application specific method composed of generic and specific modules in a workflow: OligoRAP*
• Purpose: a re-annotation workflow for oligo libraries• Motivation: rapidly evolving knowledge in genome analysis requires
frequent re-assessment of the molecules which are used to measure gene-expression.
• OligoRAP Uses set of application generic (BIOMOBY) BLAT and BLAST sequence
alignment (web)services. Uses application specific (BIOMOBY) annotation analysis service BIOMOBY: de-facto standard for bio-informatics webservices. Joint work of sequence analysis lab and micro-array lab Workflow:
• Adjustable filtering criteria make quality level of oligos explicit
• Workflow provenance makes re-annotation reproducible.
* P. Neerincx, H. Rauwerda, F. Verster, A. Kommadath, T.M. Breit, J.A.M. Leunissen, Poster ISMB 2006
Virtual Lab for e-Science research Philosophy
• Multidisciplinary research and development of related ICT infrastructure
• Generic application support Application cases are drivers for computer & computational science and
engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever possible
• Rationalization of experimental process Reproducible & comparable
• Two research experimentation environments Proof of concept for application experimentation Rapid prototyping for computer & computational science experimentation
Medical Diagnosis and ImagingProblem Solving Environment
Partners:• Universiteit van Amsterdam (UvA)• Academisch Medisch Centrum (AMC)• Vrije Universiteit Medisch Centrum (VUMC)• Philips Research• Philips Medical Systems• TU Delft• IBM
Applications:
1. Eddy current reduction
2. Matched Masked Bone Elimination
3. Functional brain imaging, DWI and fiber tracking
4. MR virtual colonoscopy
5. Parallel MEG data analyses
6. Grid-based data storage, retrieval and sharing
7. Interactive 3D medical visualization
Objective:
To study the design and implementation of a PSE for medical diagnosis and imaging to support and enhance the clinical diagnostic and therapeutic decision process.
1 3 4
5 7
Brain Imaging and Fiber Tractography
• Diffusion Weighted Imaging (DWI) Restricted Brownian motion results in anisotropy that can be
measured >= 6 measurements, reduced to tensor per voxel Largest eigenvectors give diffusion vector
• Whole volume fiber tracking can takemany hours Depends on size of volume and number
of measurements per voxel Suitable for parallelization
• Visualization techniques
Medical Diagnosis and ImagingProblem Solving Environment
VL-e generic services:• Provides:
Scientific visualization techniques Image processing algorithms
• Uses: Experiment editor Parallel processing techniques
Application specific services:• Access to PACS, DICOM• Interfaces to medical scanners (MRI)• In-house developed algorithms:
Eddy Current Reduction Matched Masked Bone Elimination
• Patient privacy
Grid Middleware
Surfnet
Virtual Laboratory
VL-e Environment
… MedicalApplications
…
Grid services:• Storage facilities (SRB)• High Performance Computing platforms• High Performance Visualization
platforms
Eddy current reduction
• Shear, magnification and translation as a result of residual currents in DWI 2D matching to correct Computationally expensive
• Parallelization throughdomain decomposition Computing cycles via Grid Integrated PACS solution
Effects of residual eddy currents onPhilips 3T Intera with DWI.Figure by Erik-Jan Vlieger, AMC.
Medical Diagnosis and ImagingProblem Solving Environment
2D/3D visualization
VL experiment topologyImage processing,Data storage
Filtering, analyses,simulation
Data retrieval,acquisition
The situation in the Netherlands
• Netherlands Bio-Informatics Center (NBIC) was set up as part of the Dutch Genomics Initiative Netherlands Genomics Initiative (NGI)
• Its aim was to organize bio-informatics in the Netherlands and to generate sufficient critical mass also to support as a technology center the other genomics initiatives
• Organizational structure: Board of directors
Dr van Kampen scientific director Drs R. Kok executive director Prof. Dr. Hertzberger adjunct scientific director
Board of overseeing International Advisory board Scientific Committee Program Steering Group
Current NBIC activities• Currently NBIC runs three programs and took the initiative and
participates in another three joint activities besides collaboration such as with SURF (networking) and VL-e (e-Science):
• NBIC programs: BioRange: a bio-informatics research program of 25 M$ & 25 M$
matching BioAssist: a 10 M$ support program BioWise: a 3 M$ education program
• Participation in : Computation life sciences: a 5 M$ program with among others physics,
chemistry and computational science Pilot grid roll out: a 3M$ Grid rollout & support with Dutch Foundation for
computing (NCF) and others BIG GRID: a 35M$ GRID and e-Science program in the Netherlands
together with NCF, physics, VL-e and others
Program activities• Bio Range has four program lines:
Micro array related bio-informatics Proteomics related bio-informatics Integrated bio-informatics Informatics research for Bio-informatics
• All program lines comprise a number of collaborative projects with participation of groups all over the Netherlands
• Bio Assist runs two program lines Establishment of e-bioscience support environment Establishment of generic e-science infrastructure
• In future also addition towards biomedical as was illustrated
The VL-e infrastructure
Grid Middleware
Surfnet
Application specificservice
Application Potential
Generic service &
Virtual Lab. services
Grid &
NetworkServices
Virtual Laboratory
VL-e Proof of Concept Environment
Telescience Medical Application Bio
InformaticsApplications
VL-e Experimental Environment
Virtual Lab.rapid prototyping
(interactive simulation)
Additional Grid Services
(OGSA services)
Network Service (lambda networking)
VL-e Certification Environment
Test & Cert.Compatibility
Test & Cert.Grid Middleware
Test & Cert.VL-software
Grid Middleware
SurfnetNetwork Service
(lambda networking)
Virtual Laboratory
VL-E Experimental Environment
VL-E Proof of concept Environment
Telescience Medical Application
Bio Applicatio
ns
Rapid prototyping(interactive simulation)
Additional Grid Services
(OGSA services)
e-Science Roll out
Application feedback
Sta
ble
A
pp
lica
tion
& V
L-e
com
pone
nt
Uns
tabl
e A
pplic
atio
n &
VL-
e co
mpo
nent
Grid Middleware
Surfnet
Virtual Laboratory
Big Grid
xxxx xxxxBioAssist
Total 25M$ support + 25M$ matchingTotal 35 M$ support
Conclusions• Omics experiments change the face of life sciences• Bioinformatics can be considered to be an essential
enabler and is a form of e-Science• Will help to realize necessary paradigm shift in Life
Science experimentation• Better support of experimentation & optimal use of ICT
infrastructure requires rationalization experimentation process
• Information management essential technology• Bioinformatics can not be decoupled from e-Bio-science
applications• e-Bioscience also has to comprise biomedical applications