master’s course bioinformatics data analysis and tools

69
Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW [email protected] C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Upload: keith

Post on 08-Jan-2016

19 views

Category:

Documents


0 download

DESCRIPTION

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Master’s course Bioinformatics Data Analysis and Tools. Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Master’s course Bioinformatics Data Analysis and Tools

Master’s course

Bioinformatics Data Analysis and Tools

Lecture 1: Introduction

Centre for Integrative BioinformaticsFEW/FALW

[email protected]

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Page 2: Master’s course Bioinformatics Data Analysis and Tools

Course objectives

• There are two extremes in bioinformatics work– Tool users (biologists): know how to press the

buttons and know the biology but have no clue what happens inside the program

– Tool shapers (informaticians): know the algorithms and how the tool works but have no clue about the biology

Both extremes can be dangerous at times, need a breed that can do both

Page 3: Master’s course Bioinformatics Data Analysis and Tools

Course objectives• How do you become a good bioinformatics

problem solver?– You need to know basic analysis and data mining

modes– You need to know some important backgrounds of

analysis and prediction techniques (e.g. statistical thermodynamics)

– You need to have knowledge of what has been done and what can be done (and what not)

• Is this enough to become a creative tool developer?– Need to like doing it– Experience helps

Page 4: Master’s course Bioinformatics Data Analysis and Tools

Course objectivesThe most important thing in tools creating (and

science in general)

• Be able to ask the proper question!– Should address a real problem– Should be targeted– Should be solvable

• Bioinformatics challenge: from Genomics to Systems Biology– Bottom up: start at the components, assemble and learn

the system– Top down: observe the system behaviour, model and

learn the details– What about bottom down or top up questions?

Page 5: Master’s course Bioinformatics Data Analysis and Tools

Contents (tentative dates)Date Lecture Title Lecturer

1 [wk 19] 01/04/08 Introduction Jaap Heringa  

2 [wk 19] 03/04/08 Microarray data analysis Jaap Heringa  

3 [wk 20] 08/04/08 Machine learning Jaap heringa

4 [wk 21] 10/04/08 Clustering algorithms Bart van Houte

5 [wk 21] 15/04/08 Feature Selection  Bart van Houte

6[wk 23] 17/04/08 Molecular Simulation & Sampling Techniques Anton Feenstra

7[wk 23] 22/04/08 Introduction to Statistical Thermodynamics I Anton Feenstra

8[wk 24] 24/04/08 Introduction to Statistical Thermodynamics II Anton Feenstra

9[wk 24] 06/05/08 Databases and parsing Sandra Smit

10[wk 24] 08/05/08 Semantic Web and Ontologies Frank van Harmelen

11[wk 25] 13/05/08 Parallelisation& Grid Computing Thilo Kielmann  

12[wk 25] 15/05/08 Application area I: Protein Domain Prediction Jaap Heringa

13[wk 25] 20/05/08 Application Area II: Repeats Detection Jaap Heringa 

Page 6: Master’s course Bioinformatics Data Analysis and Tools

At the end of this course…

• You will have seen a couple of algorithmic examples• You will have got an idea about methods used in the

field• You will have a firm basis of the physics and

thermodynamics behind a lot of processes and methods• You will have learned about state-of-the-art

computational issues, such as Semantic Web and HTP Computing

• You will have an idea of and some experience as to what it takes to shape a bioinformatics tool

Page 7: Master’s course Bioinformatics Data Analysis and Tools

Bioinformatics

“Studying informatic processes in biological systems”

(Hogeweg)

Applying algorithms and mathematical formalisms tobiology (genomics)

“Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

Page 8: Master’s course Bioinformatics Data Analysis and Tools

This course

• General theory of crucial algorithms (GA, NN, HMM, SVM, etc..)

• Method examples• Research projects within own group

– Repeats– Domain boundary prediction

• Physical basis of biological processes and of (stochastic) tools

Page 9: Master’s course Bioinformatics Data Analysis and Tools

BioinformaticsLarge - external(integrative) Science Human

Planetary Science Cultural Anthropology

Population Biology Sociology Sociobiology Psychology Systems Biology Biology Medicine

Molecular Biology Chemistry Physics

Small – internal (individual)

Bioinformatics

Page 10: Master’s course Bioinformatics Data Analysis and Tools

Genomic Data Sources

• DNA/protein sequence

• Expression (microarray)

• Proteome (xray, NMR,

mass spectrometry)

• PPI

• Metabolome

• Physiome (spatial,

temporal)

Integrative bioinformatics

Page 11: Master’s course Bioinformatics Data Analysis and Tools

Protein structural data explosion

Protein Data Bank (PDB): 14500 Structures (6 March 2001)10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

Page 12: Master’s course Bioinformatics Data Analysis and Tools

MathematicsStatistics

Computer ScienceInformatics

BiologyMolecular biology

Medicine

Chemistry

Physics

Bioinformatics

Bioinformatics inspiration and cross-fertilisation

Page 13: Master’s course Bioinformatics Data Analysis and Tools

Joint international programming initiatives

• Bioperlhttp://www.bioperl.org/wiki/Main_Pagehttp://bioperl.org/wiki/How_Perl_saved_human_genome

• Biopythonhttp://www.biopython.org/

• BioTclhttp://wiki.tcl.tk/12367

• BioJavawww.biojava.org/wiki/Main_Page

Page 14: Master’s course Bioinformatics Data Analysis and Tools

Algorithms in bioinformatics• string algorithms• dynamic programming• machine learning (NN, k-NN, SVM, GA, ..)• Markov chain models• hidden Markov models• Markov Chain Monte Carlo (MCMC) algorithms• stochastic context free grammars• EM algorithms• Gibbs sampling• clustering• tree algorithms (suffix trees)• graph algorithms• text analysis• hybrid/combinatorial techniques and more…

Some techniques can be reapplied to many different problems, e.g. clustering, NN, etc.

Page 15: Master’s course Bioinformatics Data Analysis and Tools

Algorithms in bioinformatics

• Approaches in methods can be reapplied, e.g. kowledge-based (inverse) prediction

• Different approaches can be combined, e.g. consensus prediction, metaservers

Page 16: Master’s course Bioinformatics Data Analysis and Tools

DBT

hits

PSSM

Q

Discarded sequences

Run query sequence against

database

Run PSSM against database

PSI-BLAST iterationPSI-BLAST iteration

Page 17: Master’s course Bioinformatics Data Analysis and Tools

Fold recognition by threading:THREADER and GenTHREADER

Query sequence

Compatibility scores

Fold 1

Fold 2

Fold 3

Fold N

Page 18: Master’s course Bioinformatics Data Analysis and Tools

Polutant recognition by microarray mapping:

Compatibility scores

Cond. 1

Cond. 2

Cond. 3

Cond. N

Contaminant 1

Contaminant 2

Contaminant 3

Contaminant N

Query array

Page 19: Master’s course Bioinformatics Data Analysis and Tools

Protein-protein interaction prediction

• Two extreme approaches, phlogenetic prediction and Molecular Dynamics (MD) simulation

• Mesoscopic modelling

• Soft-core Molecular Dynamics (MD)– Fuzzy residues

– Fuzzy (surface)

locations

Page 20: Master’s course Bioinformatics Data Analysis and Tools

ENFIN WP5 - BioRange (Anton Feenstra)

• Protein-protein interaction prediction

• Mesoscopic modelling

• Soft-core Molecular Dynamics (MD)– Fuzzy residues– Fuzzy (surface) locations

Page 21: Master’s course Bioinformatics Data Analysis and Tools

Where are important new questions?

Page 22: Master’s course Bioinformatics Data Analysis and Tools

New neighbouring disciplines• Translational Medicine

A branch of medical research that attempts to more directly connect basic research to patient care. Translational medicine is growing in importance in the healthcare industry, and is a term whose precise definition is in flux. In particular, in drug discovery and development, translational medicine typically refers to the "translation" of basic research into real therapies for real patients. The emphasis is on the linkage between the laboratory and the patient's bedside, without a real disconnect. This is often called the "bench to bedside" definition.

• Computational Systems BiologyComputational systems biology aims to develop and use efficient algorithms, data structures and communication tools to orchestrate the integration of large quantities of biological data with the goal of modeling dynamic characteristics of a biological system. Modeled quantities may include steady-state metabolic flux or the time-dependent response of signaling networks. Algorithmic methods used include related topics such as optimization, network analysis, graph theory, linear programming, grid computing, flux balance analysis, sensitivity analysis, dynamic modeling, and others.

• Neuro-informatics Neuroinformatics combines neuroscience and informatics research to develop and apply the advanced tools and

approaches that are essential for major advances in understanding the structure and function of the brain

Page 23: Master’s course Bioinformatics Data Analysis and Tools

Translational Medicine

• “From bench to bed side”

• Genomics data to patient data

• Integration

Page 24: Master’s course Bioinformatics Data Analysis and Tools

Natural progression of a gene

Page 25: Master’s course Bioinformatics Data Analysis and Tools

TERTIARY STRUCTURE (fold)TERTIARY STRUCTURE (fold)

Genome

Expressome

Proteome

Metabolome

Functional GenomicsFunctional GenomicsFrom gene to functionFrom gene to function

Page 26: Master’s course Bioinformatics Data Analysis and Tools
Page 27: Master’s course Bioinformatics Data Analysis and Tools
Page 28: Master’s course Bioinformatics Data Analysis and Tools

Systems Biologyis the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behaviour of that system (for example, the enzymes and metabolites in a metabolic pathway). The aim is to quantitatively understand the system and to be able to predict the system’s time processes

• the interactions are nonlinear• the interactions give rise to emergent properties,

i.e. properties that cannot be explained by the components in the system

Page 29: Master’s course Bioinformatics Data Analysis and Tools

Systems Biologyunderstanding is often achieved through modeling and simulation of the system’s components and interactions.

Many times, the ‘four Ms’ cycle is adopted:

Measuring

Mining

Modeling

Manipulating

Page 30: Master’s course Bioinformatics Data Analysis and Tools

Neuroinformatics

• Understanding the human nervous system is one of the greatest challenges of 21st century science.

• Its abilities dwarf any man-made system - perception, decision-making, cognition and reasoning.

• Neuroinformatics spans many scientific disciplines - from molecular biology to anthropology.

Page 31: Master’s course Bioinformatics Data Analysis and Tools

Neuroinformatics• Main research question: How does the brain and

nervous system work?• Main research activity: gathering neuroscience data,

knowledge and developing computational models and analytical tools for the integration and analysis of experimental data, leading to improvements in existing theories about the nervous system and brain.

• Results for the clinic: Neuroinformatics provides tools, databases, models, networks technologies and models for clinical and research purposes in the neuroscience community and related fields.

Page 32: Master’s course Bioinformatics Data Analysis and Tools
Page 33: Master’s course Bioinformatics Data Analysis and Tools

Bioinformatics algorithms

For problems such as alignenment, secondary/tertiary structure prediction, phylogenetic tree determination, etc.

Algorithmic main components:

• Search function

• Scoring function

Page 34: Master’s course Bioinformatics Data Analysis and Tools

Bioinformatics algorithms

• Search function– Search space can be large– High time complexity of algorithms, or even

NP-complete/NP-hard problems

• Scoring function– Also called the Objective Function– Often the most important

Page 35: Master’s course Bioinformatics Data Analysis and Tools

Pair-wise alignmentComplexity of the problem

Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~ n (n!)2

n

2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

T D W V T A L KT D W L - - I K

Page 36: Master’s course Bioinformatics Data Analysis and Tools

Levinthal’s paradox (1969)

•Denatured protein refolds in ~ 0.1 – 1000 seconds

•Protein with e.g. 100 amino acids each with 2 torsions ( en )

Each can assume 3 conformations (1 trans, 2 gauche)3100x2 1095 possible conformations!

•Or: 100 amino acids with 3 possibilities in Ramachandran plot (coil): 3100 » 1047 conformations

•If the protein can visit one conformation in one ps (10-12s) exhaustive search costs 1047 x 10-12 s = 1035 s 1027 years!

(the lifetime of the universe 1010 years…)

Page 37: Master’s course Bioinformatics Data Analysis and Tools

Phylogenetic tree combinatorial explosion

Number of unrooted trees =

!32

!523

n

nn

Number of rooted trees =

!22

!322

n

nn

Page 38: Master’s course Bioinformatics Data Analysis and Tools

Combinatoric explosion

# sequences # unrooted # rooted trees trees

2 1 13 1 34 3 155 15 1056 105 9457 945 10,3958 10,395 135,1359 135,135 2,027,02510 2,027,025 34,459,425

Page 39: Master’s course Bioinformatics Data Analysis and Tools

A recap on Scoring and BenchmarkingQUERY

DATABASE

True Positive

True Negative

True Positive

False Positive

True Negative False Negative

T

POSITIVES

NEGATIVES

Page 40: Master’s course Bioinformatics Data Analysis and Tools
Page 41: Master’s course Bioinformatics Data Analysis and Tools
Page 42: Master’s course Bioinformatics Data Analysis and Tools
Page 43: Master’s course Bioinformatics Data Analysis and Tools
Page 44: Master’s course Bioinformatics Data Analysis and Tools
Page 45: Master’s course Bioinformatics Data Analysis and Tools
Page 46: Master’s course Bioinformatics Data Analysis and Tools
Page 47: Master’s course Bioinformatics Data Analysis and Tools
Page 48: Master’s course Bioinformatics Data Analysis and Tools
Page 49: Master’s course Bioinformatics Data Analysis and Tools
Page 50: Master’s course Bioinformatics Data Analysis and Tools
Page 51: Master’s course Bioinformatics Data Analysis and Tools
Page 52: Master’s course Bioinformatics Data Analysis and Tools
Page 53: Master’s course Bioinformatics Data Analysis and Tools

Evaluating multiple alignments (MSAs)Evaluating multiple alignments (MSAs)• Conflicting standards of truth

– evolution

– structure

– function

• With orphan sequences no additional information• Benchmarks depending on reference alignments• Quality issue of available reference alignment databases• Different ways to quantify agreement with reference

alignment (sum-of-pairs, column score)• “Charlie Chaplin” problem

Charlie Chaplin once joined a Charlie-Chaplin competition in disguise and became third. What does this tell you about the jury’s ‘objective function’ ?

Page 54: Master’s course Bioinformatics Data Analysis and Tools

Evaluation measuresQuery Reference

Column score‘strict’ measure

Sum-of-Pairs scoremore lenient measure

What fraction of the matched amino acid pairs (or alignment columns) in the reference MSA are recreated in the query MSA?

Page 55: Master’s course Bioinformatics Data Analysis and Tools

Scoring a single MSA with the Sum-of-pairs (SP) score

Sum-of-Pairs score

• Calculate the sum of all pairwise alignment scores

• This is equivalent to taking the sum of all matched a.a. pairs

• This can be done using gap penalties or not

Good alignments should have a high SP score, but it is not always the case that the true biological alignment has the highest score.

Page 56: Master’s course Bioinformatics Data Analysis and Tools

BAliBASE benchmark alignmentsBAliBASE benchmark alignmentsThompson et al. (1999) NAR 27, 2682.Thompson et al. (1999) NAR 27, 2682.

88 categories: categories:• cat. 1 - equidistantcat. 1 - equidistant

• cat. 2 - orphan sequencecat. 2 - orphan sequence

• cat. 3 - 2 distant groupscat. 3 - 2 distant groups

• cat. 4 – long overhangscat. 4 – long overhangs

• cat. 5 - long insertions/deletionscat. 5 - long insertions/deletions

• cat. 6 – repeatscat. 6 – repeats

• cat. 7 – transmembrane proteinscat. 7 – transmembrane proteins

• cat. 8 – circular permutationscat. 8 – circular permutations

Page 57: Master’s course Bioinformatics Data Analysis and Tools

Evaluating multiple alignmentsEvaluating multiple alignments

You can score a single MSA using the sum of all matched amino acid pairs score. This is also referred to as the Sum-of-Pairs (SP) score.. (a bit confusing with the SP score for comparing a query alignment with a reference alignment )

Page 58: Master’s course Bioinformatics Data Analysis and Tools

Evaluating multiple alignmentsEvaluating multiple alignments

SP

BAliBASE alignment nseq * len

Many test alignments have a higher SP score than the reference alignment (“Charlie Chaplin problem’)

Page 59: Master’s course Bioinformatics Data Analysis and Tools

Evaluating multiple alignmentsEvaluating multiple alignments

Many test alignments have a higher SP score than the reference alignment (“Charlie Chaplin problem’)

Page 60: Master’s course Bioinformatics Data Analysis and Tools

Comparing T-coffee with other methods

Column scores are used here

Page 61: Master’s course Bioinformatics Data Analysis and Tools

BAliBASE benchmark alignments

If you are a better program on average, this does not mean you win in all cases…

How do you know what is the situation in your case? Even with a better program you can be unlucky..

Page 62: Master’s course Bioinformatics Data Analysis and Tools

Example from course DPSFAP: inverse folding

Top score structure 20 a.a. fragments in the high specificity regions -- Sequence: 3icb (residues 31–50)Protein Starting position Score Cr.m.s.d. Secondary structure (DSSP)

to native (A° )

3icb 31 –7.36 0.00 HHHHH TTTSSSSS HHHHH

1bbk B 32 –6.18 5.65 GGT SSS TT EE S E

1ezm 254 –5.93 4.61 HHHHT TT HHHHHHHHH

8cat A 73 –5.84 8.68 SEEEEEEEEEE S TTT

3enl 196 –5.84 3.82 HHHHHH GGGG B TTS B

1tie 59 –5.75 6.17 EESS SS TT EEEEES

3gap A 97 –5.73 3.11 EEHHHHHHHTTT TTTHHHH

1tfd 71 –5.59 6.50 EEEEEEE S SSS S E

1gsr A 159 –5.54 2.93 HHHHH TTTTTT HHHHHHH

1apb 149 –5.53 4.14 HHHHHHHHHHHHTT GGGE

Random 5.88 A°

The native structure is on top

Page 63: Master’s course Bioinformatics Data Analysis and Tools

Top-scoring structural 20 a.a. fragments in regions where the native state does not have lowest scores but the CRMSDs are low -- Sequence: 3icb (residues 36–55) Protein Starting position Score Cr.m.s.d. Secondary structure (DSSP)

to native (A° )

1mba 75 –9.54 3.16 HHHHTT HHHHHHHHHHHHH

1mbc 72 –8.59 3.84 HHHHTTT TTTHHHHHHHHH

3gap A 102 –8.43 3.54 HHHHTTT TTTHHHHHHHHH

1ezm 186 –7.83 5.44 ETTTTBSSS SEESSSGGG

1hmd A 67 –7.47 4.76 TTHHHHHHHHHHHHHHHHT

1sdh A 37 –7.42 4.65 HHHHHHH GGGGGGGGGG

2ccy A 36 –7.34 4.38 TTHHHHHHHHHHHHHHGGG

1ama 298 –7.11 2.67 HHHHHHSHHHHHHHHHHHHH

3icb 36 –7.08 0.00 TTTSSSSS HHHHHHHH S

1pbx A 30 –7.06 4.79 HHHHHHH GGGGGGSTTSS Random RMSD: 5.79 A°

The native structure is not on top

Page 64: Master’s course Bioinformatics Data Analysis and Tools

Inaccurate scoring functions• Since most scoring functions do not rank solutions

properly, often ensembles of predictions are taken– For example, take the top 10 (or top 5%) predicted

structures, is the true structure among those?– In alignment: next to optimal (highest scoring)

alignment, the ensemble of supoptimal alignments is taken

• Biology is also capricious, e.g. the native protein structure does not always have the lowest internal energy

Page 65: Master’s course Bioinformatics Data Analysis and Tools

• Integrate data sources

• Integrate methods

• Integrate data through method integration (biological model)

Integrative bioinformatics

Page 66: Master’s course Bioinformatics Data Analysis and Tools

Data

Algorithm

BiologicalInterpretation

(model)

tool

Integrative bioinformaticsData integration

Page 67: Master’s course Bioinformatics Data Analysis and Tools

Integrative bioinformaticsData integration

Data 1 Data 2 Data 3

Page 68: Master’s course Bioinformatics Data Analysis and Tools

Integrative bioinformaticsData integration

Data 1

Algorithm 1

BiologicalInterpretation

(model) 1

tool

Algorithm 2

BiologicalInterpretation

(model) 2

Algorithm 3

BiologicalInterpretation

(model) 3

Data 2 Data 3

Page 69: Master’s course Bioinformatics Data Analysis and Tools

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in Bioinformatics makes sense except in the light of Biology”

Bioinformatics