what is bioinformatics?. what are bioinformaticians up to, actually? manage molecular biological...

Post on 26-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

What is bioinformatics?

What are bioinformaticians up to, actually?

• Manage molecular biological data– Store in databases, organise, formalise, describe...

• Compare molecular biological data• Find patterns in molecular biological data

– phylogenies

– correlations (sequence / structure / expression / function / disease)

Goals:• characterise biological patterns & processes• predict biological properties

– low level data high level properties ⇒(eg., sequence function)⇒

Bioinformatics: neighbour disciplines

• Computational biology– Broader concept: includes computational ecology,

physiology, neurology etc...

• -omics:– Genomics– Transcriptomics– Proteomics

• Systems biology– Putting it all together...– Building models, identify control & regulation

Bioinformatics: prerequisites

• Bio- side:– Molecular biology– Cell biology – Genetics– Evolutionary theory

• -informatics side:– Computer science– Statistics– Theoretical physics

Molecular biology data...

>alpha-DATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCAAGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAGCACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGACGGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCCGGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA>alpha-AATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCCATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTGTCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTCTGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCAGTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACCATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAGTACCGTTAA

• DNA sequences

Molecular biology data...

• Amino acid sequences

• Protein structure:– X-ray crystallography

– NMR

Cell biology & proteomics data...

• Subcellular localization

Cell biology & proteomics data...

protein-protein interactions

DNA microarray technologyTranscriptomics: DNA microarray technology

Proteomics & transcriptomics data

Proteins encoded by periodically expressed genes: a functionally diverse protein category

Cell cycle regulation

Multiprotein complexes in yeast (Saccharomyces cerevisiae).

Coloured components are periodically expressed (dynamic). White components are always expressed (static).

Complexes contain both kinds of components (just-in-time assembly).

Dynamic complex formation during the yeast cell cycle.Science. 2005 Feb 4;307(5710):724-7.

de Lichtenberg U, Jensen LJ, Brunak S, Bork P.

Cell cycle regulation 2

Transcriptional regulation is poorly conserved!

Lars Juhl Jensen, Thomas Skøt Jensen, Ulrik de Lichtenberg, Søren Brunak and Peer Bork, Nature, Oct 5, 2006

Dynamic components overlap:

Phenotype data: human diseases

Prediction methods

• Homology / Alignment

• Simple pattern (“word”) recognition • Statistical methods

– Weight matrices: calculate amino acid probabilities– Other examples: Regression, variance analysis, clustering

• Machine learning– Like statistical methods, but parameters are estimated by

iterative training rather than direct calculation– Examples: Neural Networks (NN), Hidden Markov Models

(HMM), Support Vector Machines (SVM)

• Combinations

Simple pattern (“word”) recognition

Example: PROSITE entry PS00014, ER_TARGET:

Endoplasmic reticulum targeting sequence (”KDEL-signal”).

Pattern: [KRHQSA]-[DENQ]-E-L>

NB: only yes/no answers!

Statistical methods

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

• Estimate probabilities for nucleotides / amino acids• Information content in sequences; logos; Position-

Weight Matrices.• Quantitative answers.

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles

Conserved

Non-conserved

Matching anything but G => large negative score

Anything can match

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Unlike Position-Weight Matrices, sequence profiles are able to handle gaps

HMM (Hidden Markov Models)

Which parts of the sequence are

– in the cytoplasm?

– embedded in the membrane (TM-helices)?

– outside?

Rule-of-thumb says amino acids in the TM helices are hydrophobic – but there are exceptions

TMHMM predicts topology from sequence using a statistical model

TMHMM: predicting the topology of α-helix transmembrane proteins

IKEEHVI IQAE

HEC

IKEEHVIIQAEFYLNPDQSGEF…..Window

Input Layer

Hidden Layer

Output Layer

Weights

Neural Networks: ”Architecture”

Neural networks

• Neural networks can learn higher order correlations!– What does this

mean?

A A => 0A C => 1C A => 1C C => 0

No linear function can learn this pattern

Neural Networks: advantages

How to estimate parameters for prediction?

Model selection

Linear Regression Quadratic Regression Join-the-dots

The test set method

The test set method

The test set method

The test set method

The test set method

Problem: sequences are related

• If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performanceALAKAAAAM

ALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Protein sorting in eukaryotes

• Proteins belong in different organelles of the cell – and some even have their function outside the cell

• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

Secretory proteins have a signal peptide

Initially, they are transported across the ER membrane

Protein sorting: secretory pathway / ER

Signal peptides

A signal peptide is an N-terminal part of the amino acid chain, containing a hydrophobic region.

Signal peptides differ between proteins, and can be hard to recognize.

SignalP

http://www.cbs.dtu.dk/services/SignalP/

Method for prediction of signal peptides from the amino acid sequence

Based on Neural Networks

The first SignalP paper (Nielsen H, et al., Protein Eng. 10[1]: 1-6, January 1997) was cited 3144 times in a 10 year period

top related