introduction to bioinformatics: lecture ii from molecular processes to string matching
DESCRIPTION
Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/1.jpg)
JM - http://folding.chmcc.org 1
Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching
Jarek MellerJarek Meller
Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC
![Page 2: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/2.jpg)
JM - http://folding.chmcc.org 2
Outline of the lecture
Sequence approximation in computational
molecular biology: the premise and the limits Getting ready for analysis of exact string
matching and sequence alignment algorithms: some definitions and interplay with biology
The notion of string/sequence similarity Substitution matrices for sequence alignment
![Page 3: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/3.jpg)
JM - http://folding.chmcc.org 3
Before we start: literature watch
A draft of the Rat genome has been published! RGSPC Nature 428
What are the first conclusions from the comparison with other mammalian genomes?
What approaches and tools have been used to perform this comparative analysis?
H: 2.9 Gb
M: 2.5 Gb
R: 2.75 GbR: unique - 0.7 Gb; common with both H and M – 1.1 Gb
![Page 4: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/4.jpg)
4
Biological Polymers and Central Dogma
Bio-Polymer (alphabet) Process (algorithm)
DNA (A,T,G,C) replication transcription
mRNA (U,A,C,G) splicing translation
Proteins (20 a.a.) folding
interactions
Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.
![Page 5: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/5.jpg)
JM - http://folding.chmcc.org 5
Complexity of “DNA computing”
http://www.genecrc.org/site/lc/lc2d.htm
![Page 6: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/6.jpg)
JM - http://folding.chmcc.org 6
Get the relevant sequences to compare them: conservation and differences
Problem Algorithms Programs
Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)
Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)
Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods BLAST (Altschul et. al., 1990)
![Page 7: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/7.jpg)
JM - http://folding.chmcc.org 7
Redundancy in biological systems
Query: 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE 60 M LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE Sbjct: 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60 Query: 61 DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISEAIIHVLHSRH 120 DLKKHG TVLTALG ILKKKGHHEAE+KP AQSHATKHKIP+KYLEFISE II VL S+H Sbjct: 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120
Query: 121 PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154 PG+FGADAQGAMNKALELFRKD+A+ YKELG+QG Sbjct: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154
Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI
An example: sperm whale vs. human myoglobin:
![Page 8: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/8.jpg)
JM - http://folding.chmcc.org 8
Limits of the sequence approximation
• All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences
However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics
On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc.
![Page 9: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/9.jpg)
JM - http://folding.chmcc.org 9
Strings, sequences and string operations
String vs. sequence duality will be important for exact vs. inexact string matching
![Page 10: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/10.jpg)
10
Beyond the letters: how to find better models (e.g. GC content for gene finding)
http://www.imb-jena.de/IMAGE_BPDIR.html
![Page 11: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/11.jpg)
JM - http://folding.chmcc.org 11
Another example: active sites, functional motifs and multiple alignment
![Page 12: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/12.jpg)
JM - http://folding.chmcc.org 12
Distance and similarity measures
![Page 13: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/13.jpg)
JM - http://folding.chmcc.org 13
Edit distance vs. substitution score
![Page 14: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/14.jpg)
JM - http://folding.chmcc.org 14
Substitution matrices for protein sequence alignment: learning and extrapolating from examples
PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time),
s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)=
c P(b|c,1)P(c|a,1)
BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity),
s(a,b) = log pab
/qaq
b where p
ab= F
ab /
cd F
cd
Expected score:
ab q
aq
b s(a,b) = -
ab q
aq
b log q
aq
b / p
ab = -H(q||p)
![Page 15: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/15.jpg)
JM - http://folding.chmcc.org 15
Summary
![Page 16: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching](https://reader030.vdocuments.us/reader030/viewer/2022020309/5681679d550346895ddce322/html5/thumbnails/16.jpg)
JM - http://folding.chmcc.org 16
Web resources and materials for the course
http://folding.chmcc.orghttp://folding.chmcc.org/protlab/protlab.htmlhttp://folding.chmcc.org/intro2bioinfo/intro2bioinfo.html
Protein Modeling Lab
Remote access to PML and the Citrix software
All lectures and other materials available electronically from the PML servers
Electronic tests and homework, web submission interfaces
The web site for the Introduction to Bioinformatics course
Updates