reminder -structure of a genome human 3x10 9 bp genome: ~30,000 genes ~200,000 exons ~23 mb coding...
Post on 15-Jan-2016
235 views
TRANSCRIPT
![Page 1: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/1.jpg)
Reminder -Structure of a genome
Human 3x109 bpGenome: ~30,000 genes
~200,000 exons ~23 Mb coding ~15 Mb noncoding
pre-mRNA
transcription
splicing
translationmature mRNA
protein
a gene
![Page 2: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/2.jpg)
Sequence Alignment
We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its three dimensional structure.
We want to compare the linear sequence between various genes, in order to deduce function, phylogeny, structure,origin…
The level of similarity is the homology
![Page 3: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/3.jpg)
The Problem
Biological problemFinding a way to compare and represent similarity
or dissimilarity between biomolecular sequences (DNA, RNA or amino acid)
Computational problemFinding a way to perform inexact or
approximate matching of subsequences within strings of characters
Statistical problemHow to estimate the validity of our results
![Page 4: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/4.jpg)
Course plan (for the next three weeks)
Details of biology Estimate of computation time Dynamic programming algorithm for full an local
alignment Statistical analysis of results Dot matrices and heuristics for alignment Distance matrices and information theory (MSA)
![Page 5: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/5.jpg)
Homology
Similarity due to descent from a common ancestor
Homologous sequences can be identified through sequence alignment
Thus, possible to predict/infer structure or function from primary sequence analysis
![Page 6: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/6.jpg)
Gaps
Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)
Latter two operations result in gaps ( _ ) K contiguous spaces = gap of length K ( > 0 )
![Page 7: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/7.jpg)
Similarity and Alignment
Similarity has two aspects: Quantitative aspect: Similarity measure
A number that represents degree of similarity Example: a score indicating 10% match between 2 DNA
sequences. Qualitative aspect: An alignment
a mutual arrangement of two sequences that shows where the two sequences are similar, and where they differ. An optimal alignment is one that exhibits the most correspondences, and the least differences.
Example: a b c d e – h z
a b w d e f h _
![Page 8: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/8.jpg)
The Edit Distance between two strings Definition:
The edit distance between two strings is defined as the minimum number of edit operations – insertions, deletions, & substitutions – needed to transform the first string into the second. For emphasis, note that matches are not counted.
Example: AATT and AATG
Distance = 1 (edit operation of substitution)
![Page 9: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/9.jpg)
String alignment
An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model
An alignment displays a relationship between two strings Global alignment means for each string, entire string is
involved in the alignment Examples:
(1) A A G C A (2) GSAQVKGHGKKVADAL …. A A _ C _ ++ ++++H+ KV + …. NNPELQAHAGKVFKLV ….
![Page 10: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/10.jpg)
Alignment vs. Edit Transcript
Essentially equivalent: Two opposing characters in an alignment
a substitution in edit transcript A gap or space in an alignment in first string
an insertion of opposing character A gap or space in second string
a deletion of opposing character
product vs. process
![Page 11: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/11.jpg)
Gap cost or penalty functions
Observation: Gap of length k more probable than k gaps of
length 1Cause might be single mutational eventSeparated gaps probably arose due to different events
Gap penalty functions: Linear cost: Treats both cases uniformly Common to use a higher cost for h for opening a
gap and a lower cost g for extending a gap
![Page 12: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/12.jpg)
Pairwise Sequence Alignment
Example
Which one is better?
HEAGAWGHEEPAWHEAE
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE
![Page 13: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/13.jpg)
Example
AEGHW
A5-10-2-3
E-16-30-3
H-20-210-3
P-1-1-2-2-4
W-3-3-3-315
• Gap penalty: -8
• Gap extension: -3
HEAGAWGHE-E
P-A--W-HEAE
HEAGAWGHE-E
--P-AW-HEAE(-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9
Exercise: Calculate for
![Page 14: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/14.jpg)
Formal Description
Problem: PairSeqAlign Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e Output: The optimal sequence alignment
![Page 15: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/15.jpg)
How Difficult Is This?
Given two sequences of length m and n. How many alignments are there? f(m,n) How many non-equivalent alignments are
there ? g(m,n)
![Page 16: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/16.jpg)
F(n,m)
F(n,m)=f(n-1,m)+f(n,m-1)+f(n-1,m-1)012345
01111111135791121513254161317256312923141941129321681511161231681168361138537712893653711511357522417183811714583336491307391191811159564122363
![Page 17: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/17.jpg)
F(n,m)
12)21(n)f(n, lim
m)f(n,
1m)f(0,f(n,0)f(0,0)
1)-m1,-f(n1)-mf(n,m)1,-f(nm)f(n,
n
nm
ml
n
ml
n
n
l
F(n-1,m-1)F(n,m-1)
F(n-1,m)F(n,m)
![Page 18: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/18.jpg)
G(n,m)
012345
0111111112345621361015213141020355641515357012651621561262526172884210462718361203307928194516549512879110552207152002
![Page 19: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/19.jpg)
G(n,m)
n
n
nm
n
22n)g(n, lim
m)g(n,
1m)g(0,g(n,0)g(0,0)
1)-mg(n,m)1,-g(nm)g(n,
g(n-1,m-1)g(n,m-1)
g(n-1,m)g(n,m)
![Page 20: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/20.jpg)
So what?
So at n = 20, we have over 120 billion possible alignments
We want to be able to align much, much longer sequences Some proteins have
1000 amino acids Genes can have several
thousand base pairs
![Page 21: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/21.jpg)
Dynamic Programming
General algorithmic development technique Reuses the results of previous computations
Store intermediate results in a table for reuse
Look up in table for earlier result to build from
![Page 22: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/22.jpg)
Global Alignment
Needleman-Wunsch 1970 Idea: Build up optimal alignment from optimal alignments
of subsequencesHEAG
--P-
-25
HEAGA
--P-A
-20
HEAGA
--P—
-33
HEAG-
--P-A
-33
Add score from table
Gap with bottom Gap with top Top and bottom
![Page 23: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/23.jpg)
Global Alignment
Notation xi – ith letter of string x yj – jth letter of string y x1..i – Prefix of x from letters 1 through I F – matrix of optimal scores
F(i,j) represents optimal score lining up x1..i with y1..j
d – gap penalty s – scoring matrix
![Page 24: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/24.jpg)
Global Alignment
The work is to build up F Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd Fill from top left to bottom right using the recursive
relation
)(),(min
)(),((min
),()1,1(
max),(
kgapkjiF
kgapjkiF
yxsjiF
jiF
k
k
ji
![Page 25: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/25.jpg)
Global Alignment
F(i-1,j-1)F(i,j-1)
F(i-1,j)F(i,j)
s(xi,yj) d
d
Move ahead in both
xi aligned to gap
yj aligned to gap
While building the table, keep track of where optimal score came from, reverse arrows
![Page 26: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/26.jpg)
Example
HEAGAWGHEE
0-8-16-24-32-40-48-56-64-72-80
P-8-2-9-17-25-33-42-49-57-65-73
A-16
W-24
H-32
E-40
A-48
E-56
![Page 27: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/27.jpg)
Completed Table
HEAGAWGHEE
0-8-16-24-32-40-48-56-64-72-80
P-8-2-9-17-25-33-42-49-57-65-73
A-16-10-3-4-12-20-28-36-44-52-60
W-24-18-11-6-7-15-5-13-21-29-37
H-32-14-18-13-8-9-13-7-3-11-19
E-40-22-8-16-16-9-12-15-73-5
A-48-30-16-3-11-11-12-12-15-52
E-56-38-24-11-6-12-14-15-12-91
ScoreGap –8Error –2Fit +6
![Page 28: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/28.jpg)
Traceback
HEAGAWGHEE
0-8-16-24-32-40-48-56-64-72-80
P-8-2-9-17-25-33-42-49-57-65-73
A-16-10-3-4-12-20-28-36-44-52-60
W-24-18-11-6-7-15-5-13-21-29-37
H-32-14-18-13-8-9-13-7-3-11-19
E-40-22-8-16-16-9-12-15-73-5
A-48-30-16-3-11-11-12-12-15-52
E-56-38-24-11-6-12-14-15-12-91 HEAGAWGHE-E--P-AW-HEAE
Trace arrows back from the lower right to top left
• Diagonal – both• Up – upper gap • Left – lower gap
![Page 29: Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649d4d5503460f94a2be8b/html5/thumbnails/29.jpg)
Summary
Uses recursion to fill in intermediate results table
Uses O(nm) space and time O(n2) algorithm Feasible for moderate sized sequences, but not
for aligning whole genomes.