computational genomics (0382.3102) lecture 3 sequence ...bchor/cg/lecture3.pdfcomputational genomics...
TRANSCRIPT
Computational Genomics (0382.3102)
Lecture 3
Sequence Similarity and Pairwise Alignment II:
Affine Gap Penalties, Local Alignment,
BLAST and FASTA Hueristics
Prof. Benny Chor
School of Computer Science
Tel-Aviv University
Based in part on chapter *** in Gusfield’s book, chapter 3 in Kanehisa’s book,
and on a ppt presentation by Terry Speed (UC Berkeley)c
�
Benny Chor – p.1
DistancesLet
�
be a (finite or infinite) set. A distance on
�
is afunction � � � ��� � �
satisfying the followingthree properties:
� Symmetry:
��� � �
,
� �� � � � � �
.
Non-negativity: , , andif and only if .
Triangle inequality: ,
c
�
Benny Chor – p.2
DistancesLet
�
be a (finite or infinite) set. A distance on
�
is afunction � � � ��� � �
satisfying the followingthree properties:
� Symmetry:
��� � �
,
� �� � � � � �
.
� Non-negativity:
��� � �,
� �� � �
, and� �� � � �
if and only if � .
Triangle inequality: ,
c
�
Benny Chor – p.2
DistancesLet
�
be a (finite or infinite) set. A distance on
�
is afunction � � � ��� � �
satisfying the followingthree properties:
� Symmetry:
��� � �
,
� �� � � � � �
.
� Non-negativity:
��� � �,
� �� � �
, and� �� � � �
if and only if � .
� Triangle inequality:� �� �� � �
,� �� � � �� � � �� ��
c
�
Benny Chor – p.2
Famous DistancesA distance on
�
, � � � ��� � �
is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).
� � �� � � �
if � (this norm is a bit boring).
Let ( -dim. real vectors) and .
In math jargon, this is known as the norm.
c
�
Benny Chor – p.3
Famous DistancesA distance on
�
, � � � ��� � �
is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).
� � �� � � �
if � (this norm is a bit boring).
� Let
� �
�
(
�
-dim. real vectors) and � �
.� � �� � � � � � �� � � �� � � � � � � � �
� � ��� �
� � � � � �
In math jargon, this is known as the
� norm.
c
�
Benny Chor – p.3
More Famous Distances
� � with � � �
is the “regular” Euclidean distance.
with (sum of absolute values ofdifferences).
The norm: (the limit of as).
Let be a finite, undirected, connectedgraph,with positive edges’ lengths. Letbe a pair of vertices.
Define the length of the shortest pathfrom to in . This is a norm.
c
�
Benny Chor – p.4
More Famous Distances
� � with � � �
is the “regular” Euclidean distance.
� � with � � �
(sum of absolute values ofdifferences).
The norm: (the limit of as).
Let be a finite, undirected, connectedgraph,with positive edges’ lengths. Letbe a pair of vertices.
Define the length of the shortest pathfrom to in . This is a norm.
c
�
Benny Chor – p.4
More Famous Distances
� � with � � �
is the “regular” Euclidean distance.
� � with � � �
(sum of absolute values ofdifferences).
� The
� norm: � �� ��� �
� � � � (the limit of
� as
� ).
Let be a finite, undirected, connectedgraph,with positive edges’ lengths. Letbe a pair of vertices.
Define the length of the shortest pathfrom to in . This is a norm.
c
�
Benny Chor – p.4
More Famous Distances
� � with � � �
is the “regular” Euclidean distance.
� � with � � �
(sum of absolute values ofdifferences).
� The
� norm: � �� ��� �
� � � � (the limit of
� as
� ).
� Let � ��
�
be a finite, undirected, connectedgraph,with positive edges’ lengths. Let �� �
be a pair of vertices.
Define the length of the shortest pathfrom to in . This is a norm.
c
�
Benny Chor – p.4
More Famous Distances
� � with � � �
is the “regular” Euclidean distance.
� � with � � �
(sum of absolute values ofdifferences).
� The
� norm: � �� ��� �
� � � � (the limit of
� as
� ).
� Let � ��
�
be a finite, undirected, connectedgraph,with positive edges’ lengths. Let �� �
be a pair of vertices.
� Define
� �� � � the length of the shortest pathfrom to � in . This is a norm.
c
�
Benny Chor – p.4
Distance vs. Similarity
� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they
have small distance.
This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).
We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.
Dynamic programming can be used to findminimum distance alignment in time .
c
�
Benny Chor – p.5
Distance vs. Similarity
� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they
have small distance.
� This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).
We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.
Dynamic programming can be used to findminimum distance alignment in time .
c
�
Benny Chor – p.5
Distance vs. Similarity
� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they
have small distance.
� This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).
� We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.
Dynamic programming can be used to findminimum distance alignment in time .
c
�
Benny Chor – p.5
Distance vs. Similarity
� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they
have small distance.
� This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).
� We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.
� Dynamic programming can be used to findminimum distance alignment in time
��� � � � .
c
�
Benny Chor – p.5
Local Sequence Alignment
� In local sequence alignment, we have two inputsequences (strings): The query
�
, and the text .
The goal is to find two substring: – substring ofand – substring of such that the (global)
alignment score between and is maximized(over all choices of pairs of substrings).
Notice that if , is an optimal alignment ofand , then , can contain indels.
c
�
Benny Chor – p.6
Local Sequence Alignment
� In local sequence alignment, we have two inputsequences (strings): The query
�
, and the text .
� The goal is to find two substring: – substring of�
and – substring of such that the (global)alignment score between and is maximized(over all choices of pairs of substrings).
Notice that if , is an optimal alignment ofand , then , can contain indels.
c
�
Benny Chor – p.6
Local Sequence Alignment
� In local sequence alignment, we have two inputsequences (strings): The query
�
, and the text .
� The goal is to find two substring: – substring of�
and – substring of such that the (global)alignment score between and is maximized(over all choices of pairs of substrings).
� Notice that if
�
,�
is an optimal alignment ofand , then
�
,�
can contain indels.
c
�
Benny Chor – p.6
Local Alignment DP Algorithm
� The global and local alignment problems seemsvery different. But a rather small change in the”global” DP algorithm yields an efficient��� � � � ”local” DP algorithm.
Goal: Fill the matrix by values , thevalue of the best (global) alignment between allsuffixes of -prefix of and suffixes of -prefix of
.
c
�
Benny Chor – p.7
Local Alignment DP Algorithm
� The global and local alignment problems seemsvery different. But a rather small change in the”global” DP algorithm yields an efficient��� � � � ”local” DP algorithm.
� Goal: Fill the � � � matrix by values
� ��
� �
, thevalue of the best (global) alignment between allsuffixes of
�
-prefix of�
and suffixes of
�
-prefix of.
c
�
Benny Chor – p.7
Local Alignment DP Algorithm
� Initialize all boundary values (upper row, leftcolumn) to
�
.
Update rule:
Keep pointers like before (no pointer if).
Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).
c
�
Benny Chor – p.8
Local Alignment DP Algorithm
� Initialize all boundary values (upper row, leftcolumn) to
�
.
� Update rule:
� � �
�
� � � � � � �
� � ��
� � � � � � � � ��
� � � � ��
� � �
�
� � � � ��
� � � � ��
� ��
� � � � � � � � � �� � ��
� ��
Keep pointers like before (no pointer if).
Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).
c
�
Benny Chor – p.8
Local Alignment DP Algorithm
� Initialize all boundary values (upper row, leftcolumn) to
�
.
� Update rule:
� � �
�
� � � � � � �
� � ��
� � � � � � � � ��
� � � � ��
� � �
�
� � � � ��
� � � � ��
� ��
� � � � � � � � � �� � ��
� ��
� Keep pointers like before (no pointer if
� �� � �
).
Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).
c
�
Benny Chor – p.8
Local Alignment DP Algorithm
� Initialize all boundary values (upper row, leftcolumn) to
�
.
� Update rule:
� � �
�
� � � � � � �
� � ��
� � � � � � � � ��
� � � � ��
� � �
�
� � � � ��
� � � � ��
� ��
� � � � � � � � � �� � ��
� ��
� Keep pointers like before (no pointer if
� �� � �
).
� Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).
c
�
Benny Chor – p.8
Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �
and
� �� � � � � �� � �� � �� � �
.
Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.
A reasonable (best?) choice would be the substringsand , with the (global) alignment
between them being
_
In the context of the original sequences, we haveand
c
�
Benny Chor – p.9
Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �
and
� �� � � � � �� � �� � �� � �
.
Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.
A reasonable (best?) choice would be the substringsand , with the (global) alignment
between them being
_
In the context of the original sequences, we haveand
c
�
Benny Chor – p.9
Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �
and
� �� � � � � �� � �� � �� � �
.
Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.
A reasonable (best?) choice would be the substrings�� �� � ��
and
�� � �� � ��
, with the (global) alignmentbetween them being
_
In the context of the original sequences, we haveand
c
�
Benny Chor – p.9
Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �
and
� �� � � � � �� � �� � �� � �
.
Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.
A reasonable (best?) choice would be the substrings�� �� � ��
and
�� � �� � ��
, with the (global) alignmentbetween them being
�
_
� �� � ��
�� � �� � ��
In the context of the original sequences, we haveand
c
�
Benny Chor – p.9
Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �
and
� �� � � � � �� � �� � �� � �
.
Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.
A reasonable (best?) choice would be the substrings�� �� � ��
and
�� � �� � ��
, with the (global) alignmentbetween them being
�
_
� �� � ��
�� � �� � ��In the context of the original sequences, we have�� � �� �� � �� �� �
and�� � � � � �� � �� � �� � �
c
�
Benny Chor – p.9
Other Versions of Alignment
� Global in
�
, local in .
Affine gap penalties.
Linear space algorithm.
c
�
Benny Chor – p.10
Other Versions of Alignment
� Global in
�
, local in .
� Affine gap penalties.
Linear space algorithm.
c
�
Benny Chor – p.10
Other Versions of Alignment
� Global in
�
, local in .
� Affine gap penalties.
� Linear
��� � � space algorithm.
c
�
Benny Chor – p.10