computational genomics (0382.3102) lecture 4 local ...bchor/cg/lecture4.pdf · pam and blast aas...
TRANSCRIPT
Computational Genomics (0382.3102)
Lecture 4
Local Sequence Similarity Hueristics:
FASTA and BLAST
PAM and BLAST AAs Scoring Matrices
Prof. Benny Chor
School of Computer Science
Tel-Aviv University
Based in part on sections 15.4-15.10 in Gusfield’s book, chapter 3 in
Kanehisa’s book, and on a ppt presentation by Terry Speed (UC Berkeley)c©Benny Chor – p.1
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:
1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:1. Devise more efficient algorithm.
2. Special purpose hardware (CGEN).3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).
3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is
efficient, but not efficient enough.
• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.
• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.
• Most popular – FASTA and BLAST.
c©Benny Chor – p.2
Dot Matrices• A method for visual detection of local similarities
(by human eyes)
• Basis for FASTA hueristic• Consider the query
ATCACACGGG
and the text
TATCGCAGTCAATTC
c©Benny Chor – p.3
Dot Matrices• A method for visual detection of local similarities
(by human eyes)• Basis for FASTA hueristic
• Consider the query
ATCACACGGG
and the text
TATCGCAGTCAATTC
c©Benny Chor – p.3
Dot Matrices• A method for visual detection of local similarities
(by human eyes)• Basis for FASTA hueristic• Consider the query
ATCACACGGG
and the text
TATCGCAGTCAATTC
c©Benny Chor – p.3
Dot MatricesA T C A C T G G G
T
A
T
C
T
C
A
G
T
C...
c©Benny Chor – p.4
Dot MatricesA T C A C T C G G
T ∗ ∗
A ∗ ∗
T ∗
C ∗ ∗
T ∗ ∗
C ∗ ∗
A ∗ ∗
G ∗ ∗ ∗
T ∗ ∗
C ∗ ∗...
c©Benny Chor – p.4
Dot MatricesA T C A C T G G G
T ∗ ∗
A ∗ ∗ ∗
T ∗ ∗
C ∗ ∗
T ∗ ∗ ∗
C ∗ ∗
A ∗ ∗ ∗
G ∗ ∗ ∗
T ∗
C ∗ ∗...
c©Benny Chor – p.4
Dot MatricesA T C A C A G G G
T ∗
A ∗ ∗ ∗
T ∗ ∗
C ∗ ∗
G ∗ ∗ ∗
C ∗ ∗
A ∗ ∗ ∗
G ∗ ∗ ∗
T ∗
C ∗ ∗...
c©Benny Chor – p.4
Dot Matrices (cont.)• A method for visual detection of local similarities
(by human eyes).
• Biologists used to have expertise in identifyinglong diagonal stretches in dot matrices.
• Such stretches serve as seeds for good localalignments.
• Basis for understanding the FASTA hueristic.
c©Benny Chor – p.5
Dot Matrices (cont.)• A method for visual detection of local similarities
(by human eyes).• Biologists used to have expertise in identifying
long diagonal stretches in dot matrices.
• Such stretches serve as seeds for good localalignments.
• Basis for understanding the FASTA hueristic.
c©Benny Chor – p.5
Dot Matrices (cont.)• A method for visual detection of local similarities
(by human eyes).• Biologists used to have expertise in identifying
long diagonal stretches in dot matrices.• Such stretches serve as seeds for good local
alignments.
• Basis for understanding the FASTA hueristic.
c©Benny Chor – p.5
Dot Matrices (cont.)• A method for visual detection of local similarities
(by human eyes).• Biologists used to have expertise in identifying
long diagonal stretches in dot matrices.• Such stretches serve as seeds for good local
alignments.• Basis for understanding the FASTA hueristic.
c©Benny Chor – p.5
The FASTA HeuristicBasic intuition: A good local alignment between
two sequences usually contains intervals with perfect
matches.
c©Benny Chor – p.6
The FASTA HeuristicHot Spots: Consecutive runs of ≥ ktup matches (typ-
ically ktup = 6 for DNA, ktup = 2 for AAs).
c©Benny Chor – p.6
The FASTA HeuristicA T C A C A G G G
T
A ∗
T ∗
C ∗
G two hotspotsC ∗ (ktup=3 here)A ∗
G ∗
T
C...
c©Benny Chor – p.6
BLAST
•
c©Benny Chor – p.7
BLAST••
c©Benny Chor – p.7
Statistical Significance of LocalAlignments
•• What significance means
• Effecting parameters (length of text, length ofquery, length of local alginment, low complexityregions
c©Benny Chor – p.8
Statistical Significance of LocalAlignments
• What significance means• Effecting parameters (length of text, length of
query, length of local alginment, low complexityregions
c©Benny Chor – p.8
Famous DistancesA distance on S, D : S × S 7→ <≥0 is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).
• D(v, w) = 1 if v 6= w (this norm is a bit boring).
• Let S = <d (d-dim. real vectors) and p ≥ 1.D (< v1, . . . , vd >,< u1, . . . , ud >)
= p
√
∑di=1
|vi − ui|p .
In math jargon, this is known as the `p norm.
c©Benny Chor – p.9
Famous DistancesA distance on S, D : S × S 7→ <≥0 is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).
• D(v, w) = 1 if v 6= w (this norm is a bit boring).
• Let S = <d (d-dim. real vectors) and p ≥ 1.D (< v1, . . . , vd >,< u1, . . . , ud >)
= p
√
∑di=1
|vi − ui|p .
In math jargon, this is known as the `p norm.
c©Benny Chor – p.9