computational genomics (0382.3102) lecture 4 local ...bchor/cg/lecture4.pdf · pam and blast aas...

28
Computational Genomics (0382.3102) Lecture 4 Local Sequence Similarity Hueristics: FASTA and BLAST PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based in part on sections 15.4-15.10 in Gusfield’s book, chapter 3 in Kanehisa’s book, and on a ppt presentation by Terry Speed (UC Berkeley) c Benny Chor – p.1

Upload: others

Post on 18-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Computational Genomics (0382.3102)

Lecture 4

Local Sequence Similarity Hueristics:

FASTA and BLAST

PAM and BLAST AAs Scoring Matrices

Prof. Benny Chor

School of Computer Science

Tel-Aviv University

Based in part on sections 15.4-15.10 in Gusfield’s book, chapter 3 in

Kanehisa’s book, and on a ppt presentation by Terry Speed (UC Berkeley)c©Benny Chor – p.1

Page 2: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 3: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 4: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:

1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 5: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:1. Devise more efficient algorithm.

2. Special purpose hardware (CGEN).3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 6: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).

3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 7: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 8: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Local Alignment Heuristics• O(n ·m) time DP local alignment algorithm is

efficient, but not efficient enough.

• For example, if n = 103 (putative gene length)and n ·m = 1010 (≈ GeneBank size), thenn ·m = 1013. NCBI server would collapse tryingto respond to tens of thousands daily queries.

• Possible remedies:1. Devise more efficient algorithm.2. Special purpose hardware (CGEN).3. Resort to heuristics.

• Most popular – FASTA and BLAST.

c©Benny Chor – p.2

Page 9: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices• A method for visual detection of local similarities

(by human eyes)

• Basis for FASTA hueristic• Consider the query

ATCACACGGG

and the text

TATCGCAGTCAATTC

c©Benny Chor – p.3

Page 10: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices• A method for visual detection of local similarities

(by human eyes)• Basis for FASTA hueristic

• Consider the query

ATCACACGGG

and the text

TATCGCAGTCAATTC

c©Benny Chor – p.3

Page 11: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices• A method for visual detection of local similarities

(by human eyes)• Basis for FASTA hueristic• Consider the query

ATCACACGGG

and the text

TATCGCAGTCAATTC

c©Benny Chor – p.3

Page 12: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot MatricesA T C A C T G G G

T

A

T

C

T

C

A

G

T

C...

c©Benny Chor – p.4

Page 13: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot MatricesA T C A C T C G G

T ∗ ∗

A ∗ ∗

T ∗

C ∗ ∗

T ∗ ∗

C ∗ ∗

A ∗ ∗

G ∗ ∗ ∗

T ∗ ∗

C ∗ ∗...

c©Benny Chor – p.4

Page 14: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot MatricesA T C A C T G G G

T ∗ ∗

A ∗ ∗ ∗

T ∗ ∗

C ∗ ∗

T ∗ ∗ ∗

C ∗ ∗

A ∗ ∗ ∗

G ∗ ∗ ∗

T ∗

C ∗ ∗...

c©Benny Chor – p.4

Page 15: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot MatricesA T C A C A G G G

T ∗

A ∗ ∗ ∗

T ∗ ∗

C ∗ ∗

G ∗ ∗ ∗

C ∗ ∗

A ∗ ∗ ∗

G ∗ ∗ ∗

T ∗

C ∗ ∗...

c©Benny Chor – p.4

Page 16: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices (cont.)• A method for visual detection of local similarities

(by human eyes).

• Biologists used to have expertise in identifyinglong diagonal stretches in dot matrices.

• Such stretches serve as seeds for good localalignments.

• Basis for understanding the FASTA hueristic.

c©Benny Chor – p.5

Page 17: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices (cont.)• A method for visual detection of local similarities

(by human eyes).• Biologists used to have expertise in identifying

long diagonal stretches in dot matrices.

• Such stretches serve as seeds for good localalignments.

• Basis for understanding the FASTA hueristic.

c©Benny Chor – p.5

Page 18: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices (cont.)• A method for visual detection of local similarities

(by human eyes).• Biologists used to have expertise in identifying

long diagonal stretches in dot matrices.• Such stretches serve as seeds for good local

alignments.

• Basis for understanding the FASTA hueristic.

c©Benny Chor – p.5

Page 19: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Dot Matrices (cont.)• A method for visual detection of local similarities

(by human eyes).• Biologists used to have expertise in identifying

long diagonal stretches in dot matrices.• Such stretches serve as seeds for good local

alignments.• Basis for understanding the FASTA hueristic.

c©Benny Chor – p.5

Page 20: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

The FASTA HeuristicBasic intuition: A good local alignment between

two sequences usually contains intervals with perfect

matches.

c©Benny Chor – p.6

Page 21: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

The FASTA HeuristicHot Spots: Consecutive runs of ≥ ktup matches (typ-

ically ktup = 6 for DNA, ktup = 2 for AAs).

c©Benny Chor – p.6

Page 22: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

The FASTA HeuristicA T C A C A G G G

T

A ∗

T ∗

C ∗

G two hotspotsC ∗ (ktup=3 here)A ∗

G ∗

T

C...

c©Benny Chor – p.6

Page 23: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

BLAST

c©Benny Chor – p.7

Page 24: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

BLAST••

c©Benny Chor – p.7

Page 25: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Statistical Significance of LocalAlignments

•• What significance means

• Effecting parameters (length of text, length ofquery, length of local alginment, low complexityregions

c©Benny Chor – p.8

Page 26: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Statistical Significance of LocalAlignments

• What significance means• Effecting parameters (length of text, length of

query, length of local alginment, low complexityregions

c©Benny Chor – p.8

Page 27: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Famous DistancesA distance on S, D : S × S 7→ <≥0 is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).

• D(v, w) = 1 if v 6= w (this norm is a bit boring).

• Let S = <d (d-dim. real vectors) and p ≥ 1.D (< v1, . . . , vd >,< u1, . . . , ud >)

= p

∑di=1

|vi − ui|p .

In math jargon, this is known as the `p norm.

c©Benny Chor – p.9

Page 28: Computational Genomics (0382.3102) Lecture 4 Local ...bchor/CG/Lecture4.pdf · PAM and BLAST AAs Scoring Matrices Prof. Benny Chor School of Computer Science Tel-Aviv University Based

Famous DistancesA distance on S, D : S × S 7→ <≥0 is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).

• D(v, w) = 1 if v 6= w (this norm is a bit boring).

• Let S = <d (d-dim. real vectors) and p ≥ 1.D (< v1, . . . , vd >,< u1, . . . , ud >)

= p

∑di=1

|vi − ui|p .

In math jargon, this is known as the `p norm.

c©Benny Chor – p.9