sequence alignment and database searching 2 biological motivation u inference of homology two genes...
TRANSCRIPT
![Page 1: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/1.jpg)
.
Sequence Alignment and Database Searching
![Page 2: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/2.jpg)
2
Biological Motivation
Inference of Homology Two genes are homologous if they share a
common evolutionary history. Evolutionary history can tell us a lot about
properties of a given gene Homology can be inferred from similarity
between the genes Searching for Proteins with same or similar
functions
![Page 3: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/3.jpg)
3
Orthologous and Paralogous
Orthologous sequences differ because they are found in different species (a speciation event)
Paralogous sequences differ due to a gene duplication event
Sequences may be both orthologous and paralogous
![Page 4: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/4.jpg)
4
Protein Evolution
“For many protein sequences, evolutionary history can be traced back 1-2 billion years”
-William Pearson When we align sequences, we assume that they share a common ancestor
They are then homologous Protein fold is much more conserved than protein sequence DNA sequences tend to be less informative than protein sequences
![Page 5: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/5.jpg)
5
Sequence Alignment
Input: two sequences S1, S2 over the same alphabet
Output: two sequences S’1, S’2 of equal length(S’1, S’2 are S1, S2 with possibly additional gaps)
Example: S1= GCGCATGGATTGAGCGA S2= TGCGCCATTGATGACC A possible alignment:
S’1= -GCGC-ATGGATTGAGCGA
S’2= TGCGCCATTGAT-GACC--
Goal: How similar are two sequences S1 and S2
Global Alignment:
![Page 6: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/6.jpg)
6
Input: two sequences S1, S2 over the same alphabet
Output: two sequences S’1, S’2 of equal length(S’1, S’2 are substrings of S1, S2 with possibly additional
gaps)
Example: S1= GCGCATGGATTGAGCGA S2= TGCGCCATTGATGACC A possible alignment:
S’1= ATTGA-G
S’2= ATTGATG
Goal: Find the pair of substrings in two input sequences which have the highest similarity
Local Alignment:
Sequence Alignment
![Page 7: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/7.jpg)
7
Sequence Alignment
-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Three elements:
Perfect matches
Mismatches
Insertions & deletions (indel)
Score each position independently Score of an alignment is sum of position scores
![Page 8: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/8.jpg)
8
Global vs. Local Alignments
Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.
Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
![Page 9: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/9.jpg)
9
![Page 10: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/10.jpg)
10
Global Alignment (Needleman -Wunsch)
The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle)
Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Sometimes two sequences may be derived from ancient
recombination events where only a single functional domain is shared.
Global methods are useful when you want to force two sequences to align over their entire length
![Page 11: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/11.jpg)
11
Local Alignment (Smith-Waterman)
Local alignment Identify the most similar sub-region shared between
two sequences Smith-Waterman
![Page 12: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/12.jpg)
12
Properties of Sequence Alignment and Search
DNA Should use evolution sensitive measure of similarity Should allow for alignment on exons => searching for local
alignment as opposed to global alignment Proteins
Should allow for mutations => evolution sensitive measure of similarity
Many proteins do not display global patterns of similarity, but instead appear to be built from functional modules => searching for local alignment as opposed to global alignment
![Page 13: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/13.jpg)
13
Parameters of Sequence Alignment
Scoring Systems:
• Each symbol pairing is assigned a numerical value, based on a symbol comparison table.
Gap Penalties:
• Opening: The cost to introduce a gap
• Extension: The cost to elongate a gap
![Page 14: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/14.jpg)
14
Pairwise Alignment The alignment of two sequences (DNA or protein)
is a relatively straightforward computational problem. There are lots of possible alignments.
Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the
same score.
•
![Page 15: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/15.jpg)
15
Methods of Alignment
By hand - slide sequences on two lines of a word processor
Dot plot with windows
Rigorous mathematical approach Dynamic programming (slow, optimal)
Heuristic methods (fast, approximate) BLAST and FASTA
Word matching and hash tables0
![Page 16: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/16.jpg)
16
Align by Hand
GATCGCCTA_TTACGTCCTGGAC <--
--> AGGCATACGTA_GCCCTTTCGC
You still need some kind of scoring system to find the best alignment
![Page 17: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/17.jpg)
17
Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant
A C C T G A G – A G A C G T G – G C A G
70% identical
mismatchindel
![Page 18: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/18.jpg)
18
Considerations
• The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted).
• The smaller the window, the larger the weight of statistical (unspecific) matches.
• With large windows the sensitivity for short sequences is reduced.
• Insertions/deletions are not treated explicitly.
![Page 19: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/19.jpg)
19
Alignment methods
Rigorous algorithms = Dynamic Programming
Needleman-Wunsch (global) Smith-Waterman (local)
Heuristic algorithms (faster but approximate)
BLASTFASTA
![Page 20: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/20.jpg)
20
Dynamic Programming
Dynamic Programming is a very general programming technique.
It is applicable when a large search space can be structured into a succession of stages, such that: the initial stage contains trivial solutions to sub-problems
each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage
the final stage contains the overall solution
![Page 21: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/21.jpg)
21
• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)
• Three possibilities:
• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
• xi is aligned to a gap, F(i,j) = F(i-1,j) - d
• yj is aligned to a gap, F(i,j) = F(i,j-1) - d
• The best score up to (i,j) will be the largest of the three options
Creation of an alignment path matrix
![Page 22: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/22.jpg)
22
Creation of an alignment path matrix
Idea:Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences
• Construct matrix F indexed by i and j (one index for each sequence)
• F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj
• Build F(i,j) recursively beginning with F(0,0) = 0
![Page 23: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/23.jpg)
23
![Page 24: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/24.jpg)
24
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Backtracking
-5
1
-A
EEHHG-WWAA
G-AP
E-H-
0
-25
-5
-20
-13
-3
3
-8 -16
-17
Optimal global alignment: EE
![Page 25: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/25.jpg)
25
DNA Scoring Systems-very simple
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Sequence 1
Sequence 2
A G C T
A 1 0 0 0
G 0 1 0 0
C 0 0 1 0
T 0 0 0 1
Match: 1Mismatch: 0Score = 5
![Page 26: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/26.jpg)
26
Protein Scoring Systems
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Sequence 1
Sequence 2
Scoringmatrix
T:G = -2 T:T = 5Score = 48
C S T P A G N D . .
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 5
D -3 0 -1 -1 -2 -1 1 6
.
.
C S T P A G N D . .
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 5
D -3 0 -1 -1 -2 -1 1 6
.
.
![Page 27: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/27.jpg)
27
• Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution.
CP
GGAVI
L
MF
Y
W HK
RE Q
DN
S
T
CSH
S+S
positive
chargedpolar
aliphatic
aromatic
small
tiny
hydrophobic
Protein Scoring Systems
![Page 28: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/28.jpg)
28
• Scoring matrices reflect:
– # of mutations to convert one to another– chemical similarity– observed mutation frequencies– the probability of occurrence of each amino acid
• Widely used scoring matrices:
• PAM
• BLOSUM
Protein Scoring Systems
![Page 29: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/29.jpg)
29
PAM matricesPAM matrices
Family of matrices PAM 80, PAM 120, PAM 250Family of matrices PAM 80, PAM 120, PAM 250
The number with a PAM matrix represents the evolutionary The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is baseddistance between the sequences on which the matrix is based
Greater numbers denote greater distancesGreater numbers denote greater distances
![Page 30: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/30.jpg)
30
.
PAM (Percent Accepted Mutations) matrices
• The numbers of replacements were used to compute a so-called PAM-1 matrix.
• The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM matrices for larger evolutionary distances can be extrapolated from the PAM-1 matrix.
• PAM250 = 250 mutations per 100 residues.
• Greater numbers mean bigger evolutionary distance
![Page 31: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/31.jpg)
31
PAM (Percent Accepted Mutations) matrices
• Derived from global alignments of protein families . Family members share at least 85% identity (Dayhoff et al., 1978).
• Construction of phylogenetic tree and ancestral sequences of each protein family
• Computation of number of replacements for each pair of amino acids
![Page 32: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/32.jpg)
32
PAM - limitations PAM - limitations
Based on only one original datasetBased on only one original dataset
Examines proteins with few differences (85% Examines proteins with few differences (85% identity)identity)
Based mainly on small globular proteins so the Based mainly on small globular proteins so the matrix is biased matrix is biased
![Page 33: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/33.jpg)
33
• Derived from alignments of domains of distantly related proteins (Henikoff & Henikoff,1992).
• Occurrences of each amino acid pair in each column of each block alignment is counted.
• The numbers derived from all blocks were used to compute the BLOSUM matrices.
A
A
C
E
C
A - C = 4A - E = 2C - E = 2A - A = 1C - C = 1
BLOSUM (Blocks Substitution Matrix)
AACEC
![Page 34: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/34.jpg)
34
BLOSUM matricesBLOSUM matrices
Different BLOSUMDifferent BLOSUMnn matrices are calculated independently from matrices are calculated independently from BLOCKS (ungapped local alignments)BLOCKS (ungapped local alignments)
BLOSUMBLOSUMnn is based on a cluster of BLOCKS of sequences that is based on a cluster of BLOCKS of sequences that share at least share at least nn percent identity percent identity
BLOSUMBLOSUM6262 represents closer sequences than BLOSUM represents closer sequences than BLOSUM4545
![Page 35: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/35.jpg)
35
BLOSUM (Blocks Substitution Matrix)
• Sequences within blocks are clustered according to their level of identity.
• Clusters are counted as a single sequence. • Different BLOSUM matrices differ in the percentage of sequence identity used in clustering.
• The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. • Greater numbers mean smaller evolutionary distance.
![Page 36: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/36.jpg)
36
The Blosum50 Scoring Matrix
![Page 37: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/37.jpg)
37
PAM Vs. BLOSUMPAM Vs. BLOSUM
PAM100 = BLOSUM90PAM100 = BLOSUM90 PAM120 = BLOSUM80PAM120 = BLOSUM80 PAM160 = BLOSUM60PAM160 = BLOSUM60 PAM200 = BLOSUM52PAM200 = BLOSUM52 PAM250 = BLOSUM45PAM250 = BLOSUM45
More distant sequences
BLOSUM62 for general useBLOSUM62 for general useBLOSUM80 for close relationsBLOSUM80 for close relationsBLOSUM45 for distant relationsBLOSUM45 for distant relations
PAM120 for general usePAM120 for general usePAM60 for close relations PAM60 for close relations PAM250 for distant relationsPAM250 for distant relations
![Page 38: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/38.jpg)
38
PAM Matricies
PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times best for short alignments with high similarity
PAM 120 - prepared by multiplying PAM 1 by itself a total of 120 times best for general alignment
PAM 250 - prepared by multiplying PAM 1 by itself a total of 250 times best for detecting distant sequence similarity
![Page 39: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/39.jpg)
39
BLOSUM Matricies
BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity
BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default)
BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments
![Page 40: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/40.jpg)
40
TIPS on choosing a scoring matrix
• Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993).
• When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices.
• For database searching the commonly used matrix is BLOSUM62.
![Page 41: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/41.jpg)
41
T A T G T G G A A T G A
Scoring Insertions and Deletions
A T G T - - A A T G C A
A T G T A A T G C A
T A T G T G G A A T G A
The creation of a gap is penalized with a negative score value.
insertion / deletion
![Page 42: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/42.jpg)
42
1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || |1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29
Why Gap Penalties?
Gaps allowed but not penalized Score: 88
Gaps not permitted Score: 0
1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || |1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
Match = 5Mismatch = -4
![Page 43: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/43.jpg)
43
• The optimal alignment of two similar sequences is usually that which
• maximizes the number of matches and• minimizes the number of gaps.•There is a tradeoff between these two
- adding gaps reduces mismatches
• Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences.
• Penalizing gaps forces alignments to have relatively few gaps.
Why Gap Penalties?
![Page 44: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/44.jpg)
44
Gap Penalties
•How to balance gaps with mismatches?
•Gaps must get a steep penalty, or else you’ll end up with nonsense alignments.
•In real sequences, muti-base (or amino acid) gaps are quit common
•genetic insertion/deletion events
•“Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.
![Page 45: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/45.jpg)
45
45
BLOSUM vs PAM
BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.
BLOSUM 45 BLOSUM 62 BLOSUM 90
PAM 250 PAM 160 PAM 100
More Divergent Less Divergent
![Page 46: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/46.jpg)
46
46
What do the Score and the e-value really mean?
The quality of the alignment is represented by the Score (S).
The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically .
The significance of each alignment is computed as an E value (E).
Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
![Page 47: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/47.jpg)
47
47
Notes on E-values
Low E-values suggest that sequences are homologous๏ Can’t show non-homology
Statistical significance depends on both the size of the alignments and the size of the sequence database‣ Important consideration for comparing results across
different searches
‣ E-value increases as database gets bigger
‣ E-value decreases as alignments get longer
![Page 48: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/48.jpg)
48
48
Homology: Some Guidelines
Similarity can be indicative of homology Generally, if two sequences are
significantly similar over entire length they are likely homologous
Low complexity regions can be highly similar without being homologous
Homologous sequences not always highly similar
![Page 49: Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology Two genes are homologous if they share a common evolutionary](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e4f5503460f94b46013/html5/thumbnails/49.jpg)
49
49
Suggested BLAST Cutoffs
For nucleotide based searches, one should look for hits with E-values of 10-6 or less and sequence identity of 70% or more
For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more
Take Home
Message:
Always loo
k at your
alignments