class 3: sequence similarity
DESCRIPTION
Class 3: Sequence similarity. Motivation. Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?. Define alignment. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/1.jpg)
Class 3: Sequence similarity
![Page 2: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/2.jpg)
Motivation
• Same gene, or similar gene
• Suffix of A similar to prefix of B?
• Suffix of A similar to prefix of B..Z?
• Longest similar substring of A, B
• Longest similar substring of A, B..Z
• For each, How big? How similar?
![Page 3: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/3.jpg)
Define alignment
• Align these two sequences optimallyGACGGATT
GATCGGTT
• Define precisely what an alignment is
![Page 4: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/4.jpg)
Definition of alignment
• Insert spaces so that the letters line up, or letters align with spaces
GA-CGGATT
GATCGG-TT
• Don’t allow spaces to line up
• Allow spaces even at beginning and end
GCAT-
-CATG
![Page 5: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/5.jpg)
Define similarity
• Given an alignment, compute a similarity score
• Three possibilities for each column
letter-letter match
letter-letter mismatch
letter-space mismatch
![Page 6: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/6.jpg)
Optimal alignment
• Create score function
• Conventionally:
+1 bonus for match
-1 penalty for letter-letter mismatch
-2 penalty for letter-space mismatch
![Page 7: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/7.jpg)
Dynamic programming solution
• Given sequences s,t of length m,n
• Strategy: build up optimal alignment of prefixes
• Base case?
• Recurrence relation?
![Page 8: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/8.jpg)
Recurrence
• Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j]
• Three possibilities:– extend s by a letter, t by a space– extend s by a letter, t by a letter– extend s by a space, t by a letter
![Page 9: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/9.jpg)
Tiny instance -- AGC, AAAC
0 -2 -4 -6 -8
-2
-4
-6
![Page 10: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/10.jpg)
Some dp details
• What is a good order to fill the array?
• How do you recover the opt alignment?
• What do you do about ties?
• What is the space complexity of this algorithm?
• What is the time complexity of this algorithm?
![Page 11: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/11.jpg)
The gap penalty
• Model above assumes two gaps of size 1 are equivalent to one gap of size 2
• Is this realistic? Why or why not?
![Page 12: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/12.jpg)
General gap penalties
• Alignments can no longer be scored as the sum of their parts
• They still are the sum of blocks with one matched letter or one gap each
• Blocks are: matched letters, s-gap, t-gapA|A|C|---|A|GAT|A|A|C
A|C|T|CGG|T|---|A|A|T
![Page 13: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/13.jpg)
DP for general gaps
• Requires three array, one for each block type
• Time complexity is cubic
• This is expensive at best, prohibitive for large problems
• See Setubal/Meidanis 3.3.2 for details
![Page 14: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/14.jpg)
Affine gap penalty
• Charge h for each gap, plus g * (len(gap))
• This still has quadratic complexity!
• See Setubal/Meidanis
![Page 15: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/15.jpg)
Point accepted mutations
• Some mutations are more likely than others
• In proteins, some amino acids are more similar than others (size, charge, hydrophobicity)
• A point accepted mutation matrix is a table with probabilityof each transition in fixed time
![Page 16: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/16.jpg)
PAM matrices
• The entire matrix sums to 1
• A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change
![Page 17: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/17.jpg)
Scoring matrix
• Consider aligned letters a,b
• Pr(b is a mutation of a) = Mab
• Pr(b is a random occurrence) = pb
• Score(a,b) = 10log(Mab / pb)
![Page 18: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/18.jpg)
Blast
• Basic Local Alignment Search Tool
• Def: ‘segment’ is a subsequence (without gaps)
• Def: ‘segment pair’ is two segments of equal length
• Rem: the score of a segment pair is the sum of its aligned letters
![Page 19: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/19.jpg)
What Blast does
• Input:– a PAM matrix– a database of sequences B– a query sequence A– a threshhold S
• Output:– all segment pairs(A,B) with score > S
![Page 20: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/20.jpg)
How Blast works
• Compile short, high-scoring strings (words)
• Search for hits -- each hit gives a seed
• Extend seeds
![Page 21: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/21.jpg)
Blast on proteins
• Words are w-mers which score at least T against A
• Use hashing or dfa to search for hits
• Extend seed until heuristically determined limit is reached
![Page 22: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/22.jpg)
Blast on nucleic acids
• Words are w-mers in query A
• Letters compressed, four to byte
• Filter database B for very common words to avoid false positives
• Extend seeds as in proteins
![Page 23: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/23.jpg)
What does Blast give you?
• Efficiency
• A rigorous statistical theory which gives the probability of a segment pair occurring by chance
![Page 24: Class 3: Sequence similarity](https://reader035.vdocuments.us/reader035/viewer/2022062805/56814d79550346895dbad7fd/html5/thumbnails/24.jpg)
Homework
• Given sequences s,t of length m,n, how many alignments do they have?
• Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.