sequence similarity
DESCRIPTION
Sequence Similarity. Why sequence similarity. structural similarity >25% sequence identity similar structure evolutionary relationship all proteins come from < 2000 (super) families related functional role similar structure similar function functional modules are often preserved. - PowerPoint PPT PresentationTRANSCRIPT
Sequence Similarity
Why sequence similarity
structural similarity
>25% sequence identity similar structure
evolutionary relationship
all proteins come from < 2000 (super)families
related functional role
similar structure similar function
functional modules are often preserved
Muscle cells and contraction
Actin and myosin during muscle movement
Actin structure
Actin sequence
• Actin is ancient and abundant Most abundant protein in cells 1-2 actin genes in bacteria, yeasts, amoebas Humans: 6 actin genes
-actin in muscles; -actin, -actin in non-muscle cells• ~4 amino acids different between each version
MUSCLE ACTIN Amino Acid Sequence
1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF
A related protein in bacteria
Relation between sequence and structure
A multiple alignment of actins
Gene expression
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Biomolecules as Strings
• Macromolecules are the chemical building blocks of cells
Proteins• 20 amino acids
Nucleic acids• 4 nucleotides {A, C, G, ,T}
Polysaccharides
The information is in the sequence
• Sequence Structure Function
• Sequence similarity
Structural and/or Functional similarity
• Nucleic acids and proteins are related by molecular evolution Orthologs: two proteins in animals X and Y that evolved from one
protein in immediate ancestor animal Z Paralogs: two proteins that evolved from one protein through
duplication in some ancestor Homologs: orthologs or paralogs that exhibit sequence similarity
Protein Phylogenies
• Proteins evolve by both duplication and species divergence
orthologs
paralogs
duplication
Evolution
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC - - - - CAGTCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
Evolutionary Rates
OK
OK
OK
X
X
Still OK?
next generation
Changes in non-functional sites are OK, so will be propagated
Most changes in functional sites are deleterious and will be rejected
Sequence conservation implies function
Proteins between humans and rodents are on average 85% identical
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y
= y1y2…yN,
an alignment is an assignment of gaps to positions
0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap
in the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
What is a good alignment?
Alignment: The “best” way to match the letters of one sequence with those of the other
How do we define “best”?
Alignment:A hypothesis that the two sequences come from a common ancestor through sequence edits
Parsimonious explanation:Find the minimum number of edits that transform one sequence into the other
Scoring Function
• Sequence edits:AGGCCTC
Mutations AGGACTC
InsertionsAGGGCCTC
DeletionsAGG–CTC
Scoring Function:Match: + mMismatch: – sGap: – d
Score F = (# matches) m – (# mismatches) s – (#gaps) d
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Too many possible alignments:
O( 2M+N)
Alignment is additive
Observation:The score of aligning x1……xM
y1……yN
is additive
Say that x1…xi xi+1…xM
aligns to y1…yj yj+1…yN
The two scores add up:
F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
Key property: optimal solution to the entire problem is composed of optimal solutions to subproblems– Dynamic Programming
Dynamic Programming
Construct a DP matrix F: MxN:
Suppose we wish to alignx1……xM
y1……yN
Let F(i, j) = optimal score of aligning
x1……xi
y1……yj
Dynamic Programming (cont’d)
Notice three possible cases:
1. xi aligns to yj
x1……xi-1 xi
y1……yj-1 yj
2. xi aligns to a gap
x1……xi-1 xi
y1……yj -
3. yj aligns to a gap
x1……xi -
y1……yj-1 yj
m, if xi = yj
F(i, j) = F(i-1, j-1) + -s, if not
F(i, j) = F(i-1, j) – d
F(i, j) = F(i, j-1) – d
Dynamic Programming (cont’d)
• How do we know which case is correct?
Inductive assumption:F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal
Then,F(i – 1, j – 1) + s(xi, yj)
F(i, j) = max F(i – 1, j) – dF(i, j – 1) – d
Where s(xi, yj) = m, if xi = yj; -s, if not
Example
x = AGTA m = 1y = ATA s = -1
d = -1
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
F(i,j) i = 0 1 2 3 4
j = 0
1
2
3
Optimal Alignment:
F(4, 3) = 2
AGTAA - TA
The Needleman-Wunsch Algorithm
1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d
2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M
For each j = 1……N F(i-1,j-1) + s(xi, yj)
[case 1]F(i, j) = max F(i-1, j) – d
[case 2] F(i, j-1) – d
[case 3]
DIAG, if [case 1]Ptr(i,j) = LEFT, if [case 2]
UP, if [case 3]
3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignment
Performance
• Time:O(NM)
• Space:O(NM)
• Possible to reduce space to O(N+M) using Hirschberg’s divide & conquer algorithm
Substitutions of Amino Acids
Mutation rates between amino acids have dramatic differences!
How can we quantify the differencesin rates by which one amino acidreplaces another across related proteins?
Substitution Matrices
BLOSUM matrices:
1. Start from BLOCKS database (curated, gap-free alignments)2. Cluster sequences according to > X% identity
3. Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes
4. Estimate
P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)
Gaps are not inserted uniformly
A state model for alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
M(+1,+1)
I(+1, 0)
J(0, +1)
Alignments correspond 1-to-1 with sequences of states M, I, J
Let’s score the transitions
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
M(+1,+1)
I(+1, 0)
J(0, +1)
Alignments correspond 1-to-1 with sequences of states M, I, J
s(xi, yj)
s(xi, yj) s(xi, yj)
-d -d
-e -e
A probabilistic model for alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
M(+1,+1)
I(+1, 0)
J(0, +1)
Assign probabilities to every transition (arrow), and emission (pair of letters or gaps)
• Probabilities of mutation reflect amino acid similarities
• Different probabilities for opening and extending gap
A Pair HMM for alignments
MP(xi, yj)
IP(xi)
JP(yj)
log(1 – 2)
log(1 – )
log log log log
log(1 – )
log Prob(xi, yj)
Highest scoring path corresponds to the most likely alignment!
How do we find the highest scoring path?
• Compute the following matrices (DP) M(i, j): most likely alignment of x1…xi with y1…yj ending in state M
I(i, j): most likely alignment of x1…xi with y1…yj ending in state I
J(i, j): most likely alignment of x1…xi with y1…yj ending in state J
M(i, j) = log( Prob(xi, yj) ) +
max{ M(i-1, j-1) + log(1-2),
I(i-1, j) + log(1-), J(i, j-1) + log(1-) }
I(i, j) = max{ M(i-1, j) + log ,
I(i-1, j) + log }
MP(xi, yj)
IP(xi)
JP(yj)
log(1 – 2)
log(1 – )
log log log log
log(1 – )
log Prob(xi, yj)
The Viterbi algorithm for alignment
• For each i = 1, …, M For each j = 1, …, N
M(i, j) = log( Prob(xi, yj) ) +
max { M(i-1, j-1) + log(1-2),
I(i-1, j) + log(1-), J(i, j-1) + log(1-)
}
I(i, j) = max { M(i-1, j) + log ,
I(i-1, j) + log }
J(i, j) = max { M(i-1, j) + log ,
I(i-1, j) + log }
When matrices are filled, we can trace back from (M, N) the likeliest alignment
One way to view the state paths – State M
x1
xm
y1 yn……
……
State I
x1
xm
y1 yn……
……
State J
x1
xm
y1 yn……
……
Putting it all together
States I(i, j) are connected with states J and M (i-1, j)
States J(i, j) are connected with states I and M (i-1, j)
States M(i, j) are connected with states J and I (i-1, j-1)
x1
xm
y1 yn……
……
Putting it all together
States I(i, j) are connected with states J and M (i-1, j)
States J(i, j) are connected with states I and M (i-1, j)
States M(i, j) are connected with states J and I (i-1, j-1)
Optimal solution is the best scoring path from top-left to bottom-right corner
This gives the likeliest alignment according to our HMM
x1
xm
y1 yn……
……
Yet another way to represent this model
Mx1 Mxm
Sequence X
BEGIN Iy Iy
Ix Ix
END
We are aligning, or threading, sequence Y through sequence X
Every time yj lands in state xi, we get substitution score s(xi, yj)
Every time yj is gapped, or some xi is skipped, we pay gap penalty