definitions optimal alignment - one that exhibits the most correspondences. it is the alignment with...
Post on 18-Dec-2015
213 views
TRANSCRIPT
DefinitionsDefinitionsOptimal alignmentOptimal alignment - - one that one that
exhibits the most correspondences. exhibits the most correspondences. It is the alignment with the It is the alignment with the highest highest scorescore. May or may not be . May or may not be biologically meaningful.biologically meaningful.
Global alignmentGlobal alignment - Needleman- - Needleman-Wunsch (1970) maximizes the Wunsch (1970) maximizes the number of matches between the number of matches between the sequences along the entire length of sequences along the entire length of the sequences.the sequences.
Local alignmentLocal alignment - Smith-Waterman - Smith-Waterman (1981) gives the highest scoring (1981) gives the highest scoring local match between two sequences.local match between two sequences.
Pairwise Global AlignmentPairwise Global Alignment Global alignmentGlobal alignment - Needleman- - Needleman-
Wunsch (1970)Wunsch (1970) maximizes the number of matches maximizes the number of matches
between the sequences along the entire between the sequences along the entire length of the sequences.length of the sequences.
Reason for making a global alignment:Reason for making a global alignment: checking minor difference between two checking minor difference between two
sequencessequences Analyzing polymorphisms (ex. SNPs) between Analyzing polymorphisms (ex. SNPs) between
closely related sequencesclosely related sequences ……
Pairwise Global AlignmentPairwise Global Alignment
Computationally:Computationally:
Given: Given:
a pair of sequences (strings of a pair of sequences (strings of characters)characters)
Output:Output:
an alignment that maximizes the an alignment that maximizes the similarity similarity
How can we find an How can we find an optimal alignment?optimal alignment?
ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCTCTGAT---TCG-CATCGTC--T-ATCT
How many possible alignments?How many possible alignments?
C(27,7) gap positions = ~888,000 C(27,7) gap positions = ~888,000 possibilitiespossibilities
Dynamic programming: The Dynamic programming: The Needleman & Wunsch algorithmNeedleman & Wunsch algorithm
1 27
Time ComplexityTime Complexity
Consider two sequences:Consider two sequences:AAGTAAGT
AGTCAGTC
How many possible alignments the 2 How many possible alignments the 2 sequences have? sequences have?
2n2nnn = (2n)!/(n!)= (2n)!/(n!)2 2 = = (2(22n 2n //n ) = n ) = (2(2nn))
Scoring a sequence Scoring a sequence alignmentalignment
Match/mismatch score:Match/mismatch score: +1/+0+1/+0 Open/extension penalty:Open/extension penalty: –2/–1–2/–1ACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT
Matches: 18 Matches: 18 × (+1)× (+1) Mismatches: 2 Mismatches: 2 × 0× 0 Open: 2 × (Open: 2 × (––2)2) Extension: 5 × (Extension: 5 × (––1)1)
Score = +9Score = +9
Pairwise Global AlignmentPairwise Global Alignment
Computationally:Computationally:
Given: Given:
a pair of sequences (strings of a pair of sequences (strings of characters)characters)
Output:Output:
an alignment that maximizes the an alignment that maximizes the similarity similarity
Needleman & WunschNeedleman & Wunsch
Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with gap penalty multiples Fill in the matrix with max value of 3 possible moves:
Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score
The optimal alignment score is in the lower-right corner
To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.
ExampleExample Let gap = -2Let gap = -2
match = 1 match = 1 mismatch = -1.mismatch = -1.
CC AA AA AAemptyempty
CC
GG
AA
emptyempty
11 -1-1 -3-3 -5-5
-1-1 00
-3-3
-4-4
-1-1 -1-1
-2-2
-8-8 -6-6 -4-4 -2-2
-2-2 -6-6
-4-4
-2-2
00
AAACAAACA-GCA-GC
AAACAAAC-AGC-AGC
Time Complexity Time Complexity AnalysisAnalysis
Initialize matrix values: O(n), O(m)Initialize matrix values: O(n), O(m) Filling in rest of matrix: O(nm)Filling in rest of matrix: O(nm) Traceback: O(n+m)Traceback: O(n+m) If strings are same length, total If strings are same length, total
time O(ntime O(n22))
Local AlignmentLocal Alignment
Problem first formulated:Problem first formulated: Smith and Waterman (1981)Smith and Waterman (1981)
Problem:Problem: Find an optimal alignment between Find an optimal alignment between
a substring of s and a substring of ta substring of s and a substring of t Algorithm:Algorithm:
is a variant of the basic algorithm is a variant of the basic algorithm for global alignmentfor global alignment
MotivationMotivation Searching for unknown domains or motifs Searching for unknown domains or motifs
within proteins from different familieswithin proteins from different families Proteins encoded from Homeobox genes (only Proteins encoded from Homeobox genes (only
conserved in 1 region called Homeo domain – 60 conserved in 1 region called Homeo domain – 60 amino acids long)amino acids long)
Identifying active sites of enzymesIdentifying active sites of enzymes Comparing long stretches of anonymous Comparing long stretches of anonymous
DNADNA Querying databases where query word much Querying databases where query word much
smaller than sequences in databasesmaller than sequences in database Analyzing repeated elements within a single Analyzing repeated elements within a single
sequencesequence
Local AlignmentLocal Alignment Let gap = -2Let gap = -2
match = 1 match = 1 mismatch = -1.mismatch = -1.
GATCACCTGATCACCTGATACCCGATACCC
CC
CC
CC
AA
TT
AA
GG
emptyempty
TTCCCCAACCTTAAGGemptyempty
0 0 0 0 0 0 0 0 0
00
00
00
0
100
00
00
0 00 01
02
11
0
00
00
32
2
00
00
14
3
00
10
02
3
20
10
0
0
03
10
0
0
01
22
1
1
GATCACCTGATCACCTGATGAT __ ACCCACCC
Smith & WatermanSmith & Waterman Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4 possible
values: 0 Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score
The optimal alignment score is the max in the matrix
To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit
exerciseexercise Let:Let:
gap = -2gap = -2match = 1 match = 1 mismatch = -1.mismatch = -1.
Find the best local alignment:Find the best local alignment:
CGATGAAATGGA
Semi-global AlignmentSemi-global AlignmentExample:Example:
CAGCA-CTTGGATTCTCGGCAGCA-CTTGGATTCTCGG
––––––CAGCGTGG––––––––CAGCGTGG––––––––
CAGCACTTGGATTCTCGGCAGCACTTGGATTCTCGG
CAGC––––G––T––––GGCAGC––––G––T––––GG
We like the first alignment much better. In We like the first alignment much better. In semiglobal comparison, we score the semiglobal comparison, we score the alignments ignoring some of the alignments ignoring some of the end end spacesspaces..
Global AlignmentGlobal AlignmentExample:Example:
AAACCCAAACCC
A A CCC CCC
Prefer to see:Prefer to see: AAACCCAAACCC ACCCACCC
Do not want to penalize the end spaces
emptemptyy AA AA AA CC CC CC
emptemptyy 00 -2-2 -4-4 -6-6 -8-8 --
1010--
1212AA -2-2 11 -1-1 -3-3 -5-5 -7-7 -9-9CC -4-4 -1-1 00 -2-2 -2-2 -4-4 -6-6CC -6-6 -3-3 -2-2 -1-1 -1-1 -1-1 -3-3CC -8-8 -5-5 -4-4 -3-3 00 00 00
SemiGlobal AlignmentSemiGlobal AlignmentExample:Example:
s = AAACCCs = AAACCC
t = t = ACCCACCC
emptemptyy AA AA AA CC CC CC
emptemptyy 00 00 00 00 00 00 00AA -2-2 11 11 11 -1-1 -1-1 -1-1CC -4-4 -1-1 00 00 22 00 00CC -6-6 -3-3 -2-2 -1-1 11 33 11CC -8-8 -5-5 -4-4 -3-3 00 22 44
SemiGlobal AlignmentSemiGlobal AlignmentExample:Example:
s = AAACCCs = AAACCCGG
t = t = ACCCACCC
emptemptyy AA AA AA CC CC CC
emptemptyy 00 00 00 00 00 00 00AA -2-2 11 11 11 -1-1 -1-1 -1-1CC -4-4 -1-1 00 00 22 00 00CC -6-6 -3-3 -2-2 -1-1 11 33 11CC -8-8 -5-5 -4-4 -3-3 00 22 44 22
-2-2-1-100GG
-1-1
SemiGlobal AlignmentSemiGlobal Alignment Summary of end space charging procedures:Summary of end space charging procedures:
Place where spaces are Place where spaces are not penalized fornot penalized for ActionAction
Beginning of 1Beginning of 1stst sequencesequence
End of 1End of 1stst sequence sequence
Beginning of 2Beginning of 2ndnd sequencesequence
End of 2End of 2ndnd sequence sequence
Initialize 1Initialize 1stst row with zeros row with zeros
Look for max in last rowLook for max in last row
Initialize 1Initialize 1stst column with column with zeroszeros
Look for max in last columnLook for max in last column
Pairwise Sequence Comparison Pairwise Sequence Comparison over Internetover Internet
lalignlalign www.ch.embnet.org/software/www.ch.embnet.org/software/LALIGN_form.htmlLALIGN_form.html
Global/LocalGlobal/Local
lalignlalign fasta.bioch.virginia.edu/fasta_www/fasta.bioch.virginia.edu/fasta_www/plalign.htmplalign.htm
Global/LocalGlobal/Local
USCUSC www-hto.usc.edu/software/seqaln/seqaln-www-hto.usc.edu/software/seqaln/seqaln-query.htmlquery.html
Global/LocalGlobal/Local
alionalion fold.stanford.edu/alionfold.stanford.edu/alion Global/LocalGlobal/Local
genome.cs.mtu.edu/align.htmlgenome.cs.mtu.edu/align.html Global/LocalGlobal/Local
alignalign www.ebi.ac.uk/emboss/alignwww.ebi.ac.uk/emboss/align Global/LocalGlobal/Local
xenAliTwxenAliTwoo
www.soe.ucsc.edu/~kent/xenoAli/www.soe.ucsc.edu/~kent/xenoAli/xenAliTwo.htmlxenAliTwo.html
Local for Local for DNADNA
blast2seqblast2seqss
www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.htmlwww.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Local BLASTLocal BLAST
blast2seqblast2seqss
web.umassmed.edu/cgi-bin/BLAST/blast2seqsweb.umassmed.edu/cgi-bin/BLAST/blast2seqs Local BLASTLocal BLAST
lalnviewlalnview www.expasy.ch/tools/sim-prot.htmlwww.expasy.ch/tools/sim-prot.html VisualizationVisualization
prssprss www.ch.embnet.org/software/www.ch.embnet.org/software/PRSS_form.htmlPRSS_form.html
EvaluationEvaluation
prssprss Fasta.bioch.virginia.edu/fasta/prss.htmFasta.bioch.virginia.edu/fasta/prss.htm EvaluationEvaluation
graph-graph-alignalign
Darwin.nmsu.edu/cgi-bin/graph_align.cgiDarwin.nmsu.edu/cgi-bin/graph_align.cgi EvaluationEvaluation
Bioinformatics for Dummies
Significance of Sequence Significance of Sequence AlignmentAlignment
Consider randomly generated Consider randomly generated sequences. What distribution do sequences. What distribution do you think the best local alignment you think the best local alignment score of two sequences of sample score of two sequences of sample length should follow? length should follow?
1.1. Uniform distributionUniform distribution
2.2. Normal distributionNormal distribution
3.3. Binomial distribution (n Bernoulli trails)Binomial distribution (n Bernoulli trails)
4.4. Poisson distribution (nPoisson distribution (n, np=, np=))
5.5. othersothers
Extreme Value Extreme Value DistributionDistribution
YYevev = exp(- x - e = exp(- x - e-x -x ))
-5 0 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Extreme Value Distribution Extreme Value Distribution vs. Normal Distributionvs. Normal Distribution
-5 0 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-5 0 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
““Twilight Twilight Zone”Zone”
-5 0 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Some proteins with less than 15% similarity Some proteins with less than 15% similarity have exactly the same 3-D structure while have exactly the same 3-D structure while some proteins with 20% similarity have some proteins with 20% similarity have different structures. Homology/non-homology different structures. Homology/non-homology is never granted in the twilight zone. is never granted in the twilight zone.