definitions optimal alignment - one that exhibits the most correspondences. it is the alignment with...

25
Definitions Definitions Optimal alignment Optimal alignment - - one that exhibits one that exhibits the most correspondences. It is the most correspondences. It is the alignment with the the alignment with the highest highest score score . May or may not be . May or may not be biologically meaningful. biologically meaningful. Global alignment Global alignment - Needleman-Wunsch - Needleman-Wunsch (1970) maximizes the number of (1970) maximizes the number of matches between the sequences matches between the sequences along the entire length of the along the entire length of the sequences. sequences. Local alignment Local alignment - Smith-Waterman - Smith-Waterman (1981) gives the highest scoring (1981) gives the highest scoring local match between two sequences. local match between two sequences.

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

DefinitionsDefinitionsOptimal alignmentOptimal alignment - - one that one that

exhibits the most correspondences. exhibits the most correspondences. It is the alignment with the It is the alignment with the highest highest scorescore. May or may not be . May or may not be biologically meaningful.biologically meaningful.

Global alignmentGlobal alignment - Needleman- - Needleman-Wunsch (1970) maximizes the Wunsch (1970) maximizes the number of matches between the number of matches between the sequences along the entire length of sequences along the entire length of the sequences.the sequences.

Local alignmentLocal alignment - Smith-Waterman - Smith-Waterman (1981) gives the highest scoring (1981) gives the highest scoring local match between two sequences.local match between two sequences.

Page 2: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Pairwise Global AlignmentPairwise Global Alignment Global alignmentGlobal alignment - Needleman- - Needleman-

Wunsch (1970)Wunsch (1970) maximizes the number of matches maximizes the number of matches

between the sequences along the entire between the sequences along the entire length of the sequences.length of the sequences.

Reason for making a global alignment:Reason for making a global alignment: checking minor difference between two checking minor difference between two

sequencessequences Analyzing polymorphisms (ex. SNPs) between Analyzing polymorphisms (ex. SNPs) between

closely related sequencesclosely related sequences ……

Page 3: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Pairwise Global AlignmentPairwise Global Alignment

Computationally:Computationally:

Given: Given:

a pair of sequences (strings of a pair of sequences (strings of characters)characters)

Output:Output:

an alignment that maximizes the an alignment that maximizes the similarity similarity

Page 4: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

How can we find an How can we find an optimal alignment?optimal alignment?

ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCTCTGAT---TCG-CATCGTC--T-ATCT

How many possible alignments?How many possible alignments?

C(27,7) gap positions = ~888,000 C(27,7) gap positions = ~888,000 possibilitiespossibilities

Dynamic programming: The Dynamic programming: The Needleman & Wunsch algorithmNeedleman & Wunsch algorithm

1 27

Page 5: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Time ComplexityTime Complexity

Consider two sequences:Consider two sequences:AAGTAAGT

AGTCAGTC

How many possible alignments the 2 How many possible alignments the 2 sequences have? sequences have?

2n2nnn = (2n)!/(n!)= (2n)!/(n!)2 2 = = (2(22n 2n //n ) = n ) = (2(2nn))

Page 6: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Scoring a sequence Scoring a sequence alignmentalignment

Match/mismatch score:Match/mismatch score: +1/+0+1/+0 Open/extension penalty:Open/extension penalty: –2/–1–2/–1ACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT

Matches: 18 Matches: 18 × (+1)× (+1) Mismatches: 2 Mismatches: 2 × 0× 0 Open: 2 × (Open: 2 × (––2)2) Extension: 5 × (Extension: 5 × (––1)1)

Score = +9Score = +9

Page 7: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Pairwise Global AlignmentPairwise Global Alignment

Computationally:Computationally:

Given: Given:

a pair of sequences (strings of a pair of sequences (strings of characters)characters)

Output:Output:

an alignment that maximizes the an alignment that maximizes the similarity similarity

Page 8: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Needleman & WunschNeedleman & Wunsch

Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with gap penalty multiples Fill in the matrix with max value of 3 possible moves:

Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score

The optimal alignment score is in the lower-right corner

To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.

Page 9: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

ExampleExample Let gap = -2Let gap = -2

match = 1 match = 1 mismatch = -1.mismatch = -1.

CC AA AA AAemptyempty

CC

GG

AA

emptyempty

11 -1-1 -3-3 -5-5

-1-1 00

-3-3

-4-4

-1-1 -1-1

-2-2

-8-8 -6-6 -4-4 -2-2

-2-2 -6-6

-4-4

-2-2

00

AAACAAACA-GCA-GC

AAACAAAC-AGC-AGC

Page 10: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Time Complexity Time Complexity AnalysisAnalysis

Initialize matrix values: O(n), O(m)Initialize matrix values: O(n), O(m) Filling in rest of matrix: O(nm)Filling in rest of matrix: O(nm) Traceback: O(n+m)Traceback: O(n+m) If strings are same length, total If strings are same length, total

time O(ntime O(n22))

Page 11: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Local AlignmentLocal Alignment

Problem first formulated:Problem first formulated: Smith and Waterman (1981)Smith and Waterman (1981)

Problem:Problem: Find an optimal alignment between Find an optimal alignment between

a substring of s and a substring of ta substring of s and a substring of t Algorithm:Algorithm:

is a variant of the basic algorithm is a variant of the basic algorithm for global alignmentfor global alignment

Page 12: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

MotivationMotivation Searching for unknown domains or motifs Searching for unknown domains or motifs

within proteins from different familieswithin proteins from different families Proteins encoded from Homeobox genes (only Proteins encoded from Homeobox genes (only

conserved in 1 region called Homeo domain – 60 conserved in 1 region called Homeo domain – 60 amino acids long)amino acids long)

Identifying active sites of enzymesIdentifying active sites of enzymes Comparing long stretches of anonymous Comparing long stretches of anonymous

DNADNA Querying databases where query word much Querying databases where query word much

smaller than sequences in databasesmaller than sequences in database Analyzing repeated elements within a single Analyzing repeated elements within a single

sequencesequence

Page 13: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Local AlignmentLocal Alignment Let gap = -2Let gap = -2

match = 1 match = 1 mismatch = -1.mismatch = -1.

GATCACCTGATCACCTGATACCCGATACCC

CC

CC

CC

AA

TT

AA

GG

emptyempty

TTCCCCAACCTTAAGGemptyempty

0 0 0 0 0 0 0 0 0

00

00

00

0

100

00

00

0 00 01

02

11

0

00

00

32

2

00

00

14

3

00

10

02

3

20

10

0

0

03

10

0

0

01

22

1

1

GATCACCTGATCACCTGATGAT __ ACCCACCC

Page 14: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Smith & WatermanSmith & Waterman Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4 possible

values: 0 Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score

The optimal alignment score is the max in the matrix

To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit

Page 15: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

exerciseexercise Let:Let:

gap = -2gap = -2match = 1 match = 1 mismatch = -1.mismatch = -1.

Find the best local alignment:Find the best local alignment:

CGATGAAATGGA

Page 16: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Semi-global AlignmentSemi-global AlignmentExample:Example:

CAGCA-CTTGGATTCTCGGCAGCA-CTTGGATTCTCGG

––––––CAGCGTGG––––––––CAGCGTGG––––––––

CAGCACTTGGATTCTCGGCAGCACTTGGATTCTCGG

CAGC––––G––T––––GGCAGC––––G––T––––GG

We like the first alignment much better. In We like the first alignment much better. In semiglobal comparison, we score the semiglobal comparison, we score the alignments ignoring some of the alignments ignoring some of the end end spacesspaces..

Page 17: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Global AlignmentGlobal AlignmentExample:Example:

AAACCCAAACCC

A A CCC CCC

Prefer to see:Prefer to see: AAACCCAAACCC ACCCACCC

Do not want to penalize the end spaces

emptemptyy AA AA AA CC CC CC

emptemptyy 00 -2-2 -4-4 -6-6 -8-8 --

1010--

1212AA -2-2 11 -1-1 -3-3 -5-5 -7-7 -9-9CC -4-4 -1-1 00 -2-2 -2-2 -4-4 -6-6CC -6-6 -3-3 -2-2 -1-1 -1-1 -1-1 -3-3CC -8-8 -5-5 -4-4 -3-3 00 00 00

Page 18: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

SemiGlobal AlignmentSemiGlobal AlignmentExample:Example:

s = AAACCCs = AAACCC

t = t = ACCCACCC

emptemptyy AA AA AA CC CC CC

emptemptyy 00 00 00 00 00 00 00AA -2-2 11 11 11 -1-1 -1-1 -1-1CC -4-4 -1-1 00 00 22 00 00CC -6-6 -3-3 -2-2 -1-1 11 33 11CC -8-8 -5-5 -4-4 -3-3 00 22 44

Page 19: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

SemiGlobal AlignmentSemiGlobal AlignmentExample:Example:

s = AAACCCs = AAACCCGG

t = t = ACCCACCC

emptemptyy AA AA AA CC CC CC

emptemptyy 00 00 00 00 00 00 00AA -2-2 11 11 11 -1-1 -1-1 -1-1CC -4-4 -1-1 00 00 22 00 00CC -6-6 -3-3 -2-2 -1-1 11 33 11CC -8-8 -5-5 -4-4 -3-3 00 22 44 22

-2-2-1-100GG

-1-1

Page 20: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

SemiGlobal AlignmentSemiGlobal Alignment Summary of end space charging procedures:Summary of end space charging procedures:

Place where spaces are Place where spaces are not penalized fornot penalized for ActionAction

Beginning of 1Beginning of 1stst sequencesequence

End of 1End of 1stst sequence sequence

Beginning of 2Beginning of 2ndnd sequencesequence

End of 2End of 2ndnd sequence sequence

Initialize 1Initialize 1stst row with zeros row with zeros

Look for max in last rowLook for max in last row

Initialize 1Initialize 1stst column with column with zeroszeros

Look for max in last columnLook for max in last column

Page 21: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Pairwise Sequence Comparison Pairwise Sequence Comparison over Internetover Internet

lalignlalign www.ch.embnet.org/software/www.ch.embnet.org/software/LALIGN_form.htmlLALIGN_form.html

Global/LocalGlobal/Local

lalignlalign fasta.bioch.virginia.edu/fasta_www/fasta.bioch.virginia.edu/fasta_www/plalign.htmplalign.htm

Global/LocalGlobal/Local

USCUSC www-hto.usc.edu/software/seqaln/seqaln-www-hto.usc.edu/software/seqaln/seqaln-query.htmlquery.html

Global/LocalGlobal/Local

alionalion fold.stanford.edu/alionfold.stanford.edu/alion Global/LocalGlobal/Local

genome.cs.mtu.edu/align.htmlgenome.cs.mtu.edu/align.html Global/LocalGlobal/Local

alignalign www.ebi.ac.uk/emboss/alignwww.ebi.ac.uk/emboss/align Global/LocalGlobal/Local

xenAliTwxenAliTwoo

www.soe.ucsc.edu/~kent/xenoAli/www.soe.ucsc.edu/~kent/xenoAli/xenAliTwo.htmlxenAliTwo.html

Local for Local for DNADNA

blast2seqblast2seqss

www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.htmlwww.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Local BLASTLocal BLAST

blast2seqblast2seqss

web.umassmed.edu/cgi-bin/BLAST/blast2seqsweb.umassmed.edu/cgi-bin/BLAST/blast2seqs Local BLASTLocal BLAST

lalnviewlalnview www.expasy.ch/tools/sim-prot.htmlwww.expasy.ch/tools/sim-prot.html VisualizationVisualization

prssprss www.ch.embnet.org/software/www.ch.embnet.org/software/PRSS_form.htmlPRSS_form.html

EvaluationEvaluation

prssprss Fasta.bioch.virginia.edu/fasta/prss.htmFasta.bioch.virginia.edu/fasta/prss.htm EvaluationEvaluation

graph-graph-alignalign

Darwin.nmsu.edu/cgi-bin/graph_align.cgiDarwin.nmsu.edu/cgi-bin/graph_align.cgi EvaluationEvaluation

Bioinformatics for Dummies

Page 22: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Significance of Sequence Significance of Sequence AlignmentAlignment

Consider randomly generated Consider randomly generated sequences. What distribution do sequences. What distribution do you think the best local alignment you think the best local alignment score of two sequences of sample score of two sequences of sample length should follow? length should follow?

1.1. Uniform distributionUniform distribution

2.2. Normal distributionNormal distribution

3.3. Binomial distribution (n Bernoulli trails)Binomial distribution (n Bernoulli trails)

4.4. Poisson distribution (nPoisson distribution (n, np=, np=))

5.5. othersothers

Page 23: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Extreme Value Extreme Value DistributionDistribution

YYevev = exp(- x - e = exp(- x - e-x -x ))

-5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Page 24: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

Extreme Value Distribution Extreme Value Distribution vs. Normal Distributionvs. Normal Distribution

-5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Page 25: Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically

““Twilight Twilight Zone”Zone”

-5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Some proteins with less than 15% similarity Some proteins with less than 15% similarity have exactly the same 3-D structure while have exactly the same 3-D structure while some proteins with 20% similarity have some proteins with 20% similarity have different structures. Homology/non-homology different structures. Homology/non-homology is never granted in the twilight zone. is never granted in the twilight zone.