pairwise sequence alignment part 2
Post on 11-Jan-2016
33 Views
Preview:
DESCRIPTION
TRANSCRIPT
Pairwise Sequence Alignment Part 2
Outline
• Global alignments-continuation
• Local versus Global
• BLAST algorithms
• Evaluating significance of alignments
Global Alignment -Cont
Needleman-Wunsch Alignment• Global alignment between sequences
– Compare entire sequence against another• Create scoring table
– Sequence A across top, B down left• Cell at column i and row j contains the score
of best alignment between the first i elements of A and the first j elements of B– Global alignment score is bottom right cell
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0
C 1
A 2
T 3
G 4
T 5
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1
C 1
A 2
T 3
G 4
T 5
A-
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1
A 2
T 3
G 4
T 5
ACGCTG------
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1
A 2 -2
T 3 -3
G 4 -4
T 5 -5
-----CATGT
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1
A 2 -2
T 3 -3
G 4 -4
T 5 -5
AC
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1 1
A 2 -2
T 3 -3
G 4 -4
T 5 -5
AC-C
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1 1 0
A 2 -2
T 3 -3
G 4 -4
T 5 -5
ACG-C-
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1 1 0 -1
A 2 -2
T 3 -3
G 4 -4
T 5 -5
ACGC-C--
ACGC---C
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1 1 0 -1 -2 -3
A 2 -2 1 0 0
T 3 -3
G 4 -4
T 5 -5
ACG-CA
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1 1 0 -1 -2 -3
A 2 -2 1 0 0 -1 -2 -3
T 3 -3 0 0 -1 -1 1 0
G 4 -4 -1 -1 2 1 0 3
T 5 -5 -2 -2 1 1 3 2
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1 -2 -3 -4 -5 -6
C 1 -1 -1 1 0 -1 -2 -3
A 2 -2 1 0 0 -1 -2 -3
T 3 -3 0 0 -1 -1 1 0
G 4 -4 -1 -1 2 1 0 3
T 5 -5 -2 -2 1 1 3 2
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1
C 1 -1 1 0
A 2 1 0 -1
T 3 0 1
G 4 2 1 3
T 5 3 2
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1
C 1 -1 1 0
A 2 1 0 -1
T 3 0 1
G 4 2 1 3
T 5 3 2
ACGCTG--C-ATGT
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1
C 1 -1 1 0
A 2 1 0 -1
T 3 0 1
G 4 2 1 3
T 5 3 2
ACGCTG--CA-TGT
0
A
1
C
2
G
3
C
4
T
5
G
6
0 0 -1
C 1 -1 1 0
A 2 1 0 -1
T 3 0 1
G 4 2 1 3
T 5 3 2
-ACGCTGCATG-T-
Global Alignment versus Local Alignment
ATTGCAGTG-TCGAGCGTCAGGCT
ATTGCGTCGATCGCAC-GCACGCT
Global Alignment
Local Alignment
CATATTGCAGTGGTCCCGCGTCAGGCT
TAAATTGCGT-GGTCGCACTGCACGCT
Global vs. Local alignment
DOROTHY
DOROTHY
HODGKIN
HODGKIN
Global alignment:DOROTHY--------HODGKINDOROTHYCROWFOOTHODGKIN
Local alignment:
Local Alignment
• Best score for aligning part of sequences– Often beats global alignment score
• Similar algorithm: Smith-Waterman– Table cells never score below zero
0
T
1
A
2
C
3
T
4
A
5
A
6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
TACTA TAATA
TAATAA
Problems with DP for sequence alignments
-The complexity is very high
- Given a score, how to evaluate the significance of the alignment?
Complexity
• Complexity is determined by size of table– Aligning a sequence of length m against one of length n requires calculating (m n) cells
• Time of calculation Lets say we calculate 108 cells per second on a one
processor PC– Aligning two mRNA sequences of 8,000 bp requires
64,000,000 cells 0.64 seconds– Aligning an mRNA and a 107 bp chromosome requires
~1011 cells 1,000 secs = 15 minutes
Complexity for large databases
• Let’s say a database contains 3 1010 base pairs
– Searching an mRNA against the database will require ~2.5 1014 cells 2.5 106 secs = 1 month!
• We need an efficient algorithm to cut down on alignment
BLAST
• Basic Local Alignment Search Technique
• A set of tools developed at NCBI (BlastN, BlastP,..)
• BLAST benefits– Search speed– Ease of use– Statistical rigor
BLAST
• A good alignment contains subsequences of absolute identity:– First, identify very short (almost) exact matches.– Next, the best short hits from the 1st step are extended
to longer regions of similarity.– Finally, the best hits are optimized using the Smith-
Waterman algorithm.
Query sequenceWords of length W
(1)
(2) Compare the word list to the database and identify exact matches
BLAST Algorithm
W default = 11
(3) For each word match, extend alignment in both directions
(4) Score the alignments using Dynamic Programing
(5) Evaluate the statistics significance
• Using the pairwise comparison, each database search normally yields 2 groups of scores: genuinely related and unrelated sequences, with some overlap between them.
• A good search method should completely separate between the 2 score groups.
Database Searches
Random
Related
E-value• The number of hits (with the same similarity score) one can
"expect" to see just by chance when searching the given string in a database of a particular size.
• higher e-value lower similarity– “sequences with E-value of less than 0.01 are almost always
found to be homologous”
• The lower bound is normally 0 (we want to find the best)
Expectation Values
Increases linearly with
length of query sequence
Increases linearly with
length of database
Decreases exponentially with score of
alignment
top related