powerpoint presentation - the dot matrix method

22
Special Topics BSC5936: Special Topics BSC5936: An Introduction to An Introduction to Bioinformatics Bioinformatics . . Florida State University Florida State University The Department of Biological Science The Department of Biological Science www.bio.fsu.edu www.bio.fsu.edu Sept. 9, Sept. 9, 2003 2003

Upload: vanhuong

Post on 25-Jan-2017

256 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: PowerPoint Presentation - the Dot Matrix method

Special Topics BSC5936:Special Topics BSC5936:

An Introduction to BioinformaticsAn Introduction to Bioinformatics..Florida State UniversityFlorida State University

The Department of Biological ScienceThe Department of Biological Science

www.bio.fsu.eduwww.bio.fsu.edu

Sept. 9, 2003Sept. 9, 2003

Page 2: PowerPoint Presentation - the Dot Matrix method

The Dot Matrix MethodThe Dot Matrix Method

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science and Computational Science and

Information Technology (Information Technology (CSITCSIT))

Page 3: PowerPoint Presentation - the Dot Matrix method

The Dot Matrix Method.The Dot Matrix Method.Gets you started thinking about sequence alignment in general.Gets you started thinking about sequence alignment in general.

Provides a ‘Gestalt’ of all possible alignments between two Provides a ‘Gestalt’ of all possible alignments between two

sequences.sequences.

To begin — To begin — I will use a I will use a very simple very simple 00, , 11 (match, no-match) identity (match, no-match) identity

scoring function without any windowing. scoring function without any windowing. As you will see later As you will see later

today, more complex scoring functions will normally be used in today, more complex scoring functions will normally be used in

sequence analysis (especially with amino acid sequences). This sequence analysis (especially with amino acid sequences). This

example is based on an illustration in example is based on an illustration in Sequence Analysis PrimerSequence Analysis Primer (Gribskov and Devereux, editors, 1991).(Gribskov and Devereux, editors, 1991).

The sequences to be compared are written out along the x and y The sequences to be compared are written out along the x and y

axes of a matrix.axes of a matrix.

Put a dot wherever symbols match; identities are highlighted.Put a dot wherever symbols match; identities are highlighted.

A general way to see similarities in pair-wise A general way to see similarities in pair-wise comparisons:comparisons:

Page 4: PowerPoint Presentation - the Dot Matrix method

S E Q U E N C E A N A L Y S I S P R I M E RS • • •E • • • •Q •U •E • • • •N • •C •E • • • •A • •N • •A • •L •Y •S • • •I • •S • • •P •R • •I • •

M •E • • • •R • •

Since this is a comparison between two of the same sequences, an Since this is a comparison between two of the same sequences, an intra-intra-sequence comparison, the most obvious feature is the main sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”crosses directly off the main diagonal; they are “ANA” and “SIS.”

Page 5: PowerPoint Presentation - the Dot Matrix method

Since your own mind and eyes are still better than computers Since your own mind and eyes are still better than computers at discerning complex visual patterns, especially when more at discerning complex visual patterns, especially when more than one pattern is being considered, you can see all these than one pattern is being considered, you can see all these ‘less than best’ comparisons as well as the main one and ‘less than best’ comparisons as well as the main one and then you can ‘zoom-in’ on those regions of interest using then you can ‘zoom-in’ on those regions of interest using more detailed procedures.more detailed procedures.

If the previous plot was a double-stranded DNA or RNA If the previous plot was a double-stranded DNA or RNA sequence self comparison, the inverted repeat regions would sequence self comparison, the inverted repeat regions would be indicative of potential cruciform structures at that point. be indicative of potential cruciform structures at that point. Direct internal repeats will appear as parallel diagonals off of Direct internal repeats will appear as parallel diagonals off of the main diagonal.the main diagonal.

The biggest asset of dot matrix analysis is it allows The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region, but rather concentrating on any one ‘optimal’ region, but rather giving you the ‘giving you the ‘GestaltGestalt’ of the whole thing.’ of the whole thing.

Page 6: PowerPoint Presentation - the Dot Matrix method

Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It is Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It is impossible to tell whether the evolutionary event that caused the discrepancy between impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called the two sequences was an insertion or a deletion and hence this phenomena is called an ‘an ‘indelindel.’ A jump or shift in the register of the main diagonal on a dotplot clearly points .’ A jump or shift in the register of the main diagonal on a dotplot clearly points out the existence of an indel. out the existence of an indel. (again zero:one match score function)(again zero:one match score function)

Check out the ‘mutated’ Check out the ‘mutated’ inter-inter-sequence comparison below:sequence comparison below:

Page 7: PowerPoint Presentation - the Dot Matrix method

S E Q U E N C E A N A L Y S I S P R I M E RS • • •E • • • •Q •U •E • • • •N • •C •E • • • •S • • •E • • • •Q •U •E • • • •N • •C •E • • • •S • • •E • • • •Q •U •E • • • •N • •C •E • • • •

Another phenomenon that is very easy to visualize with dot matrix Another phenomenon that is very easy to visualize with dot matrix analysis are duplications or direct repeats. These are shown in the analysis are duplications or direct repeats. These are shown in the following example:following example:

The ‘duplication’ here is seen as a distinct column of diagonals; whenever you The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct see either a row or column of diagonals in a dotplot, you are looking at direct repeats.repeats.

Page 8: PowerPoint Presentation - the Dot Matrix method

Now consider the more complicated ‘mutation’ in the Now consider the more complicated ‘mutation’ in the following comparison:following comparison:

S E Q U E N C E A N A L Y S I S P R I M E RA • •N • • •A • •L •Y • •ZE • • •S • • • •E • • •Q •U •E • • •N • •C • •E • • • •S • • •

Again, notice the diagonals. However, they have now been displaced off of the center Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the ‘transposition.’ Dot matrix analysis is one of the onlyonly sensible ways to locate such sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.‘PRIMER’ is shown by the lack of a corresponding diagonal.

Page 9: PowerPoint Presentation - the Dot Matrix method

Reconsider the same plot. Notice the extraneous dots that neither Reconsider the same plot. Notice the extraneous dots that neither

indicate runs of identity between the two sequences nor inverted indicate runs of identity between the two sequences nor inverted

repeats. These merely contribute ‘noise’ to the plot and are due repeats. These merely contribute ‘noise’ to the plot and are due

to the ‘random’ occurrence of the letters in the sequences, the to the ‘random’ occurrence of the letters in the sequences, the

composition of the sequences themselves.composition of the sequences themselves.

How can we ‘clean up’ the plots so that this noise does not detract How can we ‘clean up’ the plots so that this noise does not detract

from our interpretations? Consider the implementation of a from our interpretations? Consider the implementation of a

filtered windowing approach; a dot will only be placed if some filtered windowing approach; a dot will only be placed if some

‘stringency’ is met.‘stringency’ is met.

What is meant by this is that if within some defined window size, and What is meant by this is that if within some defined window size, and

when some defined criteria is met, then and only then, will a dot when some defined criteria is met, then and only then, will a dot

be placed at the middle of that window. Then the window is be placed at the middle of that window. Then the window is

shifted one position and the entire process is repeated. This very shifted one position and the entire process is repeated. This very

successfully rids the plot of unwanted noise.successfully rids the plot of unwanted noise.

Filtered Windowing —Filtered Windowing —

Page 10: PowerPoint Presentation - the Dot Matrix method

The only remaining dots indicate the two runs of identity between the two sequences; however, any The only remaining dots indicate the two runs of identity between the two sequences; however, any indication of the palindrome, “ANA” has been lost. This is because our filtering approach was too indication of the palindrome, “ANA” has been lost. This is because our filtering approach was too stringent to catch such a short element. In general you need to make your window about the same stringent to catch such a short element. In general you need to make your window about the same size as the element you are attempting to locate. In the case of our palindrome, “AN” and “NA”’ are size as the element you are attempting to locate. In the case of our palindrome, “AN” and “NA”’ are the inverted repeat sequences and since our window was set to three, we will not be able to see an the inverted repeat sequences and since our window was set to three, we will not be able to see an element only two letters long. Had we set our stringency filter to one in a window of two, then these element only two letters long. Had we set our stringency filter to one in a window of two, then these would be visible. The Wisconsin Package’s implementation of dot matrix analysis, the paired would be visible. The Wisconsin Package’s implementation of dot matrix analysis, the paired programs programs CompareCompare and and DotPlotDotPlot use the window/stringency method by default. use the window/stringency method by default.

In this plot a window of size In this plot a window of size threethree and a stringency of and a stringency of twotwo is used to considerably is used to considerably improve the signal to noise improve the signal to noise ratio (remember, I am using a ratio (remember, I am using a 1:01:0 identity scoring function). identity scoring function).

Page 11: PowerPoint Presentation - the Dot Matrix method

You need to be careful with window/stringency dot matrix You need to be careful with window/stringency dot matrix

methods. Default window sizes and stringencies may methods. Default window sizes and stringencies may

not be appropriate for the analysis at hand.not be appropriate for the analysis at hand.

The Wisconsin Package default window size and The Wisconsin Package default window size and

stringency for protein sequences are 30 and 10 stringency for protein sequences are 30 and 10

respectively (based on BLOSUM scores [soon to be respectively (based on BLOSUM scores [soon to be

explained in Dr. Quine’s lecture]).explained in Dr. Quine’s lecture]).

Sometimes this is perfectly reasonable.Sometimes this is perfectly reasonable.

Take for instance the next real-life example — the human Take for instance the next real-life example — the human

calmodulin protein sequence compared to itself.calmodulin protein sequence compared to itself.

Filtered dot plot techniques —Filtered dot plot techniques —

Page 12: PowerPoint Presentation - the Dot Matrix method

Human calmodulin x itself —Human calmodulin x itself —

Wh a

t’s y

our i

nter

p re t

a tio

n ?W

h at’s

you

r int

erp r

e ta t

ion ?

Do

y ou

k no w

wha

t the

EF-

hand

i s?

Do

y ou

k no w

wha

t the

EF-

hand

i s?

Page 13: PowerPoint Presentation - the Dot Matrix method

The calmodulin structure —The calmodulin structure —The four EF-Hand Helix-The four EF-Hand Helix-

Loop-Helix conformations Loop-Helix conformations ((at positions 20,56, 93, and at positions 20,56, 93, and

129129) bind Ca++ ions to ) bind Ca++ ions to affect several biological affect several biological systems, including:systems, including:

mediate control of a large mediate control of a large number of Ca++ dependent number of Ca++ dependent enzymes,enzymes,

in particular several protein in particular several protein kinases and phosphotases,kinases and phosphotases,

many of which affect systems many of which affect systems ranging from muscle action ranging from muscle action and cAMP to insulin release.and cAMP to insulin release.

Page 14: PowerPoint Presentation - the Dot Matrix method

Calmodulin x alpha actinin —Calmodulin x alpha actinin —default parameters default parameters some confusion some confusion

window=24/stringency=24 window=24/stringency=24 clearer picture clearer picture

Alpha actinin has two EF-hand motifs to calmodulin’s four.Alpha actinin has two EF-hand motifs to calmodulin’s four.

Page 15: PowerPoint Presentation - the Dot Matrix method

Even more can be done with RNA —Even more can be done with RNA —Consider the following set of examples from the Consider the following set of examples from the

phenylalanine transfer RNA (tRNA-Phe) phenylalanine transfer RNA (tRNA-Phe) molecule from Baker’s yeast.molecule from Baker’s yeast.

The sequence and structure of this molecule is The sequence and structure of this molecule is also known; the illustration will show how also known; the illustration will show how simple dot-matrix procedures can quickly lead simple dot-matrix procedures can quickly lead to functional and structural insights (to functional and structural insights (even even without complex folding algorithmswithout complex folding algorithms).).

If run with all default settings (including a 0,1 If run with all default settings (including a 0,1 scoring table) the dotplot from a comparison of scoring table) the dotplot from a comparison of this sequence with itself is quite uninformative, this sequence with itself is quite uninformative, only showing the main identity diagonal:only showing the main identity diagonal:

Page 16: PowerPoint Presentation - the Dot Matrix method

Default RNA self comparisonDefault RNA self comparison(window of 21 and stringency of 14 with the 0, 1 scoring function)(window of 21 and stringency of 14 with the 0, 1 scoring function) ——

Page 17: PowerPoint Presentation - the Dot Matrix method

However, if you adjust the window size down to find finer features some elements of However, if you adjust the window size down to find finer features some elements of symmetry become apparent. Here I have changed the window size to 7 and the symmetry become apparent. Here I have changed the window size to 7 and the stringency value to 5. As a general guide pick a window size about the same size as stringency value to 5. As a general guide pick a window size about the same size as the feature that you are trying to recognize and a stringency such that unwanted the feature that you are trying to recognize and a stringency such that unwanted background noise is just filtered away enough to enable you to see that desired feature.background noise is just filtered away enough to enable you to see that desired feature.

Several direct repeats are now obvious that remained obscured in the previous analysis.Several direct repeats are now obvious that remained obscured in the previous analysis.

Page 18: PowerPoint Presentation - the Dot Matrix method

RNA comparisons of the reverse, complement of a sequence to itself can often be very RNA comparisons of the reverse, complement of a sequence to itself can often be very informative. Here the yeast tRNA sequence is compared to its reverse, complement using the informative. Here the yeast tRNA sequence is compared to its reverse, complement using the same 5 out of 7 stringency setting as previously. The stem-loop, inverted repeats of the tRNA same 5 out of 7 stringency setting as previously. The stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become obvious. They appear as clearly delineated diagonals clover-leaf molecular shape become obvious. They appear as clearly delineated diagonals running perpendicular to an imaginary main diagonal running oppositely than before. Take for running perpendicular to an imaginary main diagonal running oppositely than before. Take for instance the middle stem; the region of the molecule centered at approximately base number 38 instance the middle stem; the region of the molecule centered at approximately base number 38 has a clear propensity to base pair with itself without creating a loop since it crosses the main has a clear propensity to base pair with itself without creating a loop since it crosses the main diagonal and then just after a small unpaired gap another stem is formed between the region from diagonal and then just after a small unpaired gap another stem is formed between the region from about base number 24 through 30 with approximately 46 through 40.about base number 24 through 30 with approximately 46 through 40.

Page 19: PowerPoint Presentation - the Dot Matrix method

That same region ‘zoomed in on’ has some small direct repeats seen That same region ‘zoomed in on’ has some small direct repeats seen by comparing the sequence against itself without reversal:by comparing the sequence against itself without reversal:

Page 20: PowerPoint Presentation - the Dot Matrix method

But looking at the same region of the sequence against its reverse-But looking at the same region of the sequence against its reverse-complement shows a wealth of potential stem-loop structure in the complement shows a wealth of potential stem-loop structure in the transfer RNA:transfer RNA:

Page 21: PowerPoint Presentation - the Dot Matrix method

22 GAGCGCCAGACT G 12, 2222 GAGCGCCAGACT G 12, 22 || | ||||| | A || | ||||| | A48 CTGGAGGTCTAG A 348 CTGGAGGTCTAG A 3

Base position 22 through position 33 base pairs with (think —Base position 22 through position 33 base pairs with (think — is quite similar to the reverse-is quite similar to the reverse-complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. However the region around position 38 is represented as a loop. The actual modeled structure However the region around position 38 is represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.

Page 22: PowerPoint Presentation - the Dot Matrix method

FOR EVEN MORE INFO...FOR EVEN MORE INFO...

http://bio.http://bio.fsufsu..eduedu/~/~stevetstevet/workshop.html/workshop.htmlContact me (Contact me (stevetstevet@[email protected]) for specific bioinformatics assistance ) for specific bioinformatics assistance and/or collaboration.and/or collaboration.

What about these alike areas? What’s the best ‘path’ through the dot matrix? How long do I What about these alike areas? What’s the best ‘path’ through the dot matrix? How long do I extend it? How can I ‘zoom-in’ on it to see exactly what’s happening? Where, specifically, is this extend it? How can I ‘zoom-in’ on it to see exactly what’s happening? Where, specifically, is this alignment; how can I see the ‘best’ ones? And, what can I learn from these alignments?alignment; how can I see the ‘best’ ones? And, what can I learn from these alignments?

This brings up the alignment problem. It is easy to see that two sequences are aligned when they This brings up the alignment problem. It is easy to see that two sequences are aligned when they have identical symbols at identical positions, but what happens when symbols are not identical or have identical symbols at identical positions, but what happens when symbols are not identical or the sequences are not the same length? How can we know that the most alike portions of our the sequences are not the same length? How can we know that the most alike portions of our sequences are aligned, when is an alignment optimal, and does optimal mean biologically sequences are aligned, when is an alignment optimal, and does optimal mean biologically correct?correct?

But, how to do all of this?But, how to do all of this?

A ‘A ‘brute forcebrute force’ approach just won’t work. Even without considering the introduction of gaps, the ’ approach just won’t work. Even without considering the introduction of gaps, the computation required to compare all possible alignments between two sequences requires time computation required to compare all possible alignments between two sequences requires time proportional to the product of the lengths of the two sequences. Therefore, if the two sequences proportional to the product of the lengths of the two sequences. Therefore, if the two sequences are approximately the same length (N), this is a Nare approximately the same length (N), this is a N22 problem. To include gaps, we would have to problem. To include gaps, we would have to repeat the calculation 2N times to examine the possibility of gaps at each possible position within repeat the calculation 2N times to examine the possibility of gaps at each possible position within the sequences, now a Nthe sequences, now a N4N4N problem. Waterman illustrated the problem in 1989 stating that to align problem. Waterman illustrated the problem in 1989 stating that to align two sequences 300 symbols long, 10two sequences 300 symbols long, 108888 comparisons would be required, about the same number comparisons would be required, about the same number of elementary particles estimated to exist in the universe!of elementary particles estimated to exist in the universe!

Part of a better solution . . . enterPart of a better solution . . . enter

the dynamic programming algorithm and Dr. Jack Quine’s lecture.the dynamic programming algorithm and Dr. Jack Quine’s lecture.

Conclusions —Conclusions —