the motif problem paul tamashiro school of mathematics georgia institute of technology april 16,...

27
The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

The Motif Problem

Paul TamashiroSchool of Mathematics

Georgia Institute of Technology

April 16, 2008

Page 2: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 2

Outline• Biology

– Background– Purpose– What is a motif ?– What is the motif problem ?

• Mathematics– The two types of algorithms– An example algorithm

• Problems/Review/Questions

Page 3: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 3

Biology Section

Page 4: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 4

Background

• Hamilton Smith discovered first DNA signal in 1970 • Worked with Hind II restriction enzyme of

Haemophilus influenzae, a type of bacteria that affects the upper respiratory systems

• Primary capability of the Hind II enzyme was to separate DNA sequences into specific subsequences

• Restriction enzymes are the easiest signals to locate

(Pevzner 2001)

Page 5: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 5

Purpose of the Motif Problem

• Related to the connection between drugs and their specific targets in the human body– Drugs - any chemical substance used to treat or investigate

a disease – Target - a molecule within the human body that endures a

reaction with a drug

• Could affect the activity of certain proteins or enzymes found in nature through regulatory sites and could dramatically increase the potential benefits of drug target identification

(Peter Imming, Christian Sinning, and Achim Meyer, 2006)

Page 6: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 6

Purpose of the Motif Problem

(Peter Imming, Christian Sinning, and Achim Meyer, 2006)

Page 7: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 7

What is a Motif ?

• Definition – section of a DNA sequence (contiguous or sometimes non-contiguous) used for gene sequencing or drug target purposes (Mendes)

• Can be used in two ways:– The existing substring found in the input sequence – The pattern produced by the algorithm itself

(Mendes 2008)

Page 8: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 8

What is a Motif ?

Very elementary definition of motif and algorithm

Page 9: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 9

What is the Motif Problem ?

• If given some set of input sequences S = {S1, S2, …, St}, does a common subsequence of length l between lmin, …, lmax exist among q ≤ t of the sequences with no more than e mismatches?

• If so, how does one break down sequences in order to easily distinguish the relevant signals from randomly reoccurring patterns?

(Mendes 2008)

Page 10: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 10

Mathematics Section

Page 11: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 11

The Two Types of Algorithms

• Combinatorics– Graph theory, Counting– Enumeration

• Probability/Statistics– Expectation Maximization, Probabilistic

Optimization – Maximum Likelihood Estimators (a method that

uses statistics to find the best model for a given set of data)

(Mendes 2008)

Page 12: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 12

Notation

• S = {S1, S2, …, St} - set of input sequences

• l-mer - a subsequence of length l.

• mij - an l-mer of the sequence Sj that starts at position i.

• Sj[i] - the i-th symbol in the j-th sequence.

• nj - the length of the j-th sequence

(Mendes 2008)

Page 13: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 13

The WINNOWER Algorithm

• Summary: finds all motifs of length l that occur in the input sequences that have no more than e mismatches– 1. Constructs a graph with vertices being

sequences of DNA and edges between vertices of similar sequences

– 2. Begins eliminating unwanted edges– 3. Remaining graph may contain vertex

representing a motif

Page 14: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 14

The WINNOWER Algorithm

Figure 3: the WINNOWER algorithm (Mendes 2008)

Page 15: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 15

The WINNOWER Algorithm

• Vertices represent all of the l-mers in the set of sequences S = {S1, S2, …, St }

• There exists an edge between two vertices if the Hamming distance is less than or equal to 2e for two different l-mers

• Hamming distance is the number of places where corresponding characters are different for two strings

Page 16: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 16

The WINNOWER Algorithm

• Graph G = (V, E) is a t-partite graph where each part is made up of vertices developed by the different input sequences

• The algorithm systematically reduces the number of edges by finding extendable cliques.

Page 17: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 17

The WINNOWER Algorithm

• Clique - a subgraph where every two vertices are connected by an edge

• A clique is extendable if there exists one or more neighbors in each partition. – Suppose there exists a clique C with vertices {V1, …, Vk}.

A neighbor of the clique C is a vertex u such that {V1, …, Vk, u} is also a clique.

• The algorithm reduces the graph G by deleting spurious edges (edges that do not belong to the extended cliques of size k).

(Mendes 2008)

Page 18: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 18

The WINNOWER Algorithm

• The value of k is increase after each iterations until there are only extendable cliques remaining and G can not be altered any more– If k = 1, the algorithm would delete all vertices that

have less than t – 1 neighbors. – If k = 2, the algorithm would delete all vertices that

have less than t – 2 neighbors, et cetera.

(Mendes 2008)

Page 19: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 19

The WINNOWER Algorithm

• The algorithm does not actually give the motif directly.

• It produces a graph G where only t-cliques remain, and if a t-clique exists, this does not ensure that a motif exists in the set of sequences

• One must examine all cliques and figure out which are the appropriate motifs if they exist

Page 20: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 20

The WINNOWER Algorithm

• The empty graph implies that there exists no motif.

• A small graph with large cliques implies that some motifs may exist

• A very large graph means that there are still many spurious edges left over so the algorithm was not efficient in finding a t-clique

(Mendes 2008)

Page 21: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 21

The WINNOWER Algorithm

• Drawbacks– It consumes large amounts of time and space – Does not guaranteed to accurately find motifs– Even if a t-clique is found, it does not mean that a

motif exists

Page 22: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 22

Problems/Review/Questions

Page 23: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 23

General Problems

• Reliability - proficiency in exposing a motif

• Complexity - what it costs to find a motif

• Questions that arise:– How should we formally define the reliability of a

motif finder?– Should we be content with worst-case time

scenarios?

(Mendes 2008)

Page 24: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 24

General Problems

• One of the most recognized ways of testing reliability of an algorithm is to experiment using input sequences for which expected motifs are available.

• In other words, test the new algorithms with input sequences and see if the new algorithms give the same conclusion as pre-existing algorithms.

(Mendes 2008)

Page 25: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 25

General Problems

• Question:– Once the algorithm finishes, how do scientists distinguish

the biologically significant patterns from all of the patterns that the algorithm saw fit to keep after all of the iterations were completed?

• Answer:– The solution requires a better understanding of how DNA

and RNA sequences interact with their targets. This will give a better understanding of the requirements of these algorithms, and this will give biologists better direction on how to interpret the algorithms and data that mathematicians and computer scientists explore

Page 26: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 26

Review

• We discussed the problem of finding non-trivial motifs in a given set of input sequences allowing some number of mismatches.

• We also explored its general applications in the biological world.

• We saw an example algorithm.

• We discussed problems with current algorithms and asked questions about them.

Page 27: The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008

P. Tamashiro, GA Tech 27

References• 1. D'Haeseleer, Patrick. "How does DNA sequence motif discovery work?" Nature

Biotechnology 24: 959-961.• 2. Eskin, Eleazar, and Pavel A. Pevzner. "Finding Composite Regulatory Patterns

in DNA Sequences." Bioinformatics 18 (2002): s354-s363.• 3. Imming, Peter, Christian Sinning, and Achim Meyer. "Drugs, Their Targets and

the Nature and Number of Drug Targets." Nature Reviews Drug Discovery 5 (2006): 821-834.

• 4. Mendes, Nuno D. "Finding Common Motifs in DNA Sequences: A Survey." Instituto Superior TéCnico. 1 Apr. 2008 <http://kdbio.inescid.pt/~ndm/documents/minireviews/04_ndm_motifs.pdf>.

• 5. Pevzner, Pavel A., and Sing-Hoi Sze. "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences." International Conference on

Intelligent Systems for Molecular Biology 8 (2000): 269-278. 5 Feb. 2008

http://www.ncbi.nlm.nih.gov/pubmed/10977088?ordinalpos=2&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum.

• 6. Pevzner, Pavel A. Computational Molecular Biology. Cambridge, Massachusetts: The MIT Press, 2001. 133-151.