motif detection in yeast vishakh joe bertolami nick urrea jeff weiss
TRANSCRIPT
Motif Detection in Yeast
VishakhJoe Bertolami
Nick UrreaJeff Weiss
Overview1. Problem Statement2. Motivation3. History4. Our Approach5. Evaluation6. Results7. Discussion8. References
1. The Problem Find regulatory sequences in the upstream
region of yeast DNA. Regulatory sequences are segments of
DNA where proteins can bind to enhance transcription of a gene.
The Problem We are given:
Upstream Genome- consists of: Gene Families- consists of:
Individual Genes- consists of: Strings like ATGC
We had to find substrings unusually frequent in gene families given their distribution in the whole upstream genome.
The Problem We emulated techniques devised by van
Helden. Worked on similar data set and tried to
emulate and even better his findings.
2. Motivation Organisms like yeast share many genes
with humans. As a result, they share diseases too. Finding regulatory sequences in yeast
might lead to medical advances. Might lead to therapies for diseases such
as cystic fibrosis.
3. History Previous century saw rapid advances in
genetics. Scientific community trying to get a better
understanding of various genomes. This particular technique was developed
by Jacques van Helden.
4 .Our approach Extract all substrings of lengths 6-8 in the
upstream genome. Calculate frequency of occurrence of each
substring. Put this data in a table.
Our Approach Consider a gene family. Find all substrings in it and frequencies
and build table. For each entry, add the probability of
occurrence. Use above data to calculate three scores.
Our Approach Score 1: Expected Occurrence / Actual
Occurrence Use probability of occurrence and size of
gene family to calculate expected occurrence.
Divide by actual occurrence. Low score -> Unusually frequent substring.
Our Approach Score 2: Poisson Distribution Use expected and actual number of
occurrences. If substring occurs ‘n’ times, calculate
probability of ‘n’ occurrences using Poisson Distribution.
Lower probability -> Unusually frequent
Our Approach Score 3: Binomial Theorem Use probability of occurrence, sizes of
genome and gene family and actual occurrences.
If substring occurs ‘n’ times, calculate probability of ‘n’ occurrences using Binomial Distribution.
Lower probability -> Unusually frequent
Our Approach Sort substrings by a score. Take top sequences, create a probability
matrix. Iterate probability matrix to get
probabilistic model of regulatory sequence.
5. Evaluation Metrics Van Helden’s results in ’98 paper and his
website. ’98 paper used old data, not very reliable
for evaluation. Website very useful since it works on
current data and dynamically calculates results.
Compared our output to his.
Evaluation Metrics Also, compare three scores types to find
best method.
6. ResultsComparison of Results for MET FAMILY
Gene Van Helden’s site Binomial Dist Poisson Dist Expected / Actual Old Paper
CACGTG 1 1 3 4 1
ACGTGA 2 2 1 2 3
TCACGT 3 3 2 1 2
ATATAT 4 4 N/A N/A 5
TATATA 5 5 N/A N/A 10
AACTGT 6 7 4 28 4
ACAGTT 7 6 N/A 29 N/A
ACACAC 8 9 7 N/A N/A
GTGTGT 9 8 6 N/A N/A
Results
Probability matrices generated successfully!
7. Discussion Paper results clearly outdated. Close co-relation with van Helden’s site. Binomial distribution best, followed by
Poisson and Expected/Actual
Discussion Why don’t Binomial results perfectly
match van Helden’s site? Van Helden paper only outlines general
method. He uses many filters and adjustments. Limited info about them on site. We used similar, but not same, filters. Example: Purge sequences that appear twice in
a row.
Discussion Future work
Find more filters. Try other similar organisms’ genomes. Biologically verify results!
Discussion What we learnt
Biology! First-hand look at genetic data Became more familiar with genes Clearly understood what the fuss about genetics is
about Computer Science
Teamwork Interfacing CS with other scientific disciplines
References van Helden, J., André, B. & Collado-Vides, J.
(1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281(5), 827-42.
van Helden, J., Rios, A. F. & Collado-Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28(8):1808-18.