advanced computational biology project presentation
DESCRIPTION
Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640. Advanced ComputationAL Biology Project Presentation. OVERVIEW. Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results - PowerPoint PPT PresentationTRANSCRIPT
ADVANCED COMPUTATIONAL BIOLOGY
PROJECT PRESENTATION
Team Members:Joshua Wu 11174269
Shuyu (Christine) Xu 11161640
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Project DescriptionExplicit Suffix Trees
Suppose that we want to store explicitly all strings that are edge labels of a suffix tree.
The main question of this project is how much space explicit suffix trees require comparing to implicit suffix trees.
Implement suffix tree algorithm and run it on substrings of real data.
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Introduction Any string of length m can be
degenerated into m suffixes, and these suffixes can be stored in a suffix tree.
Setup time O(m) (m is length of string)
searching time O(n) (n is length of pattern)
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Motivation "Suffix trees are widely used in the
computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology).
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Bioinformatics Application
1. multiple genome alignment (Michael Hohl et al., 2002)
2. selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002)
3. identification of sequence repeats (Kurtz and Schleiermacher, 1999)
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Explicit vs Implicit ABC $ Explicit 1 2 3 4 ABC$ $ BC$ C$ Implicit 1,4 4,4 2,4 3,4
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Problem Analysis Best Case for explicit and implicit suffix
trees: All different characters
Best case not likely with DNA inputs: total of 4 characters
Worst case: same characters throughout
Assumptions In implicit trees, each number will only
take up one bit. (the number 10 takes up 1 bit)
Only alphabets will be in the sequence
Example: all different char ABCD $ 1,5 5,5 1 2 3 4 5 2,5 3,5 4,5
N: string length N = 5 Memory = 10 best case
Example ABCABC $ 7,7 1 2 3 4 5 6 7 1,3 2,3 6,6 N: string length N = 7 4,7 7,7 7,7 7,7 Memory = 20 4,7 4,7
Example: all same character AAAA $ 1 2 3 4 5 1,1 5,5 N=string length N = 5, 6, 7 2,2 5,5 Memory = 16, 20, 24 Memory = 4n-4 3,3 5,5
Worse case 4,5 5,5
Program Input Data
DNA for all kinds of creatures:
Homo Sapiens, Monkeys, Chickens, …
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Sample input: Homo Sapien
cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaagtggacctcagacatggctcagccataggacctgccacacaagcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcctcaaaccgtttaatcaataa
Sample result
Sample input 2: plants
EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG
Sample output:
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Homo Sapien
Sample Input: Homo Sapiens
atgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaa
Comparisons: Homo Sapiens
Comparisons: Homo Sapiens
Monkey Virus
Sample Input: Monkey Virus
GGSCFKCGKKGHFAKNCHEHAHNNAEPKVPGLCPRCKRGKHWANECKSKTDNQGNPIPPH
Monkey Virus
Plants
Sample Input: Plants EARPIVVGPPPPLSGGLPGTENSDQA
RDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG
Plants
Tobacco
Sample input: tobacco
SYSITTPSQFVFLSSAWADPIELINLCTNALGNQFQTQQARTVVQRQFSEVWKPSPQVTVRFPDSDFKVYRYNAVLDPLVTALLGAFDTRNRIIEVENQANPTTAETLDATRRVDDATVAIRSAINNLIVELIRGTGSYNRSSFESSSGLVWTSGPAT
Tobacco
Insects
Sample Input: Insects DCLSGRYKGPCAVWDNETCRRVCKE
EGRSSGHCSPSLKCWCEGC
Insects
Birds
Sample Input: Birds IDTCRLPSDRGRCKASFERWYFNGRT
CAKFIYGGCGGNGNKFPTQEACMKRCAKA
Birds
SARS
Sample Input: SARS ALNTLVKQLSSNFGAISSVLNDILSRLD
KVEAEV
SARS
Fish
Sample Input: Fish GHHHHHHLEDPSGGTPYIGSKISLISK
AEIRYEGILYTIDTENSTVALAKVRSFGTEDRPTDRPIAPRDETFEYIIFRGSDIKDLTVCEPPKPIM
Fish
Chicken
Sample Input: Chicken
RVKRVWPLVIRTVIAGYNLYRAIKKK
Chicken
files Code
Results
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work
Now we are here
Conclusion Explicit suffix trees require more space
than implicit suffix trees in real datas.
Data comparison: worst case is DNA input (least variety of characters)
results Implicit trees should be used for smaller
use of storage
1 3 5 7 9 11 13 15 17 19 21 23 250
500
1000
1500
2000
2500
3000
variety of string vs tree size
variety of string vs tree size
# of alphabets
Conclusion Application:
it is easier to compare structures for implicit than explicit suffix trees (number comparisons)
Save spaceEasy to implement
Further improvement?
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work Now we are here
Possible Future Work Program speed is too slow
The interface of our program should be improved. (Matlab)
More variety of input
References Real Data http://www.ncbi.nlm.nih.gov/entrez/viewe
r.fcgi?db=nucleotide&val=74273665 http://www.rcsb.org/pdb http://www.ncbi.nlm.nih.gov/sites/entrez
?cmd=search&db=nucleotide
References Online info http://en.wikipedia.org/wiki/Suffix_tree http://marknelson.us/1996/08/01/suffix-tr
ees/ http://homepage.usask.ca/~ctl271/857/s
uffix_tree.shtml http://www.cs.uku.fi/~kilpelai/BSA05/lect
ures/print07.pdf
THANK YOU!