crick’s early hypothesis revisited. or the existence of a universal coding frame axel bernal upenn...
Post on 21-Dec-2015
213 views
TRANSCRIPT
![Page 1: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/1.jpg)
Crick’s early Hypothesis Revisited
![Page 2: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/2.jpg)
Or The Existence of a Universal Coding Frame
Axel BernalUPenn Center for Bioinformatics
Jean-Louis LassezCoastal Carolina University
Ryan RossCoastal Carolina University
![Page 3: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/3.jpg)
BIOINFORMATICS
The application of computer technology to the management and analysis of biological data
COMPUTATIONALBIOLOGY
![Page 4: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/4.jpg)
Biology: the study of living organisms
Why should computer scientists be interested in biology?
![Page 5: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/5.jpg)
![Page 6: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/6.jpg)
Genomes and GenesThe language of life
…..catgcctagactgcatcggtaccatgacatgcatttatagaacactacgcgtaatagccatgatcccatagatacatacagagatacactgatagactcgacctcatccgattatatagacctgaaatggctagctggacatgcgatcgaatcgagattagcaccatagagtggcatagccatgcgctgatagcaaaatgccatagctagtgtctaacgtgcattgccctggatgacatggctccgatatggcggctgatcgtcgctgaaatgctcgctgcaatggctaggatacagtaatagacgtaatgccaatggctgctcgctggatagtcgctgacatcgatcgcctgatatgatgcgctagctccgcataagatcgctgatcgcta……..
![Page 7: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/7.jpg)
Genetic Code
![Page 8: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/8.jpg)
Crick’s 1957 Hypothesis
The genetic code has excellent information theoretic properties, it is
comma free
It does not admit ANY form of parasitism.
![Page 9: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/9.jpg)
Dismissed for the past 35 years
Replaced by “Frozen Accident”
• Renewed interest in comma free and circular codes (DNA computing, Arques/Michel)
• Time to revisit
![Page 10: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/10.jpg)
Coding
0000 = A1111 = B0001 = C1000 = D0011 = E1100 = F0111 = G1110 = H
0010 = I0100 = J0101 = K1010 = L1001 = M0110 = N1011 = O1101 = P
![Page 11: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/11.jpg)
0010111010110010111100011011100011011001110101010010101100110111
Communication Error
0010111010110010111100011011100011011001110101010010101110110111
I H O I B C O D P M P K I O E G
I H O
X
K H E G C O E L L K G N ...
Translation ErrorFrameshift
![Page 12: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/12.jpg)
…101011100100111010010010001010111…
Parasite sub Messages
Bounded Parasitism:
…101011100100111010010010001010111…
Spread Parasitism:
![Page 13: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/13.jpg)
Biological Implications of comma free
A frameshift will immediately abort the translation
ANY fragment of length 5 in the coding region of ANY gene in ANY organism determines the frame
Universal Frame property
![Page 14: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/14.jpg)
Crick’s Hypothesis Revisited
What is the length of the shortest segment of a coding region that defines the frame independently of the organism it comes from?
IF IT EXISTS
![Page 15: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/15.jpg)
Mathematical Concepts
Comma Free Codes
Codes with Bounded and Spread Parasitism
Circular codes
Locally Testable Languages
Similarity Measures
![Page 16: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/16.jpg)
A Circular Code
1
01
001
0001
00001
000001
0000001
![Page 17: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/17.jpg)
Unique Decomposition
![Page 18: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/18.jpg)
A Non Circular Code
000111 001 100011 110 101010
![Page 19: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/19.jpg)
Multiple Possible Decompositions
![Page 20: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/20.jpg)
00101110101100101111000110111000110101111101010100101011101101110010111010110010111100011011100011010111110101010010101110110111
0011111011110010011100011011100011011001110001110110100110110111
Locally Testable Events
∑* / ∑* 0101 ∑*
![Page 21: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/21.jpg)
![Page 22: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/22.jpg)
![Page 23: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/23.jpg)
![Page 24: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/24.jpg)
Theorem
Assumption: code X consists of a finite set of words all of the same length
The following are equivalent:
X has bounded parasitism of degree dXd+1 is comma freeX is circularX* is strictly locally testable
![Page 25: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/25.jpg)
Crick’s Hypothesis Revisited Again
Genetic code C
Language of Genes G≠C*
C has good properties then G has good propertiesBUT G may have good properties while C does not.
Shift from comma free to Testable by fragments
![Page 26: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/26.jpg)
Similarity
CX
uC
u
XXSXS ),()(
YXS ,
)(maxarg)( XScX C
2
2
2
YX
e
![Page 27: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/27.jpg)
![Page 28: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/28.jpg)
![Page 29: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/29.jpg)
![Page 30: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/30.jpg)
Arques/Michel Codes 1998
00 },{ XTTTAAA 11 }{ XCCC 22 }{ XGGG
X0 = {AAC, AAT, ACC, ATC, ATT, CAG, CTC, CTG, GAA, GAC, GAG, GAT, GCC, GGC, GGT, GTA, GTC, GTT, TAC, TTC}
X1 = {ACA, ATA, CCA, TCA, TTA, AGC, TCC, TGC, AAG, ACG, AGG, ATG, CCG, GCG,
GTG, TAG, TCG, TTG, ACT, TCT}
X2 = {CAA, TAA, CAC, CAT, TAT, GCA, CCT, GCT, AGA, CGA, GGA, TGA, CGC, CGG,
TGG, AGT, CGT, TGT, CTA, CTT}
![Page 31: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/31.jpg)
T Representations
Frame0:Frame1:Frame2:
ATGGGCAAGTAA
1 0 1 22 222 2 0
ATGGGCAAGTAAATGGGCAAGTAA
![Page 32: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/32.jpg)
Training set
• DKEYP-117 zebra fish gene.
• KEGG
• 10620 Nucleotides
• Length of windows 200 in T representation
• C is 1671 Windows (Coding frame)
• C++ 1670 Windows
![Page 33: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/33.jpg)
First Experiment
![Page 34: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/34.jpg)
• Consistent with Crick’s hypothesis but for the size of the code.
• Comma-free code (words of length 600)
OR
• G is locally testable
• Robustness with respect to overfitting.
![Page 35: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/35.jpg)
General ExperimentData sets
• We selected 14 different organisms in all three families and extracted 50 genes from each (Ecoli, Pyrococcus, Anopheles gambiae….).
• 100 genes which were selected from KEGG, NCBI, Weizmann Institute (TP53, Atm, HIV, Breast cancer…).
• 1000 genes with various ranges of GC Contents (Center for Bioinformatics, UPenn).
![Page 36: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/36.jpg)
![Page 37: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/37.jpg)
• Not Comma-free• Maybe Bounded Parasitism/Circular• It is testable by fragments
ATG…GGCAA…CACC…TAATGA..AGTG…CCAA..ACCCT…GCAAC..TAG…
![Page 38: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/38.jpg)
![Page 39: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/39.jpg)
ATG…GGCAA…CACC…TAATGA..AGTG…CCAA..ACCCT…GCAAC..TAG…….
• Not Comma-free• Not Bounded Parasitism/Circular• Not Locally testable• But it IS testable by fragments
![Page 40: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/40.jpg)
Interpretation with respect to Crick’s Hypothesis
• Existence of a universal coding frame
• Some families fit the local testability/comma free /BP/circular
• Some families are more susceptible to alternative splicing still they are Testable by Fragments (within the coding sequence)
![Page 41: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/41.jpg)
Strict Algorithm
Fw CCw / Fw CCw /
![Page 42: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/42.jpg)
Relaxed Algorithm
50SS FF
50SSS FFF
&
![Page 43: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/43.jpg)
General Results
• 95.4% success with Strict algorithm
• 94.8% success with Relaxed algorithm
• Distribution of failures (concentrated on some organisms)
• Support the Universal Frame Hypothesis
• Existence of underlying mathematical structures
![Page 44: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/44.jpg)
Smallest fragment sizeRelaxed Algorithm
fragment of size 10, window size 2 74% success
fragment of size 60, window size 2590% success
• Keep testable by fragment• Most probable
![Page 45: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/45.jpg)
Universal Property
Human - TP53 Gene
ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCGCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGAGA………………………………………………………………………….
Using this gene we are able to find the frame of any other gene.
Ecoli – dgkA Gene……..TCGAATAATACCACTGGATTCACCCGAATTATCAAAGCTTCC…..
Pseudomonas fluorescens – ahcY Gene….TACGGCTGCCGTCACAGCCTGAACGACGCCATCAAGCGCGGC……..
Bos taurus – APOE Gene………..GCTGGGGCCAGCGAGGGTGCCGAGCGCAGCTTGAGCGCCATC…
Sus scrofa - JAK2 Gene……ATTGTAACTATTCATAAGCAAGATGGCAAAAGTCTGGAAAGC……
Pyrococcus – OT3 Gene……CATAGCGTTAACCACTACACCAACAGCGTCGGCAAAATCCTC……
Methanococcus maripaludis – comE Gene….TTTAACAATTACGCACCTATAACTACAGAACAACAACGTGAT……….
![Page 46: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/46.jpg)
CONCLUSION
• Provided we extend the notion of Comma-Free to the related notion of Testable By Fragment
Crick’s 1957 Hypothesis is vindicated:
• There exists a universal frame based on a mathematical model
![Page 47: Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal](https://reader030.vdocuments.us/reader030/viewer/2022032801/56649d545503460f94a30936/html5/thumbnails/47.jpg)
Coding vs. Non Coding
Algorithm tells us the most likely coding frame under the assumption that we are in the coding regionNot suitable as such to analyze the non coding region. Need to adapt and refine.
Non coding region contains pseudo genes, gene complements, hypothetical genes, other functional regions in %’ UTR and 3’ UTR…Repeats, and apparently random sequences.Nevertheless we ran an experiment (Augustus) …. 60 pb of transcription vs. translation