predicting genes in eukaryotic genomes by computer hao bailin ( 郝柏林 ) t-life research center,...

96
Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝郝郝 ) T-Life Research Center, Fudan University Beijing Genomics Institute , Academia Sinica Institute of Theoretical Physics, Academia Sinica (www.itp.ac.cn/~hao/)

Upload: christina-lane

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Predicting Genes inEukaryotic Genomes

By Computer

Hao Bailin ( 郝柏林 )

T-Life Research Center, Fudan University

Beijing Genomics Institute , Academia Sinica

Institute of Theoretical Physics, Academia Sinica

(www.itp.ac.cn/~hao/)

Page 2: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The Central Dogma of Molecular Biology

replication

DNA DNA reverse transcription transcription

cDNA mRNA translation

Protein/Enzyme folding

Function Structure interaction

Page 3: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 4: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 5: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 6: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

DNA( 脱氧核糖核酸 ) 序列

• 由 4 种字母 ( 核苷酸、碱基 a, c, g, t ) 组成• 长度:单条染色体从几千到几千万个字母• 人有 23 对染色体;黑猩猩有 24 对;小鼠

有 19 对;水稻有 12 对;猕猴桃有 300 对• 染色体的一部分编码蛋白质;其余是控制

信号,重复片段,意义不明的“随机”字母串,等等

Page 7: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Large-Scale DNA SequencingSince 1977

• Sanger method: polymerization stopping

• Maxam-Gilbert: chemical degradation

• Each reaction: 500-600 bp (a single read)

• Clone by clone vs. whole-genome shotgun

• Sequence assembling: reads – contigs – scaffolds – superscaffolds

• Automatic sequencer: MegaBace, 96 or 384 channels

Page 8: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Letter production at BGI (Beijing + Hangzhou)

Daily: 5 x107

Yearly: 1010

Page 9: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

已经测序的真核生物基因组• 酿酒酵母 (Saccharomyces cerevisiae)

• 列解酵母 (Schizosacchromyces pombe)

• 秀丽线虫 (Caenorhabitatis elegans)

• 果蝇 (Drosophila melanogaster)

• 恶性疟疾原虫 (Plasmodium falciparum)

• 岗比亚按蚊 (Anopheles gambiae)

• 智人 (Homo sapiens) 、黑猩猩 (Pan trogodytes)

• 小鼠 (Mus musculus) 、大鼠 (Rattus norvegicus)

• 家犬 (Canis familiaris) 、家鸡 (Gallus gallus) 、家猪 (Sus scrofa)

• 河豚鱼 (Fugu rubripes)

• 家蚕 (Bambyx mori) 、蜜蜂 (Apsis mellifera)

• 拟南芥 (Arabidopsis thaliana) 、水稻 (Oryza sativa)

• 玉米 (Zea mays)

Page 10: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

cccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttat

ccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtgga

Page 11: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 12: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

ID A1BG_HUMAN STANDARD; PRT; 495 AA.... ... ...KW Immunoglobulin domain; Glycoprotein; Plasma; Repeat; Signal.... ... ...SQ SEQUENCE 495 AA; 54209 MW; 87A50C21CE89459C CRC64; MSMLVVFLLL WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG AAANLELIFV GPQHAGNYRC RYRSWVPHTF ESELSDPVELLVAES //

蛋白质序列: 20 种字母(氨基酸 AA )长度: 50 – 6000 AA

实例:人的免疫球蛋白

Page 13: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Gene-Finding by Computer

Starting from early 1980s:

• “Ab initio” or “de novo” algorithms: GeneMark, GenScan, FgeneSH, Genie, …based on gene-structure models and training data. (Our on-going project: BGF, the BGI Gene Finder)

• Homolog methods based on sequence alignment with known genes in databases and comparative genomics of not-too-distant species

• Mixed approach using both strategy: TwinScan

Page 14: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Different Stages of Gene-Finding

• Use all possible existing programs and services on the web with a public-domain or home-made genome viewer

• Write your own gene-finder, trained for the specific organism

• A dream for the time being: design a self-training and self-developing program “for any species” which would improve itself iteratively starting from a few available reads, cDNAs, and ESTs

Page 15: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Performance of Gene-Finders in Eukaryote Genomes

• M. Q. Zhang, Nature Review Genetics, 3 (2002) 698-710 (mostly for the human genome):

Nucleotide level: 80% Exon level: 45% Whole gene structure: 20%• FgeneSH and BGF for rice (our tests on 128 cDNA-confirm

ed single-gene genomic sequences): Nucleotide level: 90% Exon level: 60% Whole gene structure: 40%

Page 16: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

5‘ 3‘

3‘ 5‘

Each strand carries the same amount of information, but different sets of genes.Two strands are equivalent in information content.Two strands are not equivalent in gene content.Biological processing (duplication, transcription) goes from 5’ to 3’. Finding genes on one strand at a time or on two strands at the same time: one-pass or two-pass programs.

Page 17: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

5’-UTR 3’-UTR

transcribe

Genomic DNA

Pre-mRNA

splice

mRNA

translate

AA seq ( protein primary seq )

fold

Protein fold

start stop

5’ 3’

RNA Pol II +…

splicesome u1u2u4u5u6RNP

ribsome init.

+ elong. factors term.

chaperonine

Page 18: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Three Scales of Search• Local: signals with minimal signature (start, stop, sp

licing); movable signals (caps, promoters, polyAs, branching points, some very weak) --- clustering, discrimination analysis, various statistical models

• Intermediate: exons, introns, intergenic --- Markov, semi-Markov, Hidden-Markov models; intron length distribution

• Global: optimal combination of the above --- dynamic programming

Page 19: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

{()【( . )( . )( . )】()}

Signals:• { transcription start (downstream of promoters)

• } transcription end (upstream of poly-A)

• 【 translation start (ctg, 1/64 in a random seq.)

• 】 translation end (tag, tga, taa, 3/64)

• ( splicing donor site (minimal signal=gt, 1/16)

• ) splicing accepter site (ag, 1/16)

• · branching point (very weak …a…)

Transcription Translation Translation Transcription start start end end

Page 20: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

{()【( . )( . )( . )】()}

• 【( First exon

• )( Internal exon

• )】 Last exon

• {( Non-coding 5’ exon

• )【 Non-coding 5’ exon

• ( . ) Intron

• 】( Non-coding 3’ exon (rare)

• )} Non-coding 3’ exon (rare)

• }{ Intergenic region

Transcription Translation Translation Transcription start start end end

Page 21: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Signal and Sequence Models

• eiid: equal probability independently and identically distributed

• niid: non-equal probability independently and identically distributed

• WWM: Windowed weight matrix, etc.

• MMn: Markov chain model of order n: homogeneous and period-3 MM5 are used in many gene-finders

• Consensus sequence

Page 22: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Consensus Sequences• TATAAT ( Pribnov or -10 box ):

T80A95T45A60A50T96

• TTGACA ( -35 box ):

T82T84G78A65C54A45

• CAAT ( CAAT or –75 box ):

GGYCAATCT• TATA ( TATA or Goldberger-Hogness box ):

TATAWAW• ATG ( Transcription start point )

However, in Aful: ATG –76% GTG –22% TTG –2%

Page 23: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 24: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

GT-AG Rule for Intron 5’ splicing donor site

exon …A64G73 G100T100A62A68G84T63… …12PyNC65A100G100 N…exon

3’ splicing

acceptor site

Page 25: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 26: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Exon Intron

Arapdopsis

Rice

Human

Exon and intron size distribution

Page 27: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Algorithms

• Sequence models and scores for signals

• Dynamic programming: optimal parse

• Hidden Markov Model: geometric distribution of intron lengths

• Semi-Hidden Markov Model: needs sequence-generating models and length probability for each node

• Language theory approach

Page 28: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Flow Chart of GenScan

Chris Burge (1996): A 27-state semi-HMM A simpler model: 19-stateA model taking UTR introns into account : 35-state

Page 29: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Figure : N, intergenic

region; P,promotor; F,

5’UTR; , single-

exon gene; , initial

exon; phase

k internal exon; ,ter

-minal exon; T, 3’UTR;

A,polyadenylation signal;

and, , phase k

intron. ) strand.

snglE

initE

)20( kEk

termE

)20( kI k

Page 30: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Problems: Minor and Major

• Ambiguity symbols (N, W, S, R, …)

• (1-p) at flanking D-type nodes

• Indels and frame-shifts

• Gradient effects in gene structure

• Introns in 5’-UTRs and 3’-UTRs: leading to 35-state Markov Models

• Alternative splicing and sub-optimal paths

• Limit of probabilistic models

• Deterministic approaches

Page 31: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Dyck language: A language of nested parentheses

• Many types of parentheses

• Finite depth of nesting

• Context-free language

Our case:

• Only 3 types of parentheses

• Shallow nesting

• Conjecture: may be regular language

Page 32: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Two Subspecies of Rice

• Oryza sativa ssp. indica ( 籼稻 )• Oryza sativa ssp. Japonica ( 粳稻 )

The difference was described in Xu Shen’s ( 许慎《说文解字》 ) Chinese Dictionary of East Han Dynasty (~ 2nd Century AD)J.H. Zhang et al. Rice cultivation of Jianhu Remains in

Henan Province, Science J. ( 《科学》杂志 ) , 53( 4 ), 2002 , 3 (in Chinese)

Page 33: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Two Test Datasets for RiceGene-Finders

• The 28469 japonica full-length cDNAs (Kikuchi et al., Science 301 (18 July 2003)

• Select a high-quality subset without overlaps with publically available cDNAs

• A single-gene set: 500 sequences with one gene in each

• A multi-gene set: 46 sequences with 199 genes in total (at least 4 genes in a sequence)

Page 34: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Assessment of Gene-Finders

Test done between 22 July and 2 August 2003

• FgeneSH (trained on monocotyledons)

• GeneMark.hmm

• RiceHMM

• GlimmerR

• GenScan (trained on maize)

• BGF(rise.genomics.org.cn/bgf/)

Page 35: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Our Ultimate Goal

• An iterative, self-training, self-improving gene-finder “for any species”, starting from a small number of reads with or without EST, cDNA supports

• Annotaion and re-annotation of the rice genomes

• Plant comparative genomics, especially, that of Gramene and Crucifers

Page 36: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA features

• tRNA gene pre-tRNA mature tRNA

• Mature tRNA: 75 – 95 bases

• Cloverleaf like structure

• Five arms: acceptor arm, D arm, anticodon arm, V loop (extra arm), T C arm

Page 37: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

How many tRNA genes are present in an organism?

• Codon tRNA amino acid

• 61 encoding codons

• 20 amino acids

• Are there 61 species of tRNA with all possible anticodons ?

• Met (M) has one codon but two tRNAs

Page 38: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Wobble hypothesis Crick, 1966

• Many tRNAs recognize more than one codon

• Through non-Watson-Crick base pairings

• Less than 61 tRNAs are needed

Page 39: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The Modified Wobble Hypothesis(Guthrie & Abelson 1982)

• In eukaryotes, 46 different tRNA species would be enough.

• The modified wobble hypothesis is almost perfectly hold in H. sapiens, S. cerevisiae, A. thaliana, C.elegans whose complete collection of tRNAs are now known.

Page 40: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

aa codonA C H anti aa codonA C H anti aa codonA C H anti aa codonA C H anti

UUU0 0 0 AAA UCU37 14 10 AGA UAU0 0 1 AUA UGU0 0 0 ACA

UUC16 16 14 GAA UCC1 0 0 GGA UAC76 19 11 GUA UGC15 1330GCA

UUA6 5 8 UAA UCA9 7 5 UGA UAA0 0 1 UUA UGA0 0 0 UCAUUG10 7 6 CAA UCG4 5 4 CGA UAG0 0 1 CUA UGG14 11 7 CCA

CUU11 18 13 AAG CCU16 6 11 AGG CAU0 0 0 AUG CGU9 18 9 ACG

CUC1 0 0 GAG CCC0 0 0 GGG CAC10 17 12 GUG CGC0 1 0 GCG

CUA10 3 2 UAG CCA39 34 10 UGG CAA8 18 11 UUG CGA6 10 7 UCGCUG3 5 6 CAG CCG5 3 4 CGG CAG9 7 21 CUG CGG4 3 5 CCG

AUU20 19 13 AAU ACU10 17 8 AGU AAU0 0 1 AUU AGU 0 0 0 ACU

AUC0 0 1 GAU ACC0 0 0 GGU AAC16 20 33 GUU AGC13 9 7 GCU

AUA5 8 5 UAU ACA8 11 10 UGU AAA13 16 16 UUU AGA9 7 5 UCUAUG23 20 17 CAU ACG6 7 7 CGU AAG18 33 22 CUU AGG8 3 4 CCU

GUU15 19 20 AAC GCU16 21 25 AGC GAU0 0 0 AUC GGU1 0 0 ACC

GUC0 0 0 GAC GCC0 0 0 GGC GAC23 22 10 GUC GGC23 1411GCCGUA7 6 5 UAC GCA10 10 10 UGC GAA12 17 14 UUC GGA12 33 5 UCCGUG8 5 19 CAC GCG7 4 5 CGC GAG13 20 8 CUC GGG5 3 8 CCC

tRNA copies in Arabidopsis, C. elegans, and Human

F

L

I

M

V

S

P

T

A

Y

*

H

Q

N

K

D

E

C

*W

R

S

R

G

*

Page 41: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA Genes in the Rice Genome(Found by tRNAScan-SE + BLASTN)

Chromosome Indica (BGI) Japonica/syngenta (IRGSP) 1 85 71 (85) 2 57 59 3 79 68 4 45 46 (41) 5 58 56 6 38 32 7 34 35 8 45 42 9 34 32 10 28 23 (28) 11 23 24 12 38 36 Total 564 (in 382 Mbp) 519 (in 360 Mbp)

Page 42: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Chloroplast tRNA genes in ssp. indica and japonica

• 33 tRNA genes found in indica and japonica genome respectively.

• They are completely identical, no mutation is found (E. C. Kemmerer and Ray Wu found two tRNA genes perfectly conserved).

• It is remarkable that in spite of more than 9000 years of separation no mutation could be observed in the chloroplast tRNA genes in the two ssp.

Page 43: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The End

Thank you!

Page 44: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 45: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 46: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 47: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 48: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 49: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 50: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 51: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
Page 52: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Some Informatics Work Related to the Rice (Oryza sativa L. ssp. indica) D

raft Genome

HAO Bailin ( 郝柏林 )

Beijing Genomics Institute (BGI)

Institute of Theoretical Physics (ITP)

T-Life Research Center, Fudan University

http://www.itp.ac.cn/~hao/

Page 53: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Informatics Problems

• Collection and quality control of data

• Assembling of reads, dealing with repeats

• Gene-finding and annotation RNA genes Protein-coding genes

• Prediction of structure and function

• Connection to gene expression data

Page 54: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The Central Dogma of Molecular Biology

replication

DNA DNA reverse transcription transcription

cDNA mRNA translation

Protein/Enzyme folding

Function Structure interaction

Page 55: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Genetic Material

DNA: linear or circular Chromosome: DNA + histons Mitochondria ( 线粒体 ) Chloroplast ( 叶绿体 )

Plasmids: linear or circular

Page 56: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Two Kinds of Tasks

• Developing a new method of gene-finding – a more or less academic job

• Finding genes in a given genomic sequence – a practical job

Page 57: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The transfer RNA Genes in Rice (Oryza sativa ssp indica)

collection of contigs

WANG Xiyin( 王希胤 ) SHI Xiaoli( 史晓黎 )

(Peking U and BGI)

HAO Bailin( 郝柏林 )

(BGI, Fudan University, and ITP)

Page 58: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA function

• tRNAs are the actual translator from mRNA to Amino Acids in protein.

• Bridge between RNA world and protein world

• Naming convention:

trnQ-UUG

Page 59: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA features

• tRNA gene pre-tRNA mature tRNA

• Mature tRNA: 75 – 95 bases

• Cloverleaf like structure

• Five arms: acceptor arm, D arm, anticodon arm, V loop (extra arm), T C arm

Page 60: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA structure

53

Page 61: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

How many tRNA genes are present in an organism?

• Codon tRNA amino acid

• 61 encoding codons

• 20 amino acids

• Are there 61 species of tRNA with all possible anticodons ?

• Met (M) has one codon but two tRNAs

Page 62: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The Wobble Hypothesis

• The Wobble Hypothesis (Ckrick 1968)

• The Modified Wobble Hypothesis (1982): 46 tRNA species would be enough

• What has been found in yeast, worm and human:

Page 63: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Wobble hypothesis Crick, 1966

• Many tRNAs recognize more than one codon

• Through non-Watson-Crick base pairings

• Less than 61 tRNAs are needed

Page 64: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Wobble rules by Crick

AI = ISONINE

Codon(base 3) Anticodon(base 1)

U A,G,I

C G,I

A U,I

G C,U

Page 65: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

The Modified Wobble Hypothesis

• In eukaryotes, 46 different tRNA species would be enough.

• Revised wobble hypothesis is almost perfectly obeyed by H. sapiens, S. cerevisiae, A. thaliana, C.elegans whose complete collection of tRNAs are now known.

Page 66: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Revised wobble hypothesis in eukaryotes Guthrie & Abelson,1982

Four-codon boxes

Codon

(base 3)

Anticodon

(base 1)

U G,I

C G,I

A U

G C

Two-codon boxes

UUU UCU UAU UGUUUC UCC UAC UGCUUA UCA UAA UGAUUG UCG UAG UGG

CUU CCU CAU CGUCUC CCC CAC CGCCUA CCA CAA CGACUG CCG CAG CGG

AUU ACU AAU AGUAUC ACC AAC AGCAUA ACA AAA AGAAUG ACG AAG AGG

GUU GCU GAU GGUGUC GCC GAC GGCGUA GCA GAA GGAGUG GCG GAG GGG

Page 67: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Modified wobble hypothesis in eukaryotes• In two codon boxes

• In four codon boxes

• One exceptional four codon box for Gly

Codon base 3 Anticodon base 1

U&C G

Codon base 3 Anticodon base 1

U&C A(I)

Codon base 3 Anticodon base 1

U&C G

Page 68: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Human codon usage and tRNA genes

UUU 171 0 AAA UCU 147 10 AGA UAU 124 1 AUA UGU 99 0 ACAUUC 203 14 GAA UCC 172 0 GGA UAC 158 11 GUA UGC 119 30 GCAUUA 73 8 UAA UCA 118 5 UGA UAA 0 0 UUA UGA 0 0 UCAUUG 125 6 CAA UCG 45 4 CGA UAG 0 0 CUA UGG 122 7 CCA

CUU 127 13 AAG CCU 175 11 AGG CAU 104 0 AUG CGU 47 9 ACGCUC 187 0 GAG CCC 195 0 GGG CAC 147 12 GUG CGC 107 0 GCGCUA 69 2 UAG CCA 170 10 UGG CAA 121 11 UUG CGA 63 7 UCGCUG 392 6 CAG CCG 69 4 CGG CAG 343 21 CUG CGG 115 5 CCG

AUU 165 13 AAU ACU 131 8 AGU AAU 174 1 AUU AGU 121 0 ACUAUC 218 1 GAU ACC 192 0 GGU AAC 199 33 GUU AGC 191 7 GCUAUA 71 5 UAU ACA 150 10 UGU AAA 248 16 UUU AGA 113 5 UCUAUG 221 17 CAU ACG 63 7 CGU AAG 331 22 CUU AGG 110 4 CCU

GUU 111 20 AAC GCU 185 25 AGC GAU 230 0 AUC GGU 112 0 ACCGUC 146 0 GAC GCC 282 0 GGC GAC 262 10 GUC GGC 230 11 GCCGUA 72 5 UAC GCA 160 10 UGC GAA 301 14 UUC GGA 188 5 UCCGUG 288 19 CAC GCG 74 5 CGC GAG 404 8 CUC GGG 160 8 CCC

Page 69: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

aa codonA C H anti aa codonA C H anti aa codonA C H anti aa codonA C H anti

UUU0 0 0 AAA UCU37 14 10 AGA UAU0 0 1 AUA UGU0 0 0 ACA

UUC16 16 14 GAA UCC1 0 0 GGA UAC76 19 11 GUA UGC15 1330GCA

UUA6 5 8 UAA UCA9 7 5 UGA UAA0 0 1 UUA UGA0 0 0 UCAUUG10 7 6 CAA UCG4 5 4 CGA UAG0 0 1 CUA UGG14 11 7 CCA

CUU11 18 13 AAG CCU16 6 11 AGG CAU0 0 0 AUG CGU9 18 9 ACG

CUC1 0 0 GAG CCC0 0 0 GGG CAC10 17 12 GUG CGC0 1 0 GCG

CUA10 3 2 UAG CCA39 34 10 UGG CAA8 18 11 UUG CGA6 10 7 UCGCUG3 5 6 CAG CCG5 3 4 CGG CAG9 7 21 CUG CGG4 3 5 CCG

AUU20 19 13 AAU ACU10 17 8 AGU AAU0 0 1 AUU AGU 0 0 0 ACU

AUC0 0 1 GAU ACC0 0 0 GGU AAC16 20 33 GUU AGC13 9 7 GCU

AUA5 8 5 UAU ACA8 11 10 UGU AAA13 16 16 UUU AGA9 7 5 UCUAUG23 20 17 CAU ACG6 7 7 CGU AAG18 33 22 CUU AGG8 3 4 CCU

GUU15 19 20 AAC GCU16 21 25 AGC GAU0 0 0 AUC GGU1 0 0 ACC

GUC0 0 0 GAC GCC0 0 0 GGC GAC23 22 10 GUC GGC23 1411GCCGUA7 6 5 UAC GCA10 10 10 UGC GAA12 17 14 UUC GGA12 33 5 UCCGUG8 5 19 CAC GCG7 4 5 CGC GAG13 20 8 CUC GGG5 3 8 CCC

tRNA copies in Arabidopsis, C. elegans, and Human

F

L

I

M

V

S

P

T

A

Y

*

H

Q

N

K

D

E

C

*W

R

S

R

G

*

Page 70: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Distribution of tRNA genesin a genome

• A kind of repeats

• Usually clustered together

• Distributed unevenly among chromosomes

• For example, in human genome, 140 tRNA genes, making up to 25% of the total, form a cluster in a narrow region of only 4 Mbp on chr. 6

Page 71: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

BGI Rice Contigs

• 127 550 contigs of total length 361Mb (from the estimated genome of 466Mb)

• N50 size: 6690bp

• It makes sense to look for tRNAs, since their length is around 75-95bp and it is possible to catch most of them.

Page 72: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

How many tRNA genes are there in rice genome ?

Is revised wobble hypothesis obeyed ?

Are there 46 species of tRNA genes ?

How many copies for each tRNA?

Page 73: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

With the help of the tRNAscan-SE program and BLASTN, a collection of tRNA ge

nes was obtained:

• 592 canonical tRNA genes

• 3 possible selenocysteine tRNA genes

• 1 possible suppressor tRNA gene

• 27 possible pseudo-tRNA-genes

Page 74: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

592 canonical tRNA genes

• BLASTN confirmed tRNA: 467

• Probable novel tRNA: 74

• Putative novel tRNA: 51

• “Novel” means more adapted to rice

Page 75: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

27 pseudo-tRNA genes• Genomic sequences structurally related to tRNA• Unable to yield active gene products• May have insertions, deletions• May lack functional promoters• Experiments needed to test if they are really functi

onally inactive• Divided into four classes

End-truncated type Insertion-disrupted “Non-maintained”Non-tRNA but pol III-like elements

Page 76: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Rice codons and tRNA genes

UUU 137 0 AAA UCU 100 17 AGA UAU 102 0 AUA UGU 61 0 ACAUUC 262 15 GAA UCC 150 0 GGA UAC 159 16 GUA UGC 128 10 GCAUUA 59 7 UAA UCA 110 10 UGA UAA 6 0 UUA UGA 10 0 UCAUUG 152 9 CAA UCG 115 7 CGA UAG 3 0 CUA UGG 138 12 CCA

CUU 153 19 AAG CCU 114 16 AGG CAU 120 0 AUG CGU 73 16 ACGCUC 276 0 GAG CCC 106 0 GGG CAC 157 11 GUG CGC 141 0 GCGCUA 83 8 UAG CCA 144 11 UGG CAA 124 16 UUG CGA 77 4 UCGCUG 216 6 CAG CCG 153 10 CGG CAG 225 13 CUG CGG 106 7 CCG

AUU 140 23 AAU ACU 105 9 AGU AAU 134 0 AUU AGU 72 0 ACUAUC 229 0 GAU ACC 161 0 GGU AAC 198 14 GUU AGC 166 13 GCUAUA 89 6 UAU ACA 120 8 UGU AAA 144 10 UUU AGA 97 9 UCUAUG 249 27 CAU ACG 113 0 CGU AAG 325 22 CUU AGG 142 10 CCU

GUU 171 21 AAC GCU 187 25 AGC GAU 241 0 AUC GGU 155 0 ACCGUC 223 0 GAC GCC 279 0 GGC GAC 292 28 GUC GGC 340 24 GCCGUA 66 4 UAC GCA 196 11 UGC GAA 205 15 UUC GGA 159 13 UCCGUG 226 10 CAC GCG 264 13 CGC GAG 393 29 CUC GGG 158 8 CCC

Page 77: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Wobble hypothesis is perfectly obeyed by rice genome !

45 species of tRNA genesFound so far.

Page 78: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

On the absence of trnT-CGU gene

• In fact, six possible trnT-CGU genes were found but discarded for low similarity to known tRNA gen

es.• The incompleteness of data: only 361Mb in contigs.• The tendency of tRNA gens to cluster together in a

genome.• Almost surely to be found (3 trnT-CGU genes were

found in japonica).• Rice is not an exception to the wobble hypothesis.

Page 79: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

36 tRNA genes have an intron

• 6% of the total • All have only one intron• Intron length: 12-20 bp generally 38 bp the longest• All trnY-GUA genes have an intron• All non-initiator trnM-CAU genes have an i

ntron, while all initiator initiator trnM-CAU genes have no intron

Page 80: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

One possible suppressor tRNA gene

Suppressor tRNA is a mutant tRNA that recognizes a stop codon(UAA/UAG)

instead of the codon for the cognate amino acid. Sometimes, but not always , due to a base substitution in the anticodon.

Here, it recognizes the stop codon UAA.

Page 81: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

3 possible selenocysteine tRNA genes

• The 21st codon of amino acid found in every domain of life on Earth.

• While there are many more amino acids than those twenty which are part of the standard genetic code, only selenocysteine and pyrrolysine have been discovered to be coded genetically.

• In fact selenocysteine is encoded by the UGA codon— the umber termination codon.

• Different mechanisms are adopted in prokaryotes and eukaryotes to tell the translation machinery of the cells that it should continue or terminate the process of translation.

• AIDs patients are found to contain several low molecular mass selenium compounds which are thought to be selenoprotein encoded by the HIV genome.

Page 82: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

46 chloroplast and 10 mitochondrial tRNA genes found

• Due to sequencing contamination

• Some chloroplast tRNA genes must be identical copies

• There are about 33 tRNA genes in rice chloroplast genome as predicted by tRNAscan-SE

Page 83: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Codon Bias in Rice Genome

• Codon bias exists in rice genome.

• Codon bias in rice resembles that in human, however XCG-form codons are less used in human genome.

• Codons ending with G or C is prefered to those ending with A or U respectively.

Page 84: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

A roughly positive correlation between codon usage and the corr

esponding tRNA gene number

0

100

200

300

400

500

600

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

tRNA gene copy number

corr

espo

ndin

g co

dons

num

ber

Page 85: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA genes are dispersed in the whole genome

• Based on the public data of japonica, Chr.10, Chr.7, Chr.6, Chr.3 may have much more tRNA genes than the other chromosomes.

• Many of them may form a few clusters.• In fact, many tRNA genes are repeats, for example, 8 almost identical trnQ-UUG genes ar

e found on a contig of indica.• There are many tRNA genes identical in sequence.

They may be repeating copies of genes or may be caused by assembly error.

Page 86: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Species tRNA gene number

Genome size (Mbp)

tRNA gene per Mbp in genome

CDs size (Mbp)

tRNA gene per Mbp for CDs

S. cerevisiae 273 12 22.75 8.45 32

S. pombe 174 14 12.48 6.9 25

C. elegans 584 100 5.84 26.1 22

A. thaliana 620 125 4.96 33.5 18

D. melanogaster 284 180 1.58 24.1 12

O. sativa 596 464 1.48 -- --

H. sapiens 648 3400 0.19 58.5(?) 11(?)

tRNA genes in eukaryotes

Page 87: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Chloroplast genome of Oryza sativa ssp. indica and japonica

• Almost the same genome size

indica : 134559 (2001 data)

japonica: 134525 (1989 data, CHOSXX, X15901)• Elizabeth. C. Kemmerer and Ray Wu(2001)

very few differences between the sequences of 11 chloroplast genes from indica and japonica, including 2 tRNA genes.

The coding region and flanking region up to 100 bp are highly conserved.

More difference in intron region than coding region.

Page 88: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Chloroplast tRNA genes in ssp. indica and japonica

• 33 tRNA genes found in indica and japonica genome respectively.

• They are completely identical, no mutation is found (E. C. Kemmerer and Ray Wu found two tRNA genes perfectly conserved).

• It is remarkable that in spite of more than 7000 years of separation no mutation could be observed in the chloroplast tRNA genes in the two ssp.

Page 89: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

References

[1]. J. Yu et al., A draft sequence of the rice genome (oryza sativa l. ssp. indica). Science 296, 79(2002).

[2]. S. A. Goff et al., A draft sequence of the rice genome (oryza sativa l. ssp japonica). Science 296, 92(2002).

[3]. F. Crick, Codon-anticodon pairings: the wobble hypothesis. J. Mol. Biol. 19: 548-555(1966).

[4]. Guthrie, C. and Abelson, J. Organization and expression of tRNA genes in Saccharomyces cerevisiae. In: The Molecular Biology of the Yeast Saccharomyces: Metabolism and Gene Expression (ed. J. Strathern et al. ), Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, pp. 487-528. (1982).

[5]. International human genome sequencing consortium, Nature 409, 801(2001).

[6]. http://rna.wustl.edu/GtRDB/ S. Eddy et al..

Page 90: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Transcriptional or functional efficiency of tRNA genes

Codon frequency

tRNA gene number

Codon frequency per tRNA gene

Codons ending with

U and C

5332 277 19.25

Codons ending with A

1678 132 12.67

Codons ending with G

2978 183 16.26

Page 91: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

tRNA gene sub-species studiedAAA 0 AGA 3 AUA 0 ACA 0GAA 2 GGA 0 GUA 1 GCA 1UAA 3 UGA 3 UUA 0 UCA 0CAA 1 CGA 1 CUA 0 CCA 2

AAG 1 AGG 1 AUG 0 ACG 3GAG 0 GGG 0 GUG 1 GCG 0UAG 2 UGG 1 UUG 1 UCG 1CAG 1 CGG 2 CUG 1 CCG 1

AAU 2 AGU 1 AUU 0 ACU 0GAU 0 GGU 0 GUU 1 GCU 2UAU 1 UGU 1 UUU 1 UCU 1CAU 2 CGU ? CUU 1 CCU 2

AAC 1 AGC 1 AUC 0 ACC 0GAC 0 GGC 0 GUC 2 GCC 1UAC 1 UGC 1 UUC 3 UCC 1CAC 1 CGC 2 CUC 1 CCC 2

Page 92: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

We have to point out ……• The collection of tRNA genes we obtained may be

redundant• Novel tRNA genes may be found • Experimental work is needed to prove whether the

y are genuine tRNA genes or not• As scientists sequencing human genome pointed o

ut: the work looking for novel ncRNA genes would still be challenging even the complete finished sequence of the genome were available

Page 93: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

GC-Gradient Effect in Rice

Jun Wang, Gane Wong, et al.

Genome Research (2002)

Page 94: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Fly

Page 95: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

Huimin Xie 谢惠民 Grammatical Complexity and 1D dynamical Systems Vol.6 in Directions in Chaos WSPC, 1996.

谢惠民 《复杂性与动力系统》 上海科技教育出版社 , 1994

J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages andComputation,Addison-Wesley, 1979.

Page 96: Predicting Genes in Eukaryotic Genomes By Computer Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

THE END

THANK YOU!