codons, genes and networks bioinformatics service math@bio group of m.gromov andrei zinovyev

38
Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Post on 19-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Codons, Genes and Networks

Bioinformatics service

Math@Bio group of M.Gromov

Andrei Zinovyev

Page 2: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Plan of the talk Part I: 7-clusters structure of

genome (codons and genes)

Part II: Coding and non-coding DNA scaling laws (genes and networks)

Page 3: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Part I: 7-clusters genome structure

Dr. Tatyana Popova

R&D Centre in Biberach, Germany

Prof. Alexander Gorban

Centre for Mathematical Modelling

Page 4: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Genomic sequence as a text in unknown language

tagggacgcacgtggtgagctgatgctaggg

frequency dictionaries:t a g g g a c g c a c g t g g t g a g c t g a t g c t a g g g

ta gg ga cg ca cg tg gt ga gc tg at gc ta gg

tag gga cgc acg tgg tga gct gat gct agg

tagg gacg cacg tggt gagc tgat gcta gggr

N = 4=41

N = 16=42

N = 64=43

N=256=44

gggrcgccacgttggtgagctgatgctagggrcgacgtgg

tagggrcgcacgtggtgagctgatgctagggrcgacgtgg

agggrcgcacgtggtgagctgatgctagggrcgacgtggc

..cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc…

Page 5: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

From text to geometrycgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc

107

cgtggtgagctgatgctagggacgcacggtgagctgatgctagggacgcacacttgagctgatgctagggacgcacaattcgtgagctgatgctagggacgcacggtg……gagctgatgctagggacgcacaagtga

length~200-400

10000-20000 fragments

RN

Page 6: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Method of visualizationprincipal components analysis

RNR

2

R2

PCA plot

Page 7: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Caulobacter crescentus

singles N=4

doublets N=16

triplets N=64

quadruplets N=256

!!!

the information in genomic sequence is encodedby non-overlapping triplets (Nature, 1961)

Page 8: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

First explanation

cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

Page 9: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

tga tgc tag ggr cgc acg tgg

ctg atg cta ggg rcg cac gtg

Basic 7-cluster structure

gtgagctgatgctagggrcgcacgtggtgagc

gct gat gct agg grc gca cgt

gtgaatcggtgggtgaqtgtgctgctatgagc

atc ggt ggg tga gtg tgc tgc

tcg gtg ggt gag tgt gct gct

cgg tgg gtg agt gtg ctg ctg

Page 10: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Non-coding parts

gtgagctgatgctagggr cgcacgaat

Point mutations:insertions, deletions

a

Page 11: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

The flower-like 7 clusters structure is flat

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Page 12: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Seven classes vs Seven clusters

StanfordTIGRGeorgia Institute of Technology

Hong-Yu Ou, Feng-Biao Guo and Chun-Ting Zhang (2003). Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Letters 540(1-3),188-194

Audic, S. and J. Claverie. Self-identification of protein-coding regions in microbial genomes.Proc Natl Acad Sci U S A, 95(17):10026-31, 1998.

Lomsadze A., Ter-Hovhannisyan V., Chernoff YO, Borodovsky M.Gene identification in novel eukaryotic genomes byself-training algorithm. Nucleic Acids Research, 2005, Vol. 33, No. 20

Page 13: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Computational gene prediction

Accuracy >90%

Page 14: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Mean-field approximationfor triplet frequencies

321KJIIJK PPPF

FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):

FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers

position-specific letter frequency + correlations

: 12 numbersjiP

Page 15: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Why hexagonal symmetry?

0-+

-+0

+0-

+-0

-0+

0+-

GC-content = PC + PG

Page 16: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Genome codon usageand mean-field approximation

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

correct frameshift

64 frequencies FIJK

ggtgaATG gat gct agg … gtc gca cgc TAAtgagct

12 frequencies PI1 , PJ

2 , PK3

Page 17: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

PIJ are linear functions of GC-content

eubacteria

archae

Page 18: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

THE MYSTERY OF TWOSTRAIGHT LINES ???

R12 R64

FIJK = P1IP2

JP3K + correlations

Page 19: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Codon usage signature

0-+

Page 20: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

19 possible eubacterialsignatures

Page 21: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Example: Palindromic signatures

Page 22: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Four symmetry typesof the basic 7-cluster structure

eubacteria

flower-likedegeneratedperpendiculartriangles

paralleltriangles

Page 23: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

B.Halodurans (GC=44%)

S.Coelicolor (GC=72%)

F.Nucleatum (GC=27%)

E.Coli (GC=51%)

Page 24: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Using branching principal components to analyze 7-clusters genome structures

Page 25: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Streptomyces coelicolor

Bacillus halodurans Ercherichia coli

Fusobacterium nucleatum

Using branching principal components to analyze 7-clusters genome structures

Page 26: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Web-site

http://www.ihes.fr/~zinovyev/7clusters

cluster structures in genomic sequences

Page 27: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Papers (type Zinovyev in Google)

Gorban A, Zinovyev AGorban A, Zinovyev APCA deciphers genome.PCA deciphers genome. 2005. Arxiv preprint

Gorban A, Popova T, Zinovyev A Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences.bacterial genomic sequences. 2005. Physica A 353, 365-387

Gorban A, Popova T, Zinovyev AGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster structure of Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. microbial genomic sequences. 2005. In Silico Biology 5, 0025

Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.

Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene IdentificationSelf-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).

Page 28: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Part II:Coding and non-coding DNA scaling laws

Dr. Thomas Fink

Bioinformatics service

Dr. Sebastian Ahnert

Cavendish laboratory,University of Cambridge

Page 29: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

C-value and G-valueparadox Neither genome length nor gene

number account for complexity of an organism

Drosophila melanogaster (fruit fly) C=120Mb

Podisma pedestris (mountain grasshopper) C=1650 Mb

Page 30: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Non-linear growth of regulation

Mattick, J. S. Nature Reviews Genetics 5, 316–323 (2004).

“Amount of regulation” scales non-linearly with the number of genes: every new gene with a new function requires specific regulation, but the regulators also need to be regulated

Log number of genes

Log n

um

ber

of

regula

tory

genes

bacteria

archae

Slope = 1.96

Slope = 1

Page 31: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Complexity ceiling for prokaryotes

Adding a new function S requires adding a regulatory overhead R, the total increase isN = R + S

Since R ~ N2 , at some point R > S,i.e. gain from a new function is too

expensive for an organism, it requires toomuch regulation to be integrated

There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)

There is a maximum possible genome lengthThere is a maximum possible genome lengthfor prokaryotes (~10Mb)for prokaryotes (~10Mb)

Page 32: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

How eukaryotes bypassed this limitation?

Presumably, they invented a cheaper (digital) regulatory system, based on RNA

This regulatory information is stored in the “non-coding” DNA

Page 33: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Simple model:Accelerated networks

Node is a gene (c genes)Edge is a “regulation” (n edges)

n = c2

Connectivity < kmax,

regulators are onlyproteins

Connectivity > kmax

deficit of regulations is takenfrom non-coding DNA

Page 34: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

How much regulation genome needs to take from non-coding DNA?

)(2 max

max

max ccc

ckndeficit

cmax (prokaryotic ceiling)

These regulations must be encoded in the non-coding part of genome, therefore

N – non-coding DNA lengthC – coding DNA lengthCprok – ceiling for prokaryotes (~10Mb)

some coefficient

Page 35: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Observation:coding length vs non-coding

=1

Minimumnon-codinglength neededfor the «deficit»regulation

Page 36: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Hypothesis Prokaryotes:<Non-coding length> = <Coding length> (little constant add-on, promoters, UTRs…)

15% ≈ 1/7

EukaryotesNreg = /2 C/Cmaxprok(C-Cmaxprok) ~ C2,

Cmaxprok ≈ 10Mb ≈

This is the amount necessary for regulation, but repeats, genome parasites, etc., might make a genome much bigger

Page 37: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

This is only a hypothesis, but…

Prediction on the Nreg for human:

Nreg = 87 Mb = 3% of genome length

C = 48 Mb = 1.7%

Nreg+C = 4.7%

Page 38: Codons, Genes and Networks Bioinformatics service Math@Bio group of M.Gromov Andrei Zinovyev

Thank you for your attention Questions?