Download - Analyse statistique des séquences génomiques
Analyse statistique des séquences génomiques
DEA en bioinformatique
Lausanne, 16 mai 2000Laurent Duret
Plan
• Taille des génomes, paradoxe de la valeur C
• Contenu informationnel
• Séquences répétées
• Organisation en isochore des génomes de vertébrés
• Régions régulatrices non-codantes
• Usage des codons synonymes
What is a genome ?1911 - gene:
Elementary unit, responsible for the transmission of hereditary characters
1920 - genome:
Set of genes of an organism
1944 - Avery et al.
DNA is the molecule of heredity
1950-70 :
Double helix
Genetic code
Mycoplasma genitalium0,6 Mb
Genomes sizes E. coli 4,7 Mb
YeastS. cerevisiae
13,5 Mb
NematodeC. elegans
100 Mb
ProtozoaAmoeba dubia
700 000 Mb
PufferfishFugu
rubripes400 Mb
Man3400 Mb
Xenopuslaevis
3100 Mb
AmphibiaNewt
100 000 Mb
DipnoiLungfish
150 000 Mb
ProkaryotesEukaryotes
13%2%85%Drosophila85%2%13%E. coli70%2%28% YeastS. cerevisiae
5%2%93%Man27%2%71%NematodeC. elegansProportion of functional elements within genomes
2%98%0,01%Lunfish (dipnoi)Coding (protein)RNANon-coding
How many genes in the human genome ?Fields et al. 1994
Technique Gene estimate Comments/assumptions
Estimation 100,000 if average size = 30 kb
Estimation 300,000 if average size = 10 kb
Transcribed genome 20,000
RNA reassociation 20,000 - 92,000
Genomic sequencing (1994) 71,000 biased toward gene-rich region?
CpG islands 67,000 assumes 66% human genes have CpG islands
EST analysis (1994) 64,000 matching with GenBank; 50% EST redundancy
Chromosome 22 (1999) 45,000 correction for high gene density on chrom. 22
Chromosome 21 (2000) + 22 40,000
Functional elements in the human genome
3.4 109 nt 50,000-100,000 protein-coding genes
81% no known function43%38%introns4%12%protein-coding regions
centromeres, telomeres,
RNA2%intergenic
Untranslated RNAs: Xist, H19, His-1, bic, etc.
Regulatory elements: promoters, enhancers, etc.
2% ??
Structure of human protein genes
• 1396 complete human genes (exons + introns) from GenBank (1999)
• Average size (25%, 75%)
– Gene 15 kb ± 23 kb (4, 16) (10% > 35 kb)
– CDS 1300 nt ± 1200 (600, 1500)
– Exon (coding) 200 nt ± 180 (110, 200)
– Intron 1800 nt ± 3000 (500, 2000)
– 5'UTR 210 nt (Pesole et al. 1999)
– 3'UTR 740 nt (Pesole et al. 1999)
• Intron/exon
– Number of introns: 6 ±3 introns / kb CDS
– Introns / (introns + CDS): 80%
– 5' introns in 15% of genes (more ?), 3 ’introns very rare
• Alternative splicing in more than 30% of human genes (Hanke et al. 1999)
Structure of human protein genes
• GenBank: bias towards short genes
• 1396 complete human genes (exons + introns)
≤949596979899Publication date48121620Gene size (coding exons+introns) kb
5101520253035≤949596979899Publication dateGene size (coding exons+introns) kb
Structure of human protein genes
• GenBank: bias towards short genes
• 1396 complete human genes (exons + introns)
• 9268 complete human mRNA
Sequence:cDNA
complete gene (exons+introns)
400800120016002000889092949698Average CDS size (nt)Publication date
Overlapping genesN-myc geneN-cym geneIA IB II IIIIII II IPOL II
promoter
polyadenylationsite
transcriptionmaturation5'pm7GpppAAAAAAAAAAsmall nucleolar
RNA
mRNAoverlapping protein genessmall nucleolar RNA genes within intronsof protein genes
Tandem repeatsRepeated sequencesInterspersed repeatsDNA transposons retroelementsRecombination- DNA polymerase slippage:
CACACACACACA
- unequal crossing-over
motif bloc size % humangenome
satellite: 2-2000 nt up to 10 Mb 10%minisatellite: 2-64 nt 100-20,000 bp ?microsatellite: 1-6 nt 10-100 bp 2%
ADN satellite: centromères
CTTCGTTGGAAACGGGAsatellite α (171 )pb
(17 )site CenpB pb répétitions de satelliteα
répétitions d'ordre, supérieur spécifiques
de chaque chromosome( 10 )jusqu'à Mb
centromèrechromosome
Reverse transcriptase:RetroelementsNucleusCellRNADNAtranscriptionreverse transcriptionintegrationLINEs (long interspersed elements): 6-8 kbretroposons
SINEs (short interspersed elements):80-300 bpsmall-RNA-derived retrosequences (tRNA), pol III
Endogenous Retroviruses: 5-10 kb
LTR gag pol env LTRRetrovirusRetrotransposonRetroposonRetroséquenceRetrovirus
NucleusCellLINE reverse transcriptaseRetrosequences:opportunist retroelements
reverse transcriptionDNARNALINERNART protein
Pseudogenes1 - gene duplication generepeated elementinequal crossing-over2 - gene inactivation mutationA - pseudogenes derived from recombinationB - retropseudogenesgenepromoterAAAAAAtranscription + maturationmRNADNAretrotranscription + integrationAAAAAADNA
Retropseudogènes
• 23,000 à 33,000 retropseudogènes dans le génome humain
• Les gènes qui génèrent des retropseudogènes sont généralement de type housekeeping
• Gonçalves et al. 2000
Fréquence des éléments transposables dans le génome humain
• Total = 42% (Smit 1999)
0%4%8%12%AluLINE1MIRLINE2LTR elementsDNAtranposon
Fréquence des éléments transposables dans le génome humain (Smit 1999)
0%5%10%15%20%25%30%AluLINE1AutosomesChromosome X
Isochore organization of vertebrate genomes
Organisation en isochore des génomes de vertébrés: mise en évidence expérimentale
densité à l'équilibre, g/cm31,6901,7001,71030%41%51%satellitetaux C+Gcomposantmajeur
Fractionnement du génome de la souris par centrifugation en gradient de densité (Bernardi et al. 1976)
Analyse statistique des séquences publiées dans les banques de données.
Corrélation entre la composition en base en position 3 des codons et celle de l'envirronement génomique dans lequel se trouve le gène
Analyse statistique des séquences publiées dans les banques de données.
Distribution en fréquence des gènes dans les différentes classes d'isochores
CDS GC3%
0
1
2
3
4
5
6
7
0 20 40 60 80 100
Moy = .612Ecart-t = .1585447 séq
0
1
2
3
4
5
6
7
0 20 40 60 80 100
Moy = .639Ecart-t = .171818 séq
Homme Poulet
Nb
de
gèn
es (
%)
0
2
4
6
8
10
12
0 20 40 60 80 100
Moy = .509Ecart-t = .106703 séq
Xénope0 20 40 60 80 100
Danio
Moy = .580Ecart-t = .103173 séq
0
2
4
6
8
10
12
14
Evolution de la structure en isochore chez les vertébrés
Isochore organization of vertebrate genomes
• Insertion of repeated sequences (A. Smit 1996)
• Recombination frequency (Eyre-Walker 1993)
• Chromosome banding (Saccone, 1993)
• Replication timing (Bernardi, 1998)
• Gene density (Mouchiroud, 1991)
• Gene expression ?? -> No
• Gene structure (Duret, 1995)
isochore %C+G % total genomic DNA
L1+L2 : 33%-44% 62 %
H1+H2 : 44%-51% 31%
H3 : 51%-60% 3-5%
H1+H2L1+L2H3H1+H2L1+L2L1+L2>300 kbBernardi et al. 1985
Isochores and insertion of repeat sequences (Smit 1999)
4%8%12%16%20%AluLINE-1LTR-
elements
Density in repeat sequencesG+C content of genomic sequence:G+C < 39%G+C > 47%G+C 39%-47%
4419 human genomic sequences > 50 kb4419 human genomic sequences > 50 kb
Isochores and gene density
MHC locus (3.6 Mb) MHC locus (3.6 Mb) (The MHC sequencing consortium 1999)(The MHC sequencing consortium 1999)
Class I, class II (H1-H2 isochores): 20 genes/Mb, many pseudogenesClass I, class II (H1-H2 isochores): 20 genes/Mb, many pseudogenesClass III (H3 isochore): 84 genes/Mb, no pseudogeneClass III (H3 isochore): 84 genes/Mb, no pseudogene
Class II boundaries correlate with switching of replication timingClass II boundaries correlate with switching of replication timing
isochore % total genomic DNA %total genes
L1+L2 : 62 % 31%
H1+H2 : 31% 39%
H3 : 3-5% 30%
2060100140Number of genes / MbL1+L2H1+H2H3Mouchiroud et al. 1991
Isochores and introns length
• 760 complete human genes• L1L2: intron G+C content < 46%• H1H2: intron G+C content 46-54%• H3: intron G+C content >54%
Average intron length (bp)Gene compaction (intron length/coding region length)40080012001600200024681012L1L2H1H2H3L1L2H1H2H3
Duret, Mouchiroud and Gautier, 1995
Prediction of functional elements (1) Ab initio methods
Ruled-based or statistical methods e.g.: protein genes prediction, promoter prediction, … Very useful but ...
Limits in sensibility/specificity No method available for many functional elements (non-coding RNA genes, regulatory elements, …)
Large scale transcriptome projects: ESTs, full-length cDNA Identification of transcribed genes (protein or non-coding RNA) Information on alternative splicing, polyadenylation (Hanke et al. 1999, Gautheret et al. 1998),
expression pattern Very useful but ...
Problems with genes expressed at low level, narrow tissue distribution, stage-specific expression, … Limited tissue sampling Artifacts in ESTs (introns, partially matured RNA, …) Limited to polyadenylated RNA
Prediction of functional elements (2) Comparative sequence analysis (phylogenetic footprinting)
Function => selective pressure
Corollary Sequence conservation = selective pressure = function
provided the number of aligned homologous sequences represents enough evolutionary time for the accumulation of mutations at the less constrained (presumably selectively neutral)
base positions.
• Evolutionary rate in non-functional DNA: ~ 0.3% / My (± 0.069)
Man/Mouse: ~ 80 Myrs 46-58% identity
Mammals/Birds: ~ 300 Myr 26-28% identity
Random sequences 25% identity
Analyse comparative des gènes de -actine de l'homme et de la carpe
CarpeHomme5’UTR 3’UTR site polyA échelle de similarité: pas de similarité significative70 - 80% identité80 - 90% identitérégions codantes: éléments régulateurs:introns:ATGcodon stop
Phylogenetic footprinting Advantages
Works for all kinds of functional elements (transcribed or not, coding or not) as far as the information is in the primary sequence
Does not require any a priori knowledge of the functional elements
Limits Absence of evolutionary conservation does not mean absence of function No efficient method to detect unknown conserved secondary structure in RNA Function, but what function ? Depends on the sequencing status of other genomes
Human, mouse, fugu, C. elegans, drosophila, yeast, A. thaliana
Number of sequences to compare : > 200 Myrs of evolution Mammals/birds: 310 Myrs Human + mouse + bovine : 240 Myrs
Prédiction de régions régulatrices
• Méthodes ab initio
– Prédiction de promoteurs
– Îlots CpG
• Approche comparative
Prédiction de promoteurs eucaryotes
• Combinaison de sites de fixation de facteur de transcription (ordre, orientation, distance)
• Motifs courts, dégénérés– Difficile de distinguer les vrais sites des faux positifs:– Motif à 4 bases: ≈1/256 pb (1/128 pb sur les deux brins)
• Boîtes TATA, CAAT , GC: absents dans beaucoup de promoteurs
• Banques de données de sites de fixation de facteurs de transcription (TRANSFAC), de promoteurs caractérisés expérimentalement (EPD)
• PromoterScan (Prestridge 1995): Mesure de la densité en sites potentiels de fixation de facteurs de transcription de long de la séquence (pondération en fonction de la fréquence des sites dans ou en dehors des vrais promoteurs)
Prédiction de promoteurs: sensibilité, spécificité
• Sensibilité: fraction des promoteurs qui sont trouvés par le logiciel
– PromoterScan: sensibilité = 70% (promoteurs à boîte TATA)
• Spécificité: fraction des vrais promoteurs parmi ceux qui ont été prédits
– PromoterScan: spécificité = 20%– Un faux positif / 10 kb
• Génome humain: ≈100 000 gènes, ≈1 promoteur/30 kb
sensibilité=vrais_ positifs
vrais_ positifs+faux_ négatifs
spécificité=vrais_ positifs
vrais_ positifs+faux_ positifs
SpécificitéSensibilité1000Seuil
Prédiction de promoteurs eucaryotes: recherches en cours
• Prise en compte de l'orientation relative et des distances entre sites de fixation de facteurs de transcription– COMPEL (Kolchanov 1998): banque de données d'éléments composites– FastM : recherche dans une séquence génomique d'une combinaison de deux sites
de fixation de facteurs de transcription à une distance définie l'un de l'autre
• Recherche de corrélations entre sites– PromoterInspector (Werner 2000)
• Sensibilité: 40%• Spécificité: 45%
http://www.gsf.de/biodv/index.html
• Combinaison recherche ab initio / approche comparative: recherche de sites potentiels parmi les régions conservées
Îlots CpG• Génome de vertébrés :
– méthylation des C dans les dinucléotides 5 ’-CG-3 ’(CpG)
• Me-C fortement mutable -> T5 ’-CG- 3 ’ 5 ’-TG-3 ’ 5 ’-CA-3 ’
3 ’-GC- 5 ’ 3 ’-AC-5 ’ 3 ’-GT-5 ’
• Génome des vertébrés: globalement dépourvu en CpG (excès de TG, CA)
• Certaines régions (200 nt à plusieurs kb) échappent à la méthylation– Pas de déplétion en CpG: CpGo/e proche de 1
– Riche en G+C
– Îlot CpG:Longueur > 500 nt
CpGo/e > 0.6
G+C > 50%
ouou
CpGo /e =Nombre_ de_ CpG_ observéNombre_ de_CpG_ attendu
=0.25
01CpGo/e
La déamination des cytosines
Cytosine
N
C
CH
C
NH
NH2
O
CH
réparation
Cytosine
N
C
CH
C
NH
NH2
O
CH
Cytosine méthylée
N
C
CH
C
NH
NH2
O
CH
CH3
Uracile
HN
C
CH
C
NH
O
CH
O
déamination
Thymine
HN
C
CH
C
NH
O
CH
O
CH3
déamination TpGou
CpA
Îlots CpG: associé aux régions promotrices ?
• Bird (1986), Gardiner-Garden (1987) Larsen (1992) ref– 40% des gènes tissu-spécifiques possèdent un îlot CpG en 5 ’
– 100% des gènes ‘ housekeeping ’ possèdent un îlot CpG en 5 ’
• Rechercher des îlots CpG pour prédire des régions promotrices ?– Sensibilité: 40-100%
– Spécificité ?? (Quelle fraction des îlots CpG correspond effectivement à des régions promotrices ?)
• Ponger (1999): comparaison des îlot CpG qui recouvre ou non le site d ’initiation de la transcription
Fréquence des gènes humains avec un îlot CpG recouvrant le site d ’initiation de la transcription
– 800 gènes humains avec promoteur décrit
– Mesure de la distribution tissulaire à l ’aide d ’EST (20 tissus)
0%20%40%60%80%0-23-67-20Nombre de tissus où le gène est exprimé
Comparaison des îlots CpG recouvrant ou non le site d ’initiation de la transcription
– 272 îlots start CpG recouvrant le site d ’initiation de la transcription (start)
– 1078 îlots CpG en dehors d ’un promoteur connu (other) (en excluant les séquences répétées)
0 2460.50.70.91.11.350%60%70%80%otherstartG+C%CpGo/eLongueur(kb)
Recherche de régions régulatrices par analyse comparative (empreintes phylogénétiques)
• Goodman et al. 1988: régulation de l’expression des gènes du cluster -globine au cours du développement
• Alignement de séquences orthologues de 6 mammifères (> 270 Ma d’évolution)
• 13 empreintes phylogénétiques: ≥ 6 nt, conservation 100%
• Analyse par retard de bande sur gel:
• 12/13 (92%) correspondent à des sites de fixation de protéines
• 1996: 35 empreintes phylogénétiques avec protéines fixatrices identifiées
• Enhancers de gènes HOX (Fugu/souris) (Aparicio et al. 1995)
• enhancer TCR α (homme/souris) (Luo, 1998)
• promoteur COX5B (11 primates) (Bachman, 1996)
• promoteur uPAR (homme/souris) (Soravia, 1995)
Large scale phylogenetic footprinting
Non-coding sequences : 325,247 sequences 145 Mb
everything except protein-coding regions and structural RNA genes (rRNA, tRNA, snRNA, scRNA)
Introns, 5' and 3' untranslated regions, intergenic sequences
Filtering of microsatellite repeats and cloning vectors: XBLAST
Similarity search: BLASTN + LFASTA
Vertebrates, insects, nematode
Metazoan Genome ProjectsMillion yearsPorifera (sponge)Nematodes (C. elegans)Arthropods (Drosophila)EchinodermsUrochordataCephalochordata (amphioxus)Jawless fisheschondrichthyes (ray, shark)actinopterygii (bony fishes)amphibians mammals birds reptiles600400200800VertebratesSequencing effort: 9 to 100 Mb 0.8 to 2.4 Mb less than 0.2 Mb
Sequence Similarities1- Identification of new genes
protein-genes, RNA-genes: intronic snoRNA genes
2- Retroviral elements, retrotransposons
3- Low complexity sequences:
GC-rich, AT-rich, cryptic microsatellites
4- Artefacts:
annotation errors, sample contamination (sponge insulin, ascidian RNA, chicken TGFB1)
5- 326 highly conserved regions (HCRs)
- do not code for proteins
- do not correspond to any known structural RNA
326 Highly Conserved Regions (HCRs)
• > 70% identity over 50 to 2000 nt after more than 300 Myrs
• Unique sequences
• Generally specific of only one gene
• Longest HCR:
84% identity over 1930 nt after 300 Myrs
3’UTR deltaEF1 transcription factor
• Oldest HCRs: 500 to 600 Myrs
• No HCR between vertebrates and insects or nematode
Oldest HCRsMillion yearsPorifera (sponge)Nematodes (C. elegans)Arthropods (Drosophila)EchinodermsUrochordataCephalochordata (amphioxus)Jawless fisheschondrichthyes (ray, shark)actinopterygii (bony fishes)amphibians mammals birds reptiles600400200800Sequencing effort: 9 to 100 Mb 0.8 to 2.4 Mb less than 0.2 MbHistone 3’UTRα- actin3’UTR
3 5’HOX UTRVertebrates
Conservation pattern in 3’UTRsposition relative to the stop codon (nt)10005000150020002400c-fosTransferrin receptorbirdmammalEndoplasmic-reticulum Ca2+ ATPase birdmammalbirdmammalsimilarity: <60% ≥60% ≥70% ≥80%
Distribution of HCRs within genes3'-non-coding5'-non-codingintrons0%10%20%30%40%mammals / birdsmammals / amphibiansmammals / bony fishes2841917296512563812 Frequency of orthologous
genes containing HCRs
HCRs and multigenic familiesHistone replacement variant H3.3A0400600100014001800AAAAAAAAUGStopAUGStopAAAAAAAHistone replacement variant H3.3BHistone replacement variant H3.3A and H3.3B, Calmodulinsnt• several genes coding for a same protein
• non-coding sequences are distinct, and conserved
Function of 3’HCRs: mRNA stability, translation
A+U-rich element: stability, translationposition relative to the stop codon (nt)10005000150020002400c-fosTransferrin receptorbirdmammalbirdmammalsimilarity: <60% ≥60% ≥70% ≥80%IRE : Iron Responsive Element
IRP : Iron Regulatory Protein
CCAGUGN5'3'
Function of 3’HCRs:mRNA subcellular localization
Myosin heavy chain, c-myc, vimentin, -actin
chickencarp (bony fish)site poly(A)site poly(A)0200400600800position relative to the stop codon (nt)localization signalssimilarity: <60% ≥60% ≥70% ≥80%- 3’actin UTR
Comparaison des régions non-codantes de 77 gènes orthologues homme/souris (Jareborg et al. 1999)
0.20.40.60.81Upstream(1 kb)5’UTRIntrons3’UTRs
Upstream015’UTRcoding exon
intron3’UTR
Fraction des régions non-codantes conservées entre homme et souris
Synonymous codon usage61 codons, 20 amino-acids: genetic code degeneracy
Synonymous codon usage is not random: some codons are preferentially used
Synonymous codon usage bias
0.0%25.0%50.0%75.0%Example: Proline codon usage in Escherichia coli (4300 genes)
Mycoplasmacapricolum
Micrococcusluteus
Saccharomyces cerevisiae
Caenorhabditis elegans
Mus musculus0%25%50%75%100%CCUCCCCCACCGExample: proline codon usage in different speciesMST serine/threonine kinaseInositol 5-phosphatase0%25%50%75%CCU
CCC
CCA
CCG
Example: proline codon usage in different human genesSYNONYMOUS CODON USAGE BIAS VARIES ...... ACCORDING TO THE SPECIES... ACCORDING TO GENES WITHIN A SAME GENOME
HOW TO EXPLAIN BIASES IN SYNONYMOUS CODON USAGE ?SELECTIONIST AND NEUTRALIST HYPOTHESES
Selection of codons that are optimal for translation
(speed or accuracy)
Mutational biasCodon usage bias is correlated with gene expression level
Prefered codons correspond to the most abundant tRNAs
Selective pressure on synonymous sites => synonymous substitution rate is negatively corelated with codon usage bias
Examples: E. coli, B. subtilis, yeast
No relationship with gene expression level
Mutational biases affect all positions within a genome (not only synonymous codon position) => correlation between codon usage and whole genome base composition
Examples: - Borrelia burgdorferi:
whole genome : 29% G+C3rd codon position : 21% G+Cother positions: 32% G+C
- Mycobacterium tuberculosis: whole genome : 66% G+C3rd codon position : 79% G+Cother positions: 60% G+C
Selection-mutation-drift model
Mise en évidence d’une pression de sélection sur le choix des codons synonymes chez des organismes
pluricellulaires
• Corrélation entre biais d’usage des codons synoymes et niveau d’expression des gènes
• C. elegans • D. melanogaster • A. thaliana
• Corrélation entre l’abondance des ARNt et l’usage des codons synoymes
• D. melanogaster (White et al.1973, Moriyama & Powell 1997) • C. elegans
Expressed Sequence Tags (ESTs)Inventory of mRNAs expressed by a given organism in different tissues (or cell cultures), development stages, sexes, in response to stimuli, pathologies, etc.
Procedure• cDNA libraries
- polyA+ selection- normalization: reduce the abundance of
over-expressed mRNA
• automatic sequencing from 3' or 5' end, single read
Result• partial mRNA sequences (300-500 nt)• high error rate (3-6%)• high redundancy
In Silico Northern Blot• probe = mRNA sequence (or coding-region, when only the genomic sequence is known)
• filtering of repeated elements (microsatellites, Alu, LINEs, etc.)
• similarity search (BLASTN2) against a species-specific EST collection
• threshold for positive matches : ≥ 95% identity over 100 nt or more
- low enough to tolerate sequencing errors- high enough to distinguish different members of a gene family
• we only retained cDNA libraries that had been sufficiently sampled (> 10.000 ESTs)• cell culture, pathologies, unidentified cDNA libraries were not considered
number of ESTs number of different tissues/stages
Human 802,246 22/3C. elegans 67,987 2/2A. thaliana 36,200 1Drosophila 27,491 2/2
Expression pattern
• tissue distribution: tissue-specific vs. housekeeping
• expression level: number of EST matches / total number of ESTs in that cDNA library
No EST detected low moderate high
0.350.400.450.500.550.600.350.400.550.600.650.70Caenorhabditis elegansDrosophila melanogasterArabidopsis thalianaprotein length< 333 aa333-570 aa> 570 aaFavError bars indicate the 95% confidence interval.mRNA abundance:Frequency of favored codons (Fav),
protein length and expression level
No ESTLowInterme-diateHighExpressionIntron G+C contentD. melanogasterN = 439A. thalianaN = 2386C. elegansN = 7891
Mutational bias or selection ?In Drosophila, C. elegans and A. thaliana most of favored codons end in C or G.
Does codon usage reflect a mutational bias towards G or C in highly expressed genes ?
0.300.400.500.60both in embryo and adultonly in adultonly in embryoNo ESTprotein length< 333 aa333-570 aa> 570 aaExpression during development :FavCaenorhabditis elegansFrequency of favored codons (Fav), protein length and expression during
development in C. elegans.
Favored codons are the same in embryo- and adult- specific genes
Mise en évidence d’une pression de sélection sur le choix des codons synonymes chez des organismes
pluricellulaires
• Corrélation entre biais d’usage des codons synoymes et niveau d’expression des gènes
• C. elegans • D. melanogaster • A. thaliana
• Corrélation entre l’abondance des ARNt et l’usage des codons synoymes
• D. melanogaster (White et al.1973, Moriyama & Powell 1997) • C. elegans
Isoaccepting tRNA gene copy numberFrequency of amino-acids (weighted according to protein
expression levels)
515253545R = 0.82p < 0.00010%2%4%6%8%ACDEFGHILMNPQRSTVWYK Relationship between the number of tRNA genes and amino-acid usage in
C. elegans genome
580 tRNA genes
Example: proline
tRNA Codon
CGG CCG
GGG CCC
AGG CCT
TGG CCA
Relationship between favored codons and tRNA abundance in C. elegans genome
Anticodon-codon recognition(Wooble):
Anticodon CodonA UC GG C, UU A, GI U, C, (A)
A’s at first anticodon position are predicted as I’s
In all cases, favored codons are decoded by the most frequent tRNA gene
Selection on codon usage =selection for higher translation efficiency
- elongation rate
- cost of proofreading
=> no relationship between codon usage and gene length
- translation accuracy
the energetic cost of translation and the probability of misincorporation increase with protein length:
=> stronger selective pressure on codon usage in long genes => positive correlation between codon usage and gene length
E. coli (Eyre-Walker, 1996; Moriyama & Powell, 1998)
Correlation between protein length and Fav (a) or Favmax (b) in C. elegans highly expressed genes
number of codons (log scale) 0.2 0.4 0.6 0.8 25 80 250 800 2500 8000 R = -0.60 p<0.0001 Fav number of codons (log scale) 0.2 0.4 0.6 0.8 25 80 250 800 2500 8000 R = -0.42 p<0.0001 Favmax (a) (b) Favmax = maximum Fav value along coding sequences, using a sliding window of 150 codons, moved by steps of 3 codons. Variation of codon usage along genes ?
Mutations and evolutionBase modificationReplication error
etc.
MutationDNA repairs < 0 : Negative selection|Ns| << 1 : Random genetic drift s > 0 :Positive selectionFixationselective advantage or disadvantage (s)Population (N)somatictransmission to the offspringIndividual|Ns| << 1 : rate of evolution = rate of mutationgermlineno transmission to the offspringallele lossPolymorphismRate of mutation:etc. Rate of molecular evolution
= f(mutation rate, probability of fixation)
environmentsensibility to mutagensfidelity of polymerasesefficiency of DNA repair
Mammals: small population size:(~104 i.e. 100 to 1000 times smaller than drosophila)
the small selective differences between alternative synonymous codons is not sufficient to overcome random genetic drift
C. elegans, A. thaliana , D. melanogaster: selection on synonymous codon usage (translation efficency)
Mammals: no evidence for selection on codon usage
Probability of fixation of a new mutation :
p = f(s, N)
s : selective advantage or disadvantage
N : population size
| Ns | << 1 : random genetic drift