design and data basing of genome-wide gene specific tags
DESCRIPTION
C omplete A rabidopsis T ranscriptome M icro A rray. Design and Data Basing of Genome-wide Gene Specific Tags. France . Pierre Hilson * (coord.), Vincent Colot. Pierre Rouzé. Belgium. Paul Van Hummelen. Germany. Wilfried Nietfeld. United Kingdom. Jim Beynon. - PowerPoint PPT PresentationTRANSCRIPT
1
1st Plant GEMS meeting Berlin 2002
Design and Data Basing of Genome-wide Gene Specific Tags
CComplete omplete AArabidopsisrabidopsis TTranscriptome ranscriptome MMicroicroAArrayrray
2
1st Plant GEMS meeting Berlin 2002
the partners in CATMA
Belgium
France
Germany
Switzerland
SwedenSpainNetherlands
United Kingdom
Pierre Hilson* (coord.), Vincent ColotPierre Rouzé
Philippe Reymond
Jim BeynonWilfried Nietfeld
Peter WeisbeekJavier Paz-AresRishikesh Bhalerao
Paul Van Hummelen
Mark Crowe / Martin Trick
CATMA is a consortium of 10 research groups from 8 European countries built in October 2000
3
1st Plant GEMS meeting Berlin 2002
functional studies
gene specific data
Gene Specific Tagsfor Micro Arrays
4
1st Plant GEMS meeting Berlin 2002
Construction of a Gene Specific Tag (GST) collection representing most Arabidopsis genes
GOAL
5
1st Plant GEMS meeting Berlin 2002
Homogeneous structural re-annotation of the whole genome sequence using EUGENE
Search for Gene Specific Tags location in each gene and design of primers for PCR amplification using SPADS
Build the CATMA database and enter the GST, primer and gene data into it.
STEPS
6
1st Plant GEMS meeting Berlin 2002
CATMA flow chartGenome sequence
PCR from BACs/genome Gene Sequence Tags
SpottingMicro Arrays
Gene models
Structural Annotation EUGENE ( T. Schiex )
Design of GST primers
GST & primers sequence
SPADS ( V. Thareau )
CATMA Database( M. Crowe )
7
1st Plant GEMS meeting Berlin 2002
STRUCTURAL ANNOTATION
8
1st Plant GEMS meeting Berlin 2002
At the beginning of the project (october 2000) AGI annotation was nearly complete, but thisannotation suffered from major drawbacks
annotation methodology differed from oneAGI consortium to another
annotations having been done on several years, the first and the last differs in quality
gene models were often wrong
WHY ?
9
1st Plant GEMS meeting Berlin 2002
Based on validation of existing tools used for gene prediction (Pavy et al., 1999) we had a view on the efficiency of each of them for each gene feature and for gene modeling as a whole
An “parasitic” software (EUGENE) was developed (Schiex et al., 2001) which integrates the various sources of information available to produce a gene model for the whole Arabidopsis genome
HOW ?
10
1st Plant GEMS meeting Berlin 2002
the data set : AraSet
VALIDATION
566kb of Arabidopsis genome sequence containing 74 gene contigs of documented genes, each manually checked for consistency
57 contigs of 2 genes -> 114
14 contigs of 3 genes -> 42
3 contigs of 4 genes -> 12
168 genes (1028 exons, 860 introns)
94 intergenic sequences
11
1st Plant GEMS meeting Berlin 2002
sensitivity (Se): true predictions / actual cases
- how often is the software correct ?
specificity (Sp): true predictions / total predictions
- how many false predictions given ?
sensitivity and specificity
12
1st Plant GEMS meeting Berlin 2002
cef oef wefProgram Predicted Sensitivity Specificity ratio Correct Overlap.
exons Snef Spef Wef exons exons
GENSCAN 938 0,63 0,69 0,12 649 182 110GeneMark.hmm 1104 0,82 0,76 0,10 844 144 110MZEF prior p=0.01 641 0,37 0,60 0,21 382 126 134MZEF prior p=0.04 846 0,43 0,52 0,27 438 178 231MZEF prior p=0.10 998 0,45 0,47 0,32 467 210 322FGENE 1061 0,55 0,53 0,28 562 197 299GRAIL 1184 0,43 0,38 0,25 444 440 293FEX 1745 0,53 0,31 0,57 547 208 993FGENESP 737 0,41 0,57 0,21 423 156 156
1999 evaluation of exon prediction
Pavy et al. (1999) Bioinformatics 15:887-899
Real exons : 1028
13
1st Plant GEMS meeting Berlin 2002
1999 evaluation of gene prediction
Pavy et al. (1999) Bioinformatics 15:887-899
Gene Modeleractual genes
predicted genes
correct gene
modelmissing
gene
partial gene
modelswrong genes
dispatched genes
fused genes Sensitivity Specificity
GENSCAN 168 150 28 1 139 13 1 60 0,17 0,19
GeneMark.hmm 168 208 67 1 100 27 18 12 0,40 0,32FGENESP 168 92 10 47 111 3 0 60 0,06 0,11
Correct gene model = every exon exactly predicted
*
*
14
1st Plant GEMS meeting Berlin 2002
EUGENE, a Direct Acyclic Graph Algorithm
Schiex et al. (2001) LNCS, 2066:111-125
15
1st Plant GEMS meeting Berlin 2002
Integrate different sources of informatione.g. in the current Arabidopsis v2 version built in : IMM for exon/intron/UTR/intergenicplug in : NetGene2, SplicePredictor, NetStartfilters : RepeatMasker homology : BLAST search in protein & DNA DB
Sim4 search in EST/cDNA collectionsborders : function to use 5’ & 3’ EST data.
EUGENE features
globally optimized to maximize gene prediction accuracy on a set of annotated sequences
16
1st Plant GEMS meeting Berlin 2002
a typical EUGENE output graph
17
1st Plant GEMS meeting Berlin 2002
2002 evaluation of gene prediction
Sensitity % Specificity
GenScan 17 19 C. BurgeGenMark.hmm 41 37 M. BorodovskyGlimmerA 30 19 S. SalzbergFgenesH-GC 57 55 V. SolovyevEugene 67 56 T. Schiex
Eugene+ 76 68
18
1st Plant GEMS meeting Berlin 2002
THE Arabidopsis GENOME
AGI : 26514 genes
EUGENE : 29787 genes+ 12.3 %
19
1st Plant GEMS meeting Berlin 2002
Example : chromosome I
AGI = 6494 genes EUGENE = 7489 genes
Mb
Ch.I
0
50
100
150
200
250
300
350
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
AGIEuGène
Mb
Numberof genes
20
1st Plant GEMS meeting Berlin 2002
Aubourg, Samson, Brunaud & Lecharny URGV INRA/CNRS Evry, poster
21
1st Plant GEMS meeting Berlin 2002
EuGene vs AGI
AGI and EUGENE genes with exactly the same predicted structure : 10352
AGIEuGene
start stop
start
stop
AGI and EUGENE genes with the same start and stop codon but internal differences : 3565
AGIEuGene
start stop
start
stop
using FLAGdb++ V.Brunaud, F.Samson, A. Lecharny, S. Aubourg -> poster
22
1st Plant GEMS meeting Berlin 2002
AGI genes without cognate EuGene gene : 2255start
start
stopstopEuGene
start stopAGIstartstop stopstart
start
start
stopstopAGI
start stopEuGene
startstop stopstart
EuGene genes without cognate AGI gene : 3379
EuGene vs AGI
23
1st Plant GEMS meeting Berlin 2002
AGI genes with at least 2 overlapping/inserted EUGENE gene (Split of EuGene) : 2191
EuGenestart star
tstopstop
AGIstart stop2191
EuGene vs AGI
409start stopEuGene
EuGene genes with at least 2 overlapping/inserted genes AGI (Split of AGI) : 409
start stopstart
stopAGI
start stop
24
1st Plant GEMS meeting Berlin 2002
the CATMA gene set
1484 AGI genes, not detected by EUGENE32 Non-coding RNAs (P.Green)201 Controls
31272 Complete CATMA non redundant gene set
2334 documented/manually checked
29787 EUGENE predicted genes29555
25
1st Plant GEMS meeting Berlin 2002
GST design
26
1st Plant GEMS meeting Berlin 2002
RATIONALE
the GSTs are designed in order to be specific for a single gene, even if it is a member of a gene family
1) Find a specific sequence for each genegene A gene B
region of homology between genes A & Bthe A probe has to be designed outside this region
2) Amplify a region => 2 primers, 1probe /gene
27
1st Plant GEMS meeting Berlin 2002
SPADS Specific Primers & Amplicon Design Software
GST location in the transcript : * The GST is entirely inside an exon or overlaps an intron (then 50% of the GST sequence is in exons) * GSTs are searched in the 3’->5’ direction, to take in account bias towards partial mRNAs missing 5’ sequences
Specificity : * GST specificity : checked with BLASTn against the whole genome * Primer specificity : checked with blastn against the PCR template
GST size : 150 to 500 bp
Vincent THAREAU
28
1st Plant GEMS meeting Berlin 2002
H class : 47%
M class : 18%
No GST : 35%
CATMA v.1
H class : similarity with the closest paralogue below 40% M class : similarity with the closest paralogue below 70%
29
1st Plant GEMS meeting Berlin 2002
CATMA v.1 features
21120 Gene Specific Tags
almost 2/3rd of the GSTs are locatedin the 3’-most part of the transcript
97,4 % of the GSTs are entirely in exons2,6 % of the GSTs are overlapping introns
30
1st Plant GEMS meeting Berlin 2002
the CATMA databasehttp://www.catma.org
Mark CroweJohn Innes Center
31
1st Plant GEMS meeting Berlin 2002
QUERIES
Preset Queries
Advanced SQL Queries
BLAST
32
1st Plant GEMS meeting Berlin 2002
Since january 2001 new genome data became available, especially full-length cDNAs and 5’/3’ borders ESTs (CERES, RIKEN) and the current annotation has improved (TIGR, MIPS)
A second run of annotation is currently ongoing using a new version of EUGENE allowing to exploit 5’/3’ESTs
CATMA v.2
33
1st Plant GEMS meeting Berlin 2002
New GSTs will be designed with SPADS when the CATMA v.1 GSTs are no longer supported by the EUGENE re-annotation as well as for new genes
SPADS will be re-run on predicted genes for which no GSTs can be designed after adding a 150bp tail after the STOP codon (virtual 3’UTR)
objective >= 25000 GSTs
CATMA v.2
34
1st Plant GEMS meeting Berlin 2002
THE SONS OF CATMA
35
1st Plant GEMS meeting Berlin 2002
CAGEmicro-array analysis
Coordination : Martin Kuiper
Goal : to allow comparison of micro-array
transcript profiling experiments
36
1st Plant GEMS meeting Berlin 2002
ArabidopsisnGenomicRNA Interference
Knock-OutLine AnalysisLines silenced specifically formost Arabidopsis genes
GOALGST cloning
GST-based hpRNA vectorscomprehensive silenced line collection
AGRIKOLACoordinator: Ian Small
37
1st Plant GEMS meeting Berlin 2002
Carine Serizet Ghent (GénoPlante/VIB)Vincent Thareau Ghent (GénoPlante) Mark Crowe JIC NorwichSébastien Aubourg VIB Ghent/ URGV EvryThomas Schiex, Sylvain Foissac INRA ToulousePatrice Déhais / Eric Bonnet VIB GentStephane Rombauts VIB Gent Pierre Rouzé INRA Ghent
Pierre Hilson URGV Evry / VIB Ghent
Fundings GénoPlante, URGV, VIB
Acknowledgements