design and data basing of genome-wide gene specific tags

37
1 1 st Plant GEMS meeting Berlin 2002 Design and Data Basing of Genome-wide Gene Specific Tags C C omplete omplete A A rabidopsis rabidopsis T T ranscriptome ranscriptome M M icro icro A A rray rray

Upload: kaloni

Post on 24-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

C omplete A rabidopsis T ranscriptome M icro A rray. Design and Data Basing of Genome-wide Gene Specific Tags. France . Pierre Hilson * (coord.), Vincent Colot. Pierre Rouzé. Belgium. Paul Van Hummelen. Germany. Wilfried Nietfeld. United Kingdom. Jim Beynon. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Design and Data Basing of  Genome-wide Gene Specific Tags

1

1st Plant GEMS meeting Berlin 2002

Design and Data Basing of Genome-wide Gene Specific Tags

CComplete omplete AArabidopsisrabidopsis TTranscriptome ranscriptome MMicroicroAArrayrray

Page 2: Design and Data Basing of  Genome-wide Gene Specific Tags

2

1st Plant GEMS meeting Berlin 2002

the partners in CATMA

Belgium

France

Germany

Switzerland

SwedenSpainNetherlands

United Kingdom

Pierre Hilson* (coord.), Vincent ColotPierre Rouzé

Philippe Reymond

Jim BeynonWilfried Nietfeld

Peter WeisbeekJavier Paz-AresRishikesh Bhalerao

Paul Van Hummelen

Mark Crowe / Martin Trick

CATMA is a consortium of 10 research groups from 8 European countries built in October 2000

Page 3: Design and Data Basing of  Genome-wide Gene Specific Tags

3

1st Plant GEMS meeting Berlin 2002

functional studies

gene specific data

Gene Specific Tagsfor Micro Arrays

Page 4: Design and Data Basing of  Genome-wide Gene Specific Tags

4

1st Plant GEMS meeting Berlin 2002

Construction of a Gene Specific Tag (GST) collection representing most Arabidopsis genes

GOAL

Page 5: Design and Data Basing of  Genome-wide Gene Specific Tags

5

1st Plant GEMS meeting Berlin 2002

Homogeneous structural re-annotation of the whole genome sequence using EUGENE

Search for Gene Specific Tags location in each gene and design of primers for PCR amplification using SPADS

Build the CATMA database and enter the GST, primer and gene data into it.

STEPS

Page 6: Design and Data Basing of  Genome-wide Gene Specific Tags

6

1st Plant GEMS meeting Berlin 2002

CATMA flow chartGenome sequence

PCR from BACs/genome Gene Sequence Tags

SpottingMicro Arrays

Gene models

Structural Annotation EUGENE ( T. Schiex )

Design of GST primers

GST & primers sequence

SPADS ( V. Thareau )

CATMA Database( M. Crowe )

Page 7: Design and Data Basing of  Genome-wide Gene Specific Tags

7

1st Plant GEMS meeting Berlin 2002

STRUCTURAL ANNOTATION

Page 8: Design and Data Basing of  Genome-wide Gene Specific Tags

8

1st Plant GEMS meeting Berlin 2002

At the beginning of the project (october 2000) AGI annotation was nearly complete, but thisannotation suffered from major drawbacks

annotation methodology differed from oneAGI consortium to another

annotations having been done on several years, the first and the last differs in quality

gene models were often wrong

WHY ?

Page 9: Design and Data Basing of  Genome-wide Gene Specific Tags

9

1st Plant GEMS meeting Berlin 2002

Based on validation of existing tools used for gene prediction (Pavy et al., 1999) we had a view on the efficiency of each of them for each gene feature and for gene modeling as a whole

An “parasitic” software (EUGENE) was developed (Schiex et al., 2001) which integrates the various sources of information available to produce a gene model for the whole Arabidopsis genome

HOW ?

Page 10: Design and Data Basing of  Genome-wide Gene Specific Tags

10

1st Plant GEMS meeting Berlin 2002

the data set : AraSet

VALIDATION

566kb of Arabidopsis genome sequence containing 74 gene contigs of documented genes, each manually checked for consistency

57 contigs of 2 genes -> 114

14 contigs of 3 genes -> 42

3 contigs of 4 genes -> 12

168 genes (1028 exons, 860 introns)

94 intergenic sequences

Page 11: Design and Data Basing of  Genome-wide Gene Specific Tags

11

1st Plant GEMS meeting Berlin 2002

sensitivity (Se): true predictions / actual cases

- how often is the software correct ?

specificity (Sp): true predictions / total predictions

- how many false predictions given ?

sensitivity and specificity

Page 12: Design and Data Basing of  Genome-wide Gene Specific Tags

12

1st Plant GEMS meeting Berlin 2002

cef oef wefProgram Predicted Sensitivity Specificity ratio Correct Overlap.

exons Snef Spef Wef exons exons

GENSCAN 938 0,63 0,69 0,12 649 182 110GeneMark.hmm 1104 0,82 0,76 0,10 844 144 110MZEF prior p=0.01 641 0,37 0,60 0,21 382 126 134MZEF prior p=0.04 846 0,43 0,52 0,27 438 178 231MZEF prior p=0.10 998 0,45 0,47 0,32 467 210 322FGENE 1061 0,55 0,53 0,28 562 197 299GRAIL 1184 0,43 0,38 0,25 444 440 293FEX 1745 0,53 0,31 0,57 547 208 993FGENESP 737 0,41 0,57 0,21 423 156 156

1999 evaluation of exon prediction

Pavy et al. (1999) Bioinformatics 15:887-899

Real exons : 1028

Page 13: Design and Data Basing of  Genome-wide Gene Specific Tags

13

1st Plant GEMS meeting Berlin 2002

1999 evaluation of gene prediction

Pavy et al. (1999) Bioinformatics 15:887-899

Gene Modeleractual genes

predicted genes

correct gene

modelmissing

gene

partial gene

modelswrong genes

dispatched genes

fused genes Sensitivity Specificity

GENSCAN 168 150 28 1 139 13 1 60 0,17 0,19

GeneMark.hmm 168 208 67 1 100 27 18 12 0,40 0,32FGENESP 168 92 10 47 111 3 0 60 0,06 0,11

Correct gene model = every exon exactly predicted

*

*

Page 14: Design and Data Basing of  Genome-wide Gene Specific Tags

14

1st Plant GEMS meeting Berlin 2002

EUGENE, a Direct Acyclic Graph Algorithm

Schiex et al. (2001) LNCS, 2066:111-125

Page 15: Design and Data Basing of  Genome-wide Gene Specific Tags

15

1st Plant GEMS meeting Berlin 2002

Integrate different sources of informatione.g. in the current Arabidopsis v2 version built in : IMM for exon/intron/UTR/intergenicplug in : NetGene2, SplicePredictor, NetStartfilters : RepeatMasker homology : BLAST search in protein & DNA DB

Sim4 search in EST/cDNA collectionsborders : function to use 5’ & 3’ EST data.

EUGENE features

globally optimized to maximize gene prediction accuracy on a set of annotated sequences

Page 16: Design and Data Basing of  Genome-wide Gene Specific Tags

16

1st Plant GEMS meeting Berlin 2002

a typical EUGENE output graph

Page 17: Design and Data Basing of  Genome-wide Gene Specific Tags

17

1st Plant GEMS meeting Berlin 2002

2002 evaluation of gene prediction

Sensitity % Specificity

GenScan 17 19 C. BurgeGenMark.hmm 41 37 M. BorodovskyGlimmerA 30 19 S. SalzbergFgenesH-GC 57 55 V. SolovyevEugene 67 56 T. Schiex

Eugene+ 76 68

Page 18: Design and Data Basing of  Genome-wide Gene Specific Tags

18

1st Plant GEMS meeting Berlin 2002

THE Arabidopsis GENOME

AGI : 26514 genes

EUGENE : 29787 genes+ 12.3 %

Page 19: Design and Data Basing of  Genome-wide Gene Specific Tags

19

1st Plant GEMS meeting Berlin 2002

Example : chromosome I

AGI = 6494 genes EUGENE = 7489 genes

Mb

Ch.I

0

50

100

150

200

250

300

350

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

AGIEuGène

Mb

Numberof genes

Page 20: Design and Data Basing of  Genome-wide Gene Specific Tags

20

1st Plant GEMS meeting Berlin 2002

Aubourg, Samson, Brunaud & Lecharny URGV INRA/CNRS Evry, poster

Page 21: Design and Data Basing of  Genome-wide Gene Specific Tags

21

1st Plant GEMS meeting Berlin 2002

EuGene vs AGI

AGI and EUGENE genes with exactly the same predicted structure : 10352

AGIEuGene

start stop

start

stop

AGI and EUGENE genes with the same start and stop codon but internal differences : 3565

AGIEuGene

start stop

start

stop

using FLAGdb++ V.Brunaud, F.Samson, A. Lecharny, S. Aubourg -> poster

Page 22: Design and Data Basing of  Genome-wide Gene Specific Tags

22

1st Plant GEMS meeting Berlin 2002

AGI genes without cognate EuGene gene : 2255start

start

stopstopEuGene

start stopAGIstartstop stopstart

start

start

stopstopAGI

start stopEuGene

startstop stopstart

EuGene genes without cognate AGI gene : 3379

EuGene vs AGI

Page 23: Design and Data Basing of  Genome-wide Gene Specific Tags

23

1st Plant GEMS meeting Berlin 2002

AGI genes with at least 2 overlapping/inserted EUGENE gene (Split of EuGene) : 2191

EuGenestart star

tstopstop

AGIstart stop2191

EuGene vs AGI

409start stopEuGene

EuGene genes with at least 2 overlapping/inserted genes AGI (Split of AGI) : 409

start stopstart

stopAGI

start stop

Page 24: Design and Data Basing of  Genome-wide Gene Specific Tags

24

1st Plant GEMS meeting Berlin 2002

the CATMA gene set

1484 AGI genes, not detected by EUGENE32 Non-coding RNAs (P.Green)201 Controls

31272 Complete CATMA non redundant gene set

2334 documented/manually checked

29787 EUGENE predicted genes29555

Page 25: Design and Data Basing of  Genome-wide Gene Specific Tags

25

1st Plant GEMS meeting Berlin 2002

GST design

Page 26: Design and Data Basing of  Genome-wide Gene Specific Tags

26

1st Plant GEMS meeting Berlin 2002

RATIONALE

the GSTs are designed in order to be specific for a single gene, even if it is a member of a gene family

1) Find a specific sequence for each genegene A gene B

region of homology between genes A & Bthe A probe has to be designed outside this region

2) Amplify a region => 2 primers, 1probe /gene

Page 27: Design and Data Basing of  Genome-wide Gene Specific Tags

27

1st Plant GEMS meeting Berlin 2002

SPADS Specific Primers & Amplicon Design Software

GST location in the transcript : * The GST is entirely inside an exon or overlaps an intron (then 50% of the GST sequence is in exons) * GSTs are searched in the 3’->5’ direction, to take in account bias towards partial mRNAs missing 5’ sequences

Specificity : * GST specificity : checked with BLASTn against the whole genome * Primer specificity : checked with blastn against the PCR template

GST size : 150 to 500 bp

Vincent THAREAU

Page 28: Design and Data Basing of  Genome-wide Gene Specific Tags

28

1st Plant GEMS meeting Berlin 2002

H class : 47%

M class : 18%

No GST : 35%

CATMA v.1

H class : similarity with the closest paralogue below 40% M class : similarity with the closest paralogue below 70%

Page 29: Design and Data Basing of  Genome-wide Gene Specific Tags

29

1st Plant GEMS meeting Berlin 2002

CATMA v.1 features

21120 Gene Specific Tags

almost 2/3rd of the GSTs are locatedin the 3’-most part of the transcript

97,4 % of the GSTs are entirely in exons2,6 % of the GSTs are overlapping introns

Page 30: Design and Data Basing of  Genome-wide Gene Specific Tags

30

1st Plant GEMS meeting Berlin 2002

the CATMA databasehttp://www.catma.org

Mark CroweJohn Innes Center

Page 31: Design and Data Basing of  Genome-wide Gene Specific Tags

31

1st Plant GEMS meeting Berlin 2002

QUERIES

Preset Queries

Advanced SQL Queries

BLAST

Page 32: Design and Data Basing of  Genome-wide Gene Specific Tags

32

1st Plant GEMS meeting Berlin 2002

Since january 2001 new genome data became available, especially full-length cDNAs and 5’/3’ borders ESTs (CERES, RIKEN) and the current annotation has improved (TIGR, MIPS)

A second run of annotation is currently ongoing using a new version of EUGENE allowing to exploit 5’/3’ESTs

CATMA v.2

Page 33: Design and Data Basing of  Genome-wide Gene Specific Tags

33

1st Plant GEMS meeting Berlin 2002

New GSTs will be designed with SPADS when the CATMA v.1 GSTs are no longer supported by the EUGENE re-annotation as well as for new genes

SPADS will be re-run on predicted genes for which no GSTs can be designed after adding a 150bp tail after the STOP codon (virtual 3’UTR)

objective >= 25000 GSTs

CATMA v.2

Page 34: Design and Data Basing of  Genome-wide Gene Specific Tags

34

1st Plant GEMS meeting Berlin 2002

THE SONS OF CATMA

Page 35: Design and Data Basing of  Genome-wide Gene Specific Tags

35

1st Plant GEMS meeting Berlin 2002

CAGEmicro-array analysis

Coordination : Martin Kuiper

Goal : to allow comparison of micro-array

transcript profiling experiments

Page 36: Design and Data Basing of  Genome-wide Gene Specific Tags

36

1st Plant GEMS meeting Berlin 2002

ArabidopsisnGenomicRNA Interference

Knock-OutLine AnalysisLines silenced specifically formost Arabidopsis genes

GOALGST cloning

GST-based hpRNA vectorscomprehensive silenced line collection

AGRIKOLACoordinator: Ian Small

Page 37: Design and Data Basing of  Genome-wide Gene Specific Tags

37

1st Plant GEMS meeting Berlin 2002

Carine Serizet Ghent (GénoPlante/VIB)Vincent Thareau Ghent (GénoPlante) Mark Crowe JIC NorwichSébastien Aubourg VIB Ghent/ URGV EvryThomas Schiex, Sylvain Foissac INRA ToulousePatrice Déhais / Eric Bonnet VIB GentStephane Rombauts VIB Gent Pierre Rouzé INRA Ghent

Pierre Hilson URGV Evry / VIB Ghent

Fundings GénoPlante, URGV, VIB

Acknowledgements