tair: bringing together data for the global plant biology community philippe lamesch kate dreher the...
TRANSCRIPT
TAIR: Bringing together data for the global plant biology community
Philippe LameschKate Dreher
The Arabidopsis Information Resourcewww.arabidopsis.org
contact us: [email protected]
o Philippe Lamesch
Introducing TAIR and PMN
TAIR10 genome annotation
TAIR gene confidence ranking
TAIR tools
o Kate Dreher
Ee
Rr
Outline
TAIR: The Arabidopsis Information Resource
• collect, curate and distribute information on Arabidopsis• information freely available from arabidopsis.org
Slides available from TAIR www.arabidopsis.org
TAIR is used worldwide
Visits per month (source: Google Analytics)
TAIR usage worldwide : July 2009-July 2010
What TAIR does:(1) Arabidopsis genome annotation
What TAIR does:(2) manual literature curation
• Controlled vocabulary annotations
Gene Ontology (GO) http://www.geneontology.org/
Plant Ontology (PO) http://www.plantontology.org/
• Gene name, symbol
• Allele, phenotype
• Summary statement composition
Who we partner with:
PMN (Plant Metabolic Network) and PlantCyc
A comprehensive plant biochemical pathway database, containing curated information from the literature and computational analyses about the genes, enzymes, compounds, reactions, and pathways involved in primary and secondary metabolism
Who we partner with:ABRC
Distribution of biological research materials
• A new approach for improving the Arabidopsis genome annotation for TAIR10
• The Arabidopsis gene structure confidence ranking
Arabidopsis genome annotation
Arabidopsis genome annotation
• Arabidopsis genome sequenced almost 10 years ago• High quality sequence with few gaps• TIGR did initial genome annotation• TAIR took over responsibility in 2005• Current TAIR9 stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs
Genome annotation at TAIRAdd novel genesUpdate exon/intron structures of existing genesDelete mispredicted genesMerge and split genesChange gene typesAdd splice-variants
Genome annotation at TAIR
Annotate ‘atypical’ gene classes
* * * ** * *
Trans. element
Short protein-coding genes
Transposable element genes
Pseudogenes
uORFs (genes within UTR of other genes)
Add novel genesUpdate exon/intron structures of existing genesDelete mispredicted genesMerge and split genesChange gene typesAdd splice-variants
Arabidopsis gene structure annotation A new approach
TAIR6-TAIR9: Use ESTs and cDNAs and a assembly tool called PASA to improve gene structures
TAIR10
TAIR10: Use new experimental data and new prediction tools to further improve gene structure predictions
Using PASA and ESTs/cDNAs
Clustered transcripts
NCBI
Genome annotation TAIR6-TAIR9
Clustered transcripts
Resulting gene model
NCBI
Using PASA and ESTs/cDNAs
Genome annotation TAIR6-TAIR9
Clustered transcripts
Resulting gene model
Previous gene model
NCBI
comparison
Novel genesNew Splice-variantsGene structure updates
Using PASA and ESTs/cDNAs
Genome annotation TAIR6-TAIR9
ESTs
cDNAs
Radish sequence alignmentsEugene
predictiondicot sequence alignments
monocot sequence alignments
Aceview genepredictions
2 gene isoforms
Manual annotation at TAIR: Apollo
Short MS peptide
TAIR10: using proteomics and RNA-seq data to improve genome annotation
4-step process:1.Mapping RNA seq & Peptides2.Assembly/Gene built3.Manual review4.Integration (genome release/Gbrowse)
Mapping and Assembly1. Mapping• RNA-seq sequences (Tophat (C. Trapnell),
Supersplat (T.C. Mockler))• Peptides (6-frame translation, spliced exon graph)
2. Assembly approaches• Augustus (M. Stanke)o Uses spliced RNA seq reads, peptideso Aim: Identify additional splice-variants, update existing
genes• TAU (T.C. Mockler)o Uses spliced RNA seq readso Aim: Identify additional splice-variants• Cufflinks (C. Trapnell)o Uses spliced and unspliced RNA seq datao Aim: Identify novel genes
Augustus
TopHat, SuperSplat
145,000 RNA-seq junctions based on >1 read
203,000 clustered spliced RNA-seq junctions
(spliced RNA-seq junction)
RNA-seq datasets (Mockler Lab, Ecker Lab)
200 Million aligned RNA-seq reads
Augustus145,000 RNA-seq junctions based on >1 read 260,000 peptides (Baerenfaller et al, Castellana et al)
Augustus gene prediction
+ ESTs & cDNAs+ AGI models
11% of RNA-seq junctions incorporated into Augustus models64% of peptide sequences incorporated into Augustus models
Predicted Augustus models:5461 distinct models1596 novel models
Categorisation/Review
TAU Models
RNA-seq Junctions
Augustus Model
TAIR confidence rank
TAIR Model
Peptides
(Splice variants, NMD targets)
(correction)
(colour reflects matching model)
Incorrect junction in TAIR model
Unsupported exon
Example Augustus update
Example Augustus splice variant
Example 2 August splice variant
Augustus/TAU/Cufflinks Augustus• Incorporate 64% of peptides not contained in TAIR, 11 % for RNA-seq
junctions• 5461 potential updated genes• 1596 potential novel genesTAU• 30,083 junctions distinct to Augustus or TAIR models• 10,902 junctions incorporated into 10,491 TAU modelsCufflinks• 367 novel assemblies which fall above the 100 bp
#TE-filter applied to AUG and cufflinks models
Preliminary TAIR 10 Results
Novel genes Updated genes Splice-variants B-list Rejects
Preliminary TAIR 10 Results
Novel genes 126 Updated genes 1182Splice-variants 5885 (18% of all loci) B-list 1586 Rejects 2318
Gene Confidence Rank
• Attributes confidence scores to all exons and gene models based on different types of experimental and computational evidence
Assigning A Confidence Rank
E1
E4
Full support
No support
New and updated tools at TAIR
• N-Browse• GBrowse• Synteny viewer
• N-Browse (in collaboration wit the Kris Gunsalus Lab, NYU)
• > 7,000 experimental interactions• Interactions curated by TAIR, IntAct & BioGrid• Tutorial at
http://www.arabidopsis.org/tools/nbrowse.jsp#nb-tut
New and updated tools at TAIR
N-Browse
N-Browse: Finding information about edges (interactions)
N-Browse: How to select and move nodes
N-Browse: How to visualize GO terms from a selected set of nodes
N-Browse: How to load your own file and overlay it with the curated interaction data
N-Browse: How to save your session and export your data
New Tools at TAIR
• N-Browse• GBrowse• Synteny viewer
GBrowseHeader
Main Browser Window
Track Menu
Alternative gene annotations
• Eugene (transcript, proteins +) Thierry-Mieg (NCBI)
• Gnomon (transcript, proteins) Souvorov (NCBI)
• Aceview (transcript) Sebastien Aubourg
• Hanada et al 2007 (3633 predicted genes)
Proteomic Data• High-density Arabidopsis proteome map (Baerenfaller.
2008)Incorrect start codon
VISTA plot Gbrowse track
Transcriptome data
Orthologs and Gene Families
Variation
Promoter Elements
Methylation
Decorated Fasta file
Decorated Fasta file
Decorated Fasta file
New Tools at TAIR
• N-Browse• GBrowse• Synteny viewer
Data provided by Pedro Pattyn at the University of Ghent
AT5G48000
AT5G48010
AT5G47990
Example 2 Augustus update
GBrowseHeader
Main Browser Window
Track Menu
Gbrowse