improvements in the tomato reference genome (sl3.0) and annotation (itag3.0)
TRANSCRIPT
Improvements in the Tomato Reference
Genome (SL3.0) and Annotation
(ITAG3.0)
Prashant S Hosmani, Surya Saha, Mirella Flores, Stephane
Rombauts, Florian Maumus, Henri van de Geest, Gabino Sanchez-
Perez and Lukas Mueller
Boyce Thompson Institute, Ithaca, NY
VIB Department of Plant Systems Biology, Ghent University, Gent, Belgium
URGI, INRA, Universite ́ Paris-Saclay, Versailles, France
Wageningen Plant Research, Wageningen University, Netherlands
Acknowledgements
Gabino Sanchez
Henri van de Geest
SGN Community (You!)
RNAseq data contributors
Stephane Rombauts
Florian Maumus
SL3.0
Solanum lycopersicum
Heinz 1706
BAC Integration Workflow
Automatic
integration of BACsManual validation NCBI validation
https://github.com/solgenomics/Bio-GenomeUpdate
BAC
assemblies
Align to SL2.50
• 500bp BAC ends
• 100% identity
Place
BACs
1,069 full-length phase htgs3 BACs integrated and
~11Mb of contig gaps removed
BioNano Workflow
Assemble molecules
into CMaps
Hybrid assembly with
NGS scaffoldsManual validation
Hybrid assembly statistics
Scaffolds: 57
Total Genome Map Length: 779.789 Mb
Avg. Genome Map Length: 13.681 Mb
Genome Map N50: 25.384 Mb
Chr00 Integration
Chr00
Chr02
Cmap 84
• Chr00 contig NW_004194391.1 (203,142bp) inserted in chr09 150kb scaffold gap
• Two Inversions on chromosome 12
• 19 gaps resized
Chr00 contig NW_004194387.1 (561,203bp) integrated in 1.4Mb scaffold gap
ITAG3.0
Annotation
Structural annotation pipeline
Repeat masking
genome
Evidence – RNA
and protein
ITAG 2.40 gene
models
Post-processing
• Genes with functional domain support
• Assign Solyc-ID to novel genes
Repeat identification and masking the
genome
• Generated custom repeat libraryRepeatModeler
• Exclusion of repeats with similarity with known proteins (SwissProt)
ProtExcluder
• Masked 56.39% genomeRepeatMasker
Repeat identification and classification
Extensive identification and classification of repeats using
REPET, which masks 61% of the SL3.0 reference
genome.
Florian Maumus
ITAG 2.40 processing
• ITAG2.40 protein-coding genes34,725
• Webapollo curated genes
• Removed contamination (56)
• Removed transposon (2,244)32,425
• ITAG2.40 mapped - GMAP
• Mapped to SL3.0 repeat masked genome
31,309
Expression evidence for annotation
Expression data evidence
• 8 billion RNAseq reads
• Tissue and treatment specific RNAseq
• 5’ and 3’ UTR enriched RNAseq
• RENseq for NBS-LRR genes
• Pacbio Iso-seq data
• SwissProt plant proteins
Mapped on to SL3.0 and transcriptome was assembled
Mapping rate ~85%
RNAseq data sources
• Jim Giovannoni (BTI/USDA)
• Jocelyn Rose (Cornell)
• Greg Martin (BTI)
• Zhangjun Fei (BTI/USDA)
• Jonathan Jones (The Sainsbury Laboratory)
• Asaph Aharoni (Weizmann Institute of Science)
• Neelima Sinha (University of California, Davis)
MAKER pipeline
Ab-initio gene prediction methods
• Augustus (Training using BRAKER1)
• SNAP (MAKER based training)
• GeneMark (with high quality genes)
• Eugene (Stephane Rombauts)
Updating legacy annotation (ITAG2.40)
Post-processing
Added genes only with functional domain support (Pfam) ~800 genes
Removed genes with 70% overlap with repeats (674 genes).
Assigned Solyc ID to novel genes with ITAG convention.
Novel genes are assigned Solyc ID between existing Solyc ID.
Improvements in ITAG 3.0 compared with
ITAG 2.40
ITAG 2.40 ITAG 3.0
# of genes 34,725 34,769
Avg. gene length 1,209 bp 1,529 bp
Exons per gene 4.61 5.10
5’ UTR per gene 0.39 0.63
3’ UTR per gene 0.44 0.62
Novel genes in ITAG3.0 – 5,822
Gene structure improvement example
ITAG3.0
ITAG2.40
ITAG3.0
ITAG2.40
Correct fusion example
UTR example
RNAseq
XY plot
RNAseq
XY plot
Quality check - Annotation Edit Distance
(AED)
AED= 0 complete support
AED =1 lack of support
AED
Functional annotation
Automated Assignment of Human Readable Descriptions (AHRD)
Swissprot plant protein database
TrEMBL plant protein database
Araport 11 (Arabidopsis latest annotation)
User curated locus information from solgenomics.net (2000+)
Unknown proteins
In ITAG 3.0, 409 have a functional description of “Unknown proteins” compared to 7,689 in ITAG2.40
Functional annotation
Automated Assignment of Human Readable Descriptions (AHRD)
AHRD-Version 3.3.2
Quality score (***)Solyc08g081780.1.1 Dirigent protein (***)
Solyc01g008960.2.1 Argonaute family protein (***)
Solyc01g013880.1.1 Leucine-rich repeat receptor-like protein kinase family protein (*-*)
Position Criteria
1 Bit score of the blast result is >50 and e-value is <e-10
2 Alignment of the blast result is >60%
3 Human Readable Description score is >0.5
“AHRD’s quality-code consists of a three character string, where each
character is either ‘*’ if the respective criteria is met or ‘-’ otherwise.”
Novel genes in ITAG3.0
5,822 novel gens in ITAG 3.0
Future work
Genome
Improving genome assembly by sequencing with Pacbio
technology
Annotation
tRNA, non-coding RNA annotation
Multiple isoforms
Co-expression network based functional annotation
Workshop: SGN and RTB DatabasesTuesday, Jan 17 10:30 AM
PostersSurya Saha: Improved Tomato Genome Reference (SL3.0) using Full-Length BACs, BioNano Optical Maps and SGN Community Resources (P0798)
Prashant Hosmani: ITAG3.0 Annotation for the New Tomato Reference Genome SL3.0 (P0797)
Thank you!!
Questions??
Data available to download from
FTP
• ITAG 3.0
• GFF, proteins, transcripts, CDS
• List of fused genes
SGN Workshop, SOL 2016
Gap Reduction
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10 11 12
BACs Reduction in contig gaps
BA
Cs Inte
gra
ted
Repeat classification
SGN Workshop, SOL 2016
LTR retrotransposon
Copia 64840935
Gypsy 260719161
TRIM/LARD 671571
Non-LTR retrotransposon LINE 9871924
Putative_retrotransposon Putative_RT 528982
DNA DNA 20712725
Helitron Helitron 1210271
TIR TIR 12144035
Confused Confused 48373586
Unclassified Unclassified 70850157
Hostgene Endogenous virus 5839457
Tandem repeats Hostgene 5044454
Tandem repeats 8901715
Ns SUM repeats 509708973
Mapping rates for different RNAseq data
RNAseq data # of reads in
Millions
REPET light RepeatModeler
light
AC_Jim 637 86.87% 88.03%
epigenome 82 60.77% 64.35%
UTR seq 87 85.88% 86.57%
TEA part A 4,295 84.41% 84.39%
TEA part B 2,449 84.40% 84.71%
RENseq 15 32.91% 39.83%
Yang 331 79.94% 80.28%
Total reads 7,930