bits of the green junk
DESCRIPTION
By Florian Maumus and Hadi Quesneville We present our opinions, recent developments and perspectives regarding whole-genome repeatome annotation. This talk was presented by Florian Maumus at the Barbados Workshop on the Computational Identification and Analysis of Transposable Elements, Holetown, Barbados, April 18-24 2014TRANSCRIPT
Barbados Workshop on the Computational Identification and Analysis of Transposable Elements
April 18th - 25th, 2014
Florian Maumus with Hadi Quesneville (URGI-INRA, Versailles, France)
REPET package
TEdenovo TEannotGenome Repeat annotation
HadiQuesneville
De novo repeatome detectionDeep repeatome annotation
Repeat annotation in large genomes
De novo repeatome detectionDeep repeatome annotation
Repeat annotation in large genomes
7
Repeat complement = Repeatome
The Repeatome includes:Transposable elements
Endogenous virusesTandem repeats
RibozymesGenes
…
= What you get with repeat-finders!
Burst and Decay
« Repeats » Old repeats Dark matter
Dark matter, the genomic humus
Detected Detectable? Background Noise
Burst Decay Melt
Turnover ++Recent activity +++
Turnover -Recent activity -
Complexity of the repeatome
old
young
Maize2.3 Gb genomeAbout 85% repeats
Human3.2 Gb genomeAbout 50% repeats
Different history, different challenges
LECA:Core eukaryotic genes +Copia, Gypsy, LINEs, DNA transposons…
TEs have been jumping around genes over evolutionary times
Contents include: Professional Tool Roll
Archaeology Margin TrowelBattiferro Leaf & Square
Battiferro forged ornamental tools lanceBattiferro Trowel and Square
Aluminium scale rulersSmall Tools Set
Hand ShovelSmall BrushMason Line*
Line PegsLine LevelPlumb BobRetractable
Hi-Viz Grip KnifeBattiferro Trowel*
*Optional.
Archeology toolbox
Repeatome toolbox
K-mer strict : Tallymer, DSK
K-mer based : RepeatScout, P-clouds
Similarity, e.g Recon
CombinedRepeatModeler (RepeatScout + Recon)
TEdenovo (Recon + Piler + Grouper; + RepeatScout in v2.2)
REPET: TEdenovo
TEdenovo pipeline Consensus library
+ RepeatScout (v2.2)REPET Classification utility
REPET Classification tool
Consensus library
TR searchTandem Repeat Finder
BLASTxtBLASTxRepbase
Pfam hmmGyDB hmm
Consensus 1: termLTRs 0,12% TR Bx: AtGypsy; Btx: none profiles: IN, RT LTR retro
Consensus 2: none 0,32% TR Bx: none; Btx: none profiles: LRR Host gene
Consensus 3: none 0,23% TR Bx: none; Btx: none profiles: none Unclassified
rDNAtRNA
Host genes
Summary of evidences Proposed Classification
TEannot pipeline genome annotation
REPET: TEannot
TEdenovo
RepeatScout
RepeatModeler
Performance, Complementarity ?
Experimental model
Arabidopsis thaliana120 Mb
Consensus sequences
0 10 20 30 40 50 Mb
Genome coverage
Sensitivity & Specificity
Tallymer
TRF
TEdenovo
RepeatModeler
RepeatScout
TEdenovo+RS+RM
All
0 10 20 30 40 50 60 70 80 90 100
Percent reference coverage
Sensitivity
TEdenovo
RepeatModeler
RepeatScout
TEdenovo+RS+RM
All
0 10 20 30 40 50 60 70 80 90 100
Percent 24-nt sRNA coverage (Lister et al., 2008)
Biological Sensitivity
TEdenovo RepeatModeler RepeatScout0
5
10
15
20
25
30
35
Gen
ome
cove
rage
incr
ease
(%
)
REPET, RepeatScout, and RepeatModeler employ complementary computational methods that together enable to better represent repeatome complexity.
Conclusions I
TEdenovo outcompetes RepeatModeler and RepeatScout Greater coverage with
Less consensus Larger consensusLarger copies
Complementarity of TEdenovo, RepeatModeler and RepeatScoutComprehensive annotation of complex repeatomes
De novo repeatome detectionDeep repeatome annotation
Repeat annotation in large genomes
Arabidopsis120 Mb
CDS Repeatome Dark matter
0% 100%
Experimental model
Three strategies with REPET:Annotate genome with genomic copies
Use relaxed parameters for HSP detection
Use P-clouds to detect short repeat fragments
Iterative annotationAnnotate genome with genomic copies
(Expand the knowledge)
Iterative annotationAnnotate genome with genomic copies
(Expand the knowledge)
Iterative annotationAnnotate genome with genomic copies
(Expand the knowledge)
Genome
Consensus
Genomic copies
Genomic copies
Genomic copies
Genomic copies
TEannot
TEannot
TEannot
TEdenovo
TEannot
Iterative annotationAnnotate genome with genomic copies
RepeatModeler
RepeatScout
Tallymer
24-nt sRNA
Reference
Genome
0 10 20 30 40 50 60 70 80 90 100
TEdenovo_1
TEdenovo_2
TEdenovo_3
TEdenovo_4
Iterative annotationAnnotate genome with genomic copies
AAAC
AG
AT
CA
CC
CG
CTGA
GC
GG
GT
TA
TC
TG
TT
-0,05
0,05
0,15
CDS
TEdenovo
delta_2vs1
delta_3vs2
delta_4vs3
Dinucleotide composition
Relevance
Genome annotation using the delta_2vs1 copies
masks as much as 23 Mb (19.5%) of the genome
Covers 66% of the reference annotationand 56% of the TEdenovo annotation
The supplementary annotations from TEdenovo_2 are highly representative of the A. thaliana repeatome.
Relaxed (parameters) annotation
Relaxed (parameters) annotation
Relaxed (parameters) annotation
Default : Identity > 90%, Evalue<1e-300Cool : Identity > 85%, Evalue < 1e-50Soft : Identity > 80%, Evalue < 1e-20
Consensus size
RepeatModeler
RepeatScout
Reference
Tallymer
24 nt sRNA
0 10 20 30 40 50 60 70 80 90 100
TEdenovo_1
TEdenovo_cool
TEdenovo_soft
TEdenovo_soft_2
Relaxed (parameters) annotation
TEdenovo
Cool
Soft
Copy/consensus identity along chr1
()
Deep annotation of the A. thaliana repeatome
RepeatScout
RepeatModeler
TEdenovo
Repbase(+Buisine et al.)
Remove redundancy
Bundle libraryTEannot
Consensus size
selectednot
selected
Deep annotation of the A. thaliana repeatome
TEannot
P-clouds
Complete bundle
annotation
P-cloudsCopies Consensus
In-cloud k-mers
De Koning et al.
• TEdenovo
• Bundle
=> Repeated and repeat-derived sequences contribute at least 30% to the A. thaliana genome
Enhanced repeat detection in gene-rich regions
• Bundle + P-clouds
Arabidopsis repeats browser
Deep annotations
REPET
RepeatModeler
RepeatScout
Buisine et al.
24-nt sRNA
Genes
Conclusions II
Innovative approaches for deep repeatome annotation
About one third of the A. thaliana genome of repetitive origin (vs 24%)Increased sensitivity and detection of old repeat remnants Improved genome evolution and epigenetic analyses
Continuum between repeatome and genomic dark matter
Time
De novo repeatome detectionDeep repeatome annotation
Repeat annotation in large genomes
All genomes should benefit the greater quality of TEdenovo
Adapted from Nina V. Fedoroff (2012) and Steven M. Carr
Limitations with REPET
All-by-all genome comparison => LOTS (Gb) of high scoring pairs (HSPs)
HSP files > 1 Gb are not handled by Piler
Grouper can last for weeks
Impossible to run TEdenovo on whole large and/or highly repeated genomes until recently
Solutions
Use a sample of whole genome as input for TEdenovo (e.g. 300Mb)
(As recommended for RepeatModeler)
Tomato genomes
S. pennellii : 942 Mb
S. lycopersicum : 782 Mb
TEdenovo (n HSP >= 5)
0 0.5 1 Gb
Consensus library
320 Mb input
TEannot
Mb
82% of the Solanum pennellii ATGC space masked
Conclusions III
Efficient annotation of large plant genomes with REPET
Still quite a long process !
De novo repeat annotation in large genomes
Future developments
Parallelize Grouper
Parallelize the “Long join” procedure
Establish phyla-specific approaches
Develop strategies to annotate genomes with different composition
old, complex repeatomes as compared to large plant genomes
De novo repeat annotation in large genomes
Future challenges & perspectives
Propose TEdenovo and TEannot pipelines on GALAXY
Deliver REPET compilation for use on a cloud
Véronique Jamilloux
Tina Alaeitabar
TimothéeChaumier
Mark Moissette
Olivier Inizan
HadiQuesneville
THANK YOU !