Diversity and survival strategies of LTR
retrotransposons in the Arabidopsis genome
Brooke Peterson-Burch
Voytas Laboratory
Iowa State University
Beyond genes
Most DNA in eukaryotes doesn’t code for anything necessary for the survival and replication of the organism.
How did that sequence get there?Why isn’t it eliminated?
Genome sequences can teach us about genome evolution and the part that retroelements play
What’s a retroelement?
Type of transposable element
A mRNA copy of the parental element ‘genome’ is reverse transcribed into DNA and inserted into a new location in the host
Transposition is replicative
Retroelement genomes
pol
env
LTR
vif
vpr
LTRgagMACANC p6
PR RT RH IN
TMSU
tat
nefHIV-1
vpurevRetroviridae
retroposonsgag
RT RHEN AAAn
MA CA NC PR RT RHINPseudoviridae
MA CA NC
PR RT RH INMetaviridae
DirsRT RH
λ Recombinase
gag
BEL gag PR RT RH IN
Element
Retro living…
Transcription
mRNA
pol
env
LTR
vif
vpr
LTRgagMA CA NC p6
PR RT RH IN
TMSU
tat
nefHIV-1
vpurev
Translation
LTRMA CA NC PR RT RHIN
LTRPseudoviridae
Element
Retroelement life cycle
Particle
Only virusesescape host cell
Packaging
pol
env
LTR
vif
vpr
LTRgagMA CA NC p6
PR RT RH IN
TMSU
tat
nefHIV-1
vpurev
LTRMA CA NC PR RT RHIN
LTRPseudoviridae
Element
Retroelement life cycle
cDNA
Reverse Transcription
pol
env
LTR
vif
vpr
LTRgagMA CA NC p6
PR RT RH IN
TMSU
tat
nefHIV-1
vpurev
LTRMA CA NC PR RT RHIN
LTRPseudoviridae
Element
Retroelement life cycle
New CopycDNA
IN
Integration
pol
env
LTR
vif
vpr
LTRgagMA CA NC p6
PR RT RH IN
TMSU
tat
nefHIV-1
vpurev
LTRMA CA NC PR RT RHIN
LTRPseudoviridae
Retroelements play a major role in the structure and evolution of many genomes
Genome sequences provide a great resource for diversity, distribution, and element identification studies
Retroelements and GenomesGenome data-mining can help answer questions about:
Number of ElementsTypes of ElementsDiversityPhysical distributionImpact on hostOdd or interesting elementsEvolutionary historyElement sequence and domain characteristics
Diversity of the Pseudoviridae
A retroelement family tree
RetroposonsRetroposons
PseudoviridaePseudoviridae
BELBEL
DirsDirs
RetroviridaeRetroviridae
MetaviridaeMetaviridae
3
6
4
1
25
Melm
oth
Tgmr
2 2904626
5 21307623
Tst1X66399
AtRE1
Evelknievel
Hopscotch
Retrofit
Luec
kenb
uess
er (
G)
Oss
er (
G) E
ndovir1-1 S
IRE
1
ToR
TL1
Opie-2
PR
EM
2
Art1 Tpv2-6
1 16648808
copia (I)
RIRE1 BARE 1
Sto 4
Tnt1 94 Tto1
Panzee
Ta1
-3
Tca
5 (F
)17
31
Ty4
(F
)
Ty1
(F
) Tca
2 (F
)
5 8
7838
61
Ta1
1
0.1
5 14977057
4 80
8019
8
Ty5-
6p (F
)
Mos
qcop
ia (I
)
95
68
97
100
92
85
70
9491
95
100
78
86
54
A.thaliana captures all plant Pseudoviridae diversity
Retroposons
Pseudoviridae
BEL
Dirs
Retroviridae
Metaviridae
Mapping proteases to HIV-1 structure helps explain patterns of conservation
LTRMA CA NC RT RHIN
LTRPR
Integrase: what’s happening in the back?
H D D EH CC
(Meta/Retro)viridae
GPF/Y
common region
Other
GKGY
GPF/Y
PseudoviridaeG KGY
Proline rich regionH D D EH CC
GKG Y
GPFY
-- 1 --
-- 1 --
-- 1 --
-- 1 --
-- 1 --
-- 1 --
-- 1 --
-- 1 --
-- 1 --
-- 1 --
- -1 --
-- 4 --
-- 60 --
-- 60 --
-- 57 --
-- 58 --
-- 60 --
-- 68 --
Chromodomain
+/-
Del
Athila5-1
MMLV
SnRV
Tf1
Ty3-2
gypsy
HIV1osvaldo
RSV
WDSV
BARE-1copia
Endovir1-1
Retrofit
Ty1Ty5
Melmoth
1731
Osser
Tnt1-94
Opie-2
Mosqcopia
+-----
…217
…211
…311
…239
…223
…218
…257
…290
…327
…231
…465
…476
…249
…189
…238
…248
…201
…198
…133
...137
…192
…167
…167
ILGD
+/-
---
+--+-----
Chromodomain present
ILGD motifpresent
* * * * **
LTRMA CA NC RT RH
LTRPR IN
24%
2000nt 12000nt10000nt8000nt6000nt4000nt
29%
Calypso
Endovir
SIRE-1
Athila4-6
Cyclops-2
gag pol env?
24%
2000nt 12000nt10000nt8000nt6000nt4000nt
29%
Calypso
Endovir
SIRE-1
Athila4-6
Cyclops-2
gag pol env?
Putative env gene is conserved across species
HIV
-1
Ro
usv
Mo
ML
V
Ty3Gypsy
Del1
Reina
Cyclops
Calypso
Fababean
Athila
4-6
Grande
Tat4
-1
Cin
ful-
1
MA
G
SU
RL
Ty1
cop
ia Tto
1
Tn
t1-94
Ta1-3
Art1
ToRTL1
Opie-2
Endovir1-1
SIRE-1
Tst1
Retrofit
Hop
scot
ch
Eve
lkn
ieve
l
Oss
er
Ty5
-6p
0.1 changes
Retroviridae
Pseudoviridae
Metaviridae
Putative retroviruses
Retroviruses independently evolved at least twice in
plants
retrovirus envlike-coding regions show a bipartite structural organization
Endovir1-1 env
668 aa ToRTL1 env
31% ID
24% ID
648 aa SIRE-1 env
476 aa
pol
env
LTR
vif
vpr
LTRgagMA CA NC p6
PR RT RH IN
TMSU
tat
nefHIV-1
vpurev
Gag surprises…
Putative retrovirus group
(Hemi/Pseudo)virusB
C
C
A
A
BA B
A
C
CB
LTRRT RH
LTRPR INMA CA NC
Gag is much larger in the retroviral lineage
Sequence and structural conservation is evident
Diversity of the Pseudoviridae family summary
Enzymatic regions appear to be highly constrained other than the IN C-terminus.Arabidopsis LTR retrotransposons are representative of plant elements in the familyThe putative retroviruses represent an uniquely evolving Pseudoviridae lineage bearing numerous changes in the retrotransposon genome. Sub-lineage differences suggest areas to focus experimental efforts for functional studies.Gag shows greater sequence conservation than previously thought
Summary continued…
envlike-coding regions have been evolutionarily conserved indicating a functional role for the ORF
features suggestive of viral env proteins have been identified in all LTR retrotransposon envlike ORFs
putative env proteins have evolved in at least two independent plant LTR retrotransposon lineages, giving credence to the hypothesis that retroviruses evolved from retrotransposons
Organization of the retroelement populations of the Arabidopsis genome
Do retroelements of higher eukaryotes choose where they integrate?
Is yeast a good model?Multicellular organism genome projects have noted that transposable element numbers are markedly increased near centromeres. This project quantitatively documents these anecdotal observations for the Arabidopsis genome
Completed genome?
10MB 20 30 40 50 60 70 80 90
3
4
X
28.0
2
RetroMap: a graphical tool for simplifying whole-genome analysis of retroelements
RetroMap FeaturesRetroMap provides the following tools to work with genome
data:• Parse blast results• Assign Lineages or arbitrary groupings to retroelements• View chromosomal locations• Identify and extract LTRS• Identify and extract full length elements• Assign ages to complete LTR retroelements• Extract sequence(s) for hits• Visualize hit open reading frames• Generate information about neighboring annotated features
(Arabidopsis thaliana only)• Generate tab-delimited datafiles of retroelement information for direct
import into statistical software packages
Overview of how RetroMap generates retroelement data for a genome
Starting eprobe sequences
0.1
TAtRL ta11
L1 Hs
R2 Dm.
R1 Dm
Jockey Dm
996
Tca2 Ca.
Ty5 Sp
copia DmArt1 At
Endovir1 1 At
SIRE1 Gm
1000
Pao Bm
BEL Dm
Mazi Dm
Roo Dm1000
Prt1 Pbla
Dirs1 Dd
PAT Pred
861
HIV1
RSV
SnRVMMLV
WDSV
Cer1 CeOsvaldo Db
Athila At con
Ty3 Sc
sushi Fr
Tf1 Spom
946
988
A. thaliana LTR retrotransposon genome overview
0.2
Tat
Athila
Metavirus
root
Metaviridae
0.1
Pseudoviridaeroot
Full-length Solo LTRs RT only A. thal DNARetroposon -- -- 311 0.22%Pseudoviridae 220 483 83 1.25%Metaviridae 217 2803 143 3.16%Athila 47 -- -- 0.60%Tat 48 -- -- 0.50%Metavirus 88 -- -- 0.64%Totals 437 3286 537 4.63%
A. thaliana retroelements consist of retroposons and only two LTR families
Pseudoviridae elements are significantly shorter (p=.0001)
Dating LTR retrotransposons
gag pol
identical at time of insertion
Relative ages can be estimated from the sequence divergence (genetic distance) of the LTRs
e.g. T = d (genetic distance: 1 – (% identity ÷ 100))
2k (k: nucleotide substitution rate for genome)
Pseudos are younger than Metas. The Athila sublineage being the oldest tested
A. thaliana RT distributions
Going solo
homologous recombination loops out and deletes retroelement internal sequences
host DNA
host DNA
Full-length element
solo LTR
Where have they been?
No family distribution is randomMetaviridae Athila and Tat are found preferentially inside heterochromatic regions, others groups are not
Pseudoviridae and retroposon distributions are not significantly different
Solo LTRs show same distributions as full-length family members
Hypotheses
Retroelement lineages show ‘universal’ organizational characteristics on the family levelGeneral retroelement abundance at centromeres is due to reduced elimination…the ‘graveyard scenario’Metaviridae in Arabidopsis are targeted to heterochromatin
ConclusionsHeterochromatic regions DO appear to act as graveyards, at least in the case of the Pseudoviridae (and presumably the retroposons)
Younger Pseudoviridae elements tend to be found outside of heterochromatinSolo LTR distributions indicate that homologous recombination between LTRs is not greatly inhibited in heterochromatin
The Metaviridae lineages appear to use targeting in their interactions with the host genome
AcknowledgementsSo many people helped make this research happen, I couldn’t have done it without their support and input.
Special thanks go to the many members of the Voytas lab, past and present, undergrads too!
I’ve been lucky to have good collaborators who are interesting and fun to work with. These have included Dr. Nettleton, Dr. Wright, Dr. Laten from Loyola University, and always Dr. Voytas.
To the head honcho: no one can say it hasn’t been a crazy, crazy ride. Thanks. :o)
Basic Hit Redundancy Elimination SchemeQuery sequence
1) Simple match, no overlap with nearest hit, no compression
case 1
case 2
2) Overlap case(s) both hits merged into one representing their combined maximum extent on the database sequence
case 3
3) Two non-overlapping hits which should be combined:a) Left checks it’s boundary position on its query sequence and determines
if the other hit falls within that range. If so merge.b) Right repeats the proceedure if Left failed to indicate a merge
case 4
4) An example of a merge case which may lead to false positives
BLAST false-positive amplification problem
RTBlast Round 1
RT RT R TLTR
RT RT RT RT R TLTR R TLTR LTR LTR LTR RT
Blast Round 2
LTR prediction
• Works only for hits of a sequence interior to LTRs
10 kb 10 kb
Blast2Sequences
Genome sequenceHit
H it
Hit
• Blast2Sequences is used to detect repeats• 10kb of sequence upstream and downstream are compared
• Innermost matching repeats are taken to be the LTRs
LTR Identification ErrorsHit
Predicted element Hit
Tandem elements
10 kb 10 kb
Hit1 Hit2
Nested elements
10 kb 10 kb
Hit2Predicted element
Hit
pA pA
10 kb 10 kb
Degenerate or simple internal repeat elements
Hit
Sample distribution data
Sample hit neighbors annotation data