1 ancient exapted transposable elements drive nuclear ...64 recruits anril to target genes (holdt et...

32
1 Ancient exapted transposable elements drive nuclear localisation of lncRNAs 1 2 Joana Carlevaro-Fita 1,2,4 3 Taisia Polidori 1,2,4 4 Monalisa Das 1,2,4 5 Carmen Navarro 3 6 Rory Johnson 1,2 * 7 8 9 10 1. Department of Biomedical Research (DBMR), University of Bern, 3008 Bern, Switzerland 11 2. Department of Medical Oncology, Inselspital, University Hospital and University of Bern, 3010 Bern, 12 Switzerland 13 3. Department of Computer Science and Artificial Intelligence, University of Granada, Spain 14 4. Equal contribution 15 16 *Correspondence: [email protected] 17 Keywords: Transposable element; long noncoding RNA; lncRNA; evolution; exaptation. 18 19 20 not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was this version posted September 16, 2017. ; https://doi.org/10.1101/189753 doi: bioRxiv preprint

Upload: others

Post on 30-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

1

Ancient exapted transposable elements drive nuclear localisation of lncRNAs 1

2

Joana Carlevaro-Fita1,2,4 3

Taisia Polidori1,2,4 4

Monalisa Das1,2,4 5

Carmen Navarro3 6

Rory Johnson1,2* 7

8

9

10

1. Department of Biomedical Research (DBMR), University of Bern, 3008 Bern, Switzerland 11

2. Department of Medical Oncology, Inselspital, University Hospital and University of Bern, 3010 Bern, 12

Switzerland 13

3. Department of Computer Science and Artificial Intelligence, University of Granada, Spain 14

4. Equal contribution 15

16

*Correspondence: [email protected] 17

Keywords: Transposable element; long noncoding RNA; lncRNA; evolution; exaptation. 18

19

20

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 2: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

2

Abstract 21

22

The sequence domains underlying long noncoding RNA (lncRNA) activities, including their 23

characteristic nuclear localisation, remain largely unknown. One hypothetical source of such 24

domains are neofunctionalised fragments of transposable elements (TEs), otherwise known as RIDLs 25

(Repeat Insertion Domains of Long Noncoding RNA). We here present a transcriptome-wide map of 26

putative RIDLs in human, using evidence from insertion frequency, strand bias and evolutionary 27

conservation of sequence and structure. In the exons of GENCODE v21 lncRNAs, we identify a total 28

of 5374 RIDLs in 3566 gene loci. RIDL-containing lncRNAs are over-represented in functionally-29

validated and disease-associated genes. By integrating this RIDL map with global lncRNA 30

subcellular localisation data, we uncover a cell-type-invariant and dose-dependent relationship 31

between LINE2 and MIR elements and nuclear-cytoplasmic distribution. Experimental validations 32

establish causality, showing that these RIDLs drive the nuclear localisation of their host lncRNA. 33

Together these data represent the first map of TE-derived functional domains, and suggest a global 34

role for TEs as nuclear localisation signals of lncRNAs. 35

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 3: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

3

Introduction 36

The human genome contains many thousands of long noncoding RNAs, of which at least a fraction 37

are likely to have evolutionarily-selected biological functions (Ulitsky and Bartel 2013). Our current 38

thinking is that, similar to proteins, lncRNAs functions are encoded in primary sequence through 39

“domains”, or discrete elements that mediate specific aspects of lncRNA activity, such as molecular 40

interactions or subcellular localisation (Guttman and Rinn 2012; Rory Johnson and Guigó 2014; Mercer 41

and Mattick 2013). Mapping these domains is thus a key step towards the understanding and prediction of 42

lncRNA function in healthy and diseased cells. 43

One possible source of lncRNA domains are transposable elements (TEs) (Rory Johnson and Guigó 44

2014). TEs have been a major contributor to genomic evolution through the insertion and 45

neofunctionalisation of sequence fragments – a process known as exaptation (Bourque 2009)(Feschotte 46

2008). This process has contributed to the evolution of diverse features in genomic DNA, including 47

transcriptional regulatory motifs (Bourque et al. 2008; R. Johnson et al. 2006), microRNAs (Roberts, 48

Cardin, and Borchert 2014), gene promoters (Faulkner et al. n.d.; Huda et al. 2011), and splice sites (Lev-49

Maor et al. 2003; Sela et al. 2007). 50

We recently proposed that TEs may contribute pre-formed functional domains to lncRNAs. We 51

termed these “RIDLs” – Repeat Insertion Domains of Long noncoding RNAs (Rory Johnson and Guigó 52

2014). As RNA, TEs interact with a rich variety of proteins, meaning that in the context of lncRNA they 53

could plausibly act as protein-docking sites (Blackwell et al. n.d.). Diverse evidence also points to repetitive 54

sequences forming intermolecular Watson-Crick RNA:RNA and RNA:DNA hybrids (Gong and Maquat 55

2011; Holdt et al. 2013; Rory Johnson and Guigó 2014). However, it is likely that bona fide RIDLs represent 56

a small minority of the many exonic TEs, with the remainder being phenotypically-neutral “passengers”. 57

A small but growing number of RIDLs have been described, reviewed in (Rory Johnson and Guigó 58

2014). These are found in lncRNAs with clearly demonstrated functions, including the X-chromosome 59

silencing transcript XIST (Elisaphenko et al. 2008), the oncogene ANRIL (Holdt et al. 2013) and the 60

regulatory antisense UchlAS (Carrieri et al. 2012). In each case, domains of repetitive origin are necessary 61

for a defined function: the structured A-repeat of XIST, of retroviral origin, recruits the PRC2 silencing 62

complex (Elisaphenko et al. 2008); Watson-Crick hybridisation between RNA and DNA Alu elements 63

recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 64

of its sense mRNA (Carrieri et al. 2012). In parallel, a number of transcriptome-wide maps of lncRNA-65

linked TEs have been presented, showing how TEs have contributed extensively to lncRNA gene evolution 66

(Kapusta et al. 2013; D. Kelley and Rinn 2012)(Schmitt et al. 2016)(Hezroni et al. 2015). However, there 67

has been no attempt so far to create passenger-free maps of RIDLs with evidence for functionality in the 68

context of mature lncRNA molecules. 69

Subcellular localisation, and the domains controlling it, exert a powerful influence over lncRNA 70

functions (reviewed in (L.-L. Chen 2016)). For example, transcriptional regulatory lncRNAs must be 71

located in the nucleus and chromatin, while those regulating microRNAs or translation must be present in 72

the cytoplasm (K. Zhang et al. 2014). This is reflected in diverse patterns of lncRNA localization (Cabili et 73

al. 2015; Mas-Ponte et al. 2017b). If lessons learned from mRNA are also valid for lncRNAs, then short 74

sequence motifs binding to RNA binding proteins (RBPs) will be an important mechanism (Martin and 75

Ephrussi 2009). This was recently demonstrated for the BORG lncRNA, where a pentameric motif was 76

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 4: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

4

shown to mediate nuclear retention (B. Zhang et al. 2014). Most interestingly, another study implicated an 77

inverted pair of Alu elements in nuclear retention of lincRNA-P21 (Chillón and Pyle 2016). This raises the 78

possibility that RIDLs may function as regulators of lncRNA localisation. 79

In the present study, we create a human transcriptome-wide catalogue of putative RIDLs. We show 80

that lncRNAs carrying these RIDLs are enriched for functional genes. Finally, we provide in silico and 81

experimental evidence that certain RIDL types, derived from ancient transposable elements, confer a potent 82

nuclear localisation on their host transcripts. 83

84

85

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 5: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

5

Materials and Methods 86

All operations were carried out on human genome version GRCh38/hg38, unless stated otherwise. 87

88

Exonic TE curation 89

RepeatMasker annotations were downloaded from the UCSC Genome Browser (version hg38) on 90

December 31st 2014, and GENCODE version 21 lncRNA GTF annotations were downloaded from 91

www.gencodegenes.org (Harrow et al. 2012). We developed a pipeline, called Transposon.Profile, to 92

annotate exonic and intronic TEs of a given gene annotation. Transposon.Profile is largely based on 93

BEDTools’ intersect and merge functionalities(Quinlan and Hall 2010). Transposon.Profile merges the 94

exons of all transcripts belonging to the given gene annotation, henceforth referred to as “exons”. The set 95

of introns was curated by subtracting the merged exonic sequences from the full gene spans, and only 96

retaining those introns that belonged to a single gene. These intronic regions were assigned the strand of 97

the host gene. 98

The RepeatMasker annotation file was then compared to these exons, and intersections are recorded 99

and then classified into one of 6 categories: TSS (transcription start site), overlapping the first exonic 100

nucleotide of the first exon; splice acceptor, overlapping exon 5’ end; splice donor, overlapping exon 3’ 101

end; internal, residing within an exon and not overlapping any intronic sequence; encompassing, where an 102

entire exon lies within the TE; TTS (transcription termination site), overlapping the last nucleotide of the 103

last exon. In every case, the TEs are separated by strand relative to the host gene: + where both gene and 104

TE are annotated on the same strand, otherwise -. The result is the “Exonic TE Annotation” (Supplementary 105

File S1). 106

107

RIDL identification 108

Using this Exonic TE Annnotation, we identified the subset of those individual TEs with evidence 109

for functionality. For certain analysis, an equivalent Intronic TE Annotation was also employed, being the 110

Transposon.Profile output for the equivalent intron annotation described above. Three different types of 111

evidence were used: enrichment, strand bias and evolutionary conservation. 112

In enrichment analysis, the exon/intron ratio of the fraction of nucleotide coverage by each repeat 113

type was calculated. Any repeat type that has >2-fold exon/intron ratio was considered as a candidate. All 114

exonic TE instances belonging to such TE types are defined as RIDLs. 115

In strand bias analysis, a subset of Exonic TE Annotation was used, being the set of non-splice 116

junction crossing TE instances (“noSJ”). This additional filter was employed to guard against false positive 117

enrichments for TEs known to provide splice sites (Lev-Maor et al. 2003; Sela et al. 2007). For all TE 118

instances, the “relative strand” was calculated: positive, if the annotated TE strand matches that of the host 119

transcript; negative, if not. Then for every TE type, the ratio of relative strand sense/antisense was 120

calculated. Statistical significance was calculated empirically: entire gene structures were randomly re-121

positioned in the genome using bedtools shuffle, and the intersection with the entire RepeatMasker 122

annotation was re-calculated. For each iteration, sense/antisense ratios were calculated for all TE types. A 123

TE type was considered to have significant strand bias, if its true ratio exceeded (positively or negatively) 124

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 6: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

6

all of 1000 simulations. All exonic instances of these TE types that also have the appropriate sense or 125

antisense strand orientation are defined as RIDLs. 126

In evolutionary analysis, four different annotations of evolutionarily-conserved regions were 127

treated similarly, using unfiltered Exonic TE Annotations. Primate, Placental Mammal and Vertebrate 128

PhastCons elements based on 46-way alignments were downloaded as BED files from UCSC Genome 129

Browser(Siepel et al. 2005)s, while the ECS conserved regions from obtained from Supplementary Data of 130

Smith et al (Smith et al. 2013). For each TE type we calculated the exonic/intronic conservation ratio. To 131

do this we used IntersectBED (Quinlan and Hall 2010) to overlap exonic locations with TEs, and calculate 132

the total number of nucleotides overlapping. We performed a similar operation for intronic regions. Finally, 133

we calculated the following ratio: 134

rei=(nt of overlap between conserved elements and exons)/(nt of overlap between conserved elements and 135

introns) 136

To estimate the background, the exonic and intronic BED files were positionally randomized 1000 times 137

using bedtools shuffle, each time recalculating rei. We considered to be significantly conserved those TE 138

types where the true rei was greater or less than every one of 1000 randomised rei values. All exonic instances 139

of these TE types that also intersect the appropriate evolutionary conserved element are defined as RIDLs. 140

All RIDL predictions were then merged using mergeBED and any instances with length <10 nt 141

were discarded. The outcome, a BED format file with coordinates for hg38, is found in Supplementary File 142

S2. 143

144

RIDL orthology analysis 145

In order to assess evolutionary history of RIDLs, we used chained alignments of human to chimp 146

(hg19ToPanTro4), macaque (hg19ToRheMac3), mouse (hg19ToMm10), rat (hg19ToRn5), cow 147

(hg19ToBosTau7). Due to availability of chain files, RIDL coordinates were first converted from hg38 to 148

hg19. Orthology was defined by LiftOver utility used at default settings. 149

150

Comparing RIDL-carrying lncRNAs versus other lncRNAs 151

In order to test for functional enrichment amongst lncRNAs hosting RIDLs, we tested for statistical 152

enrichment of the following traits in RIDL- carrying lncRNAs compared to other lncRNAs (see below) by 153

Fisher's exact test: 154

A) Functionally-characterised lncRNAs: lncRNAs from GENCODE v21 that are present in lncRNAdb 155

(Quek et al. 2015). 156

B) Disease-associated genes: lncRNAs from GENCODE v21 that are present in at least in one of the 157

following databases or public sets: LncRNADisease (G. Chen et al. 2013), Lnc2Cancer (Ning et al. 158

2016), Cancer LncRNA Census (CLC) (Carlevaro-fita et al. bioRxiv doi: 10.1101/152769) 159

C) GWAS SNPs: We collected SNPs from the NHGRI-EBI Catalog of published genome-wide 160

association studies (Hindorff et al. 2009; Welter et al. 2014) (https://www.ebi.ac.uk/gwas/home). We 161

intersected its coordinates with lncRNA exons and promoters (promoters defined as 1000nt upstream 162

and 500nt downstream of GENCODE-annotated TSS). 163

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 7: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

7

For defining a comparable set of “other lncRNAs” we sampled from the rest of GENCODE v21 164

lncRNAs a set of lncRNAs matching RIDL-carrying lncRNAs exonic length distribution. We performed 165

the subsampling using matchDistribution script: https://github.com/julienlag/matchDistribution. 166

167

Subcellular localisation analysis 168

Processed RNAseq data from human cell fractions were obtained from ENCODE in the form of RPKM 169

(reads per kilobase per million mapped reads) quantified against the GENCODE v19 annotation(Djebali et 170

al. 2012b; Mas-Ponte et al. 2017a). Only transcripts common to both the v21 and v19 annotations were 171

considered. For the following analysis only one transcript per gene was considered, defined as the one with 172

largest number of exons. Cytoplasmic/nuclear ratio expression for each transcript was defined as 173

(cytoplasm polyA+ RPKM)/(nuclear polyA+ RPKM), and only transcripts having non-zero values in both 174

were considered. For each RIDL type and cell type in turn, the cytoplasmic/nuclear ratio distribution of 175

RIDL-containing to non-RIDL-containing lncRNAs was compared using Wilcox test. Only RIDLs having 176

at least 3 host expressed transcripts in at least one cell type were tested. Resulting p-values were globally 177

adjusted to False Discovery Rate using the Benjamini-Hochberg method (Benjamini and Hochberg 1995). 178

179

Cell lines and reagents 180

Human cervical cancer cell line HeLa and human lung cancer cell line A549 were cultured in Dulbecco’s 181

Modified Eagle’s medium (Sigma-Aldrich, # D5671) supplemented with 10% FBS and 1% penicillin 182

streptomycin at 37ºC and 5 % CO2. Anti-GAPDH antibody (Sigma-Aldrich, # G9545) and anti-histone H3 183

antibody (Abcam, # ab24834) were used for Western blot analysis. 184

185

Gene synthesis and cloning of lncRNAs 186

The three lncRNA sequences (RP11-5407, Linc00173, RP4-806M20.4) containing wild-type RIDLs, 187

or corresponding mutated versions where RIDL sequence has been randomised (“Mutant”), were 188

synthesized commercially (BioCat GmbH). The sequences were cloned into pcDNA 3.1 (+) vector within 189

the NheI and XhoI restriction enzyme sites. The clones were checked by restriction digestion and Sanger 190

sequencing. The sequence of the wild type and mutant clones are provided in Supplementary File S3. 191

192

Sub-cellular fractionation 193

Each pair of the wild type and the corresponding mutant lncRNA clones were transfected 194

independently in separate wells of a 6 well plate. Transfections were carried out with 2 g of total plasmid 195

DNA in each well using Lipofectamine 2000. 48 h post-transfection, cells from each well were harvested, 196

pooled and re-seeded into a 10 cm dish and allowed to grow till 100% confluence. The nuclear and 197

cytoplasmic fractionation was carried out as described previously (Suzuki et al. 2010) with minor 198

modifications. In brief, cells from 10 cm dishes were harvested by scraping and washed with 1X ice cold 199

PBS. For fractionation, cell pellet was re-suspended in 900 μl of ice-cold 0.1% NP40 in PBS and triturated 200

7 times using a p1000 micropipette. 300 μl of the cell lysate was saved as the whole cell lysate. The 201

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 8: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

8

remaining 600 μl of the cell lysate was centrifuged for 30 sec on a table top centrifuge and the supernatant 202

was collected as “cytoplasmic fraction”. 300 μl from the cytoplasmic supernatant was kept for RNA 203

isolation and the remaining 300 μl was saved for protein analysis by western blot. The pellet containing the 204

intact nuclei was washed with 1 ml of 0.1% NP40 in PBS. The nuclear pellet was re-suspended in 200 μl 205

1X PBS and subjected to a quick sonication of 3 pulses with 2 sec ON-2 sec OFF to lyse the nuclei and 206

prepare the “nuclear fraction”. 100 μl of nuclear fraction was saved for RNA isolation and the remaining 207

100 μl was kept for western blot. 208

209

RNA isolation and real time PCR 210

The RNA from each nuclear and cytoplasmic fraction was isolated using Quick-RNA MiniPrep kit 211

(ZYMO Research, # R1055). The RNAs were subjected to on-column DNAse I treatment and clean up 212

using the manufacturer’s protocol. For A549 samples, additional units of DNase were employed, due to 213

residual signal in –RT samples. The RNA from each fraction was converted to cDNA using GoScript 214

reverse transcriptase (Promega, # A5001) and random hexamer primers. The expression of each of the 215

individual transcripts was quantified by qRT-PCR (Applied Biosystems® 7500 Real-Time) using indicated 216

primers (Supplementary File S4) and GoTaq qPCR master mix (Promega, # A6001). Human GAPDH 217

mRNA and MALAT1 lncRNA were used as cytoplasmic and nuclear markers, respectively. 218

219

Western Blotting 220

The protein concentration of each of the fractions was determined, and equal amounts of protein (50 221

μg) from whole cell lysate, cytoplasmic fraction, and nuclear fraction were resolved on 12 % Tris-glycine 222

SDS-polyacrylamide gels and transferred onto polyvinylidene fluoride (PVDF) membranes (VWR, # 223

1060029). Membranes were blocked with 5% skimmed milk and incubated overnight at 4ºC with anti-224

GAPDH antibody as a cytoplasmic marker and anti p-histone H3 antibody as nuclear marker. Membranes 225

were washed with PBS-T (1X PBS with 0.1 % Tween 20) followed by incubation with HRP-conjugated 226

anti-rabbit or anti-mouse secondary antibodies respectively. The bands were detected using SuperSignal™ 227

West Pico chemiluminescent substrate (Thermo Scientific, # 34077). 228

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 9: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

9

Results 229

The objective of this study was to create a map of repeat insertion domains of long noncoding 230

RNAs (RIDLs) and link them to lncRNA functions. We previously hypothesised that RIDLs could function 231

by mediating interactions between host lncRNA and DNA, RNA or protein molecules (Rory Johnson and 232

Guigó 2014) (Figure 1A). However, any attempt to map RIDLs is confronted with the problem that they 233

will likely represent a small minority amongst many non-functional “passenger” transposable elements 234

(TEs) residing in lncRNA exons (Figure 1B). Therefore, it is necessary to identify RIDLs by some signature 235

of selection in lncRNA exons. In this study we use three such signatures: exonic enrichment, strand bias, 236

and evolutionary conservation (Figure 1B). 237

238

A map of exonic TEs in Gencode v21 lncRNAs 239

Our first aim was to create a comprehensive map of TEs within the exons of GENCODE v21 human 240

lncRNAs (Figure 2A). Altogether 5,520,018 distinct TE insertions were intersected against 48684 exons 241

from 26414 transcripts of 15877 GENCODE v21 lncRNA genes, resulting in 46474 exonic TE insertions 242

in lncRNA (Figure 1B). Altogether 13121 lncRNA genes (82.6%) carry at least one exonic TE fragment in 243

their mature transcript. 244

We defined TEs residing in lncRNA introns to represent neutral background, since they do not 245

contribute to functionality of the mature processed transcript, but nevertheless should model any large-scale 246

genomic biases in TE abundance. Thus we also created a control TE intersection dataset with 31,004 247

Gencode lncRNA introns, resulting in 562,640 intron-overlapping TE fragments (Figure 2A). This analysis 248

is likely to be conservative, since some intronic sequences could contribute to unidentified lncRNA exons. 249

Comparing intronic and exonic TE data, we see that lncRNA exons are slightly depleted for TE 250

insertions: 29.2% of exonic nucleotides are of TE origin, compared to 43.4% of intronic nucleotides (Figure 251

2B), similar to previous studies (Kapusta et al. 2013). This may reflect selection against disruption of 252

lncRNA transcripts by TEs, and hence is evidence for general lncRNA functionality. 253

254

Contribution of TEs to lncRNA gene structures 255

TEs have contributed widely to both coding and noncoding gene structures by the insertion of 256

elements such as promoters, splice sites and termination sites (Sela et al. 2007). We next classified inserted 257

TEs by their contribution to lncRNA gene structure (Figure 2C,D). It should be borne in mind that this 258

analysis is dependent on the accuracy of underlying GENCODE annotations, which are often incomplete 259

at 5’ and 3’ ends (J. Lagarde, B. Uszczyńska et al, Manuscript In Press). Altogether 4993 (18.9%) transcripts 260

are promoted from within a TE, most often those of Alu and ERVL-MaLR classes (Figure 2E). 7497 261

(28.4%) lncRNA transcripts are terminated by a TE, most commonly by Alu, ERVL-MaLR and LINE1 262

classes. 8494 lncRNA splice sites (32.2%) are of TE origin, and 2681 entire exons are fully contributed by 263

TEs (10.1%) (Figure 2E). These observations support several known contributions of TEs to gene structural 264

features (Sela et al. 2007). Nevertheless, the most frequent case is represented by 22031 TEs that lie 265

completely within an exon and do not overlap any splice junction (“inside”). 266

267

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 10: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

10

Evidence for selection on certain exonic TE types 268

This exonic TE map represents the starting point for the identification of RIDLs, defined as the 269

subset of TEs with evidence for functionality in the context of mature lncRNAs. In this and subsequent 270

analyses, TEs are grouped by type as defined by RepeatMasker. We utilise three distinct sources of evidence 271

for selection on TEs: exonic enrichment, strand bias and evolutionary conservation (Figure 1B). 272

We first asked whether particular TE types are more frequently found within lncRNA exons, 273

compared to the background represented by intronic sequence (D. Kelley and Rinn 2012). Thus we 274

calculated the ratio of exonic / intronic sequence coverage by TEs (Figure 3A). We found signals of 275

enrichment, >2-fold, for numerous repeat types, including endogenous retrovirus classes (HERVE-int, 276

HERVK9-int, HERV3-int, LTR12D) in addition to others such as ALR/Alpha, BSR/Beta and REP522. 277

Surprisingly, quite a number of simple repeats are also enriched in lncRNA, including GC-rich repeats. A 278

weaker but more generalized trend of enrichment is also observed for various MLT repeat classes. These 279

findings are consistent with previous analyses by Kelley and Rinn (D. Kelley and Rinn 2012). 280

Interestingly, presently-active LINE1 elements are depleted in lncRNA exons (Figure 3A). It is 281

possible that this reflects selection against disruption to normal gene expression, where numerous weak 282

polyadenylation signals lead to premature transcription termination when the LINE1 element lies on the 283

same strand as the overlapping gene (Perepelitsa-Belancio and Deininger 2003). Another explanation may 284

be low transcriptional processivity exhibited by the LINE1 ORF2 in the sense strand, leading to reduced 285

gene expression and consequent negative selection (Perepelitsa-Belancio and Deininger 2003). 286

In a similar way, we investigated whether exonic TEs display a strand preference relative to host 287

lncRNA (Rory Johnson and Guigó 2014). We took care to avoid a major source of bias: as shown above, 288

many TSS and splice sites of lncRNA are contributed by TE, and such cases would lead to clear strand bias 289

in the resulting exonic TE sequences. To avoid this, we removed any TE insertions that overlap an exon-290

intron boundary. We calculated the relative strand overlap of all remaining TEs in lncRNA exons. Statistical 291

significance was assessed by randomisation (see Materials and Methods) (Figure 3B). We carried out the 292

same operation for lncRNA introns (Supplementary Figure S1). In lncRNA exons, a number of TE types 293

are enriched in either sense or antisense. They are dominated by LINE1 family members, possibly for the 294

reasons mentioned below. Other significantly enriched TE types include LTR78, MLT1B, and MIRc 295

(Figure 3B). 296

In introns, we observed a widespread depletion of same-strand TE insertions (Supplementary 297

Figure S1). This effect is modest yet statistically significant, and is especially true for LINE1 elements, 298

possibly for reasons mentioned above. In contrast, almost no TE types were significantly enriched on the 299

sense strand in introns. 300

To test for TE type-specific conservation, we turned to two sets of predictions of evolutionarily-301

conserved elements. First, the widely-used PhastCons conserved elements, based on phylogenetic hidden 302

Markov model (Siepel et al. 2005) calculated on primate, placental mammal and vertebrate alignments; 303

second, the more recently published “Evoluationarily Conserved Structures” (ECS) set of conserved RNA 304

structures (Smith et al. 2013). Importantly, the PhastCons regions are defined based on sequence 305

conservation alone, while the ECS are defined by phylogenetic analysis of RNA structure evolution. 306

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 11: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

11

To look for evidence of evolutionary conservation on exonic TEs, we calculated the frequency of 307

their overlap with evolutionarily conserved genomic elements compared to intronic TEs of the same type. 308

To assess statistical significance, we again used positional randomisation (see inset in Figure 3C). This 309

pipeline was applied independently to the PhastCons (Figure C3, Supplementary Figure S1B,C,D) and ECS 310

(Supplementary Figure S1) data. The majority of TEs do not exhibit significantly different conservation 311

when found in exons or introns (Grey points). Depending on conservation data used, the method detects 312

significant conservation for 17 to 40 TE types (Figure 3C). We also found a small number of TEs with high 313

evolutionary rates specifically in lncRNA exons, namely the young AluSz, AluSx and AluJb (PhastCons) 314

and L1M4c and AluSx1 (ECS) (coloured red in Figure 3C and Supplementary Figure S1). 315

316

An annotation of RIDLs 317

We next combined all TE classes with evidence of functionality into a draft annotation of RIDLs, 318

enriched for functional TE elements. This annotation combined altogether 99 TE types with at least one 319

types of selection evidence. For each TE / evidence pair, only those TE instances satisfying that evidence 320

were included. In other words, if MIRb elements were found to be associated with vertebrate PhastCons 321

elements, then only those instances of exonic MIRb elements overlapping such a PhastCons element would 322

be included in the RIDL annotation, and all other exonic MIRs would be excluded. An example is CCAT1 323

lncRNA oncogene: it carries three exonic MIR elements, of which one is defined as a RIDL based on its 324

overlapping a PhastCons element (Figure 4A). 325

After removing redundancy, the final RIDL annotation consists of 5374 elements, located within 326

3566 distinct lncRNA genes (Figure 3D). The most predominant TE families are MIR and L2 repeats, 327

representing 2329 and 1143 RIDLs (Figure 4B). The majority of both are defined based on evolutionary 328

evidence (Figure 4B). In contrast, RIDLs composed by ERV1, low complexity, satellite and simple repeats 329

families are more frequently identified due to exonic enrichment (Figure 4B). The entire RIDL annotation 330

is available in Supplementary File S2. 331

We also examined the evolutionary history of RIDLs. Using 6-mammal alignments, their depth of 332

evolutionary conservation could be inferred (Supplementary Figure S2). 12% of instances appear to be 333

Great Ape-specific, with no orthologous sequence beyond chimpanzee. 47% are primate-specific, while the 334

remaining 40% are identified in at least one non-primate mammal. The wide timeframe for appearance of 335

RIDLs is consistent with the wide diversity of TE types, from ancient MIR elements to presently-active 336

LINE1 (Jurka, Zietkiewicz, and Labuda 1995; Konkel, Walker, and Batzer 2010; Smith et al. 2013). 337

Instances of genomic TE insertions typically represent a fragment of the full consensus sequence. 338

We hypothesised that particular regions of the TE consensus will be important for RIDL activity, 339

introducing selection for these regions that would distinguish them from unselected, intronic copies. To test 340

this, we compared insertion profiles of RIDLs to intronic instances, for each TE type, and used the 341

correlation coefficient (CC) as a quantitative measure of similarity (Figure 4C and Supplementary File S5). 342

For 17 cases, a CC<0.9 points to possible selective forces acting on RIDL insertions. An example is the 343

macrosatellite SST1 repeat where RIDL copies in 41 lncRNAs show a strong preference inclusion of the 344

3’ end, in contrast to the general 5’ preference observed in introns (Figure 4C). This suggests a possible 345

functional relevance for the 1000-1500 nt region of the SST1 consensus. 346

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 12: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

12

347

RIDL-carrying lncRNAs are enriched for functions and disease roles 348

We next looked for evidence to support the RIDL annotation by investigating the properties of 349

RIDL-carrying lncRNAs. We first asked whether RIDLs are randomly distributed amongst lncRNAs, or 350

else clustered in a smaller number of genes. Figure 4D shows that the latter is the case, with a significant 351

deviation of RIDLs from a random distribution. These lncRNAs carry a mean of 1.15 RIDLs / kb of exonic 352

sequence (median: 0.84 RIDLs/kb) (Supplementary Figure S3). 353

Are RIDL-lncRNAs more likely to be functional? To address this, we compared lncRNA genes 354

carrying one or more RIDLs, to a length-matched set of non-RIDL lncRNAs (Supplementary Figure S4). 355

RIDL-lncRNAs are (1) over-represented in the reference database for functional lncRNAs, lncRNAdb 356

(Quek et al. 2015) (Figure 4E) and (2) are significantly enriched for lncRNAs associated with cancer and 357

other diseases (Figure 4F). Finally, we found that RIDL-lncRNAs are enriched for trait/disease-associated 358

SNPs in both their promoter and exonic regions (Figure 4G). 359

In addition to CCAT1 (Figure 4A) (Nissan et al. 2012) are a number of deeply-studied RIDL-360

containing genes. XIST, the X-chromosome silencing RNA contains seven internal RIDL elements. As we 361

pointed out previously (Rory Johnson and Guigó 2014), these include an intriguing array of four similar 362

pairs of MIRc / L2b repeats. The prostate cancer-associated UCA1 gene has a transcript isoform promoted 363

from an LTR7c, as well as an additional internal RIDL, thereby making a potential link between cancer 364

gene regulation and RIDLs. The TUG1 gene, involved in neuronal differentiation, contains highly 365

evolutionarily-conserved RIDLs including Charlie15k and MLT1K elements (Rory Johnson and Guigó 366

2014). Other RIDL-containing lncRNAs include MEG3, MEG9, SNHG5, ANRIL, NEAT1, CARMEN1 and 367

SOX2OT. LINC01206, located adjacent to SOX2OT, also contains numerous RIDLs. A full list of RIDL-368

carrying lncRNAs can be found in Supplementary File S6. 369

370

Correlation between RIDLs and subcellular localisation of host transcript 371

The location of a lncRNA within the cell is of key importance to its molecular function (Cabili et 372

al. 2015; Derrien et al. 2012)(Mas-Ponte et al. 2017a). We next asked whether RIDLs might define lncRNA 373

localisation (Hacisuleyman et al. 2016; B. Zhang et al. 2014)(Chillón and Pyle 2016) (Figure 5A). 374

Beginning with normalised subcellular RNAseq data from the LncATLAS database (Mas-Ponte et al. 375

2017a), based on 10 ENCODE cell lines (Djebali et al. 2012a), we comprehensively tested each of the 99 376

RIDL types for association with localisation of their host transcript. 377

After correcting for multiple hypothesis testing, this approach identified four distinct TEs that 378

significantly correlate with nuclear localisation: L1PA16, L2b, MIRb and MIRc (Figure 5B). For example, 379

44 lncRNAs carrying L2b RIDLs have 6.9-fold higher relative nuclear/cytoplasmic ratio in IMR90 cells, 380

and this tendency is observed in six difference cell types (Figure 5B,C). Furthermore, the degree of nuclear 381

localisation increases in lncRNAs as a function of the number of RIDLs they carry (Figure 5D). We also 382

found a significant relationship between GC-rich elements and cytoplasmic enrichment across three 383

independent cell samples. The GC-RIDL containing lncRNAs have between 2 and 2.3-fold higher relative 384

expression in cytoplasm in these cells (Supplementary Figure S5). These data suggest that certain types of 385

RIDL may regulate the subcellular localisation of lncRNAs. 386

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 13: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

13

387

RIDLs play a causative role in lncRNA nuclear localisation 388

To more directly test whether RIDLs play a causative role in nuclear localisation, we selected three 389

lncRNAs, based on: (i) presence of L2b, MIRb and MIRc RIDLs; (ii) moderate expression; (iii) nuclear 390

localisation, as inferred from RNAseq (Figure 6A,B and Supplementary Figure S6). Nuclear localisation 391

of these candidates was validated using qRTPCR (Figure 6C and Supplementary Figure S7). 392

We designed an assay to compare the localisation of transfected lncRNAs carrying wild-type 393

RIDLs, and mutated versions where RIDL sequence has been randomised (“Mutant”) (Figure 6D, full 394

sequences available in Supplementary File 3). Wild-type and Mutant lncRNAs were overexpressed in 395

cultured cells and their localisation evaluated by fractionation. Overexpressed copies were many fold 396

greater than endogenous lncRNAs (Supplementary Figure S8). Fractionation quality was verified by 397

Western blotting (Figure 6E) and qRT-PCR (Figure 6F), and we found negligible levels of DNA 398

contamination in cDNA samples (Supplementary Figure S9). 399

Comparison of Wild-Type and Mutant lncRNAs demonstrated a potent and consistent impact of 400

RIDLs on nuclear-cytoplasmic localisation: for all three candidates, loss of RIDL sequence resulted in 401

relocalisation of the host transcript from nucleus to cytoplasm (Figure 6F). Overexpressed Wild-Type 402

lncRNA copies had slightly lower levels of nuclear enrichment compared to endogenous transcripts, and 403

the two types were distinguished by primers matching a plasmid-encoded region (Supplementary File S4). 404

We repeated these experiments in another cell line, A549, and observed similar effects (Supplementary 405

Figure S7). These results demonstrate a role for L2b, MIRb and MIRc elements in nuclear localisation of 406

lncRNAs. 407

408

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 14: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

14

Discussion 409

Recent years have seen a rapid increase in the number of annotated lncRNAs. However our 410

understanding of their molecular functions, and how such functions are encoded in primary RNA 411

sequences, lag far behind. Two recent conceptual developments offer hope for resolving the sequence-412

function code of lncRNAs: First, the idea that the subcellular localisation of lncRNAs is a readily 413

quantifiable characteristic that holds important clues to function; Second, that the abundant transposable 414

element (TE) content of lncRNAs may contribute to functionality. 415

In this study we have linked these two ideas, by showing that TEs can drive the nuclear localisation 416

of lncRNAs. A global correlation analysis of TEs and RNA localisation data revealed a handful of TEs, 417

most notably LINE2b, MIRb and MIRc, strongly correlate with the degree of nuclear retention of their host 418

transcripts. This correlation is observed in multiple cell types, and scales with the number of TEs present. 419

A causative link was established experimentally, confirming that the indicated TEs have a two- to four-fold 420

effect on transcript localisation. 421

These data are consistent with our previously-published hypothesis that exonic TE elements may 422

function as lncRNA domains. In this “RIDL hypothesis”, transposable elements are co-opted by natural 423

selection to form “Repeat Insertion Domains of LncRNA”, that is, fragments of sequence that confer 424

adaptive advantage through some change in the activity of their host lncRNA. We proposed that RIDLs 425

may serve as binding sites for proteins or other nucleic acids, and indeed a growing body of evidence 426

supports this (Rory Johnson and Guigó 2014). In the context of localisation, RIDLs could mediate nuclear 427

retention due to their described interactions with nuclear proteins (D. R. Kelley et al. 2014), although it is 428

also conceivable that the effect occurs through hybridisation to complementary repeats in genomic DNA. 429

The localisation RIDLs discovered – MIR and LINE2 - are both ancient and contemporaneous, 430

being active prior to the mammalian radiation (Cordaux and Batzer n.d.). Both have previously been 431

associated with acquired roles in the context of genomic DNA (Jjingo et al. 2014)(R. Johnson et al. 2006). 432

Although the evolutionary history of lncRNA remains an active area of research and accurate dating of 433

lncRNA gene birth is challenging, it appears that the majority of human lncRNAs were born after the 434

mammalian radiation (Hezroni et al. 2015)(Hezroni et al. 2017)(Necsulea et al. 2014)(Washietl, Kellis, and 435

Garber 2014). This would mean that MIR and LINE2 RIDLs were pre-existing sequences that were exapted 436

by newly-born lncRNAs. This corresponds to the “latent” exaptation model proposed by Feschotte and 437

colleagues(Chuong, Elde, and Feschotte 2017). Given that nuclear retention is at odds with the primary 438

needs of TE RNAs to be exported to the cytoplasm, we propose that the observed nuclear localisation 439

activity is a more modern feature of L2b/MIR RIDLs, which is unrelated to their original roles. However it 440

is also possible that for other cases the reverse could be true – a pre-existing lncRNA exapts a newly-441

inserted TE. Our use of evolutionary conservation to identify RIDLs likely biases our analysis against the 442

discovery of modern TE families, such as Alus. 443

This study marks a step in the ongoing efforts to map the domains of lncRNAs. Previous studies 444

have utilised a variety of approaches, from integrating experimental protein-binding data (Van Nostrand et 445

al. 2016)(Hu et al. 2017)(Li et al. 2014), to evolutionarily conserved segments (Smith et al. 2013)(Seemann 446

et al. 2017). Previous maps of TEs have highlighted their profound roles in lncRNA gene evolution 447

(Kapusta et al. 2013)(D. Kelley and Rinn n.d.)(Hezroni et al. 2015). However the present RIDL annotation 448

stands apart in attempting to identify the subset of TEs with evidence for selection. We hope that the present 449

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 15: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

15

RIDL map will prove a resource for future studies to better understand functional domains of lncRNAs. 450

Although various evidence suggests that the RIDL annotation is a useful enrichment group of functional 451

TE elements, it most likely contains substantial false positive and false negative rates that will be improved 452

in future. 453

This study, and the concept of RIDLs in general, may help to address two longstanding questions 454

in the lncRNA field. 455

How does lncRNA primary sequence evolve new functions so rapidly? The insertion of TE-derived “pre-456

fabricated” sequence domains, with protein-, RNA- or DNA-binding capacity, confers rapid acquisition of 457

new molecular activities. There is a variety of evidence that TE sequences contain abundant, latent activity 458

of this type (reviewed in (Rory Johnson and Guigó 2014)). 459

Why do lncRNAs have nuclear-restricted localisation? Although they are readily detected in the cytoplasm, 460

it is nevertheless true that lncRNAs tend to have higher nuclear/cytoplasmic ratios compared to mRNAs 461

(Clark et al. 2012)(Ulitsky and Bartel 2013)(Derrien et al. 2012)(Mas-Ponte et al. 2017a). This is true across 462

various human and mouse cell types. Although this may partially be explained by decreased stability 463

(Mukherjee et al. 2017), it is likely that RNA sequence motifs also contribute to nuclear localisation 464

(Chillón and Pyle 2016)(B. Zhang et al. 2014). Here we show that this is the case, and that the enrichment 465

of certain RIDL types in lncRNA mature sequences is likely to be a major contributor to lncRNA nuclear 466

retention. 467

In summary therefore, we have made available a first annotation of selected RIDLs in lncRNAs, 468

and described a new paradigm for TE-derived fragments as drivers of nuclear localisation in lncRNAs. 469

470

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 16: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

16

Figure Legends 471

472

Figure 1: Repeat insertion domains of lncRNAs (RIDLs). 473

(A) In the “Repeat Insertion Domain of LncRNAs” (RIDL) model, exonically-inserted fragments of 474

transposable elements contain pre-formed protein-binding (red), RNA-binding (green) or DNA-475

binding (blue) activities, that contribute to functionality of the host lncRNA (black). RIDLs are 476

likely to be a small minority of exonic TEs, coexisting with large numbers of non-functional 477

“passengers” (grey). 478

(B) RIDLs (dark orange arrows) will be distinguished from passenger TEs by signals of selection, 479

including: (1) simple enrichment in exons; (2) a preference for residing on a particular strand 480

relative to the host transcript; (3) evolutionary conservation. Selection might be identified by 481

comparing exonic TEs to a neutral population, for example those residing in lncRNA introns (light 482

coloured arrows). 483

484

Figure 2: An exonic transposable element annotation with the GENCODE v21 lncRNA catalogue. 485

(A) Statistics for the exonic TE annotation process using GENCODE v21 lncRNAs. 486

(B) The fraction of nucleotides overlapped by TEs for lncRNA exons and introns, and the whole 487

genome. 488

(C) Overview of the annotation process. The exons of all transcripts within a lncRNA gene annotation 489

are merged. Merged exons are intersected with the RepeatMasker TE annotation. Intersecting TEs 490

are classified into one of six categories (bottom panel) according to the gene structure with which 491

they intersect, and the relative strand of the TE with respect to the gene. 492

(D) Summary of classification breakdown for exonic TE annotation. 493

(E) Classification of TE classes in exonic TE annotation. Numbers indicate instances of each type. 494

495

Figure 3: Evidence for selection on transposable elements in lncRNA exons. 496

(A) Figure shows, for every TE type, the enrichment of per nucleotide coverage in exons compared to 497

introns (y axis) and overall exonic nucleotide coverage (x axis). Enriched TE types (at a 2-fold 498

cutoff) are shown in green. 499

(B) As for (A), but this time the y axis records the ratio of nucleotide coverage in sense vs antisense 500

configuration. “Sense” here is defined as sense of TE annotation relative to the overlapping exon. 501

Similar results for lncRNA introns may be found in Supplementary Figure 1. Significantly-enriched 502

TE types are shown in green. Statistical significance was estimated by a randomisation procedure, 503

and significance is defined at an uncorrected empirical p-value < 0.001 (See Material and Methods). 504

(C) As for (A), but here the y axis records the ratio of per-nucleotide overlap by PhastCons mammalian-505

conserved elements for exons vs introns. Similar results for three other measures of evolutionary 506

conservation may be found in Supplementary Figure 1. Significantly-enriched TE types are shown 507

in green. Statistical significance was estimated by a randomisation procedure, and significance is 508

defined at an uncorrected empirical p-value < 0.001 (See Material and Methods). An example of 509

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 17: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

17

significance estimation is shown in the inset: the distribution shows the exonic/intronic 510

conservation ratio for 1000 simulations. The green arrow shows the true value, in this case for 511

MLT1A0 type. 512

(D) Summary of TE types with evidence of exonic selection. Six distinct evidence types are shown in 513

rows, and TE types in columns. On the right are summary statistics for (i) the number of unique TE 514

types identified by each method, and (ii) the number of instances of exonic TEs from each type 515

with appropriate selection evidence. The latter are henceforth defined as “RIDLs”. 516

517

Figure 4: Annotated RIDLs and RIDL-lncRNAs 518

(A) Example of a RIDL-lncRNA gene: CCAT1. Of note is that although several exonic TE instances 519

are identified (grey), including three separate MIR elements, only one is defined a RIDL (orange) 520

due to overlap of a conserved element. 521

(B) Breakdown of RIDL instances by TE family and evidence sources. 522

(C) Insertion profile of SST1 RIDLs (blue) and intronic insertions (red). x axis shows the entire 523

consensus sequence of SST1. y axis indicates the frequency with which each nucleotide position 524

is present in the aggregate of all insertions. “CC”: Spearman correlation coefficient of the two 525

profiles. “RIDLs” / “Intronic TEs” indicate the numbers of individual insertions considered for 526

RIDLs / intronic insertions, respectively. 527

(D) Number of lncRNAs (y axis) carrying the indicated number of RIDL (x axis) given the true 528

distribution (black) and randomized distribution (red). The 95% confidence interval was computed 529

empirically, by randomly shuffling RIDLs across the entire lncRNA annotation. 530

(E) The number of RIDL-lncRNAs, and a length-matched set of non-RIDL lncRNAs, which are 531

present in the lncRNAdb database of functional lncRNAs (Quek et al. 2015). Numbers denote 532

gene counts. Significance was calculated using Fisher’s exact test. 533

(F) As for (E) for lncRNAs present in disease- and cancer-associated lncRNA databases (see Materials 534

and Methods). 535

(G) As for (E) for lncRNAs containing at least one trait/disease-associated SNP in their promoter 536

regions (darker colors) or exonic regions (lighter colors) (see Materials and Methods). 537

Figure 5: Correlation between RIDLs and host lncRNA nuclear-cytoplasmic localisation. 538

(A) We hypothesized that RIDLs may govern the subcellular localization of host lncRNAs. 539

(B) Results of an in silico screen of RIDL types (rows) against lncRNA nuclear/cytoplasmic 540

localisation in human cell lines (columns). For every RIDL type (rows) in every cell type 541

(columns), the localisation of RIDL-containing lncRNAs was compared to all other detected 542

lncRNAs. Significantly-different pairs are coloured (Benjamini-Hochberg corrected p-value < 543

0.01; Wilcoxon test). Colour scale indicates the nuclear-cytoplasmic ratio mean of RIDL-lncRNAs. 544

Numbers in cells indicate the number of considered RIDL-lncRNAs. 545

(C) The nuclear-cytoplasmic localization of lncRNAs carrying L2b RIDLs in IMR90 cells. Blue 546

indicates lncRNAs carrying ≥1 L2b RIDLs, grey indicates all other detected lncRNAs (“not”). 547

Dashed lines represent medians. Significance was calculated using Wilcoxon test. 548

(D) The nuclear-cytoplasmic ratio of lncRNAs broken down by the number of significant RIDLs that 549

they carry (L1PA16, L2b, MIRb, MIRc). Correlation coefficient (Rho) between number of RIDLs 550

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 18: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

18

and nuclear-cytoplasmic ratio of lncRNAs and the corresponding p-value (P) are indicated in the 551

plot. In each box, upper value indicates the number of lncRNAs, and lower value the median. 552

553

Figure 6: Disruption of RIDLs results in lncRNA relocalisation from nucleus to cytoplasm. 554

(A) Structures of candidate RIDL-lncRNAs. Orange indicates RIDL positions. For each RIDL, 555

numbers indicate the position within the TE consensus, and its orientation with respect to the 556

lncRNA is indicated by arrows (“>” for same strand, “<” for opposite strand). 557

(B) Experimental design 558

(C) Expression of the three lncRNA candidates as inferred from HeLa RNAseq (Djebali et al. 559

2012c). 560

(D) Nuclear/cytoplasmic localisation of endogenous candidate lncRNA copies in wild-type HeLa 561

cells, as measured by qRT-PCR. 562

(E) The purity of HeLa subcellular fractions was assessed by Western blotting against specific 563

markers. GAPDH / Histone H3 proteins are used as cytoplasmic / nuclear markers, respectively. 564

(F) Nuclear/cytoplasmic localisation of transfected candidate lncRNAs. GAPDH/MALAT1 are 565

used as cytoplasmic/nuclear controls, respectively. N indicates the number of biological 566

replicates, and error bars represent standard error of the mean. P-values for paired t-test (1 tail) 567

are shown. 568

569

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 19: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

19

Supplementary Figure Legends 570

571

Supplementary Figure S1: (A) Figure shows, for every TE type, the ratio of intronic nucleotide coverage 572

in sense vs antisense configuration. “Sense” here is defined as sense of TE annotation relative to the sense 573

of the overlapping intron. Significantly-enriched TE types are shown in green. Statistical significance was 574

estimated by a randomisation procedure, and significance is defined at an uncorrected empirical p-value < 575

0.001 (See Material and Methods). (B-D) As for (A), but y axis records the ratio of nucleotide overlap in 576

exons vs introns by PhastCons Primate-conserved elements, PhastCons placental-conserved elements, 577

PhastCons vertebrate-conserved elements and Evolutionarily conserved structure, respectively. 578

Supplementary Figure S2: Evolutionary analysis using 6-mammal alignments. Rows represent human 579

RIDLs. In each column (6 different species) they are coloured in red when they are conserved in the given 580

species and otherwise in light yellow. 581

Supplementary Figure S3: Density distribution of RIDL-carrying lncRNAs based on the number of 582

RIDLs per kb of exons that they contain. Dotted line indicates median. 583

Supplementary Figure S4: Creating an exon-length-matched sample of non-RIDL lncRNAs. (A) Exonic 584

length distribution of RIDL-carrying and non-RIDL GENCODE v21 lncRNAs (“others”). (B) Exonic 585

length distribution of RIDL-carrying lncRNAs and a length-matched sample of non-RIDL lncRNAs. The 586

latter were used for all comparisons at gene level. 587

Supplementary Figure S5: Cytoplasmic-nuclear localization in IMR90. Yellow represents lncRNAs 588

carrying one or more copies of GC-rich RIDL elements, grey represents all other detected lncRNAs. Dashed 589

lines indicate the median of each group. 590

Supplementary Figure S6: Barplot generated by LncATLAS webserver (Mas-Ponte et al. 2017b). Bars 591

indicate log2-transformed cytoplasmic/nuclear expression ratios (CN RCI) in 15 cell lines for three 592

candidate genes. Colours represent the total nuclear expression of each gene in each cell line. 593

Supplementary Figure S7: Experimental data for A549 cells. (A) Expression data for three candidate 594

lncRNAs calculated using RNAseq data (Djebali et al. 2012a). (B) Validation of nuclear/cytoplasmic 595

localisation of candidate genes in wild-type cells by qRT-PCR. (C) Validation of subcellular fractionation 596

by Western blot, using GAPDH/Histone H3 proteins as cytoplasmic/nuclear markers, respectively. (D) 597

Nuclear/cytoplasmic localisation of transfected candidate lncRNAs. 598

Supplementary Figure S8: Estimating overexpression of transfected lncRNA candidates. Data are shown 599

individually for each of four biological replicates. y axis represents the fold change of transfected gene level 600

compared to the endogeneous level estimated from wild-type cells. 601

Supplementary Figure 9: DNA contamination check in RNA Isolated from different subcellular fractions. 602

Total RNA was isolated from different subcellular fractions using RNA isolation kit as described in 603

Materials and Methods. Equal amount of RNAs with (+RT) and without (-RT) reverse transcription, were 604

subjected to PCR (40 cycles) with an exonic primer. The reaction products were electrophoresed on a 1.5 605

% agarose gel stained with SYBR safe. The T, C, N represent the RNA isolated from total, cytoplasmic and 606

nuclear fractions respectively. 607

608

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 20: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

20

Supplementary Files 609

610

Supplementary File S1: File is in BED format with coordinates for human genome assembly GRCh38 / 611

hg38. The file contains the coordinates of genomic nucleotides that overlap both TEs and merged exons of 612

lncRNA genes. 613

The 4th “name” column contains information on the TE (from RepeatMasker annotation) in comma-614

separated format: 615

Chromosome of TE annotation, start of TE annotation, end of TE annotation, strand of TE annotation, TE 616

name, TE class, TE family, left in repeat, end in repeat, begin in repeat, hypothetical start if TE were full 617

length, hypothetical end if TE were full length. 618

In the 5th “score” column is indicated the TE classification: 619

+1 : tss.plus ; +2 : splice_acc.plus ; +3 : encompass.plus ; +4 : splice_don.plus; +5 : tts.plus ; +6 : 620

inside.plus. The sign (+-) denotes whether the annotated strand of the TE lies on the same/opposite strand 621

with respect to that of the overlapping lncRNA. 622

Supplementary File S2: File is in BED format with coordinates for human genome assembly GRCh38 / 623

hg38. It was created by merging a subset of the exonic TEs from Supplementary File S1, and thus some 624

lines may be composed of merged features. 625

The 4th “name” column contains information on the TE (from RepeatMasker annotation) in comma-626

separated format: 627

Chromosome of TE annotation, start of TE annotation, end of TE annotation, strand of TE annotation, TE 628

name, TE class, TE family, left in repeat, end in repeat, begin in repeat, hypothetical start if TE were full 629

length, hypothetical end if TE were full length. 630

Supplementary File S3: File contains wild type and mutant synthesized plasmid sequences. Coloured 631

sequences indicate wild-type / scrambled RIDL sequences. 632

Supplementary File S4: Primer sequences and calculations of their efficiency. 633

Supplementary File S5: PDF images of insertion profiles of all RIDLs (blue) and intronic insertions (red). 634

x axis shows the entire consensus sequence of a given TE type. y axis indicates the frequency with which 635

each nucleotide position is present in the aggregate of all insertions. “CC”: Spearman correlation coefficient 636

of the two profiles. “Data” / “Reference” indicate the numbers of individual insertions considered for RIDLs 637

/ intronic insertions, respectively. 638

Supplementary File S6: List of GENCODE v21 lncRNAs carrying one or more RIDLs. First column 639

contains gene ID and second column the number of RIDLs present in the lncRNA. 640

641

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 21: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

21

Acknowledgements 642

We acknowledge Deborah Re (DBMR), Silvia Roesselet (DBMR) and Marianne Zahn (Inselspital) 643

for administrative support. We wish to thank Roderic Guigo (CRG), Marc Friedlaender (SciLife Lab) and 644

Marta Mele (Harvard) for many helpful discussions and Roberta Esposito (DBMR) for help and advice 645

regarding experiments and their analysis. Samir Ounzain (CHUV) also contributed suggestions regarding 646

experimental design. Julien Lagarde (CRG) kindly provided help in gene sampling analysis. CN is 647

supported by grant TIN-2013-41990-R from the Spanish Ministry of Economy, Industry and 648

Competitiveness (MINECO). This research was supported by the NCCR RNA & Disease funded by the 649

Swiss National Science Foundation. 650

651

652

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 22: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

22

References 653

654

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and 655

Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B 656

(Methodological) 57: 289–300. https://www.jstor.org/stable/2346101 (August 25, 2017). 657

Blackwell, Benjamin J et al. “Protein Interactions with piALU RNA Indicates Putative Participation of 658 retroRNA in the Cell Cycle, DNA Repair and Chromatin Assembly.” 659 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383447/pdf/mge-2-26.pdf (August 25, 2017). 660

Bourque, Guillaume et al. 2008. “Evolution of the Mammalian Transcription Factor Binding Repertoire 661 via Transposable Elements.” Genome research 18(11): 1752–62. 662 http://www.ncbi.nlm.nih.gov/pubmed/18682548 (August 25, 2017). 663

———. 2009. “Transposable Elements in Gene Regulation and in the Evolution of Vertebrate Genomes.” 664 Current Opinion in Genetics & Development 19(6): 607–12. 665 http://www.sciencedirect.com/science/article/pii/S0959437X09001725 (August 25, 2017). 666

Cabili, Moran N et al. 2015. “Localization and Abundance Analysis of Human lncRNAs at Single-Cell 667 and Single-Molecule Resolution.” Genome biology 16(1): 20. 668 http://genomebiology.com/2015/16/1/20 (January 30, 2015). 669

Carrieri, Claudia et al. 2012. “Long Non-Coding Antisense RNA Controls Uchl1 Translation through an 670 Embedded SINEB2 Repeat.” Nature 491(7424): 454–57. 671 http://www.ncbi.nlm.nih.gov/pubmed/23064229 (February 10, 2015). 672

Chen, G. et al. 2013. “LncRNADisease: A Database for Long-Non-Coding RNA-Associated Diseases.” 673 Nucleic Acids Research 41(D1): D983–86. http://www.ncbi.nlm.nih.gov/pubmed/23175614 674 (December 24, 2016). 675

Chen, Ling-Ling. 2016. “Linking Long Noncoding RNA Localization and Function.” Trends in 676 biochemical sciences. http://www.ncbi.nlm.nih.gov/pubmed/27499234 (August 25, 2016). 677

Chillón, Isabel, and Anna M. Pyle. 2016. “Inverted Repeat Alu Elements in the Human lincRNA-p21 678 Adopt a Conserved Secondary Structure That Regulates RNA Function.” Nucleic Acids Research: 679 gkw599. http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkw599 (September 5, 2016). 680

Chuong, Edward B, Nels C Elde, and Cédric Feschotte. 2017. “Regulatory Activities of Transposable 681 Elements: From Conflicts to Benefits.” 682 https://www.nature.com/nrg/journal/v18/n2/pdf/nrg.2016.139.pdf (August 25, 2017). 683

Clark, Michael B et al. 2012. “Genome-Wide Analysis of Long Noncoding RNA Stability.” Genome 684 research 22(5): 885–98. http://genome.cshlp.org/content/22/5/885.long (June 3, 2014). 685

Cordaux, Richard, and Mark A Batzer. “The Impact of Retrotransposons on Human Genome Evolution.” 686 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2884099/pdf/nihms-201920.pdf (August 29, 2017). 687

Derrien, Thomas et al. 2012. “The GENCODE v7 Catalog of Human Long Noncoding RNAs: Analysis 688 of Their Gene Structure, Evolution, and Expression.” Genome research 22(9): 1775–89. 689 http://genome.cshlp.org/content/22/9/1775.long (May 23, 2014). 690

Djebali, Sarah et al. 2012a. “Landscape of Transcription in Human Cells.” Nature 489(7414): 101–8. 691 http://www.nature.com/doifinder/10.1038/nature11233 (April 20, 2017). 692

———. 2012b. “Landscape of Transcription in Human Cells.” Nature 489(7414): 101–8. 693 http://www.ncbi.nlm.nih.gov/pubmed/22955620 (May 11, 2017). 694

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 23: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

23

———. 2012c. “Landscape of Transcription in Human Cells.” Nature 489(7414): 101–8. 695 http://www.ncbi.nlm.nih.gov/pubmed/22955620 (July 17, 2017). 696

Elisaphenko, Eugeny A et al. 2008. “A Dual Origin of the Xist Gene from a Protein-Coding Gene and a 697 Set of Transposable Elements.” 698 http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0002521&type=printable 699 (August 25, 2017). 700

Faulkner, Geoffrey J et al. “The Regulated Retrotransposon Transcriptome of Mammalian Cells.” 701 https://www.nature.com/ng/journal/v41/n5/pdf/ng.368.pdf (August 25, 2017). 702

Feschotte, Cédric. 2008. “Transposable Elements and the Evolution of Regulatory Networks.” Nature 703 reviews. Genetics 9(5): 397–405. http://www.ncbi.nlm.nih.gov/pubmed/18368054 (September 1, 704 2017). 705

Gong, Chenguang, and Lynne E Maquat. 2011. “lncRNAs Transactivate STAU1-Mediated mRNA Decay 706 by Duplexing with 3’ UTRs via Alu Elements.” Nature 470(7333): 284–88. 707 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3073508&tool=pmcentrez&rendertype=708 abstract (March 6, 2015). 709

Guttman, Mitchell, and John L. Rinn. 2012. “Modular Regulatory Principles of Large Non-Coding 710 RNAs.” Nature 482(7385): 339–46. http://www.nature.com/doifinder/10.1038/nature10887 711 (January 6, 2017). 712

Hacisuleyman, Ezgi et al. 2016. “Function and Evolution of Local Repeats in the Firre Locus.” Nature 713 Communications 7: 11021. http://www.nature.com/doifinder/10.1038/ncomms11021 (November 20, 714 2016). 715

Harrow, J. et al. 2012. “GENCODE: The Reference Human Genome Annotation for The ENCODE 716 Project.” Genome Research 22(9): 1760–74. http://genome.cshlp.org/cgi/doi/10.1101/gr.135350.111 717 (November 20, 2016). 718

Hezroni, Hadas et al. 2015. “Principles of Long Noncoding RNA Evolution Derived from Direct 719 Comparison of Transcriptomes in 17 Species.” Cell Reports 11(7): 1110–22. 720 http://linkinghub.elsevier.com/retrieve/pii/S2211124715004106 (September 1, 2017). 721

———. 2017. “A Subset of Conserved Mammalian Long Non-Coding RNAs Are Fossils of Ancestral 722 Protein-Coding Genes.” Genome Biology 18(1): 162. 723 http://www.ncbi.nlm.nih.gov/pubmed/28854954 (September 2, 2017). 724

Hindorff, Lucia A et al. 2009. “Potential Etiologic and Functional Implications of Genome-Wide 725 Association Loci for Human Diseases and Traits.” Proceedings of the National Academy of Sciences 726 of the United States of America 106(23): 9362–67. http://www.ncbi.nlm.nih.gov/pubmed/19474294 727 (August 30, 2017). 728

Holdt, Lesca M et al. 2013. “Alu Elements in ANRIL Non-Coding RNA at Chromosome 9p21 Modulate 729 Atherogenic Cell Functions through Trans-Regulation of Gene Networks.” PLoS Genet 9(7). 730 http://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1003588&type=printable 731 (August 25, 2017). 732

Hu, Boqin et al. 2017. “POSTAR: A Platform for Exploring Post-Transcriptional Regulation Coordinated 733 by RNA-Binding Proteins.” Nucleic Acids Research 45(D1): D104–14. 734 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw888 (September 2, 2017). 735

Huda, Ahsan, Nathan J. Bowen, Andrew B. Conley, and I. King Jordan. 2011. “Epigenetic Regulation of 736 Transposable Element Derived Human Gene Promoters.” Gene 475(1): 39–48. 737

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 24: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

24

http://www.sciencedirect.com/science/article/pii/S0378111910004762 (August 27, 2017). 738

Jjingo, Daudi et al. 2014. “Mammalian-Wide Interspersed Repeat (MIR)-Derived Enhancers and the 739 Regulation of Human Gene Expression.” Mobile DNA 5(1): 14. 740 http://mobilednajournal.biomedcentral.com/articles/10.1186/1759-8753-5-14 (September 2, 2017). 741

Johnson, R. et al. 2006. “Identification of the REST Regulon Reveals Extensive Transposable Element-742 Mediated Binding Site Duplication.” Nucleic Acids Research 34(14): 3862–77. 743 http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkl525 (August 25, 2017). 744

Johnson, Rory, and Roderic Guigó. 2014. “The RIDL Hypothesis: Transposable Elements as Functional 745 Domains of Long Noncoding RNAs.” RNA (New York, N.Y.) 20(7): 959–76. 746 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4114693&tool=pmcentrez&rendertype=747 abstract (April 29, 2015). 748

Jurka, J, E Zietkiewicz, and D Labuda. 1995. “Ubiquitous Mammalian-Wide Interspersed Repeats (MIRs) 749 Are Molecular Fossils from the Mesozoic Era.” Nucleic acids research 23(1): 170–75. 750 http://www.ncbi.nlm.nih.gov/pubmed/7870583 (August 29, 2017). 751

Kapusta, Aurélie et al. 2013. “Transposable Elements Are Major Contributors to the Origin, 752 Diversification, and Regulation of Vertebrate Long Noncoding RNAs.” ed. Hopi E. Hoekstra. PLoS 753 genetics 9(4): e1003470. http://dx.plos.org/10.1371/journal.pgen.1003470 (August 25, 2017). 754

Kelley, David R, David G Hendrickson, Danielle Tenen, and John L Rinn. 2014. “Transposable Elements 755 Modulate Human RNA Abundance and Splicing via Specific RNA-Protein Interactions.” Genome 756 Biology 15(12): 537. http://www.ncbi.nlm.nih.gov/pubmed/25572935 (September 2, 2017). 757

Kelley, David, and John Rinn. 2012. “Transposable Elements Reveal a Stem Cell-Specific Class of Long 758 Noncoding RNAs.” Genome Biology 13(11): R107. 759 http://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-11-r107 (August 25, 2017). 760

———. “Transposable Elements Reveal a Stem Cell-Specific Class of Long Noncoding RNAs.” 761 https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2012-13-11-762 r107?site=genomebiology.biomedcentral.com (August 25, 2017). 763

Konkel, Miriam K, Jerilyn A Walker, and Mark A Batzer. 2010. “LINEs and SINEs of Primate 764 Evolution.” Evolutionary anthropology 19(6): 236–49. 765 http://www.ncbi.nlm.nih.gov/pubmed/25147443 (August 29, 2017). 766

Lev-Maor, Galit, Rotem Sorek, Noam Shomron, and Gil Ast. 2003. “The Birth of an Alternatively 767 Spliced Exon: 3’ Splice-Site Selection in Alu Exons.” Science 300(5623). 768 http://science.sciencemag.org/content/300/5623/1288/tab-pdf (August 25, 2017). 769

Li, Jun-Hao et al. 2014. “starBase v2.0: Decoding miRNA-ceRNA, miRNA-ncRNA and protein–RNA 770 Interaction Networks from Large-Scale CLIP-Seq Data.” Nucleic Acids Research 42(D1): D92–97. 771 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkt1248 (September 2, 2017). 772

Martin, Kelsey C, and Anne Ephrussi. 2009. “mRNA Localization: Gene Expression in the Spatial 773 Dimension.” Cell 136(4): 719–30. http://www.ncbi.nlm.nih.gov/pubmed/19239891 (August 29, 774 2017). 775

Mas-Ponte, David et al. 2017a. “LncATLAS Database for Subcellular Localisation of Long Noncoding 776 RNAs.” RNA (New York, N.Y.): rna.060814.117. 777 http://rnajournal.cshlp.org/lookup/doi/10.1261/rna.060814.117 (April 10, 2017). 778

———. 2017b. “LncATLAS Database for Subcellular Localization of Long Noncoding RNAs.” RNA 779

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 25: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

25

23(7): 1080–87. http://rnajournal.cshlp.org/lookup/doi/10.1261/rna.060814.117 (July 19, 2017). 780

Mercer, Tim R, and John S Mattick. 2013. “Structure and Function of Long Noncoding RNAs in 781 Epigenetic Regulation.” Nature Publishing Group 20. 782 https://www.nature.com/nsmb/journal/v20/n3/pdf/nsmb.2480.pdf (August 25, 2017). 783

Mukherjee, Neelanjan et al. 2017. “Integrative Classification of Human Coding and Noncoding Genes 784 through RNA Metabolism Profiles.” Nature structural & molecular biology 24(1): 86–96. 785 http://www.nature.com/doifinder/10.1038/nsmb.3325 (September 2, 2017). 786

Necsulea, Anamaria et al. 2014. “The Evolution of lncRNA Repertoires and Expression Patterns in 787 Tetrapods.” Nature 505(7485): 635–40. http://www.ncbi.nlm.nih.gov/pubmed/24463510 (May 11, 788 2017). 789

Ning, Shangwei et al. 2016. “Lnc2Cancer: A Manually Curated Database of Experimentally Supported 790 lncRNAs Associated with Various Human Cancers.” Nucleic Acids Research 44(D1): D980–85. 791 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkv1094 (August 30, 2017). 792

Nissan, Aviram et al. 2012. “Colon Cancer Associated Transcript-1: A Novel RNA Expressed in 793 Malignant and Pre-Malignant Human Tissues.” International Journal of Cancer 130(7): 1598–1606. 794 http://www.ncbi.nlm.nih.gov/pubmed/21547902 (August 29, 2017). 795

Van Nostrand, Eric L et al. 2016. “Robust Transcriptome-Wide Discovery of RNA-Binding Protein 796 Binding Sites with Enhanced CLIP (eCLIP).” Nature Methods 13(6): 508–14. 797 http://www.ncbi.nlm.nih.gov/pubmed/27018577 (September 2, 2017). 798

Perepelitsa-Belancio, Victoria, and Prescott Deininger. 2003. “RNA Truncation by Premature 799 Polyadenylation Attenuates Human Mobile Element Activity.” Nature Genetics 35(4): 363–66. 800 http://www.nature.com/doifinder/10.1038/ng1269 (August 25, 2017). 801

Quek, Xiu Cheng et al. 2015. “lncRNAdb v2.0: Expanding the Reference Database for Functional Long 802 Noncoding RNAs.” Nucleic acids research 43(Database issue): D168-73. 803 https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku988 (December 24, 2016). 804

Quinlan, Aaron R, and Ira M Hall. 2010. “BEDTools: A Flexible Suite of Utilities for Comparing 805 Genomic Features.” Bioinformatics (Oxford, England) 26(6): 841–42. 806 http://www.ncbi.nlm.nih.gov/pubmed/20110278 (August 25, 2017). 807

Roberts, Justin T, Sara E Cardin, and Glen M Borchert. 2014. “Burgeoning Evidence Indicates That 808 microRNAs Were Initially Formed from Transposable Element Sequences.” Mobile Genetic 809 Elements 4(3): e29255. http://www.tandfonline.com/doi/abs/10.4161/mge.29255 (August 25, 2017). 810

Schmitt, Adam M. et al. 2016. “Long Noncoding RNAs in Cancer Pathways.” Cancer Cell 29(4): 452–811 63. http://linkinghub.elsevier.com/retrieve/pii/S1535610816300927 (June 28, 2016). 812

Seemann, Stefan E. et al. 2017. “The Identification and Functional Annotation of RNA Structures 813 Conserved in Vertebrates.” Genome Research 27(8): 1371–83. 814 http://www.ncbi.nlm.nih.gov/pubmed/28487280 (September 2, 2017). 815

Sela, Noa et al. 2007. “Comparative Analysis of Transposed Element Insertion within Human and Mouse 816 Genomes Reveals Alu’s Unique Role in Shaping the Human Transcriptome.” 8(6). 817 https://genomebiology.biomedcentral.com/track/pdf/10.1186/gb-2007-8-6-818 r127?site=genomebiology.biomedcentral.com (August 25, 2017). 819

Siepel, Adam et al. 2005. “Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast 820 Genomes.” Genome research 15(8): 1034–50. http://www.ncbi.nlm.nih.gov/pubmed/16024819 821

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 26: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

26

(August 25, 2017). 822

Smith, M. A., T. Gesell, P. F. Stadler, and J. S. Mattick. 2013. “Widespread Purifying Selection on RNA 823 Structure in Mammals.” Nucleic Acids Research 41(17): 8220–36. 824 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3783177&tool=pmcentrez&rendertype=825 abstract (April 20, 2015). 826

Suzuki, Keiko et al. 2010. “REAP: A Two Minute Cell Fractionation Method.” BMC Research Notes 827 3(1): 294. http://www.ncbi.nlm.nih.gov/pubmed/21067583 (August 28, 2017). 828

Ulitsky, Igor, and David P. Bartel. 2013. “lincRNAs: Genomics, Evolution, and Mechanisms.” Cell 829 154(1): 26–46. http://linkinghub.elsevier.com/retrieve/pii/S0092867413007599 (November 20, 830 2016). 831

Washietl, Stefan, Manolis Kellis, and Manuel Garber. 2014. “Evolutionary Dynamics and Tissue 832 Specificity of Human Long Noncoding RNAs in Six Mammals.” Genome research 24(4): 616–28. 833 http://www.ncbi.nlm.nih.gov/pubmed/24429298 (May 25, 2014). 834

Welter, Danielle et al. 2014. “The NHGRI GWAS Catalog, a Curated Resource of SNP-Trait 835 Associations.” Nucleic Acids Research 42(D1): D1001–6. https://academic.oup.com/nar/article-836 lookup/doi/10.1093/nar/gkt1229 (August 30, 2017). 837

Zhang, B. et al. 2014. “A Novel RNA Motif Mediates the Strict Nuclear Localization of a Long 838 Noncoding RNA.” Molecular and Cellular Biology 34(12): 2318–29. 839 http://mcb.asm.org/cgi/doi/10.1128/MCB.01673-13 (November 20, 2016). 840

Zhang, Kun et al. 2014. “The Ways of Action of Long Non-Coding RNAs in Cytoplasm and Nucleus.” 841 Gene 547(1): 1–9. http://www.ncbi.nlm.nih.gov/pubmed/24967943 (March 17, 2015). 842

843

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 27: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

RIDLs

Passenger TEsLncRNA

RNAProteinDNA

Exonic Enrichment

Strand Bias

EvolutionaryConservation

A

B

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 28: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

262

27

402

37

219

634

315

185

47

209

126

68

117

25

235

28

196

20

175

415

167

333

10

153

10

52

119

28

15

10

116

5

122

358

135

77

1

53

0

58

52

11

367

44

116

25

98

214

243

138

13

372

21

59

136

18

913

36

507

56

295

790

710

288

43

378

118

117

249

37

1795

172

336

21

307

790

1292

1161

347

1905

1846

203

628

125

691

32

250

11

137

227

525

327

31

309

131

43

89

24

1017

58

67

10

95

217

324

174

5

629

21

69

124

36

625

10

75

4

62

150

335

111

0

151

0

15

15

14

304

15

112

4

70

83

176

234

0

332

44

28

66

15

454

50

133

9

112

194

763

218

7

278

116

83

173

41

1903

177

272

13

316

551

1276

1127

293

1749

1620

190

630

112hAT−Tip100

hAT−Charlie

TcMar−Tigger

Simple_repeat

MIR

Low_complexity

L2

L1

ERVL−MaLR

ERVL

ERVK

ERV1

CR1

Alu

TSS+

TSS−

Splic

eAcc

+

Splic

eAcc

Splic

eDon

+

Splic

eDon

Enco

mpa

ss+

Enco

mpa

ss−

TTS+

TTS−

Insi

de+

Insi

de−

Rep

eat F

amily

0

500

1000

1500

0

5000

10000

15000

20000

TSS

Accep

tor

Encom

pass

Donor

TTSIns

ide

Num

ber

0.0

0.1

0.2

0.3

0.4

lncrna

.exon

s

lncrna

.intro

ns

geno

me

Frac

tion

Ove

rlapp

ed

DNALINELow complexityLTRSatelliteSimple repeatSINE

A B

C D

E

TSSDonorAcceptorInsideEncompassingTTS

Transcripts

MergedGene

RepeatMasker

Exon

ic T

E An

nota

tion

26414 lncRNA transcripts Gencode 21

48684 merged exons

31004 introns

5520018RepeatMasker

46474 exonic overlap

562640 intronic overlap

23975 samesense22499 antisense

273264 samesense289376 antisense

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 29: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

Strand Bias

� �

��

��

��

� �

��

��

��

��

��

� �

��

��

� �

� �

��

��

��

��

���

��

��

��

��

��

��

��

��

� �

��

��

L1MA9L1MC2L1PA15L1PA16

L1PA3

L1PA5

L1PB1

LTR78

MIRc

MLT1B

AluSx1

AluYc

Charlie2b

L1M4b

L1M5

L1MA2L1MA6

L1MB4L1MC1L1MCa

L1ME2

L1ME3D

L1PA4L1PA7

Tigger2

−4

−2

0

2

4

3.5

4.0

4.5

5.0

5.5

Total exonic overlap (Log10 nt)

Log2

ratio

sen

se /

antis

ense

ove

rlapExonic Enrichment

A B

C

D

��

��

��

��

��

� �

��

��

��

��

��

��

� �

� �

� �

��

� �

��

��

��

� �

� �

��

��

��

� �

(A)n

(AC)n

(AT)n

(CA)n(CT)n

(CTC)n

(GCC)n(GGC)n

(GT)n(T)n

(TA)n

(TC)n(TG)n

(TTTC)n

A−rich

ALR/Alpha

AluAluJr4

AluSc

AluSc5

AluSc8AluSg

AluSg4

AluSg7AluSpAluSq

AluSq2AluSx3

AluSx4 AluSz6

AluYcAluYh3 AluYm1

AmnSINE1

Arthur1

Arthur1B

BSR/Beta

Charlie1

Charlie15a

Charlie16a

Charlie18aCharlie1aCharlie1b

Charlie2b

Charlie4aCharlie4z

Charlie5

Charlie7

Charlie8

ERV3−16A3_I−int

ERVL−B4−intERVL−E−int

FAMFLAM_A

FLAM_C

FRAM

G−rich

GA−rich

HAL1HAL1ME

HAL1b

HERV16−int

HERV3−int

HERVE−int

HERVH−int

HERVK9−int

L1M1

L1M2

L1M3L1M4

L1M4a1

L1M4b

L1M4c

L1M5

L1M6

L1M7

L1MA1

L1MA10L1MA2

L1MA3L1MA4

L1MA5

L1MA5A

L1MA6L1MA7

L1MA8

L1MA9

L1MB2

L1MB3

L1MB4L1MB5

L1MB7

L1MB8L1MC

L1MC1

L1MC2

L1MC3

L1MC4

L1MC4a

L1MC5

L1MC5a

L1MCa

L1MCcL1MD

L1MD1

L1MD2

L1MD3

L1MDa

L1ME1

L1ME2

L1ME2z L1ME3L1ME3A

L1ME3BL1ME3Cz

L1ME3D

L1ME3E

L1ME3G

L1ME4aL1ME4b

L1ME4c

L1ME5

L1MEc

L1MEd

L1MEf

L1MEgL1MEhL1MEi

L1P2

L1P3

L1PA10

L1PA11

L1PA13

L1PA14 L1PA15

L1PA16

L1PA17

L1PA2 L1PA3

L1PA4

L1PA5

L1PA7

L1PA8

L1PREC2

L2

L3

L3b

L4_A_MamL4_B_Mam

L4_C_Mam

LFSINE_Vert LOR1−int

LTR104_Mam

LTR12

LTR12C

LTR12D

LTR12F

LTR16

LTR16ALTR16A1

LTR16A2

LTR16C

LTR16E1

LTR16E2

LTR1A2

LTR33

LTR33B

LTR49−int

LTR50

LTR67B

LTR7 LTR78

LTR78BLTR79

LTR7C

LTR8

LTR85c

LTR88bMADE2

MARNA

MER102a

MER102b

MER102c

MER103C

MER104

MER112

MER113

MER115

MER117 MER1A

MER1BMER2

MER20MER20B

MER21A

MER21B

MER21CMER3

MER30

MER33

MER39MER4−int

MER41−int

MER41A

MER41BMER44A

MER45A

MER46C

MER49

MER4A1

MER4D1 MER50

MER51A

MER51E

MER52A

MER53MER57−int

MER57A−int

MER58A

MER58B

MER58C

MER5A

MER5A1

MER5BMER5C

MER61−int

MER61A

MER63A

MER63B

MER77

MER8

MER81

MER91A

MER92−int

MER94

MER96B

MIR1_Amn

MLT1A

MLT1A0MLT1A1

MLT1B

MLT1C

MLT1D

MLT1E1A

MLT1E2

MLT1E3

MLT1F

MLT1F1MLT1F2

MLT1G

MLT1G1MLT1G3

MLT1H

MLT1H1MLT1H2

MLT1I

MLT1JMLT1J1

MLT1J2

MLT1K

MLT1L

MLT1M

MLT1N2

MLT1O

MLT2A1

MLT2A2

MLT2B1

MLT2B2

MLT2B3

MLT2B4

MLT2D

MLT2F

MSTA

MSTBMSTB1

MSTC

MSTD

MamGypsy2−I

MamRTE1MamRep137

MamRep4096

MamRep434

MamRep605

MamSINE1MamTip2

OldhAT1

Plat_L3

REP522

SST1

SVA_D

THE1A

THE1B

THE1C

THE1D

Tigger1

Tigger13a

Tigger15aTigger19aTigger19b

Tigger2

Tigger2a

Tigger3a

Tigger3b

Tigger4b Tigger7

−2

0

2

3.0 3.5 4.0 4.5 5.0

Total Exonic Coverage (Log10 nt)

Log2

Rat

io E

xoni

c vs

Intro

nic

Cov

erag

e

�a�a�a

EnrichedNot significantDepleted

Placental Mammal PhastCons

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� �

��

��

(AT)n

(T)n

(TA)n

A−rich AluJrAluSc

AluSc8

AluSpGA−rich

L1MA4L1MA5A

L1MC1L1MC3

L1MD2

L1ME1L1ME3AL1MEg

L2b

LTR12C

MER63A

MIR

MLT1A0

MLT1H1

MLT1K

MLT2B4

THE1D

AluJb

AluSg

AluSq2AluSx1

AluSx3AluSz

L1MEc

MLT1CMLT1H

MER50

−4

−2

0

2

4

3.5

4.0

4.5

5.0

5.5

Total exonic overlap (Log10 nt)

Log2

ratio

exo

n / i

ntro

n ov

erla

p

Log2 ratio exon/intron overlap

Freq

uenc

y

−4 −2 0 2 4

010

2030

4050

60 MLT1A0

1000

Ran

dom

isat

ions

9415552926109054416

5374

201432324017

99

Individ

ual

Type

s

Total

(AAC

)n(A

ATA)

nAl

uJb

AluJ

oAl

uJr

AluS

cAl

uSc8

AluS

g4Al

uSp

AluS

qAl

uSq2

AluS

x1Al

uSz

(A)n

A−ric

h(A

T)n

BSR/

Beta

(CCG

)n(C

GC)

n(C

GG

)nCh

arlie

15a

(CT)

nER

VL−B

4−in

tFR

AMG

A−ric

h(G

CC)n

(GG

C)n

G−r

ichHE

RV3−

int

HERV

E−in

tHE

RVK9

−int

L1M

A4L1

MA5

L1M

A5A

L1M

A7L1

MA9

L1M

C1L1

MC2

L1M

C3L1

MC4

L1M

DL1

MD1

L1M

D2L1

MD3

L1M

E1L1

ME3

AL1

ME4

bL1

MEf

L1M

EgL1

PA15

L1PA

16L1

PA2

L1PA

3L1

PA5

L1PB

1L2

aL2

bL2

cLT

R12C

LTR1

2DLT

R12F

LTR1

6E1

LTR3

3LT

R78

LTR7

CM

ER1A

MER

41A

MER

49M

ER50

MER

51E

MER

57A−

int

MER

57−i

ntM

ER61

AM

ER61

−int

MER

63A

MER

92−i

ntM

IRM

IR3

MIR

bM

IRc

MLT

1A0

MLT

1A1

MLT

1BM

LT1C

MLT

1DM

LT1F

MLT

1H1

MLT

1J2

MLT

1KM

LT2B

4RE

P522

SST1

(TA)

nTH

E1D

Tigg

er3a (T)n

(TTT

C)n

(TTT

A)n

ALR/

Alph

a

2.55.07.510.0

Log2 Enrichment

Structure Conserved

Vertebrate Conserved

Placental Conserved

Primate Conserved

Strand Bias

Exonic Enrichment

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 30: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

●●

0

20

40

60

4 5 6 7 8 9 10Number of RIDLs per gene

Cou

nt

shuffled valuestrue value95% confidence

RIDLs per gene

XISTTPT1-AS1LINC00963

TSIXmir-29cLINC00861

0 500 1000 1500

0.0

0.4

SST1 CC: −0.18 RIDLs: 41

Intronic TEs: 93

Position in Repeat Consensus (nt)Base

Incl

usio

n Fr

eque

ncy

candidate.RIDL.merged.oldIDs.hg38

A

B C D

E F G

0

500

1000

1500

2000

Aluce

ntrERV1

ERVKERVL

ERVL−MaL

R

hAT−B

lackja

ck

hAT−C

harlie L1 L2

Low_c

omple

xityMIR

Satellite

Simple

_repe

at

TcMar−

Tiggertel

o

#RID

L

EvidenceConservationConservation + EnrichmentConservation + Same StrandEnrichmentEnrichment + Same StrandSame Strand

48

13

0.0

0.5

1.0

1.5

2.0

2.5

% o

f gen

es

RIDL-lncRNAs (n=3566)Other lncRNAs (n=2068)

Functionally-characterised lncRNAs

P = 0.01172

75

0

2

4

6

% o

f gen

es

RIDL-lncRNAs (n=3566)Other lncRNAs (n=2068)

Disease-associated lncRNAs

P = 0.03

200

69131

55

0.0

2.5

5.0

7.5

10.0

% o

f gen

es

RIDL-lncRNAs (n=3566)Other lncRNAs (n=2068)

aa

PromotersExons

GWAS SNP overlap

P = 0.0001

P = 0.04

Scalechr8

Merged Exons

RIDL

RIDL

7 Vert. El

20-way El

5 kbhg38127,209,000127,210,000127,211,000127,212,000127,213,000127,214,000127,215,000127,216,000127,217,000127,218,000127,219,000

Basic Gene Annotation Set from GENCODE Version 22 (Ensembl 79)

Exonic TE Annotation

RIDLs Annotation

Multiz Alignment & Conservation (7 Species)

Primates Multiz Alignment & Conservation (20 Species)

CCAT1

SINE-MIRDNA-hAT-CharlieSINE-MIRSimple repeatSINE-AluSINE-MIR

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 31: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

TE T

ype

Nuc

lC

yto

Log2

Nuc

l/Cyt

o R

atio

Number of RIDLs

NuclCyto

Cell Type

Nuc

lC

yto

Nucl/CytoRatio

7

21

7

11

79

3

8

6

67

55

53

4

6

8

86

52

112

91

10

16

7

20

8

6

72

3

10

6

42

46

46

5

5

8

61

55

79

73

12

18

7

25

10

9

97

3

11

54

48

52

3

6

6

81

62

101

94

13

13

6

18

11

9

91

8

57

45

55

5

9

6

89

53

95

90

10

14

8

22

7

10

80

3

7

3

57

48

52

4

6

7

76

62

99

86

8

16

6

18

10

12

76

3

8

53

52

38

3

5

7

67

59

95

91

12

10

5

17

10

12

78

11

45

44

42

3

4

6

63

53

77

84

12

13

7

18

7

6

71

9

4

50

46

42

3

6

8

71

48

85

75

6

19

8

29

11

12

103

3

13

69

60

66

4

6

7

87

63

118

106

12

21

6

28

12

10

99

3

13

5

59

60

62

5

5

6

80

73

114

108

11

28MLT1J2MLT1A0

MIRcMIRbMIR3MIR

MER49LTR7C

LTR12DL2cL2bL2a

L1PA16L1ME3A

L1MD3GC_richGA.richCT.richAT_rich

AluSx

A549

GM1287

8

HeLaS

3

HepG2

HUVECIM

R90K56

2MCF7

NHEK

SKNSH

−1012 Log2 Nucl/Cyto Ratio

Cum

ulat

ive

Freq

uenc

y

L2b in IMR90

A

B C

D IMR90

0.00

0.25

0.50

0.75

1.00

−6 − 0 6

44 L2b2128 not

3 3

P=6.7e-6

695266

353

1

0.751.89

3.095.12

6.57

−5

0

5

10

0 1 2 3 4

Rho=0.2P = 0

RIDL x lncRNAs N C

All other lncRNAs

vs

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint

Page 32: 1 Ancient exapted transposable elements drive nuclear ...64 recruits ANRIL to target genes (Holdt et al. 2013); a SINEB2 repeat in Uchl1AS increases translational rate 65 of its sense

Randomised

Wild Type

Mutant

RIDL N C

Transfect Fractionate

1 kb hg38116,533,500 116,534,000 116,534,500 116,535,000 116,535,500116,533,000

LINC00173MIRb > MIRc >

200 baseshg3858,817,10058,817,20058,817,30058,817,40058,817,50058,817,60058,817,70058,817,800

RP4-806M20.4MIR296 L2b >

2 kb hg38911,000 911,500 912,000 912,500 913,000 913,500 914,000 914,500 915,000 915,500

MIRb >L2b <

RP11-54O7.1L2b >

Chr20

Chr1

Chr12

3147-3372

137-2193096-3275

3294-3371

31-184 57-206

Position in consensus

A

B C

D

FE

Expr

essio

n (F

PKM

)

HeLa R

NA

seq

0

2

4

6 Whole cellCytoplasmNucleus

0.0

2.5

5.0

7.5

MALAT1GAPDH

Log2

Nuc

l/Cyt

o Ra

tio HeLa qR

T-PCR

RP11-54O7.1

LINC00173

RP4-806M20.4

WT endogenous

RP11-54O7.1

LINC00173

RP4-806M20.4

Nuclear controlCytoplasmic contol

Log2

Nuc

l/Cyt

o Ra

tio

MALAT1 / GAPDH

RP11-54O7.1

LINC00173

RP4-806M20.4

−2

0

2

4

WT transfected

P = 0.002 P = 0.01 P = 0.006

HeLa

GAPDH

H3

Nuclear controlMutant (randomised)

Cytoplasmic contol

N=4

not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which wasthis version posted September 16, 2017. ; https://doi.org/10.1101/189753doi: bioRxiv preprint