abstract - arxiv · allen institute for artiﬁcial intelligence, seattle, wa, usa...

Longformer: The Long-Document Transformer

Iz Beltagy∗ Matthew E. Peters∗ Arman Cohan∗

Allen Institute for Artificial Intelligence, Seattle, WA, USA{beltagy,matthewp,armanc}@allenai.org

Abstract

Transformer-based models are unable to pro-cess long sequences due to their self-attentionoperation, which scales quadratically with thesequence length. To address this limitation,we introduce the Longformer with an attentionmechanism that scales linearly with sequencelength, making it easy to process documents ofthousands of tokens or longer. Longformer’sattention mechanism is a drop-in replacementfor the standard self-attention and combinesa local windowed attention with a task moti-vated global attention. Following prior workon long-sequence transformers, we evaluateLongformer on character-level language mod-eling and achieve state-of-the-art results ontext8 and enwik8. In contrast to mostprior work, we also pretrain Longformer andfinetune it on a variety of downstream tasks.Our pretrained Longformer consistently out-performs RoBERTa on long document tasksand sets new state-of-the-art results on Wiki-Hop and TriviaQA.1

1 Introduction

Transformers (Vaswani et al., 2017) have achievedstate-of-the-art results in a wide range of natu-ral language tasks including generative languagemodeling (Dai et al., 2019; Radford et al., 2019)and discriminative language understanding (De-vlin et al., 2019). This success is partly due tothe self-attention component which enables the net-work to capture contextual information from theentire sequence. While powerful, the memory andcomputational requirements of self-attention growquadratically with sequence length, making it infea-sible (or very expensive) to process long sequenceson current hardware.

To address this limitation, we present Long-former, a modified Transformer architecture with

∗ Equal contribution.1https://github.com/allenai/longformer

Figure 1: Longformer’s memory usage scales linearlywith the sequence length, unlike the full self-attentionmechanism that runs out of memory for long sequenceson current GPUs. Longformer’s GPU-kernel is nearlyas fast as the highly optimized full self-attention opera-tion, and nearly 6X faster than naive Pytorch.

a self-attention operation that scales linearly withthe sequence length, making it versatile for pro-cessing long documents (Fig. 1). This is an advan-tage for natural language tasks such as long docu-ment classification, question answering (QA), andcoreference resolution, where existing approachespartition or shorten the long context into smallersequences that fall within the typical 512 tokenlimit of BERT-style pretrained models. Such parti-tioning could potentially result in loss of importantcross-partition information, and to mitigate thisproblem, existing methods often rely on complexarchitectures to address such interactions. On theother hand, our proposed Longformer is able tobuild contextual representations of the entire con-text using multiple layers of attention, reducing theneed for task-specific architectures.

Recent work has addressed the computational in-efficiency of Transformers on long sequences (seeTab. 1). However, they primarily focus on autore-gressive language modeling, while the applicationof long document transformers to document-levelNLP tasks in the transfer learning setting (Dai andLe, 2015; Peters et al., 2018; Howard and Ruder,2018; Devlin et al., 2019) has remained largelyunexplored. We address this gap and show thatLongformer’s attention mechanism can act as a

arX

iv:2

004.

0515

0v1

[cs

.CL

] 1

0 A

pr 2

020

https://github.com/allenai/longformer

Model attention char-lm other pretrainmatrix tasks

Transformer-XL (2019) ltr yes no noAdaptive Span (2019) ltr yes no noCompressive (2020) ltr yes no noReformer (2020) sparse yes no noSparse (2019) sparse yes no noBP-Transformer (2019) sparse yes MT noBlockwise (2019) sparse no QA yesOur Longformer sparse yes multiple yes

Table 1: Summary of prior work on adapting Trans-formers for long documents. ltr: left-to-right.

drop-in replacement for the self-attention mecha-nism in pretrained Transformers, and leads to gainsacross a suite of document NLP tasks.

Longformer’s attention mechanism is a combina-tion of a windowed local-context self-attention andan end task motivated global attention that encodesinductive bias about the task. Through ablationsand controlled trials we show both attention typesare essential – the local attention is primarily usedto build contextual representations, while the globalattention allows Longformer to build full sequencerepresentations for prediction.

We first evaluate Longformer on autoregressivecharacter-level language modeling using a com-bination of windowed and a new dilated attentionpattern, allowing the model to process sequences ofup to 32K characters on modern GPUs. We achievestate-of-the-art results on text8 and enwik8benchmark datasets, demonstrating the effective-ness of Longformer in long document modeling.

Then, to evaluate Longformer’s ability to re-place the full self-attention operation of existingpretrained models, we pretrain it with the maskedlanguage modeling (MLM) objective, continuingfrom the RoBERTa (Liu et al., 2019) releasedcheckpoint. After pretraining, we apply it todownstream language tasks through finetuning anddemonstrate that Longformer consistently outper-forms RoBERTa on a wide range of document-levelnatural language tasks including text classification,QA, and coreference resolution, achieving state-of-the-art results on two of these datasets.

2 Related Work

Long-Document Transformers Tab. 1 summa-rizes recent prior work on long documents. Twotypes of self-attention approaches have been ex-plored. The first is a left-to-right (ltr) approach thatprocesses the document in chunks moving from

left-to-right. While such models have been success-ful in autoregressive language modeling, they areunsuitable for transfer learning approaches withtasks that benefit from bidirectional context.

Our work falls within the other general approachthat defines some form of sparse attention patternand avoids computing the full quadratic attentionmatrix multiplication. The model with the mostsimilar attention pattern to ours is Sparse Trans-former (Child et al., 2019), which uses a form ofdilated sliding window of blocks of size 8x8 pro-vided by BlockSparse (Gray et al., 2017). Whileboth models use custom CUDA kernels, ours is im-plemented in TVM (Chen et al., 2018) (§3.2) mak-ing it more customizable and maintainable thanBlockSparse which is implemented in C++, anddesigned for a specific versions of TensorFlow. Wealso introduce additional task motivated global at-tention patterns suitable for common NLP tasks(§3) and show they are essential for good perfor-mance in the transfer learning setting.

A few models tried tasks other than autoregres-sive language modeling, which is a step forwardbecause arguably focusing on language modelingas the primary evaluation has led to the develop-ment of models with limited applicability. BP-Transformer (Ye et al., 2019) evaluated on machinetranslation (MT), but didn’t explore the pretrain-finetune setting. Blockwise attention (Qiu et al.,2019) pretrained their models and evaluated onquestion answering (QA). However, the evaluationis limited as it doesn’t include language modeling,and the QA datasets are of relatively short docu-ments,2 therefore the effectiveness of this modelon long document tasks remains unexplored.

Task-specific Models for Long DocumentsMany task-specific approaches have been devel-oped to workaround the 512 limit of pretrainedtransformer models like BERT. The simplest ap-proach just truncates the document, commonlyused for classification (Xie et al., 2019). An-other approach chunks the document into chunksof length 512 (could be overlapping), processeseach chunk separately, then combines the activa-tions with a task specific model (Joshi et al., 2019).A third approach popular for multihop and opendomain QA tasks uses a two-stage model wherethe first stage retrieves relevant documents that arepassed onto the second stage for answer extrac-

2 SQuAD contexts typically fit within the 512 limit, andMRQA is constructed by dropping long-document examples.

2

(a) Full n2 attention (b) Sliding window attention (c) Dilated sliding window (d) Global+sliding window

Figure 2: Comparing the full self-attention pattern and the configuration of attention patterns in our Longformer.

tion (Clark and Gardner, 2017; Chen et al., 2017).All of these approaches suffer from informationloss due to truncation or cascading errors fromthe two stage approach. In contrast, Longformercan process long sequences without truncating orchunking, allowing us to adopt a much simpler ap-proach that concatenates the available context andprocesses it in a single pass.

3 Model

The original Transformer model has a self-attentioncomponent with O(n2) time and memory complex-ity where n is the input sequence length and thus, isnot efficient to scale to long inputs. To address thischallenge, we sparsify the full self-attention matrixaccording to an “attention pattern” specifying pairsof input locations attending to one another. Unlikethe full self-attention, our proposed attention pat-tern scales linearly with the input sequence, makingit efficient for longer sequences. In the following,we discuss the design and implementation of thisattention pattern.

3.1 Attention Pattern

Fig. 2 summarizes the configurations of our pro-posed attention pattern.

Sliding Window Given the importance of localcontext (Kovaleva et al., 2019), our attention pat-tern employs a fixed-size window attention sur-rounding each token. Using multiple stacked lay-ers of such windowed attention results in a largereceptive field, where top layers have access toall input locations and have the capacity to buildrepresentations that incorporate information acrossthe entire input. More formally, in this attentionpattern, given a fixed window size w, each tokenattends to 1

2w tokens on each side (Fig. 2b). Thecomputation complexity of this pattern is O(n×w),which scales linearly with input sequence length n.To make this attention pattern efficient, w shouldbe small compared with n. However, as mentioned

above, a model with typical multiple stacked trans-former will have a large receptive field. This isanalogues to CNNs where stacking layers of smallkernels leads to high level features that are builtfrom a large portion of the input (receptive field)(Wu et al., 2019). In our case, with a transformer of` layers, the receptive field size is `×w (assumingw is fixed for all layers). Depending on the appli-cation, it might be helpful to use different values ofw for each layer to balance between efficiency andmodel representation capacity (§4.1).

Dilated Sliding Window To further increase thereceptive field without increasing computation, thesliding window can be “dilated”. This is analoguesto dilated CNNs (van den Oord et al., 2016) wherethe window has gaps of size dilation d (Fig. 2c).Assuming a fixed d and w for all layers, the recep-tive field is ` × d × w, which can reach tens ofthousands of tokens even for small values of d.

In multi-headed attention, each attention headcomputes a different attention score. We found set-tings with different dilation configurations per headimproves performance by allowing some headswithout dilation to focus on local context, whileothers with dilation focus on longer context.

Global Attention In state-of-the-art BERT-stylemodels for natural language tasks, the optimal in-put representation differs from language modelingand varies by task. For masked language modeling(MLM), the model uses local context to predict themasked word, while for classification, the model ag-gregates the representation of the whole sequenceinto a special token ([CLS] in case of BERT). ForQA, the question and document are concatenated,allowing the model to compare the question withthe document through self-attention.

In our case, the windowed and dilated attentionare not flexible enough to learn task-specific repre-sentations. Accordingly, we add “global attention”on few pre-selected input locations. Importantly,we make this attention operation symmetric: that is,

3

a token with a global attention attends to all tokensacross the sequence, and all tokens in the sequenceattend to it. Fig. 2d shows an example of a slid-ing window attention with global attention at a fewtokens at custom locations. For example for clas-sification, global attention is used for the [CLS]token while in QA global attention is provided onall question tokens. Note that, since the number ofsuch tokens is small relative and independent of n(the total number of tokens in the input sequence)the complexity of the combined local and globalattention is still O(n).

While specifying global attention is task specific,it is much simpler than existing task specific ap-proaches that chunk/shorten the input into smallersequences and often use complex architecture tocombine information across these chunks. Fur-ther, it increases the representational power of themodel as it allows building contextual representa-tions across the entire sequence.

Linear Projections for Global Attention Re-call that given the linear projections Q, K, V , theTransformer model (Vaswani et al., 2017) computesattention scores as follows:

Attention(Q,K, V ) = softmax

(QKT

√dk

)V (1)

We use two sets of projections, Qs, Ks, Vs to com-pute attention scores of sliding window attention,and Qg, Kg, Vg to compute attention scores for theglobal attention. The additional projections provideflexibility to model the different types of attention,which we show is critical for best performance ondownstream tasks. Qg, Kg, Vg are all initializedwith values that match Qs, Ks, Vs.

3.2 CUDA Kernels for our Attention PatternIn regular transformers, attention scores are com-puted as in Eqn. 1. The expensive operation is thematrix multiplication QKT because both Q and Khave n (sequence length) projections.

Our dilated sliding window attention pattern isnot straightforward to implement using moderndeep learning libraries. Implementing it requires aform of banded matrix multiplication (matrix mul-tiplication where the output is all zero except cer-tain diagonals) that is not supported in existingdeep learning libraries like PyTorch/Tensorflow.In addition, naive implementations with loops isunusably slow. To address this challenge, we pro-vide a highly efficient custom CUDA kernel that

implements these attention operations and allowsparallelizing the operation on GPU threads.

Tensor Virtual Machine (TVM) We build ourcustom CUDA kernel using TVM (Chen et al.,2018), a deep learning compiler stack that compileshigh level description of a function into optimizeddevice-specific code. Using TVM, we describe ourform of banded matrix multiplication in high-levelpython constructs, then TVM generates the corre-sponding CUDA code and compiles it for GPUs.

Fig. 1 compares the runtime and memory of thefull self-attention, our TVM implementation of thedilated sliding window, and a naive implementationof this attention. It is clear the full attention im-plementation is fast but it takes significant amountof memory because it needs to store all n2 values.The naive implementation with loops is not mem-ory consuming because it only stores the non-zerovalues, however it is significantly slow and imprac-tical to use. Finally, our implementation is bothfast and memory efficient because it only computesand stores the non-zero values.3

4 Autoregressive Language Modeling

Autoregressive or left-to-right language modelingis loosely defined as estimating the probability dis-tribution of an existing token/character given itsprevious tokens/characters in an input sequence.This task is considered one of the fundamental tasksin natural language and recent prior work on mod-eling long sequences using transformers has reliedon this task as their primary evaluation (Dai et al.,2019; Rae et al., 2020; Sukhbaatar et al., 2019).Similarly, we develop and evaluate our model onautoregressive language modeling.

4.1 Attention Pattern

For autoregressive language modeling we useour dilated sliding window attention. Follow-ing Sukhbaatar et al. (2019) we use differing win-dow sizes across the layers. In particular, we usesmall window sizes for the lower layers and in-crease window sizes as we move to higher layers.This allows the top layers to learn higher-level rep-resentation of the entire sequence while having the

3It is worth noting that theoretically, a perfectly optimizedsliding window attention operation should be faster than then2 computation. However, achieving this level of performancerequires special knowledge of low-level GPU programming,similar to implementing a highly optimized matrix multipli-cation. Our current implementation is sufficiently fast andpractical to use.

4

Model #Param Dev Test

Dataset text8T12 (Al-Rfou et al., 2018) 44M - 1.18Adaptive (Sukhbaatar et al., 2019) 38M 1.05 1.11BP-Transformer (Ye et al., 2019) 39M - 1.11Our Longformer 41M 1.04 1.10

Dataset enwik8T12 (Al-Rfou et al., 2018) 44M - 1.11Transformer-XL (Dai et al., 2019) 41M - 1.06Reformer (Kitaev et al., 2020) - - 1.05Adaptive (Sukhbaatar et al., 2019) 39M 1.04 1.02BP-Transformer (Ye et al., 2019) 38M - 1.02Our Longformer 41M 1.02 1.00

Table 2: Small model BPC on text8 & enwik8

Model #Param Test BPC

Transformer-XL (18 layers) 88M 1.03Sparse (Child et al., 2019) ≈100M 0.99Transformer-XL (24 layers) 277M 0.99Adaptive (Sukhbaatar et al., 2019) 209M 0.98Compressive (Rae et al., 2020) 277M 0.97Our Longformer 102M 0.99

Table 3: Performance of large models on enwik8

lower layers capture local information. In addition,it provides balance between efficiency (smaller win-dow sizes are less computationally expensive dueto fewer nonzero values) and performance (largerwindow sizes have richer representation power andoften result in performance improvements).

We do not use dilated sliding windows for lowerlayers to maximize their capacity to learn and uti-lize the immediate local context. For the higherlayers, we use a small amount of increasing dila-tion only on 2 heads. This gives the model theability to directly attend to distant tokens withoutsacrificing local context.

4.2 Experiment SetupTask and Datasets We focus on character-levelautoregressive language modeling because the se-quences are naturally longer than those of word-level language modeling, making them suitablefor evaluating our model. We used text8 andenwik8 (Mahoney, 2009) for evaluation, two stan-dard and widely used character-level language mod-eling datasets.

Training Ideally, we would like to train ourmodel on the largest window size and sequencelength we can fit in a modern GPU memory. How-ever, we found that the model needs a large numberof gradient updates to learn the local context first;before learning to utilize longer context. To accom-

modate this, we adopt a staged training procedurewhere we increase the attention window size andsequence length across multiple training phrases.In particular, in the first phase we start with a shortsequence length and window size, then on each sub-sequent phase, we double the window size and thesequence length, and halve the learning rate. Thismakes training fast, while keeping the slow part(longest sequences and window sizes) to the end.We train the model over 5 total phases with start-ing sequence length of 2,048 and ending sequencelength of 23,040 on the last phase (see Appendix Afor detailed configurations of each phase).

Evaluation At evaluation time, we are able torun our model on sequence of length 32,256. Fol-lowing prior work (Dai et al., 2019; Sukhbaataret al., 2019), during evaluation we split the datasetinto overlapping sequences of size 32,256 with astep of size 512, and report the performance on thelast 512 tokens on the sequence.

Hyperparameters Our model only specifieshow the self-attention component works, and itis agnostic to the other design choices for the trans-former model. We use relative position embeddingswith sinusoidal weights as in Dai et al. (2019). Weuse two different model sizes, a small (12 layers,512 hidden size) model as in Dai et al. (2019),and a large (30 layers, 512 hidden size) model asin Child et al. (2019). We employed mixed preci-sion training (floating points 16 and 32) using apex4

to reduce memory consumption and speed-up train-ing. However, we kept the attention computationin fp32 to avoid numerical instability issues.5 Weused gradient checkpointing (Chen et al., 2016) toreduce memory usage, and ran our experiments on48GB RTX8000 GPUs. Refer to Appendix A for amore detailed list of hyperparameters.

4.2.1 ResultsTab. 2 and 3 summarize evaluation results ontext8 and enwik8 datasets. We achieve a newstate-of-the-art on both text8 and enwik8 usingthe small models with BPC of 1.10 and 1.00 ontext8 and enwik8 respectively, demonstratingthe effectiveness of our model.

For large models, given how expensive theseexperiments are, and following recent work (Ki-taev et al., 2020; Rae et al., 2020), we are only

4https://github.com/NVIDIA/apex5We found that using fp16 in attention operation results in

floating point overflow and NaNs in later stages of training.

5

https://github.com/NVIDIA/apex

Model Dev BPC

Decreasing w (from 512 to 32) 1.24Fixed w (= 230) 1.23Increasing w (from 32 to 512) 1.21

No Dilation 1.21Dilation on 2 heads 1.20

Table 4: Top: changing window size across layers. Bot-tom: with/without dilation (@ 150K steps on phase1)

evaluating on enwik8. Tab. 3 shows that Long-former outperforms the comparable Transformer-XL model, matches the performance of the compa-rable Sparse Transformer (Child et al., 2019), andmatches or slightly underperforms recent modelsthat have more than twice the number of parameters.It is worth noting that Adaptive Span (Sukhbaataret al., 2019) and Compressive Transformer (Raeet al., 2020) are not good fit for the pretraining-finetuning paradigm as discussed in §2.

4.2.2 Ablation StudyTo show the importance of the design choices ofour attention patterns, we tried different variantsand report their controlled experiment results. Tomake the ablation study more manageable, we traineach configuration for 150K steps6 with phase 1configuration on a small model on text8, thenreport the BPC performance on the dev set.

The top of Tab. 4 demonstrates the impact ofdifferent ways of configuring the window sizesper layer. We observe that increasing the windowsize from the bottom to the top layer leads to thebest performance, arranging them in the reverseway leads to worse performance, and using a fixedwindow size (the average of window sizes of theother configuration) leads to a performance thatit is in between. The bottom of Tab. 4 shows theimpact of adding dilation. Adding some dilation totwo heads leads to some improvement comparedwith no dilation at all.

5 Pretraining and Finetuning

Current state-of-the-art systems for many NLPtasks finetune a pretrained model with task super-vision (e.g. BERT). One of our main motivationsis to develop such a model suitable for long docu-ment tasks. To do so, we pretrained Longformer

6An obvious caveat is that there is a chance the end per-formance will not agree with the performance at step 150K.However, this is a reasonable approximation that saves thehuge cost of running all these configurations to completion.

Source Tokens Avg doc len

Books (Zhu et al., 2015) 0.5B 95.9KEnglish Wikipedia 2.1B 506Realnews (Zellers et al., 2019) 1.8B 1.7KStories (Trinh and Le, 2018) 2.1B 7.8K

Table 5: Pretraining data

on a document corpus and finetune it for six tasks,including classification, QA and coreference resolu-tion. The resulting model can process sequences upto 4,096 tokens long (8 times longer than BERT)7.

We pretrain Longformer with masked languagemodeling (MLM), where the goal is to recoverrandomly masked tokens in a sequence. SinceMLM pretraining is expensive, we continue pre-training from the RoBERTa (Liu et al., 2019) re-leased checkpoint, while only making the minimalchanges necessary to support Longformer’s atten-tion mechanism. Note that our attention pattern canbe plugged into any pretrained transformer modelwithout the need to change the model architecture.

Attention Pattern We use the sliding windowattention with window size of 512 on all layers.This matches RoBERTa’s sequence length, andtherefore uses the same amount of computationas RoBERTa.8

Position Embeddings RoBERTa uses learnedabsolute position embeddings with the maximumposition being 512. To support longer documents,we add extra position embeddings to support up toposition 4,096. To leverage RoBERTa’s pretrainedweights, instead of randomly initializing the newposition embeddings, we initialize them by copyingthe 512 position embeddings from RoBERTa mul-tiple times as analysis of BERT’s attention headsshows a strong learned bias to attending to localcontext, including the previous or next token (Clarket al., 2019). Using the copy initialization preservesthis local structure everywhere except at the parti-tion boundaries. Despite its simplicity, we foundthis to be a very effective (see Tab. 6), allowingLongformer pretraining to rapidly converge with asmall number of gradient updates.

7Sequences up to 16K are possible on current GPUs.8We tried adding the additional dilation pattern on a few

heads as in §4.1 but found it to hurt performance, likelybecause it is not compatible with the pretrained RoBERTaweights. We suspect that retraining such model from scratchmight be needed to get improved performance

6

Pretraining Data In order to allow the model tolearn long dependencies in pretraining, we com-piled a corpus of long documents. Some of thesedata sources were also included in the originalRoBERTa pretraining including the Books corpus(Zhu et al., 2015) plus English Wikipedia. Weadditionally included one third of a subset of theRealnews dataset (Zellers et al., 2019) with doc-uments longer than 1,200 tokens as well as onethird of the Stories (Trinh and Le, 2018) corpus.Our goal was to include a mix of long and shortdocuments to both allow the model to learn longerdependencies while not to forget information fromthe original RoBERTa pretraining. The statistics ofthe pretraining data is shown in Tab. 5.

Continued MLM Pretraining We train9 twosizes of Longformer, a base model and a largemodel. Both models are trained for 65K gradi-ent updates with sequences length 4,096, batch size64 (218 tokens), maximum learning rate of 3e-5,linear warmup of 500 steps, followed by a power 3polynomial decay. The rest of the hyperparametersare the same as RoBERTa.

Tab. 6 shows the BPC on the development set ofour training corpus. The first row shows a 1.846BPC using RoBERTa-base, which is comparableto the 1.880 BPC reported on the RoBERTa paperon their corpus. This indicates our training corpusis from a distribution close to that used to trainRoBERTa. The following two rows show the per-formance of Longformer before pretraining withrandomly initialized position embeddings and withcopied position embeddings. The significant differ-ence indicates the importance of the copy initial-ization, and the relative small difference betweenthe RoBERTa BPC and the initialized BPC indi-cates that our sliding window attention is workingwell with the RoBERTa weights. The followingtwo rows show the impact of continuing pretrain-ing. Traininig for 2K steps improves BPC from1.957 to 1.753, which further decreases to 1.705 af-ter 65K steps, demonstrating the model is learningto better utilize the sliding window attention andlonger context. Similar patterns are observed withRoBERTa-large and Longformer-large.

5.1 TasksWe apply Longformer to multiple long documenttasks, including QA, coreference resolution andclassification. Tab. 7 shows the evaluation datasets

9using fairseq (Ott et al., 2019)

Model base large

RoBERTa (seqlen: 512) 1.846 1.496Longformer (seqlen: 4,096) 10.299 8.738

+ copy position embeddings 1.957 1.597+ 2K gradient updates 1.753 1.414+ 65K gradient updates 1.705 1.358

Table 6: MLM BPC for RoBERTa and various pre-trained model configurations.

Wordpieces WH TQA HQA ON IMDB HY

avg. 1,535 6,589 1,316 506 300 70595th pctl. 3,627 17,126 1,889 1,147 705 1,975

Table 7: Average and 95th percentile of context lengthof datasets in wordpieces. WH: WikiHop, TQA: Triv-iaQA, HQA: HotpotQA, ON: OntoNotes, HY: Hyper-partisan news

have contexts significantly longer than 512 word-pieces. Our primary goal is to evaluate whetherour attention mechanism can act as a replace-ment for the standard self-attention mechanism inBERT style models, and to perform controlled tri-als against a strong baseline. We are also interestedin evaluating whether we can replace complicatedtask specific models necessitated by BERT’s lim-ited context with simpler models that just concate-nate all available context into a single sequence.

Our baseline is a RoBERTa based model thatbreaks the context into the longest possible seg-ment, passes each individually through RoBERTa,and concatenates the activations for further process-ing. For QA tasks, we also concatenate the questionto each segment so that RoBERTa can conditionit’s contextual representations of the context onthe question. The Longformer variant replaces theRoBERTa self-attention mechanism with our win-dowed attention used during pretraining, plus a taskmotivated global attention. The global attentionuses additional linear projections (§3.1).

Question answering We consider three datasetswith long contexts: WikiHop (Welbl et al., 2018),Wikipedia setting of TriviaQA (Joshi et al., 2017),and HotpotQA, distractor dev setting (Yang et al.,2018).10

For WikiHop and TriviaQA we follow the sim-ple QA model of BERT (Devlin et al., 2019), andconcatenate question and documents into one longsequence, run it through Longformer, then have a

10We use the full version of TriviaQA and HotpotQA, notthe simplified versions in MRQA (Fisch et al., 2019).

7

QA Coref. Classification

Model WikiHop TriviaQA HotpotQA OntoNotes IMDB Hyperpartisan

RoBERTa-base 72.4 74.3 63.5 78.4 95.3 87.4Longformer-base 75.0 75.2 64.4 78.6 95.7 94.8

Table 8: Summary of finetuning results on QA, coreference resolution, and document classification. Results are onthe development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics areF1, WikiHop and IMDB use accuracy, HotpotQa is joint F1, OntoNotes is average F1.

dataset-specific prediction layer. WikiHop uses aclassification layer for the candidate while Trivi-aQA uses the loss function of Clark and Gardner(2017) to predict answer span. We include globalattention to question tokens and answer candidatesfor WikiHop and to question tokens for TriviaQA.

For HotpotQA, we experiment with two differentmodels, a single stage multitask model that jointlypredicts evidence and answer spans, and a two-stage model that first extracts evidence paragraphsand passes them to a second stage for answer extrac-tion. The single stage model combines a span ex-traction loss with a question type (yes/no/span) clas-sification head over the first token and an evidenceextraction loss predicted from special tokens at theend of sentences and paragraphs. We also includeglobal attention to sentence and paragraph tokens.Note that both approaches are significantly sim-pler than recent SOTA models that include pipelineapproaches of multiple stages and customized ar-chitectures (e.g., (Tu et al., 2019; Chen et al., 2019;Tu et al., 2020; Groeneveld et al., 2020)). See Ap-pendix B for further details about the models andhyperparameters.

Coreference Resolution We use OntoNotes(Pradhan et al., 2012), and the model from Joshiet al. (2019), a modification of the system fromLee et al. (2018) to replace ELMo with BERT. TheLongformer system is a straightforward adaptionof the baseline model by replacing RoBERTa withLongformer and extending the sequence length.We didn’t use global attention for this task.

Document Classification We evaluate on IMDB(Maas et al., 2011) and Hyperpartisan news detec-tion (Kiesel et al., 2019) datasets.11 IMDB is astandard sentiment classification datasets consist-ing of movie reviews. While most documents inthis dataset are short, about 13.6% of them are

11For Hyperpartisan we split the training data intotrain/dev/test sets using standard 90/10/10 splits and per-formed each experiment five times with different seeds tocontrol variability associated with the small dataset.

larger than 512 wordpieces (Tab. 7). Documentsin Hyperpartisan are relatively long, and it is smallwith only 645 documents making it a good test forLongformer’s ability to adapt to limited data. Weuse global attention on the [CLS] token.

5.2 Results

Main Result Tab. 8 summarizes the results of allour finetuning experiments. We observe that Long-former consistently outperforms the RoBERTabaseline. Its performance gain is especially ob-vious for tasks that require long context such asWikiHop and Hyperpartisan. For TriviaQA, theimprovement is more modest as the local contextis often sufficient to answer the question. In thecase of HotpotQA, the supporting fact auxiliarysupervision allows models to easily find relevantcontexts and then focus on local context, leading tosmaller gains. This is contrasted with WikiHop thatonly includes distant supervision of intermediatereasoning chains, where our approach excels byreasoning over the entire context. On the IMDBand OntoNotes datasets the performance gains aresmaller. For IMDB, the majority of the datasetconsists of short documents and thus it is expectedto see smaller improvements. For OntoNotes, wefound that the distance between any two mentionsis typically quite small so that a baseline that pro-cesses smaller chunks separately is able to stitchtogether mentions into coreference chains withoutconsidering cross chunk interactions.

Longformer-large for QA We also evaluate theperformance of Longformer-large on long contextQA tasks. Tab. 9 shows that our Longformer-largeachieves new state-of-the-art results on WikiHopand TriviaQA by large margins (3.6 and 4 pointsrespectively). Tab. 10 summarizes results of Hot-potQA, and, as expected, using Longformer-largeimproves the result compared to Longformer-base.The two-stage model improves the results evenfurther, likely because of the increased capacitythat allows each stage to specialize on one task.

8

Model WikiHop TriviaQA

Current SOTA 78.3 73.3Longformer-large 81.9 77.3

Table 9: Leaderboard results of Longformer-large

Model ans. supp. joint

RoBERTa-base 73.5 83.4 63.5Longformer-base 74.3 84.4 64.4Longformer-large 78.8 86.0 69.5Longformer-large (2 stage) 81.0 85.8 71.4TAP2 (Glaß et al., 2019) (not GNN) 79.4 86.2 -HGN (Fang et al., 2019) 81.0 87.9 73.0C2F Reader (Shao et al., 2020) - - 73.9

Table 10: HotpotQA results, distractor setting on thedev set. All numbers are F1 scores.13

This model matches performance of TAP2 (Glaßet al., 2019), which is the the best performing sin-gle model that doesn’t use a form of graph neuralnetworks (GNN; Kipf and Welling, 2017). How-ever, our model is simpler than TAP2 where theyuse a three-stage approach consisting of extractingparagraphs, evidence sentences and finally answerspans with an additional specialized span selectionpretraining. All the better performing models forHotpotQA (Shao et al., 2020; Fang et al., 2019)use multi-stage approaches plus GNNs or graphnetwork of entities, which seem to encode an im-portant inductive bias for the task.12 Furthermore,the 10 paragraphs come from 10 different docu-ments and the concatenated context is therefore notcoherent. In this case we suspect that Longformerhas less chance to use information from pretrain-ing due to difference between the pretraining andfinetuning input structure. Additional pretrainingtasks could help further improve results.

5.3 Ablations on WikiHopTab. 11 presents an ablation study for WikiHop. Allresults use Longformer-base, trained for 5 epochswith identical hyperparameters except where noted.Longformer benefits from longer sequences, globalattention, separate projection matrices for globalattention, MLM pretraining, and longer training.In addition, when configured as in RoBERTa-base(seqlen: 512, and n2 attention) Longformer per-forms slightly worse then RoBERTa-base, confirm-

12We can encode this inductive bias using global attentionover entities, but we leave this for future work.

13 We report numbers from dev set scores of publishedpapers. Missing values are not publicly available. Leaderboardtest results will be included in the future.

Model Accuracy / ∆

Longformer (seqlen: 4,096) 73.8

RoBERTa-base (seqlen: 512) 72.4 / -1.4Longformer (seqlen: 4,096, 15 epochs) 75.0 / +1.2Longformer (seqlen: 512, attention: n2) 71.7 / -2.1Longformer (seqlen: 512, attention: window) 68.8 / -5.0Longformer (seqlen: 2,048) 73.1 / -0.7Longformer (no MLM pretraining) 73.2 / -0.6Longformer (no linear proj.) 72.2 / -1.6Longformer (no linear proj. no global atten.) 65.5 / -8.3

Table 11: WikiHop development set ablations

ing that performance gains are not due to additionalpretraining.

6 Conclusion and Future Work

We present Longformer, a transformer-based modelthat is scalable for processing long documentsand that makes it easy to perform a wide rangeof document-level NLP tasks without chunk-ing/shortening the long input and without com-plex architecture to combine information acrossthese chunks. Longformer employs an attentionpattern that combines local and global informationwhile also scaling linearly with the sequence length.Longformer achieves state-of-the-art results on thecharacter-level language modeling tasks of text8and enwik8. When pretrained, Longformer con-sistently outperforms RoBERTa on long documenttasks and sets new state-of-the-art results on Wiki-Hop and TriviaQA. For future work, we would liketo explore other attention patterns that are moreefficient by dynamically adapting to the input. Wealso would like to apply our model to other relevantlong document tasks such as summarization.

Acknowledgment

We would like to thank Noah Smith, Dan Weld,Dirk Groeneveld, Kyle Lo, Daniel King and DougDowney for helpful discussions and feedback, andthe AI2 infrastructure team for technical support.

References

Rami Al-Rfou, Dokook Choe, Noah Constant, MandyGuo, and Llion Jones. 2018. Character-level lan-guage modeling with deeper self-attention. In AAAI.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In ACL.

9

Jifan Chen, Shih-Ting Lin, and Greg Durrett. 2019.Multi-hop question answering via reasoning chains.ArXiv, abs/1910.02610.

Tianqi Chen, Thierry Moreau, Ziheng Jiang, LianminZheng, Eddie Yan, Haichen Shen, Meghan Cowan,Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018.TVM: An automated end-to-end optimizing com-piler for deep learning. In 13th USENIX Symposiumon Operating Systems Design and Implementation(OSDI 18), pages 578–594.

Tianqi Chen, Bing Xu, Chiyuan Zhang, and CarlosGuestrin. 2016. Training deep nets with sublinearmemory cost. ArXiv, abs/1604.06174.

Rewon Child, Scott Gray, Alec Radford, and IlyaSutskever. 2019. Generating long sequences withsparse transformers. ArXiv, abs/1904.10509.

Christopher Clark and Matt Gardner. 2017. Simpleand effective multi-paragraph reading comprehen-sion. In ACL.

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does bertlook at? an analysis of bert’s attention. ArXiv,abs/1906.04341.

Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In Advances in Neural Informa-tion Processing Systems 28.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Car-bonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019.Transformer-xl: Attentive language models beyonda fixed-length context. In ACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In NAACL-HLT.

Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, ShuohangWang, and Jing jing Liu. 2019. Hierarchical graphnetwork for multi-hop question answering. ArXiv,abs/1911.03631.

Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-nsol Choi, and Danqi Chen. 2019. MRQA 2019shared task: Evaluating generalization in readingcomprehension. In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP.

Michael Glaß, Alfio Massimiliano Gliozzo, RishavChakravarti, Anthony Ferritto, Lin Pan, GaudaniBhargav, Dinesh Garg, and Avirup Sil. 2019. Spanselection pre-training for question answering. ArXiv,abs/1909.04120.

Scott Gray, Alec Radford, and Diederik P. Kingma.2017. Gpu kernels for block-sparse weights.

Dirk Groeneveld, Tushar Khot, Mausam, and AshishSabhwaral. 2020. A simple yet strong pipeline forHotpotQA. arXiv preprint.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InACL.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In ACL.

Mandar Joshi, Omer Levy, Luke Zettlemoyer, andDaniel Weld. 2019. BERT for coreference resolu-tion: Baselines and analysis. In EMNLP-IJCNLP.

Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em-manuel Vincent, Payam Adineh, David Corney,Benno Stein, and Martin Potthast. 2019. SemEval-2019 task 4: Hyperpartisan news detection. InProceedings of the 13th International Workshop onSemantic Evaluation, pages 829–839, Minneapo-lis, Minnesota, USA. Association for ComputationalLinguistics.

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. ICLR.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.2020. Reformer: The efficient transformer. InICLR.

Olga V. Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof bert. In EMNLP/IJCNLP.

Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-fine inference. In NAACL.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. ArXiv, abs/1907.11692.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham,Dan Huang, Andrew Y. Ng, and Christopher Potts.2011. Learning word vectors for sentiment analy-sis. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 142–150, Port-land, Oregon, USA. Association for ComputationalLinguistics.

Matt Mahoney. 2009. Large text compression bench-mark.

Aaron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew W. Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. In SSW.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofNAACL-HLT 2019: Demonstrations.

10

https://doi.org/10.18653/v1/S19-2145

https://doi.org/10.18653/v1/S19-2145

http://www.aclweb.org/anthology/P11-1015

http://www.aclweb.org/anthology/P11-1015

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared task: Modeling multilingual unre-stricted coreference in OntoNotes. In Joint Confer-ence on EMNLP and CoNLL - Shared Task, pages1–40, Jeju Island, Korea. Association for Computa-tional Linguistics.

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Yih,Sinong Wang, and Jie Tang. 2019. Blockwise self-attention for long document understanding. ArXiv,abs/1911.02972.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku-mar, and Timothy P. Lillicrap. 2020. Compressivetransformers for long-range sequence modelling. InICLR.

Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, andGuoping Hu. 2020. Is graph structure necessary formulti-hop reasoning? ArXiv, abs/2004.03096.

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-janowski, and Armand Joulin. 2019. Adaptive at-tention span in transformers. In ACL.

Trieu H. Trinh and Quoc V. Le. 2018. A sim-ple method for commonsense reasoning. ArXiv,abs/1806.02847.

Ming Tu, Jinke Huang, Xiaodong He, and Bowen Zhou.2020. Graph sequential network for reasoning oversequences.

Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang,Xiaodong He, and Bufang Zhou. 2019. Select, an-swer and explain: Interpretable multi-hop readingcomprehension over multiple documents. ArXiv,abs/1911.00484.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. Transac-tions of the Association for Computational Linguis-tics, 6:287–302.

Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,and Michael Auli. 2019. Pay less attention withlightweight and dynamic convolutions. ArXiv,abs/1901.10430.

Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-ThangLuong, and Quoc V. Le. 2019. Unsupervised dataaugmentation for consistency training.

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, YanyanLan, Li-Wei Wang, and Tie-Yan Liu. 2020. On layernormalization in the transformer architecture. ArXiv,abs/2002.04745.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. Hotpotqa: A datasetfor diverse, explainable multi-hop question answer-ing. In EMNLP.

Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, andZheng Zhang. 2019. Bp-transformer: Modellinglong-range context via binary partitioning. ArXiv,abs/1911.04070.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against neural fakenews. In NeurIPS.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies:Towards story-like visual explanations by watchingmovies and reading books. 2015 IEEE InternationalConference on Computer Vision (ICCV), pages 19–27.

11

https://www.aclweb.org/anthology/W12-4501



A Character LM Hyperparameters

We evaluate on text8 and enwik8, both contain100M characters from Wikipedia split into 90M,5M, 5M for train, dev, test. Our implementationis based on the Transformer-XL (Dai et al., 2019)code14 with the memory mechanism disabled. Ourhyperparameters and stage configurations are listedin Tab. 12. Our CUDA kernel supports the autore-gressive mode where each token attends to a win-dow of previous tokens only. Our implementationalso includes a version of the relative position em-bedding that is compatible with our dilated slidingwindow attention.

We ran the small model experiments on 4RTX8000 GPUs for 16 days. For the large model,we ran experiments on 8 RTX8000 GPUs for 13days. Most of our hyperparameter search is similarto the ablation in Tab. 4 where we run the configu-ration for 150K steps on text8. We experimentedwith absolute position embeddings and learned po-sition embeddings, dropout values of [0.1, 0.2](small model) and [0.1, 0.4] (large model), pre-layernorm and post-layernorm (Xiong et al., 2020),learning rate (LR) of phase1 of values [2.5e-5, 5e-4, 1e-4] constant and cosine LR schedules, anddifferent configurations for dilation (on all heads,on 2 heads, no dilation). Number of gradient up-dates/phase reported in Tab. 12 is determined byrunning each phase until the validation BPC stopsgetting better.

B Task specific model details

All the QA and classification models are imple-mented using PyTorch-Lightning15.

WikiHop Instances in WikiHop consist of: aquestion, answer candidates (ranging from twocandidates to 79 candidates), supporting contexts(ranging from three paragraphs to 63 paragraphs),and the correct answer. The dataset does not pro-vide any intermediate annotation for the multihopreasoning chains, requiring models to instead inferthem from the indirect answer supervision.

To prepare the data for input to Longformerand RoBERTa, we first tokenize the question,answer candidates, and support contexts usingRoBERTa’s wordpiece tokenizer. Then we

14https://github.com/kimiyoung/transformer-xl

15https://github.com/PyTorchLightning/pytorch-lightning

concatenate the question and answer candi-dates with special tokens as [q] question[/q] [ent] candidate1 [/ent] ...[ent] candidateN [/ent]. The contextsare also concatenated using RoBERTa’s doc-ument delimiter tokens as separators: </s>context1 </s> ... </s> contextM</s>. The special tokens [q], [/q],[ent], [/ent] were added to the RoBERTavocabulary and randomly initialized before taskfinetuning.

After preparing the input data, we compute acti-vations from the top layer of each model as follows.We take the question and answer candidates andconcatenate them to as much context as possible upto the model sequence length (512 for RoBERTa,4,096 for Longformer), run the sequence throughthe model, collect the output activations, and repeatuntil all of the context is exhausted (for all modelsexcept Longformer-large, where we just includethe first 4,096 length sequence due to memory re-quirements). Then all activations for all chunks areconcatenated into one long sequence. In the case ofLongformer, we use global attention to the entirequestion and answer candidate sequence.

For prediction, we attach a linear layer to each[ent] that outputs a single logit, average over alllogits for each candidate across the chunks, applya softmax and use the cross entropy loss with thecorrect answer candidate.

Training used the Adam optimizer with linearwarmup over 200 gradient updates to a maximumLR, and linear decay over the remainder of training.We used gradient accumulation to effective batchsize of 32 instances, checking the development ac-curacy every 250 gradient updates and reported themaximum development accuracy. Other hyperpa-rameters (dropout, weight decay) were identical toRoBERTa pretraining.

In general, we ran minimal hyperparameter trials,but for fair comparison between Longformer andRoBERTa ran an identical hyperparameter searchwith Longformer-base and RoBERTa-base. Thisconsisted of a grid search of LR in [2e-5, 3e-5,5e-5] and number epochs in [5, 10, 15]. Thebest Longformer-base configuration used lr=3e-5,15 epochs. We ran two hyperparameter trials forLongformer-large, lr=3e-5 and number epochs in[5, 15] (the 5 epoch model had higher dev accuracyof 77.6, and was the single model submitted to thepublic leaderboard for test set evaluation). All mod-

12

https://github.com/kimiyoung/transformer-xl

https://github.com/kimiyoung/transformer-xl

https://github.com/PyTorchLightning/pytorch-lightning

https://github.com/PyTorchLightning/pytorch-lightning

Param Value

Position Embeddings Relative and Sinusoidal as in Dai et al. (2019)Small model config 12 layers, 8 heads, 512 hidden size as in Dai et al. (2019)Large model config 30 layers, 8 heads, 512 hidden size as in Child et al. (2019)Optimizer AdamWDropout 0.2 (small model), 0.4 (large model)Gradient clipping 0.25Weight Decay 0.01Layernorm Location pre-layernorm (Xiong et al., 2020)Activation GeLUNumber of phases 5Phase 1 window sizes 32 (bottom layer) - 8,192 (top layer)Phase 5 window sizes 512 (bottom layer) - (top layer)Phase 1 sequence length 2,048Phase 5 sequence length 23,040 (gpu memory limit)Phase 1 LR 0.00025Phase 5 LR 000015625Batch size per phase 32, 32, 16, 16, 16#Steps per phase (small) 430K, 50k, 50k, 35k, 5k#Steps per phase (large) 350K, 25k, 10k, 5k, 5kWarmup 10% of the phase steps with maximum 10K stepsLR scheduler constant throughout each phaseDilation (small model) 0 (layers 0-5), 1 (layers 6-7), 2 (layers 8-9), 3 (layers 10-11)Dilation (large model) 0 (layers 0-14), 1 (layers 15-19), 2 (layers 20-24), 3 (layers 25-29)Dilation heads 2 heads only

Table 12: Hyperparameters for the best performing model for character-level language modeling

els were trained on a single RTX8000 GPU, withLongformer-base taking about a day for 5 epochs.

TriviaQA TriviaQA has more than 100K ques-tion, answer, document triplets for training. Doc-uments are Wikipedia articles, and answers arenamed entities mentioned in the article. The spanthat answers the question is not annotated, but it isfound using simple text matching.

Similar to WikiHop, we tokenize the questionand the document using RoBERTa’s tokenizer,then form the input as [s] question [/s]document [/s]. We truncate the document at4,096 wordpiece to avoid it being very slow. After-wards, we get the activations from RoBERTa andLongformer similar to WikiHop (discussed above).We use global attention on all question tokens.

For prediction, we add one layer that predicts thebeginning and end of the answer span. Because ofthe distant supervision nature of the training data(no gold answer spans), we use the loss functionof Clark and Gardner (2017) which works like anOR that the model only needs to get one answerspan right, not all of them.

Hyperparameters of the best configuration arelisted in Tab. 13. All other hyperparameters aresimilar to RoBERTa’s. For hyperparameter search,we only tuned LR for the RoBERTa baseline andtried rates [3e-5, 5e-5, 1e-4], then used the best,which is 3e-5, for all subsequent experiments with

no further tuning. We trained the Longformer-largewith the best configuration once and submitted itsoutput to the leaderboard. We ran our experimentson 32GB V100 GPUs. Small model takes 1 day totrain on 4 GPUs, while large model takes 1 day on8 GPUs.

HotpotQA HotpotQA dataset involves answer-ing questions from a set of 10 paragraphs from10 different Wikipedia articles where 2 paragraphsare relevant to the question and the rest are dis-tractors. It includes 2 tasks of answer span ex-traction and evidence sentence identification. Ourmodel for HotpotQA combines both answer spanextraction and evidence extraction in one jointmodel. We also experimented with a two-stagemodel with similar setup that uses longformer firstfor evidence extraction and then extract answersusing a BERT-based QA model. The second stageof our two-stage model is based on Groeneveldet al. (2020). Similar to Wikihop and TriviaQA,to prepare the data for input to Longformer andRoBERTa, we concatenate question and then allthe 10 paragraphs in one long context. We partic-ularly use the following input format with specialtokens: “[CLS] [q] question [/q] [p]sent1,1 [s] sent1,2 [s] ... [p] sent2,1[s] sent2,2 [s] ...” where [s] and [p]are special tokens representing sentences and para-graphs. The special tokens were added to the

13

RoBERTa vocabulary and randomly initialized be-fore task finetuning. For Longformer, we use globalattention to input tokens as well as sentence andparagraph tokens. For answer span extraction weuse BERT’s QA model (Devlin et al., 2019) withaddition of a question type (yes/no/span) classifi-cation head over the first special token ([CLS]).For evidence extraction we apply 2 layer feedfor-ward networks on top of the representations corre-sponding to sentence and paragraph tokens to getthe corresponding evidence prediction scores anduse binary cross entropy loss to train the model.We combine span, question classification, sentence,and paragraphs losses and train the model in a mul-titask way using linear combination of losses. Ourexperiments are done on RTX8000 GPUs and train-ing each epoch takes approximately half a day on4 GPUs. We trained the model using Adam opti-mizer with linear warmup (1000 steps) and lineardecay. We used minimal hyperpareter tuning usingLRs of 3e-5 and 5e-5 and epochs of 3 to 7 andfound the model with LR of 3e-5 and 6 epochs towork best. We conduct the same hyperparametersearch for the RoBERTa baseline as well. The restof hyperparameters are reported in Tab 13.

Param WikiHop TriviaQA HotpotQA

Epochs 15 5 6LR 3e-5 3e-5 5e-5Warmup steps 200 1000 1000Batch size 32 32 32Optimizer Adam Adam Adam

Table 13: Hyperparameters of the QA models. All mod-els use a similar scheduler with linear warmup and de-cay.

Coreference model details The coreferencemodel is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al.(2019). After preprocessing each document withthe RoBERTa wordpiece tokenizer, it splits eachdocument into non-overlapping segments up to themaximum sequence length, then concatenates theactivations for the coarse-to-fine clustering stagethat forms coreference clusters. The maximum se-quence length was 384 for RoBERTa-base, chosenafter three trials from [256, 384, 512] using thedefault hyperparameters in the original implemen-tation.16 For Longformer-base the sequence lengthwas 4,096. Similar to the original implementation,

16https://github.com/mandarjoshi90/coref

different learning rates were used for the pretrainedRoBERTa parameters and the randomly initializedtask parameters. Using a larger learning rate in thetask parameters allows the optimizer to adjust themfarther from their randomly initialized values with-out destroying the information in the pretrainedRoBERTa parameters.

Hyperparameter searches were minimal and con-sisted of grid searches of RoBERTa LR in [1e-5,2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] forboth RoBERTa and Longformer for a fair compari-son. The best configuration for Longformer-basewas RoBERTa lr=1e-5, task lr=1e-4. All other hy-perparameters were the same as in the original im-plementation. Training takes about 10 hours on asingle GPU.

Our implementation is a superhack that involvesPyTorch and Tensorflow sharing a single processand GPU. To avoid re-implementing the com-plicated coarse-to-fine logic from Tensorflow inPyTorch (that involves a highly optimized cus-tom GPU kernel originally released by Lee et al.(2018)), we devised a system where the lower trans-former portion of the model passes activations andgradients back and forth between PyTorch and Ten-sorflow. The input tensors are first run throughthe transformer in PyTorch, the activations are col-lected from the top layer, transferred from GPUto CPU then from CPU to Tensorflow and back toGPU to run the coarse-to-fine clustering and com-pute the loss. Then gradients are back propogatedin Tensorflow to the top of the transformer andthe process reversed to transfer them to PyTorchfor back propogation through the remainder of themodel. Separate optimizers are maintained withidentical LR schedules for parameter updates. Theoverhead in this approach is minimal compared tothe overall cost of running the model.

Text classification For classification, followingBERT, we used a simple binary cross entropy losson top of a first [CLS] token with addition ofglobal attention to [CLS]. We used Adam opti-mizer with batch sizes of 32 and linear warmupand decay with warmup steps equal to 0.1 of thetotal training steps. For both IMDB and Hyperpar-tisan news we did grid search of LRs [3e-5, 5e-5]and epochs [10, 15, 20] and found the model with[3e-5] and epochs 15 to work best. Experimentswere done on a single RTX8000 GPU.

14