zero-shot, one kill: bert for neural information retrieval

Zero-shot, One Kill: BERTfor Neural InformationRetrievalUsing Wikipedia-based Weak Supervision forPassage-(Re)ranking and �estion Answering

Stergios Efes

Uppsala University

Department of Linguistics and Philology

Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits

June 9, 2021

Supervisor:

Joakim Nivre, Uppsala University

Abstract

[Background]: The advent of bidirectional encoder representation from trans-formers (BERT) language models (Devlin et al., 2018) and MS Marco, a largescale human-annotated dataset for machine reading comprehension (Bajaj et al.,2016) that made publicly available, led the �eld of information retrieval (IR) toexperience a revolution (Lin et al., 2020). The retrieval model based on BERT ofNogueira and Cho (2019), by the time they published their paper, became the topentry in the MS Marco passage-reranking leaderboard, surpassing the previousstate of the art by 27% in MRR@10. However, training such neural IR modelsfor di�erent domains than MS Marco is still hard because neural approachesoften require a vast amount of training data to perform e�ectively, which isnot always available. To address the problem of the shortage of labelled data anew line of research emerged, training neural models with weak supervision. Inweak supervision, given an unlabelled dataset labels are generated automaticallyusing an existing model and then a machine learning model is trained upon thearti�cial “weak“ data. In case of weak supervision for IR, the training datasetcomes in the form of a tuple (query, passage). Dehghani et al. (2017) in their workused the AOL query logs (Pass et al., 2006), which is a set of millions of real webqueries, and BM25 to retrieve the relevant passages for each of the user queries.A drawback with this approach is that it is hard to obtain query logs for everysingle di�erent domain. [Objective]: This thesis proposes an intuitive approachfor addressing the shortage of data in domains with limited or no data at allthrough transfer learning in the context of IR. We leverage Wikipedia’s structurefor creating a Wikipedia-based generic IR training dataset for zero-shot neuralmodels. [Method]: We create the “pseudo-queries“ by concatenating the titlesof Wikipedia’s articles along with each of their title sections and we considerthe associated section’s passage as the relevant passage of the pseudo-queries.All of our experiments are evaluated on a standard collection: MS Marco, whichis a large scale web collection. For our zero-shot experiments, our proposedmodel, called “Wiki“, is a BERT model trained on the arti�cial Wikipedia-baseddataset and the baseline is a default BERT model without any additional training.In our second line of experiments, we explore the bene�ts gained by pre-�ne-tuning on the Wikipedia-based IR dataset and further �ne-tuning on in-domaindata. Our proposed model, "Wiki+Ma", is a BERT model pre-�ne-tuned in theWikipedia-based dataset and further �ne-tuned in MS Marco, while the baselineis a BERT model �ne-tuned only in MS Marco. [Results]: Results regarding our�rst experiments show that our BERT model trained on the Wikipedia-basedIR dataset, called "Wiki", achieves a performance of 0.197 in MRR@10, which isabout +10 points more in comparison to a BERT model with default weights; inaddition, results in the development set indicate that the “Wiki“ model performsbetter than BERT model trained on in-domain data when the data is between10k-50k instances. Results regarding our second line of experiments show thatpre-�ne-tuning on the Wikipedia-based IR dataset bene�ts later �ne-tuning stepson in-domain data in terms of stability. [Conclusion]: Our �ndings suggest thattransfer learning for IR tasks by leveraging the generic knowledge incorporatedin Wikipedia is possible, though more experimentation is needed to understandits limitations in comparison with the traditional approaches such as the BM25.

Contents

1 Introduction 5

1.1 Purpose and Research Questions . . . . . . . . . . . . . . . . . . . . . 51.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 7

2.1 Information Retrieval Tasks . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Ad hoc Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Question Answering . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Other Information Retrieval Tasks . . . . . . . . . . . . . . . 8

2.2 Evaluation in Information Retrieval . . . . . . . . . . . . . . . . . . . 82.2.1 Mean Average Precision . . . . . . . . . . . . . . . . . . . . . 92.2.2 Mean Reciprocal Rank . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Traditional Information Retrieval . . . . . . . . . . . . . . . . . . . . 92.3.1 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 BM25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 Other Traditional Information Retrieval Models . . . . . . . . 10

2.4 Learning-To-Rank Information Retrieval . . . . . . . . . . . . . . . . 102.5 Neural Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 A Uni�ed Model formulation of the Neural Ranking Models . 112.5.2 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Bidirectional Encoder Representations from Transformers . . . . . . 132.6.1 Traditional Language Modelling . . . . . . . . . . . . . . . . . 132.6.2 Neural Language Modelling: Attention Mechanisms and Trans-

formers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6.3 BERT Language Modelling . . . . . . . . . . . . . . . . . . . . 14

2.7 Weak Supervision for Ranking . . . . . . . . . . . . . . . . . . . . . . 152.7.1 Wikipedia-Based Weak Supervision Signals . . . . . . . . . . 152.7.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Methodology 17

3.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 BERT for Passage-Reranking . . . . . . . . . . . . . . . . . . . . . . . 173.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 MS Marco Training Dataset . . . . . . . . . . . . . . . . . . . 183.3.2 Wikipedia-based Training Dataset . . . . . . . . . . . . . . . 183.3.3 Development and Test set . . . . . . . . . . . . . . . . . . . . 18

3.4 Creating the Wikipedia-based dataset . . . . . . . . . . . . . . . . . . 193.5 Data pre-preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6 Training on TPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.7 Experimental Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.8 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.9 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Results 23

4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3

4.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Conclusion 27

4

1 Introduction

Informational retrieval technologies (IR) play an important role in people’s daily life.Millions of users are searching the web every day and billions of queries are processeddaily by major search engines, such as Google and Bing.

The e�ectiveness of such search engines is based upon two factors: the use ofneural networks and a large amount of click-log data that they are being trained upon(Zamani, 2019). Such text retrieval models need to acquire an understanding of rawtext documents and to learn a ranking function 5 (@, 3) which given a query q and adocument d outputs a probability for a document being relevant (Guo et al., 2020).

Neural IR started to �ourish only after the publication of large datasets for passage-reranking (Nogueira and Cho, 2019) such as MS Marco from Microsoft Bing (Bajajet al., 2016). It has been noted that the absence of such big datasets made it impossiblefor neural IR to compete with classical IR (Lin, 2019). Despite the recent advancesthough in Neural IR (Nogueira and Cho, 2019; Hofstätter et al., 2020; S. Han et al.,2020) it has to be emphasized that they took place in a scenario where there is anabundance of training signals, i.e. MS Marco. On the other hand, without such largedatasets (which are either click-logs or manually labelled with human judgments) thee�ectiveness of neural networks is highly questioned (Lin, 2019; W. Yang et al., 2019).

For such a setting, when there are not any labeled data available for training, recentresearch has focused on training neural ranking models using weak supervision, inwhich labels are acquired automatically using other means. For example, Dehghaniet al. (2017) used an unsupervised IR model, BM25, while K. Zhang et al. (2020) usedanchor text. Addressing the issue of the lack of labelled data, Frej et al. (2019) takeit one step further by utilizing Wikipedia for building large-scale IR test collectionsautomatically.

This thesis follows the aforementioned line of research: it aims to investigate onwhat level Wikipedia-based weak supervision signals can be used to train a generice�cient neural ranking retrieval model. (Radford et al., 2019).

1.1 Purpose and Research �estions

The purpose of this work is to investigate the e�ectiveness of transfer learning indeep neural networks in the context of IR for addressing the data bottleneck thatone faces when they need to train a neural retrieval model for a domain with limiteddata or no data at all. BERT models (Devlin et al., 2018) have been shown to be ane�ective approach to transfer learning by being pre-trained on large amounts of rawtext data and then �ne-tuned for a speci�c task (Devlin et al., 2018). In our case, weexplore the possibility of transfer learning for IR by �ne-tuning a BERT model withWikipedia-based weak supervision for the task of passage-reranking. We assume thatsince the Wikipedia corpus is characterised by generic knowledge, training a BERTmodel in a Wikipedia-based IR dataset would potentially help it to incorporate genericinformation retrieval knowledge that would prove bene�cial for later retrieval tasks.

We try to quantify how bene�cial such an approach can be by answering thefollowing research questions:

5

• Does a BERT model, �ne-tuned with Wikipedia-based weak supervision, performbetter in terms of accuracy when tested on out-of-domain data (zero-shot setting)in comparison to a default BERT model that has not been �ne-tuned at all?

• If we further �ne-tune this BERT model (that has already been trained withWikipedia-based weak supervision) on in-domain data, will there be any im-provement in terms of accuracy, convergence or stability in comparison to amodel that has only been trained with in-domain data?

1.2 Outline

Beginning with Chapter 2, it describes the di�erent tasks in IR by paying particularattention to the question answering task and one of its components, that is goingto be central in this thesis, passage-reranking. In addition, it describes the di�erentapproaches in IR, such as the traditional IR, learning-to-rank and neural IR, and howevaluation is performed in IR. After the necessary IR concepts are laid out, a pre-sentation of language modelling follows (BERT) and of weak supervision. Chapter 3presents the methodology used to perform passage-reranking with BERT and presentsthe datasets used. That is Wikipedia, which is pre-processed and used to create anarti�cial query-passage dataset for training the BERT model, and MS Marco which isboth used for training and testing. Chapter 4 describes the experimental settings alongwith the baseline(s). Chapter 5 analyzes and discusses the results from our experiments.Chapter 6 presents the essential contributions of this work and concludes the thesis.

6

2 Background

The following sections aim to present a coherent yet short overview of the conceptsand theory needed to understand neural information retrieval. The literature doesnot always agree on terminology and di�erent names for the same concepts are usedinterchangeably. For this reason, it felt necessary to clarify the terms used in thisthesis.

Traditional IR or classical IR refers to the basic retrieval systems used in IR (suchas, for instance, the vector space model, and BM25 retrieval algorithm). Such methodsfocus on word occurrences for measuring relevance. For this thesis, the term traditionalIR was chosen as an attempt to highlight the absence of machine learning in it.

Learning to rank (LETOR) or machine-learned IR refers to a new paradigm in IRthat seeks to employ machine learning approaches to solve IR problems. By LETORthe literature seems to refer to 1) the framework, meaning the formalisation of the IRproblem of ranking from the perspective of machine learning, 2) the use of traditionalmachine learning approaches to solve it, such as support vector machine for instance,or simple neural networks such as the perceptron. These kinds of methods focus ontraining ML models on human labelled datasets using hand-crafted features as will beexplained more thoroughly later. For this work the term learning to rank (LETOR) isadopted because it is broadly used in the literature as opposed to machine-learned IR

Neural IR or neural retrieval or Neu-IR, or neural LETOR refers to the use of deepneural network (DNN) architectures to address IR problems from the perspective ofthe LETOR framework. The DNNs are trained on human labelled datasets but thefeature learning is done automatically by the network in contrast to the hand-craftedrules mentioned before. The term neural IR was chosen to be used in this thesis sinceit is broadly used in the literature.

2.1 Information Retrieval Tasks

In this section, we describe the task of question answering in IR and how it relates tothis thesis, we also make a brief description of the ad hoc retrieval task since it is themost prominent task of the IR �eld, and we brie�y mention the rest of the IR tasks.

2.1.1 Ad hoc Retrieval

The most notable of the retrieval tasks is the ad hoc retrieval (Guo et al., 2020), in whicha user has an information need for which they issue a query to a retrieval system,which in turn measures the relevance between the query and the documents in thecollection and then retrieves the top N scoring documents. A major di�culty in adhoc retrieval is that the incoming queries usually have an unclear intent and rangefrom a few words to a few sentences (Mitra and Craswell, 2017).

2.1.2 �estion Answering

The task of question answering (QA) is to automatically answer a user’s questionsissued in natural language using some information resources (Guo et al., 2020). Theinformation resources could either be structured data (such as a knowledge base)

7

or unstructured data (for example web pages, or documents) which is what we areconcerned with in this thesis. Furthermore, there are several task formats for QA,such as passage-reranking (Nogueira and Cho, 2019), passage-retrieval, answer spanlocating (Rajpurkar et al., 2016) and answer synthesizing from multiple sources (Mitraet al., 2016).

As far as passage-reranking is concerned, it has to be noted that the literature seemsto not agree in the terminology. Some papers consider it not an independent task perse but rather just a post-retrieval step to the passage-retrieval task (Aktolga et al.,2011). Others, such as Nogueira and Cho (2019), treat it as an independent task. Inaddition, both the terms passage-ranking and passage-reranking seem to be usedinterchangeably in the literature. In any case, passage-reranking has become a crucialcomponent in any QA system (Cui et al., 2005.

Most QA systems employ a pipeline structure that consists of several modules toget answers; Nogueira and Cho, 2019):

• passage-retrieval: in this phase, = relevant passages are retrieved using an inex-pensive method, such as BM25 or TF-IDF.

• passage-reranking: the = retrieved passages are reranked using a more computa-tionally expensive method.

• answer span locating: the top 5-10 passages will be used as candidates by an an-swer extraction algorithm for marking up the answer location in the passage(s).

2.1.3 Other Information Retrieval Tasks

There are many other IR tasks as well such as product search (Brenner et al., 2018),sponsored search (Fain and Pedersen, 2006), community question answering (L. Yanget al., 2013), and automatic conversation (Ji et al., 2014), but they are outside the scopeof this work.

2.2 Evaluation in Information Retrieval

Information retrieval systems are evaluated using test collections (Schütze et al., 2008)which consist of:

• A document collection which is the set of documents that the IR system indexes.

• A set of queries that express the information needs.

• A set of ground truth labels, which are relevance judgments, a binary assessmentof a document being relevant or irrelevant with respect to a query.

IR evaluation revolves around the concept of relevant and not relevant document.Given a query a document is classi�ed as being relevant or not. Gold standard or groundtruth judgments of relevance is referred to the decision of this binary classi�cation.

The typical metrics used in IR are adjusted versions of the precision, recall and theF measure; they are “adjusted” since they are set-based measures and are computedusing unordered sets, while in the IR context the aim is to evaluate ranked results.Below are presented the most common metrics used in IR, the mean average precision(MAP) and the one used in this thesis mean reciprocal rank (MRR).

8

2.2.1 Mean Average Precision

One of the most commonly used metrics in the IR community is the MAP. Given aquery, average precision is de�ned to be the average of the precision values obtainedfor the top k retrieved documents, and later this value is averaged over all queries. Let{31, ..., 3<9 } be the list of relevant documents for query @ 9 ∈ & and ' 9,: to be the setof retrieved results from the top results until you get to document 3: :

"�% (&) = 1|& |

|& |∑9=1

1< 9

< 9∑:=1

%A428B8>=(' 9,: ) (2.1)

2.2.2 Mean Reciprocal Rank

The reciprocal rank (RR) metric calculates the reciprocal of the rank at which the �rstrelevant document was retrieved. RR is 1 if a relevant document was retrieved at rank1, 0.5 if a relevant document was retrieved at rank 2 and so on. Averaged across allqueries, this metric is called the mean reciprocal rank (MRR) (Craswell, 2009):

"'' =1|& |

|& |∑8=1

1rank8

(2.2)

MRR is associated with a user model where the user only wishes to see one relevantdocument (Craswell, 2009). The metric is very sensitive when going from rank 1 torank 2 (0.5) in contrast to moving from rank 100 to 1,000 (0.009).

2.3 Traditional Information Retrieval

The main tendency in traditional IR is that a sparse term-document matrix is built outof term frequencies. Such retrieval systems are also called as bag of word models sincethey ignore the ordering of the words in the documents.

2.3.1 Vector Space Model

The intuition behind the vector space model (VSM) is that documents and documentqueries (in this setting a query is seen as a document) can be represented as vectors ina multi-dimensional space and be compared there using the cosine similarity. Morespeci�cally, ®3 denotes the derived vector from a document d and ®@ denotes a vectorderived from a query q, where each component in the vector corresponds to a dictionaryterm that is calculated using the tf-idf scheme we describe below (Schütze et al., 2008).

Tf-idf weighting

Using the tf-idf weighting scheme a weight is assigned to a term C in a document 3using the following formula:

tf-idft,d = tft,d × idft (2.3)

where:

tft,d =5C,3∑

9=159,3

in which 5C,3 is the number of times a term C appears in a document 38 divided by the

9

total number of terms in the document.and

idf = ;>6 |� |3 5C

in idf (inverse document frequency) |� | denotes the number of documents in thecollection and 3 5C the number of times a term C appears in the collection.

Cosine similarity

Having derived the document vector ®3 and the document query vector ®@, cosinesimilarity is computed between these two vectors representations to quantify thesimilarity between them:

2>B8=4 (@, 3) = ®@ × ®3| ®@ | × | ®3 |

(2.4)

2.3.2 BM25

The BM25 scoring formula stems from the probabilistic relevance framework (PRF) andit is considered the state-of-the-art of traditional IR by some researchers (Robertsonand Zaragoza, 2009). Even though through the years several BM25 versions have beendeveloped (Kamphuis et al., 2020), the basic formula is (Aklouche et al., 2019):

B2>A4 (@, 3) =∑C ∈@

83 5 (C) ×C 5C,3 × (:1 + 1)

C 5C,3 + :1 × (1 − 1 + 1 × 3;0E63;)

(2.5)

where :1 and 1 are constants tuned on a labelled dataset, 3; is the document length,and 0E63; is the average document length in all the documents in the collection.

2.3.3 Other Traditional Information Retrieval Models

Since the aim of this thesis is not give a comprehensible guide to traditional IR, webrie�y mention only two other approaches that we think are prominent in the �eld ofIR.

• In the language modelling approach a document is assumed to have its ownlanguage model and the task is to to calculate the probability of a document 3emitting a query @ (Banerjee and H. Han, 2009).

• In Bayesian networks, documents, terms, and queries are represented as nodesand there are arcs that link them together. Using the prior document probabilitiesand the conditional ones from the interior nodes a posterior probability can becomputed (Turtle and Croft, 1989).

2.4 Learning-To-Rank Information Retrieval

Learning-to-rank (LETOR) refers to the application of traditional machine learningin the �eld of information retrieval for (re)ranking a list of documents given a query.Many ML models have been employed over the years in the LETOR task, for examplesupport vector machine (Yue et al., 2007), boosted decision trees (Burges et al., 2005).

LETOR makes use of training data annotated with human relevance labels to trainfor a ranking task. The main thing that distinguishes LETOR models from the neuralapproaches is that LETOR models employ hand-crafted features for representing the

10

query-document pairs (Mitra, Craswell, et al., 2018) – something we will address in amore detailed manner in the next chapter where the neural approaches are described.Typically such hand-crafted features fall under one of the following three categories:query-independent or static features (e.g. document length or web-link length), query-dependent or dynamic features (i.e., BM25), query-level features (e.g. query length).

The di�erent LETOR approaches can be categorised based on their training objec-tives (T.-Y. Liu, 2011):

• Point-wise method: It is the earliest method used. In the point-wise approach,the loss function looks at one document at a time and scores the documentindependently of the other documents. A regression model is typically trainedon labeled data to predict a numerical relevance label for a document given aquery (Mitra, Craswell, et al., 2018).

• Pair-wise method: In the pair-wise approach, the loss function looks at twodocuments at a time and tries to derive the optimal ordering for them. Theranking problem is reduced to a binary classi�cation problem to predict themost relevant document (Mitra, Craswell, et al., 2018).

• List-wise method: In list-wise approaches, the entire set of documents is takenas an input and the model predicts the ground truth labels (T.-Y. Liu, 2011).

2.5 Neural Information Retrieval

Neural IR refers to the use of deep neural architectures to address the IR tasks. A keydi�erence between deep architectures and the LETOR approaches is that, in contrastto the manual feature engineering demanded by the LETOR methods, deep neuralnetworks learn the features needed for training in an unsupervised way – at the costof training more complex models though (Mitra, Craswell, et al., 2018).

2.5.1 A Unified Model formulation of the Neural Ranking Models

As we mentioned earlier in the the thesis, neural IR is studied under the LETORframework (Guo et al., 2020). For this reason the literature usually gives a uni�edformulation of neural ranking models from a generalized view of LETOR problems,which we are also going to describe below.

Following Guo et al. (2020), suppose there is set of queries& , which could be any typeof text queries, natural language questions, or input utterances and a set of documents� , which could be any type of text documents, answer passages, or snippets fromweb-pages or real documents. Let also . be a set of labels that represent a relevancedegree . = {1, 2, ...;}. Then it exists an order between these grades ; > ; − 1 > ... > 1.Now let @8 be the query in the 8Cℎ position of & , and �8 being the set of documentsassociated with @8 , such that �8 = {38,1, 38,2, ...38,=8 } and .8 the set of labels associatedwith @8 , such that . = {~8,1, ~8,2...~8,=8 } with ~8, 9 being the relevance degree of 38, 9 of @8 .Let 5 (@8 , 38, 9 ) be a ranking function that assigns a relevance score to a pair of a queryand a document. Lastly, let !(5 ;@8 , 38, 9 , ~8, 9 ) be the loss function that calculates the lossbetween the prediction of the function 5 and the label. Hence the objective of function5 is to �nd the optimal ranking function 5 ∗ by minimizing the loss function over alabelled dataset.

11

5 ∗ : 0A6<8=∑8=1

∑9=1

!(5 ;@8 , 38, 9 , ~8, 9 ) (2.6)

Without loss of generalisation, we can further abstract the ranking function 5 to thefollowing uni�ed formulation

5 (@, 3) = 6(k (@), q (3), [ (@, 3)) (2.7)

where q and d are inputs, k , q are representation functions which extract featuresfrom q and d, [ is the interaction function which extracts features from the pair (q,d), and 6 is the evaluation function which computes the relevance score based on thefeature representations.

Using this generalised scheme, we can describe now the di�erences between theLETOR approaches and the Neural IR. In LETOR approaches the inputs of the functions5 are usually raw texts, while in the neural IR, these inputs could be either raw texts orword embeddings (the mapping function is not included in the uni�ed formula sinceit is considered as a basic input layer).

Turning to the thek , q and [ functions, in the LETOR approaches they are usuallyset to be �xed functions, while the function 6 is a machine learning model (for instancea gradient boosting tree) which could be learned from the training data. On the otherhand, neural ranking model encode all the four functionsk , q , [ and 6 in the networkso that can be learned in an unsupervised manner from the data (Guo et al., 2020).

2.5.2 Model Architectures

Depending on the nature of the interaction function [ or the di�erent assumptionsover the features (extracted by the representation function q ,k ) described in section2.5.1, deep learning models can be divided in to the following two architectures (Guoet al., 2020): representation-focused and interaction-focused.

Figure 2.1: Model architectures: a) Representation-focused, b) Interaction-focused (Guo et al.,

2016).

Representation-focused architecture

The underlying assumption in representation-focused models is that relevance dependson compositional meaning of the input texts. Thus, models of this category usuallyemploy complex representation functions q and k (e.g. deep neural networks) toderive high-level representations over the text inputs @ and 3 , but not the interactionfunction [, and de�ne a simple evaluation function 6, for instance cosine similarity,for calculating the relevance score (Guo et al., 2020).

12

Interaction-focused architecture

On the other side of the spectrum, the underlying assumption in the interaction-focusedmodels is that the relevance depends upon the interaction between the input texts.Thus, models of this category employ a function [ along with simple representationfunctions q andk , while they de�ne a complex evaluation function 6 (i.e. deep neuralnetworks). Depending on the kind of the interaction function used, the interaction-focused architecture can be further categorised (Guo et al., 2020) into:

• Non-parametric interaction functions that measure the closeness between inputswithout learnable parameters, in which some of them are de�ned over each pairof input vectors.

• Parametric interaction functions that learn the similarity function from the data.

2.6 Bidirectional Encoder Representations from Transformers

Having described the main IR concepts needed for the thesis, we proceed to thearea of language modelling (LM) which constitutes the basis of Bidirectional EncoderRepresentations (BERT) models used later in our experiments.

2.6.1 Traditional Language Modelling

Language modelling (LM) has been the basis for various NLP tasks. In machine transla-tion tasks, e.g, a LM is used to improve �uency over the output translations by choosingthe most probable �uent output (Jing and Xu, 2019). The �rst language models wererule-based until the advent of the statistical language models (1980s) which assign aprobability distribution over sequences of words (Jing and Xu, 2019):

% (B) = % (F1,F2...F=) = % (F1) × % (F2 |F1) ...% (F= |F1,F2...F=−1) (2.8)

A major drawback of this n-gram LM approach was the curse of dimensionality. Inparticular, for modelling a LM of this kind with a vocabulary of size 10,000, there arepotentially 10000=−1 free parameters.

2.6.2 Neural Language Modelling: A�ention Mechanisms and Transformers

To address the aforementioned “curse” of data sparsity, neural networks were intro-duced for language modeling in continuous space (Bengio et al., 2000). Such LMscreated a dense representation of the language in contrast to the sparse representationof the statistical approach mentioned before. Since then many techniques have beenimplemented in neural language modelling, but are beyond the scope of this thesis to bepresented. We will only mention the recurrent neural networks (RNNs) (Mikolov et al.,2010), and the one that mostly concerns this thesis which is the attention mechanisms.

Attention mechanisms are a set of coe�cients that are used by the neural network toacquire the target area needed to focus on. LMs equipped with attention mechanismsuse the long history more e�ciently (Bahdanau et al., 2014); (Mei et al., 2016). Formallythe attention vector ΔC is calculated by the representation of {A1, A2, ..., AC−1}:

ΔC =C−1∑8=0

UC8A8 (2.9)

Due to the success of attention mechanisms, Vaswani et al. (2017) proposed an archi-tecture based solely on attention mechanisms: the Transformer, which consists of an

13

encoder and a decoder. The Transformer encoder is the basis of the BERT languagemodels.

2.6.3 BERT Language Modelling

BERT is a language model which is based on the aforementioned Transformer architec-ture. As the authors mention in their paper, it is designed to “pretrain deep bidirectionalrepresentations from unlabeled text by jointly conditioning on both left and right contextin all layers” (Devlin et al., 2018).

Input and output representation

BERT is capable of coping either with a single sentence or a pair of sentences and thusit is able to handle a variety of downstream tasks. An example is given in Figure 2.2.The �rst element of every sequence is the [CLS] token. In classi�cation tasks the �nalhidden state of that token is used as the aggregate sequence representation. A pairof sentences is concatenated into a single sequence using the [SEP] token to separatethem. In addition, a learned embedding is added for every token indicating whetherit belongs to Sentence A or B. Thus, for a single token, its input representation isobtained by summing the corresponding token, segment, and position embeddings.

Figure 2.2: BERT input representation (Devlin et al., 2018).

BERT framework

The BERT framework consists of two phases: pre-training and �ne-tuning. Duringthe �rst phase, the BERT model is pre-trained on large amounts of raw text data inan unsupervised manner to acquire deep language representations, and then, duringthe second phase it is �ne-tuned in a supervised task. Most of the times the wording“training BERT” involves only the 2nd task of �ne-tuning, since BERT models pre-trained on general language data are freely available.

More speci�cally the pre-training step of BERT involves training it in the followingtwo unsupervised tasks: masked LM and next sentence prediction (NSP). In MaskedLM a percentage of the input tokens are masked and then the model tries to predictthose masked tokens. In that way the BERT model acquires knowledge on how wordsare related to each other. On the other hand, the aim of NSP is to acquire knowledgeregarding the relationship of two sentences, since many downstream tasks such asquestion answering require such knowledge. Thus, the BERT model is trained on abinary classi�cation task of predicting whether the next sentence follows the previousone.

14

Figure 2.3: Pre-training and fine-tuning (Devlin et al., 2018).

The �ne-tuning step of BERT is pretty easy since the Transformer architecturewith the self-attention mechanism permits BERT to model other downstream tasksby just swapping out the appropriate inputs and outputs into the Sentences A andB respectively and �ne-tune all of the parameters end-to-end. For instance, in thequestion answering task, sentence A is the question and sentence B the relevant answer.Thus a BERT classi�er is trained on such inputs.

2.7 Weak Supervision for Ranking

Weak supervision is a sub-�eld of machine learning in which the basic assumptionis that we can acquire in a cheap way imperfect, noisy, labels in an unsupervisedfashion and use them as a weak supervision signal for training a classi�er (Hernández-González et al., 2016). Weak supervision has been applied in many NLP tasks such asrelation extraction (Bing et al., 2015; X. Han and Sun, 2016), knowledge base completion(Ho�mann et al., 2011), sentiment analysis (Severyn and Moschitti, 2015).

In the context of neural IR weak supervision is used to address the problem of theabsence of large labeled datasets needed to e�ciently train neural networks. Such largedatasets are expensive to obtain, and thus unsupervised learning is considered as a longstanding goal for several applications (Dehghani et al., 2017). More speci�cally, weaksupervision in neural IR means that we take advantage of an existing unsupervisedIR model, such as BM25, which we use as “pseudo-labeler”. Given a target collectionof documents and a set of training queries the pseudo-labeler is used to rank thedocuments for every query in the training set. The objective is to train a classi�erusing these scores as weak supervision signals obtained by the “pseudo-labeler”. As anexample, for a query “dogs”, our pseudo-labeler retrieves the following three documents:“dogs are good”, “dogs eat bones”, “dogs love humans”. Then the created dataset will be aset of tuples (q, relevant passage): (dogs, dogs are good), (dogs, dogs eat bones), (dogs, dogslove humans). These tuples will be used to train a binary classi�er for distinguishingbetween relevant and irrelevant documents.

2.7.1 Wikipedia-Based Weak Supervision Signals

Frej et al. (2019) used Wikipedia to create test collections for IR by utilizing Wikipedia’sinternal linkage to create query topics. This thesis is inspired by the aforementionedwork on using Wikipedia as a source to create automatically creating an arti�cialtraining dataset for training a neural classi�er for IR, in which the weak labels areconstructed not by using an unsupervised algorithm, as Dehghani et al. (2017) did with

15

BM25, but rather exploiting Wikipedia’s internal structure for creating the arti�cialdataset. More details regarding the methodoloy will be presented in section 3.

2.7.2 Previous Work

Weak supervision in IR is an active area of research, and several weakly-supervisedalternatives have been explored so far.

In Dehghani et al. (2017), the authors utilized BM25 to retrieve documents to con-struct their weak training dataset. K. Zhang et al. (2020) used the anchor texts andtheir linked web pages to construct their weak supervision signals. Ma et al. (2020)introduced a zero-shot retrieval approach using synthetic query generation by traininga generative model on a di�erent community QA data. Frej et al. (2019) exploitedWikipedia’s internal linkage to create query topics. Nogueira and Cho (2019) trained aBERT model for passage re-ranking and achieved state-of-the-art results in the MSMarco dataset.

This thesis is inspired mostly by the work of Nogueira and Cho (2019) and Frejet al. (2019), though it di�erentiates from them substantially. Nogueira and Cho (2019)train a BERT model with supervision on MS Marco (a human-labelled dataset thatwe will describe in detail in the next chapter) for the passage-reranking task andperform evaluation on MS Marco, while in our work we will �rst �ne-tune the BERTmodel with Wikipedia-based weak supervision and then perform then evaluation in azero-shot fashion on the MS Marco dataset; in addition, after training on Wikipediawe will further �ne-tune our model on the MS Marco dataset and evaluate on MSMarco again. On the other hand, our main di�erences with Frej et al. (2019) lies inthe methodology for creating the weak supervision signals using Wikipedia and thatthey again use Wikipedia for evaluating their models. Speci�cally, Frej et al. (2019)implement a complicated method for creating the weak supervision signals by utilizingthe internal linkage of Wikipedia to build an IR collection, consisting by a training anda test set, and then perform various experiments on it, while in our work we utilisethe internal structure of Wikipedia’s articles to build the weak supervision signals.The next chapter will explain our procedure in detail.

16

3 Methodology

In this section, we explain in detail our methodology for creating the Wikipedia-basedIR training dataset and training the BERT retrieval models. As the basis for all of oursystems, a ��')("�!! model is used (downloaded from the o�cial github repo ofGoogle AI1), and all the pre-processing and training is done using the Tensor�owlibrary2.

The architectural overview of our pipeline is as follows: (1) Wikipedia-based datasetcreation, (2) MS Marco preparation, (3) Data pre-processing, (4) TPU training, (5)Experimental systems.

• Wikipedia-based dataset creation: We download the latest Wikipedia dumpwhich we pre-process following the methodology described in section 3.4 tocreate the Wikipedia training dataset.

• MSMarco preparation: After we download the training and development datasetof MS Marco, we divide the development set to two sets sets: 100 queries for thedevelopment set and 6880 queries for the test set as we explain in section 3.3.

• Data pre-processing: Before training our di�erent BERT models, all the trainingdata is converted �rst to the necessary format that BERT needs, and then to theTFRecord format for boosting the speed of the TPUs.

• TPU training: Training a BERT model in a large amount of data is a very tediousand computationally expensive procedure. For this reason, we train all modelsin Google Cloud using its TPUs.

• Experimental systems: Having both Wikipedia and MS Marco in TFRecordformat we start training three di�erent BERT models: (1) a model trained onWikipedia data, (2) a model trained on both Wikipedia and MS Marco data, (3)and a model trained only on MS Marco data as explained in section 3.6.

3.1 Task

As mentioned earlier, the passage-reranking task is the second phase of a questionanswering pipeline. A question answering pipeline consists of 3 phases (Nogueiraand Cho, 2019): (1) passage-retrieval, in which a large number of relevant passagesare pooled using a cheap computational method (e.g. BM25 or TFIDF), (2) passage-reranking, in which these documents are re-ranked by more sophisticated methods(such as neural networks), and (3) answer extraction, in which the top-n documentswill be fed to a question extraction algorithm for marking up the answer.

3.2 BERT for Passage-Reranking

Given a list of retrieved passages, the aim in the passage-reranking phase is to calculatea relevance score B8 for a candidate passage 38 to a query @. Using the theory presented

1https://github.com/google-research/bert2https://www.tensor�ow.org/

17

in chapter 2.5, ��')("�!! is used as a binary classi�er in a point-wise fashion, thatis, the [�!(] vector is used as an input to a single layer neural network to obtain theprobability of the passage being relevant and the loss function looks at one documentat a time independently of the other documents.

More speci�cally, the query is fed to the classi�er as sentence A and the passagetext as sentence B. In addition, the maximum query length is truncated to have 64tokens. The passage is also truncated such that the concatenation of the passage andthe query amounts to at most 512 tokens. The publicly available pre-trained BERTmodel is used as basis and the a re-ranker is �ne-tuned using the cross-entropy loss:

! = −∑9 ∈ 9?>B

;>6(B 9 ) −∑9 ∈ 9=46

;>6(1 − B 9 ) (3.1)

in which �?>B represents the set of indexes of the relevant passages and �=46 representsthe set of indexes of non-relevant passages in the top-1,000 documents retrieved withBM25 (Nogueira and Cho, 2019).

3.3 Datasets

In this section, we describe the training datasets used in our experiments: (1) the MSMarco dataset, where we train our in-domain BERT model, and (2) the Wikipedia-baseddataset where we train our BERT model for the zero-shot experiments. In addition,we describe both the development and the test set.

3.3.1 MS Marco Training Dataset

The MS Marco dataset (Bajaj et al., 2016) is composed of ∼400 million query-passagepairs, in which the passages are marked for being relevant or irrelevant.

Nonetheless, in our experiments, we use a smaller release of MS Marco which is∼10% of the Marco dataset (the original MS Marco Dataset is more than 270 gb) tomake our experiments easier (the dataset is called ‘triples.train.small.tsv‘ and wasdownloaded from the o�cial github repo of Microsoft3). In addition, from this smallerMS Marco version we use only the �rst 20.000.000 query-passage pairs in order to beequal in size with our Wikipedia-based dataset that we will describe next.

3.3.2 Wikipedia-based Training Dataset

The Wikipedia-based training dataset consists of 20.000.000 query-passage pairs, whereeach query is the concatenation of a Wikipedia article’s title and one of its section’stitles, while the associated ‘relevant‘ passage (in quotation since this is our assumption)is the section’s passage. We describe thoroughly our methodology for building suchdataset in section 3.4.

3.3.3 Development and Test set

All experiments performed in this thesis are evaluated using the development and thetest set from the MS Marco dataset. In the o�cial MS Marco dataset, the developmentset contains 6980 queries that are associated with the top 1,000 passages retrievedusing the BM25 from the MS Marco dataset. Each query has on average one relevantpassage, while some of them have none since the corpus was initially constructed byretrieving the top-10 passages from the Bing search engine and then annotating them.

3https://microsoft.github.io/MSMARCO-Passage-Ranking/

18

For this reason some of the relevant passages might not be retrieved by the BM25. Theo�cial MS Marco contains also a test set that consists of ∼6, 800 queries and their top1,000 retrieved passages; though their relevance judgements are not publicly available.

For the above reason, we create a test set out of the o�cial development set bydividing it into two sets of 100 and 6880 queries that are used in the experimentsas a development and test set respectively. Using a development set of 100 queriesinstead of the original one of 6980 queries does not pose any problem as far as sizeis concerned since scienti�c research has demonstrated that just 50 queries is thesu�cient minimum (Buckley and Voorhees, 2017).

3.4 Creating the Wikipedia-based dataset

The procedure for creating the arti�cial Wikipedia-based training corpus for IR usedin our experiments is as follows:

Let W be a set of Wikipedia articles , = {F1,F2, ...,F=}, with each article F8containing a main title C8 and 9 section titles, i.e. B42C8>=)8C;48, 9 , and each of the sec-tionTitles is associated with a section passage B42C8>=%0BB0648, 9 . Let & be the set ofthe arti�cial Wikipedia-based user queries & = {@1,1, @1,2, ..., @8, 9 } such that @8, 9 is theconcatenation of a Wikipedia title C8 with one of its B42C8>=)8C;48, 9 . Let now % be theset of the associated relevant passages % = {?1,1, ?1,2, ..., ?8, 9 } with ?8, 9 being the cor-responding Wikipedia B42C8>=%0BB0648, 9 to the B42C8>=)8C;48, 9 , and we call the ?8, 9 the“relevant passage” to the query @8, 9 . Let now Ψ be a sequence of associated irrelevantpassages Ψ = (k1,1, ...,k8′, 9 ′) withk8′, 9 ′ being a Wikipedia B42C8>=%0BB0648′, 9 ′ , such thatB42C8>=%0BB0648′, 9 ′ ≠ ?8, 9 , and we call thek8′, 9 ′ the “irrelevant passage” to the query @8, 9 .Now, let � be our arti�cial IR training dataset, that is a set of triplets (@8, 9 , ?8, 9 ,k8′, 9 ′)with @8, 9 being the arti�cial user query, ?8, 9 being its associated relevant passage, andk8′, 9 ′ being the associated irrelevant passage:

� = {(@1,1, ?1,1,k1,1), ..., (@8, 9 , ?8, 9 ,k8′, 9 ′)} (3.2)

We build the training dataset A in two steps: (1) we parse the Wikipedia dump andcreate a temporary �le U with the (@8, 9 , ?8, 9 ) pairs, (2) create another temporary �le Vwith the irrelevant passagesk8′, 9 ′ and concatenate the two �les U and V , resulting tothe training dataset A.

Creating the (@8, 9 , ?8, 9 ) pairs

In the �rst step (1), we start by downloading the latest Wikipedia dump from the o�cialWikipedia repository4. Since the Wikipedia dump comes in the XML format is needed tobe parsed in order for us to obtain the clean text out of it. For this reason, we use an open-source Wikipedia parser 5 to clean the Wikipedia dump and obtain the clean text. Werun the parser with the following con�guration --sections --filter_disambig_pages

to preserve the sections of every article and �lter the disambiguation pages of Wikipediasince they do not have any sections and therefore we cannot utilise them to createany queries. The procedure of extracting clean text out from Wikipedia takes around6-8 hours depending on your hardware.

Having obtained the clean text we use simple regex rules for parsing it and extractingthe arti�cial queries (which are the article title and the section title concatenated aswe mentioned before) associated with their relevant passages (the title sections). For

4https://dumps.wikimedia.org/enwiki/5https://github.com/attardi/wikiextractor

19

Figure 3.1: Extracting the (@8, 9 , ?8, 9 ) from a Wikipedia article. The extracted (@8, 9 , ?8, 9 ) from

the above article would be (“MissingNo. History”, “Developed [...] games”), (“Miss-

ingNo. Characteristics”, “A player [...] ”) etc.

example, to identify the section we match the pattern ‘Section::::‘ that exists in theclean text and using that we extract the section passage.

After we �nish the whole procedure for the entire Wikipedia we end up with∼12.000.000 query-passage pairs (@8 , ?8, 9 ) saved in the �le we named U .

Creating the irrelevant passagesk8′, 9 ′

Since we train a binary classi�er (relevant-irrelevant) our training corpus � needsto have for each arti�cial query an associated irrelevant passage. We do this withthe following intuitive way: we loop over the �le U we just created from step (1)starting from the 800,000th index (we randomly picked this number) and then we startsaving the associated passages until we reach 12,000,000 instances (the same numberof instances of �le U). We save the instances in a new �le called V . In that way we makesure that the passage in ith position of V are irrelevant to ?8, 9 of U . As a �nal step to ourprocedure we concatenate the �les U and V and we end up with our arti�cial trainingdataset with 12,000,000 line of triplets (@8, 9 , ?8, 9 ,k8′, 9 ′), resulting to ∼20 gigabytes ofdata.

3.5 Data pre-preprocessing

Before training our models we need �rst to convert them to the appropriate BERTformat, and then convert them to the TFRecord format to be consumed by the TPUs.The BERT format is necessary by the BERT model while the TFRecord format issuggested for optimising the Tensor�ow models trained6.

6https://cloud.google.com/architecture/best-practices-for-ml-performance-cost

20

Preprocessing happens in one pass, meaning that we wrote a python �le, that wecalled "preprocessor.py", which accepted a list of strings as an input and it output aTFRecord �le. Google AI provides the necessary code for these conversion in theiro�cial github7, that one can modify and customize according to their speci�c needs(e.g. there is not any straight-forward implementation of converting a string to BERTformat and then to TFRecord format, but rather one must �rst get accustomed with thelogic of Google’s AI github and their code provided until they become able to re-usethe Google AI’s code and adjust it to their needs.)

3.6 Training on TPUs

Despite their impressive performance BERT models are quite slow in training andinference time (W. Liu et al., 2020). For this reason, to address the slow inference andtraining time of BERT in our experiments we make use of Google’s TPUs8. As anexample, �ne-tuning just one of our BERT model on a TPU takes around 20 hoursfor 20 gigabyte of data while -based on the metrics of Google9- training on a GPUwould have taken 300 hours (∼13 days!). As far as inference time is concerned, thedevelopment set consisting of 100,000 instances (100 queries × 1,000 passages perquery) takes around 5 minutes to be annotated from the BERT relevance classi�er we�ne-tuned on TPU, while the test set that consists of 6,880,000 instances takes around50 minutes. Using GPUs the equivalent times would be ∼1 and ∼10 hours respectively.

Although Google provides free TPUs usage through their Google colab10 it wasunfeasible to use it since the limitations it poses on the continuous usage11. For thisreason we used TPUs through the Google cloud. In total we spent 500$ to perform allof our experiments (Google Cloud provides 300$ free trial).

3.7 Experimental Systems

To investigate our �rst research question regarding the possibility of transfer learningwith BERT in the context of IR by using Wikipedia as a pre-�ne-tuning step, we usethe following two systems:

• Default: This model is used “as is” with its “out-of-the-box” weights and withoutany additional training.

• Wiki: a model that we train with weak supervision in 20, 000, 000 query-passagepairs using our Wikipedia-based query-passage dataset.

For our second research question regarding to what extent pre-�ne-tuning for thepassage-reranking task in Wikipedia data could improve a model in terms of (1) accu-racy, (2) convergence or stability, we train the following two models:

• Wi+Ma: A ��')("�!! model is �rst pre-�ne-tuned with weak supervisionin 20, 000, 000 query-passage from our Wikipedia-based dataset and then it isfurther �ne-tuned in additional 20, 000, 000 data from the MS Marco dataset.

7https://github.com/tensor�ow/models/tree/master/o�cial/nlp/data8https://cloud.google.com/tpu/9https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-�rst-

machine-learning-chip10https://colab.research.google.com/11https://research.google.com/colaboratory/faq.html#resource-limits

21

• Marco: A ��')("�!! model, that is is �ne-tuned only in 20, 000, 000 data fromthe MS Marco dataset.

3.8 Experimental Se�ings

Following the settings of Nogueira and Cho (2019) a ��')("�!! model is �ne-tunedusing a TPU v2.8 in Google Cloud. with a batch size of 128 (128 sequences * 512 tokens= 65,536 tokens/batch) if we �ne-tune the model on MS Marco, otherwise a batch of32 is used when we �ne-tune in Wikipedia.

ADAM (Da, 2014) is used with initial learning rate set to 3×10−6, V1 = 0.9, V2 = 0.999,L2 weight decay of 0.01, learning rate warmup over the �rst 10,000 steps, and lineardecay of the learning rate. Dropout probability of 0.1 on all layers. For the �ne-tuningin Wikipeda we change the learning rate to just 10−6.

3.9 Evaluation Methods

In the passage-ranking task we want to measure in which position the trained retrievalmodels rank the relevant passage that is the answer to the user’s query. As mentionedin section 3.3.3 every query is associated with 1000 passages containing at most onerelevant passage which is the answer to the user’s question. In addition, the associatedrelevance judgment to the relevant passage is binary (relevant, non-relevant). Sincewe are interested in our experiments to �nd the �rst relevant passage in the list wepicked MRR as our evaluation method which puts a high focus on the �rst relevantelement of the list. Also, using MRR makes our results comparable with the resultsof the other papers in the MS Marco leader-board as the o�cial MS Marco web pagefor the passage-reranking task states that the evaluation method should be the MRRmetric12.

12https://microsoft.github.io/MSMARCO-Passage-Ranking/

22

4 Results

In this section, we report the results of our experiments. Table 4.1 shows the results onthe test and development set, while Figure 4.1 shows the performance of the modelsin the development set trained on di�erent amounts of data. Figure 4.2 quanti�es theinstability of the BERT model by reporting the results of 10 runs for each of the models.One thing to notice is that in Figure 4.1 and 4.2 we do not report any results for ourDefault model since as it was mentioned we do not train it on any data.

4.1 Accuracy

Regarding the results of our �rst research question, we observe in Table 4.1, that ourWi+Ma model achieves a score of 0.19 on the test set, that is almost +10 points higherthan the performance of the Default model which achieves a score of 0.104, and 0 onthe development set. Turning to the results for the �rst part of our second researchquestion, which is if more accuracy can be gained by pre-�ne-tuning in Wikipediadata, we can see that our Wi+Ma and Marco model have the same performance, both ofthem achieving a score of ∼0.36 in MRR@10. Therefore, we can argue that no accuracyis gained by pre-�ne-tuning on Wikipedia.

System Dev (100 queries) Test (3880 queries)Default 0.0 0.104

Wiki 0.197 0.196Wi+Ma 0.37 0.3610Marco 0.35 0.3598

Table 4.1: Accuracy: Results on development and test set.

4.2 Convergence

In the Figure 4.1 we observe the results about convergence which is whether ourmodel’s loss function moves towards the minimum with a decreasing trend. This isregarding the second part of our research questions, if any convergence would begained by pre-�ne-tuning in Wikipedia, in Figure 4.1 we observe that our Wi+Mamodel does not converge faster as it steadily increases its performance with more datain the same fashion as the Marco and the Wiki model do.

4.3 Stability

in Figure 4.2 we observe the results regarding stability that concerns the last part of ourresearch questions if any stability would be gained by pre-�ne-tuning in Wikipedia.Stability is de�ned as whether the model leads to smaller standard deviation of the �ne-tuning accuracy (Devlin et al., 2018; Dodge et al., 2020). The plot shows an impressivegain in the stability for our Wi+Ma model, since all the lines of the 10 runs areoverlapping, thus creating a solid blue line. Turning to the other two models, Marco

23

Figure 4.1: Convergence: Results on the development set by training di�erent sizes of the

data (between 1k - 20 million). For the results between 1k and 1 million we report

the best score in 10 runs.

and Wiki, we see that their performances are unstable, hence their lines are scatteredin the plot, but they start to gradually converge as they see more data.

4.4 Analysis

Zero-shot

Our hypothesis was con�rmed that Wikipedia with its generic knowledge can beutilised to create a generic IR training dataset that by imitating the user’s queriescan be leveraged by BERT and its transfer-learning capabilities to train a genericretrieval model. Our model succeeded +10 of MRR@10 score in comparison with theDefault model when tested on out-of-domain data, the MS Marco, a dataset that ischaracterised by real-world queries that are quite diverse. In addition, we can observefrom Figure 4.1 that our Wiki model has an even better score (+6) in comparison to theMarco model (that uses in-domain data), when that is trained on 10k-50k of trainingdata.

Accuracy and Convergence

Turning to our second hypothesis regarding the potential bene�ts of pre-�ne-tuning aBERT model for the passage-reranking task in our Wikipedia dataset the results showedthat, as far as accuracy and convergence is concerned, both models are equal. Thatwas something we did not necessarily expect and leaves room for improvements. Onepossible explanation for our model not performing better is that the queries constructedfrom the Wikipedia articles in concatenation with the sections resembled better thetask of ad-hoc retrieval which is characterised by short queries with potentially unclear

24

Figure 4.2: BERT instability. Development accuracy in 10 runs.

intent (Mitra and Craswell, 2017) while the MS Marco dataset was created to addressthe need for evaluating systems against natural language questions (Bajaj et al., 2016).Also, there is a lot of hyperparameter optimization that we did not address in thisthesis, such as the length of the Wikipedia passage that was set to 512 tokens, or thebatch size that was set to 32. Last but not least, observing Figure 4.1 we see a strangebehaviour in Wiki when trained on 100k data. Its accuracy decreases. This mightbe an indication that the way the query-passages are constructed introduces somenoise in the data that makes the model decrease its performance. Another potentialreason might be the nature of the MS Marco dataset per se. MS Marco is a genericweb dataset with any sort of questions and answers, thus the incorporated knowledgein Wikipedia’s articles might be just a subset of the knowledge incorporated in MSMarco’s dataset. This would mean that when the BERT model is exposed in the MSMarco for training there are are many events that the model has never seen beforeand thus it cannot converge faster or have higher accuracy.

Instability

Turning to the instability of the BERT models, it has been mentioned in the literaturethat despite the signi�cant success of the BERT models the process of �ne-tuning them(especially on small datasets) remains unstable (Devlin et al., 2018; Dodge et al., 2020).

Potentially, the reason behind this lies in the nature of the �ne-tuning phase. Anew task-speci�c layer replaces the original one and �ne-tunes the complete model.In that way, new sources of randomness are introduced: the weight initialization ofthe new output layer and the data order in the stochastic �ne-tuning optimization (T.Zhang et al., 2020). Such factors have been claimed to in�uence the results signi�cantly(Dodge et al., 2020; Phang et al., 2018) and in particular on small datasets (e.g., < 10,000samples). Hence for such small datasets, practitioners resort to conduct several randomtrials of �ne-tuning for model selection based on validation performance (Devlin et al.,

25

2018). This increases model deployment costs and time, while making the scienti�ccomparison challenging (Dodge et al., 2020).

The results shown in Figure 4.2 for our Wi+Ma model indicate that a BERT modeltrained on large amounts of Wikipedia data increases its stability as we see that thebehaviour of the other two models is quite unstable, especially when the size of thedata is small. The reason behind the stability of the Wi+Ma model probably lies inthe following: as we see from the other two models, Marco and Wiki, they tend tobe quite unstable; though as they are fed more data (e.g. 1 million data) they startto stabilize and their lines to overlap. By extrapolating the lines we can hypothesizethat with more data these models reach convergence as far as stability is concerned.So using the Wiki model as basis for �ne-tuning on in-domain data the model startsfrom a model where the weights have been already stabilised. Unfortunately, due tothe expensiveness of the experiments we could not run for 10 times for more than1 million data. Our results are aligned with and con�rm the �ndings of Phang et al.(2018) who showed that �ne-tuning the model on a large intermediate task stabilizesthe later �ne-tuning on small datasets. In addition, another speculation regardingthe stability gained by the Wi+Ma model is due to the fact that, as we mentioned inthe "Accuracy and Convergence", the knowledge inside Wikipedia is a subset of theknowledge of the Marco’s dataset. Therefore when we train on MS Marco there areenough events that the BERT models has already seen before in Wikipedia and thisleads the model to develop more dense weights and to have smaller standard deviationin the accuracy.

26

5 Conclusion

In this thesis, our hypothesis was that transfer learning in the context of IR is possibleby leveraging a generic knowledge database as Wikipedia to create an IR trainingdataset, so that training a deep neural model in such dataset would prove bene�cial forlater retrieval tasks since the model would incorporate generic retrieval knowledge.Our aim was to address the shortage of data when training the data-hungry neuralmodels for domains where there is limited or no data at all. For that, we proposed asimple and intuitive technique for transfer learning in the context of IR; we leveragedWikipedia’s structure to create a generic weakly supervised IR training dataset.

Our hypothesis suggested two research questions, that tried to examine the impactthat the arti�cial Wikipedia-based dataset has on deep neural retrieval models from twodi�erent angles: our �rst question was regarding the zero-shot scenario, i.e. whethera BERT model trained only in the Wikipedia-based IR dataset would have higheraccuracy in comparison to a default BERT model, and the second one was regardingthe pre-�ne-tuning on Wikipedia scenario, i.e. whether pre-�ne-tuning a BERT modelon the Wikipedia-based IR dataset would have pose any improvements of the later�ne-tuning stages with in-domain data in terms of accuracy, convergence, or stability.

First of all, our hypothesis was con�rmed that transfer learning in the context of IRis possible by utilising Wikipedia to create cheaply an IR dataset capable of traininga data-hungry deep neural model. Our zero-shot experiments on the dataset, MSMarco, showed that the BERT model trained only on Wikipedia surpassed -under alarge margin- the default BERT model (+10 points di�erence in MRR@10). In addition,experiments on the development set indicated that our "Wiki" model is better or thesame in performance when compared with a BERT model trained only on in-domaindata, when the size of the data ranges between 10k-50k samples.

At the same time even though we showed that transfer learning in the contextof IR using Wikipedia is possible we could show that our zero-shot approach isa more e�cient in a real world scenario. To put it more simple, "should someoneemploy a BERT model trained on Wikipedia for their domain-speci�c application?".We answer with con�dence "not yet, since more experimentation is needed with thetraditional approaches". One limitation in our experiments is that we did not compareour "Wiki" BERT model with any of the unsupervised traditional retrieval approachessuch as the BM25 algorithm or TF-IDF. These traditional approaches not only are lesscomputational expensive than the neural approaches, as they do not require trainingdata (even though BM25 o�ers as well some tuning parameters), but at the same timewhether such traditional approaches are less e�ective or not when compared to neuralapproaches is still an open issue. As a matter of fact, a lot of criticism has emergedamong the IR community whether neural ranking models are improving retrievale�ectiveness in limited data scenarios (Lin, 2019; W. Yang et al., 2019); the authorsargue that the emerged "neural hype" has been compared against weak baselines andits demonstrated "wins" are up to question. For this reason, in our future work wewill focus on not only comparing our neural models against traditional approaches,such as BM25, but also making sure that the compared baselines are not "weak". Astep towards creating strong baselines for IR is (1) �ne-tuning the parameters of theBM25 algorithms using a development set, (2) employ query expansion using graphtraversals (Grainger et al., 2016).

27

Turning to the second line of experiments regarding the potential gains of pre-�ne-tuning in Wikipedia, we found that it stabilizes the model while at the same time thereis not any improvement gained in terms of accuracy or convergence. Regarding thefailure of gaining more accuracy or convergence, one limitation of our research is thatwe tested only in one dataset, the MS Marco, to perform our zero-shot experiments. Apotential issue posed by this dataset is that it is generic in nature as it is a web datasetand it is characterised by any sort of generic questions and answers. This might be aprobable reason why we did not see any improvement in convergence or in accuracy:Because the incorporated knowledge in the Wikipedia-based dataset is just a subsetof the incorporated knowledge in MS Marco dataset. In the future, we propose thefollowing two ways to test this hypothesis: (1) Test on di�erent datasets from di�erentdomains, (2) pre-�ne-tune a BERT model on MS Marco and evaluate zero-shot on aWikipedia test dataset.

To summarise, the contributions of this thesis are the following: We found strongevidence that transfer learning is possible in the context of IR by leveraging Wikipediato build and arti�cial IR training dataset. We also found evidence that pre-�ne-tuningis such datasets improves BERT instability. Though as far as addressing the issueof the lack of data in domains with few training data or no data at all, we couldnot prove that our technique if more e�cient than a computationally inexpensiveunsupervised approach, such as BM25, on the absence of performing any comparisonswith such traditional approaches. Our future work will be (1) investigating the issueof "weak baselines" in IR literature by performing systematic literature reviews on it,(2) contributing on building "strong" baselines for the IR community for the propercomparison of the neural ranking models.

28

Bibliography

Aklouche, Billel, Ibrahim Bounhas, and Yahya Slimani (2019). “BM25 Beyond Query-Document Similarity”. In: International Symposium on String Processing and Infor-mation Retrieval. Springer, pp. 65–79.

Aktolga, Elif, James Allan, and David A Smith (2011). “Passage reranking for questionanswering using syntactic structures and answer types”. In: European Conference onInformation Retrieval. Springer, pp. 617–628.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). “Neural machinetranslation by jointly learning to align and translate”. arXiv preprint arXiv:1409.0473.

Bajaj, Payal, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu,Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. (2016).“Ms marco: A human generated machine reading comprehension dataset”. arXivpreprint arXiv:1611.09268.

Banerjee, Protima and Hyoil Han (2009). “Language Modeling Approaches to Informa-tion Retrieval”. Journal of Computing Science and Engineering 3.3, pp. 143–164.

Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent (2000). “A neural probabilisticlanguage model”. In: Proceedings of the 13th International Conference on NeuralInformation Processing Systems, pp. 893–899.

Bing, Lidong, Sneha Chaudhari, Richard C Wang, and William Cohen (2015). “Im-proving distant supervision for information extraction using label propagationthrough lists”. In: Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing, pp. 524–529.

Brenner, Eliot, Jun Zhao, Aliasgar Kutiyanawala, and Zheng Yan (2018). “End-to-EndNeural Ranking for eCommerce Product Search: an application of task models andtextual embeddings”. arXiv preprint arXiv:1806.07296.

Buckley, Chris and Ellen M Voorhees (2017). “Evaluating evaluation measure stability”.In: ACM SIGIR Forum. Vol. 51. 2. ACM New York, NY, USA, pp. 235–242.

Burges, Chris, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, andGreg Hullender (2005). “Learning to rank using gradient descent”. In: Proceedings ofthe 22nd international conference on Machine learning, pp. 89–96.

Craswell, Nick (2009). “Mean Reciprocal Rank.” In: Encyclopedia of database systems.Vol. 1703, pp. 1703–1703.

Cui, Hang, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua (2005). “Questionanswering passage retrieval using dependency relations”. In: Proceedings of the28th annual international ACM SIGIR conference on Research and development ininformation retrieval, pp. 400–407.

Da, Kingma (2014). “A method for stochastic optimization”. arXiv preprint arXiv:1412.6980.Dehghani, Mostafa, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft

(2017). “Neural ranking models with weak supervision”. In: Proceedings of the 40thInternational ACM SIGIR Conference on Research and Development in InformationRetrieval, pp. 65–74.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018). “Bert:Pre-training of deep bidirectional transformers for language understanding”. arXivpreprint arXiv:1810.04805.

29

Dodge, Jesse, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, andNoah Smith (2020). “Fine-tuning pretrained language models: Weight initializations,data orders, and early stopping”. arXiv preprint arXiv:2002.06305.

Fain, Daniel C and Jan O Pedersen (2006). “Sponsored search: A brief history”. Bulletinof the american Society for Information Science and technology 32.2, pp. 12–13.

Frej, Jibril, Didier Schwab, and Jean-Pierre Chevallet (2019). “WIKIR: A Python toolkitfor building a large-scale Wikipedia-based English Information Retrieval Dataset”.arXiv preprint arXiv:1912.01901.

Grainger, Trey, Khalifeh AlJadda, Mohammed Korayem, and Andries Smith (2016).“The Semantic Knowledge Graph: A compact, auto-generated model for real-timetraversal and ranking of any relationship within a domain”. In: 2016 IEEE Interna-tional Conference on Data Science and Advanced Analytics (DSAA). IEEE, pp. 420–429.

Guo, Jiafeng, Yixing Fan, Qingyao Ai, and W Bruce Croft (2016). “A deep relevancematching model for ad-hoc retrieval”. In: Proceedings of the 25th ACM Internationalon Conference on Information and Knowledge Management, pp. 55–64.

Guo, Jiafeng, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu,W Bruce Croft, and Xueqi Cheng (2020). “A deep look into neural ranking modelsfor information retrieval”. arXiv:1903.06902v1.

Han, Shuguang, Xuanhui Wang, Mike Bendersky, and Marc Najork (2020). “Learning-to-Rank with BERT in TF-Ranking”. arXiv preprint arXiv:2004.08476.

Han, Xianpei and Le Sun (2016). “Global distant supervision for relation extraction”.In: Proceedings of the AAAI Conference on Arti�cial Intelligence. Vol. 30. 1.

Hernández-González, Jerónimo, Iñaki Inza, and Jose A Lozano (2016). “Weak supervi-sion and other non-standard classi�cation problems: a taxonomy”. Pattern Recogni-tion Letters 69, pp. 49–55.

Ho�mann, Raphael, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld(2011). “Knowledge-based weak supervision for information extraction of over-lapping relations”. In: Proceedings of the 49th annual meeting of the association forcomputational linguistics: human language technologies, pp. 541–550.

Hofstätter, Sebastian, Sophia Althammer, Michael Schröder, Mete Sertkan, and AllanHanbury (2020). “Improving e�cient neural ranking models with cross-architectureknowledge distillation”. arXiv preprint arXiv:2010.02666.

Ji, Zongcheng, Zhengdong Lu, and Hang Li (2014). “An information retrieval approachto short text conversation”. arXiv preprint arXiv:1408.6988.

Jing, Kun and Jungang Xu (2019). “A survey on neural network language models”.arXiv preprint arXiv:1906.03591.

Kamphuis, Chris, Arjen P de Vries, Leonid Boytsov, and Jimmy Lin (2020). “WhichBM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants”. In:European Conference on Information Retrieval. Springer, pp. 28–34.

Lin, Jimmy (2019). “The neural hype and comparisons against weak baselines”. In:ACM SIGIR Forum. Vol. 52. 2. ACM New York, NY, USA, pp. 40–51.

Lin, Jimmy, Rodrigo Nogueira, and Andrew Yates (2020). “Pretrained transformers fortext ranking: Bert and beyond”. arXiv preprint arXiv:2010.06467.

Liu, Tie-Yan (2011). Learning to rank for information retrieval. Springer Science &Business Media.

Liu, Weijie, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju (2020). “Fast-bert: a self-distilling bert with adaptive inference time”. arXiv preprint arXiv:2004.02178.

Ma, Ji, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald (2020). “Zero-shotNeural Retrieval via Domain-targeted Synthetic Query Generation”. arXiv preprintarXiv:2004.14503.

30

Mei, Hongyuan, Mohit Bansal, and Matthew R Walter (2016). “Coherent dialogue withattention-based language models”. arXiv preprint arXiv:1611.06997.

Mikolov, Tomáš, Martin Kara�át, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur(2010). “Recurrent neural network based language model”. In: Eleventh annualconference of the international speech communication association.

Mitra, Bhaskar and Nick Craswell (2017). “Neural models for information retrieval”.arXiv preprint arXiv:1705.01509.

Mitra, Bhaskar, Nick Craswell, et al. (2018). An introduction to neural informationretrieval. Now Foundations and Trends.

Mitra, Bhaskar, Grady Simon, Jianfeng Gao, Nick Craswell, and Li Deng (2016). “Aproposal for evaluating answer distillation from web data”. In: Proceedings of theSIGIR 2016 WebQA Workshop.

Nogueira, Rodrigo and Kyunghyun Cho (2019). “Passage Re-ranking with BERT”. arXivpreprint arXiv:1901.04085.

Pass, Greg, Abdur Chowdhury, and Cayley Torgeson (2006). “A picture of search”. In:Proceedings of the 1st international conference on Scalable information systems, 1–es.

Phang, Jason, Thibault Févry, and Samuel R Bowman (2018). “Sentence encoders onstilts: Supplementary training on intermediate labeled-data tasks”. arXiv preprintarXiv:1811.01088.

Radford, Alec, Je�rey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever(2019). “Language models are unsupervised multitask learners”. OpenAI blog 1.8,p. 9.

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang (2016). “Squad:100,000+ questions for machine comprehension of text”. arXiv preprint arXiv:1606.05250.

Robertson, Stephen and Hugo Zaragoza (2009). The probabilistic relevance framework:BM25 and beyond. Now Publishers Inc.

Schütze, Hinrich, Christopher D Manning, and Prabhakar Raghavan (2008). Introductionto information retrieval. Vol. 39. Cambridge University Press Cambridge.

Severyn, Aliaksei and Alessandro Moschitti (2015). “Twitter sentiment analysis withdeep convolutional neural networks”. In: Proceedings of the 38th International ACMSIGIR Conference on Research and Development in Information Retrieval, pp. 959–962.

Turtle, Howard and W Bruce Croft (1989). “Inference networks for document retrieval”.In: Proceedings of the 13th annual international ACM SIGIR conference on Researchand development in information retrieval, pp. 1–24.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). “Attention is all you need”.Advances in neural information processing systems 30, pp. 5998–6008.

Yang, Liu, Minghui Qiu, Swapna Gottipati, Feida Zhu, Jing Jiang, Huiping Sun, andZhong Chen (2013). “Cqarank: jointly model topics and expertise in communityquestion answering”. In: Proceedings of the 22nd ACM international conference onInformation & Knowledge Management, pp. 99–108.

Yang, Wei, Kuang Lu, Peilin Yang, and Jimmy Lin (2019). “Critically Examining the"Neural Hype" Weak Baselines and the Additivity of E�ectiveness Gains from NeuralRanking Models”. In: Proceedings of the 42nd international ACM SIGIR conference onresearch and development in information retrieval, pp. 1129–1132.

Yue, Yisong, Thomas Finley, Filip Radlinski, and Thorsten Joachims (2007). “A supportvector method for optimizing average precision”. In: Proceedings of the 30th annualinternational ACM SIGIR conference on Research and development in informationretrieval, pp. 271–278.

Zamani, Hamed (2019). “Neural Models for Information Retrieval without LabeledData”. PhD thesis. University of Massachusetts Amherst.

31

Zhang, Kaitao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu (2020). “SelectiveWeak Supervision for Neural Information Retrieval”. In: Proceedings of The WebConference 2020, pp. 474–485.

Zhang, Tianyi, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, and Yoav Artzi (2020).“Revisiting Few-sample BERT Fine-tuning”. arXiv preprint arXiv:2006.05987.

32

zero-shot, one kill: bert for neural information retrieval

Documents