bjut at trec 2016 opensearch: search ranking based on clickthrough...

BJUT at TREC 2016 OpenSearch Track:Search Ranking Based on Clickthrough Data

Cheng Li1, Zhen Yang1∗, David Lillis2,31. College of Computer Science, Faculty of Information, Beijing University of Technology, China

2. Beijing-Dublin International College, Beijing University of Technology, China3. School of Computer Science, University College Dublin, Ireland

[email protected]

Abstract

In this paper we describe our efforts for the TRECOpenSearch task. Our goal for this year is to evaluate theeffectiveness of: (1) a ranking method using informationcrawled from an authoritative search engine; (2) search rank-ing based on clickthrough data taken from user feedback; and(3) a unified modeling method that combines knowledge fromthe web search engine and the users’ clickthrough data. Fi-nally, we conduct extensive experiments to evaluate the pro-posed framework on the TREC 2016 OpenSearch data set,with promising results.

IntroductionIn this year’s OpenSearch Track, our main aims are: (1)Building an efficient ranking function that uses the infor-mation from a web search engine and the documents. (2)Explore a novel method that can use the clickthrough datafrom the feedback to improve the performance of the rank-ing function. (3) Create a unified model that combines bothof these. As search engines play an ever more crucial rolein information consumption in our daily lives, ranking rele-vance is always a challenge. The advancing state of the artfor search engines presents new relevance challenges, whichdrives us towards using user feedback (e.g. clicks) to opti-mize the retrieval performance of search engines.

In addressing the OpenSearch task, we first crawled thedocument information for each query from Google Scholar,mainly document titles. We used this information to reflectthe content for each document. Next, word2vector was usedto build matrices to represent the documents after a seriesof data processing operations, and training was performedon the dataset to generate a classifier for each query. Then,we used the classifiers to make judgments about the scoresof documents and uploaded the ranks to the API. In addi-tion, we used the dataset to train the classifier including theclickthough data received as feedback through the from API.Clickthrough data has been shown to be highly useful forgenerating search engine rankings (Schuth, Balog, and Kelly2015; Joachims 2002; Radlinski, Kurup, and Joachims 2008;Agichtein, Brill, and Dumais 2006). Finally, we used both ofthe above classifiers to create a unified modeling based onstructural risk minimization, and ranked the candidate doc-uments.

The remainder of the paper is organized as follows: InSection 2, we present our approach for ranking using webinformation and clickthough data. In Section 3, we reportour experimental results. In Section 4, we conclude the pa-per.

Ranking System FrameworkFigure 1 shows our system framework. It consists of fourprincipal parts: (1) Information gathering, (2) Document an-notation, (3) Ranking model and Unified model, (4) Resultsgeneration.

Information GatheringThe first step of the ranking process is to gather useful infor-mation. Candidate data was received from the Living LabsAPI, including training set and test set, each query and itscorresponding doclist (a list of candidate documents) anduser feedback for each query. With regard to informationfrom the web, we crawled a doclist for each query fromGoogle Scholar. For each query, we crawled the top docu-ments up to a maximum of 100 documents. Document ti-tles are the primary information received during this pro-cess. For some queries, 100 documents were not available,but the more than 70 were retrieved in each case. More-over, we gathered a text corpus from Wikipedia to train theword2vector model. This corpus was approximately 11 GiBin size.

Document AnnotationThe different search engines used in the OpenSearch taskmake different information available about each document.For instance, CiteSeerX provides only a single field with thefull document text. In contrast, SSOAR returns many fieldsincluding abstract, author, description, identifier, languageetc. The Microsoft Academic Search results include abstractand URL. Document title is common to each, so this wasused for the generation of the document matrices.

Before generating the document matrices, we prepro-cessed the data. Then, we trained the word2vector modelwith the corpus and used this model to construct documentmatrices both of the sites and Google Scholar.

Two labels are applied to each document. One is basedon its position in the Google Scholar results, and the other is

Figure 1: System Framework.

based on the clickthrough data. For the first label, we use fivescores to represent different levels of the documents. Thehigher-ranked documents should have higher scores thanthose ranked later by Google Scholar. Each document isgiven a score as follows (rank ranges are inclusive):• 5: ranked between position 1 and position 5• 4: ranked between position 6 and position 10• 3: ranked between position 11 and position 20

• 2: ranked between position 21 and position 50

• 1: ranked at position 51 or higher

The other label is based on clickthrough data. It can beunderstood as the probability of the document is clicked, andis defined by following formula:

P (d) = 0.5+dt − df

dt + df + 0.5× 0.5

1 + e−k(dt+df−s−0.25)(1)

where P (d) is the probability for document d, dt is the num-ber of times the document was clicked, df is the number oftimes the document was not clicked. The latter part of thisformula is a logistic function to dampen the effect of clicksfor small sample sizes (so that a document that has appearedonce but has not been clicked will not get a probability ofzero). K affects how quickly the function begins to rise ands is the shift factor. Values of K = 0.33 and s = 10 wereused. A document that has never been clicked begins witha probability of 0.5, and this probability will rise or fall de-pending on the user clickthrough data.

Ranking Model and Unified ModelThe ranking model consisted of the web information rankingmodel and the clickthrough data ranking model.

Existing search engines exhibit excellent performance inretrieving information, therefore the starting point for themodel is to train the classifier using Google Scholar. Weadopted Least Squares method into the framework, whichcan learn a linear model to fit the training data (Hu et al.2013). The square loss is:

L(H, I) = ||GH − I||2 (2)

where G is the content matrix of Google document data, His the linear classifier, and I is the label matrix we defined.Square loss is a widely used method for text analytics.

However, simple square loss alone is insufficient for train-ing a stable and robust classifier. For the documents, it isobserved that not all the words in titles are relevant to thequery but instead we can use some key words to representthe documents. A mature method is sparse learning, whichhas been used in various fields to obtain an effective model.To ensure sparsity of the model, we used the L1-norm pe-nalization based on square loss. To overcome overfitting,the most popular method is regularization. One of the mostwidely used methods is ridge regression, which introducesan L2-norm penalization. At the moment, the problem canbe solved by the elastic net, which does automatic variableselection and continuous shrinkage, and selects groups ofcorrelated variables.

This model is suitable for the web information rankingmodel, and can be used as clickthrough data ranking modelby replacing Google document matrix G with the site docu-ment matrix C, and replacing the score labels with the sec-ond label outlined above.

A unified model was then created. As with the rankingmodel, we used square loss to train the model, which can bewritten as:

minL(H1, H2) = ||GH1−GH2||2+||CH1−CH2||2 (3)

where H1 and H2 are the classifiers obtained from the rank-ing model. L2-norm penalization was then used to controlthe robustness of the learned model (Beck and Teboulle2009).

Results GenerationFor the training dataset, we used the classifier obtained fromridge regression to rank. As for the test dataset, first we cal-culated the cosine similarity between the test queries and

training queries, which can be defined as following formula:

sim(qi, qj) =qi · qj

||qi|| · ||qj ||(4)

where qi and qj are two vectors representing a test query anda training query and ||q|| is the length of the vector q. Andthen, after normalizing the cosine similarity vector, we couldget the classifiers of test queries, as follows:

Hi =

n∑j=1

αijHj (5)

where αij is the parameter of the cosine similarity with nor-malization,Hi is the test query classifier andHj is the train-ing query classifier.

If the two queries are semantically similar to each other,they should have a close representative in the word2vectormodel within the large scale corpus. Considering this as-pect, we used the equation (5) to express the classifiers, andranked the test dataset with these.

Experimental ResultsThis section, we introduce the evaluation methods and ourresults.

Evaluation MeasuresDuring the OpenSearch runs, CiteSeerX and SSOAR bothuse interleaved comparisons on their live websites. Thespecific type of interleaving used is Team Draft Interleav-ing (TDI), whereby the rankings produced by participatingteams are interleaved with the current production ranking ofthe site. Users are shown this interleaved ranking, but areunaware of the origin of the results. The ranker (participantor production) that contributes more documents that usersclick is preferred.

OpenSearch used five metrics for evaluation, as fol-lows (Schuth, Balog, and Kelly 2015):

• Impressions: the total number of times when rankings (forany of the test queries) from the given team have beendisplayed to users.

• Wins: a win occurs when the ranking of the participanthas more clicks on results assigned to it by Team DraftInterleaving than clicks on results assigned to the rankingof the site.

• Losses: A loss is the opposite to a win.

• Ties: a tie occurs when the ranking of the participant ob-tains the same number of clicks as the ranking of the site.

• Outcome: Outcome is defined as:

wins =wins

wins+ losses(6)

An outcome value below 0.5 means that the ranking ofthe participant performed worse than the ranking of thesite (i.e., in overall, it has more losses than wins).

Table 1: Training Data Performances.

type Outcome Wins Losses Ties Impressionstrain 0.4468 21 26 5 52

Table 2: Test Data Performances.

Round Outcome Wins Losses Ties Impressions1 0.3333 3 6 1 102 0.6 6 4 1 113 0.5432 44 37 15 96

Results AnalysisTwo sets of results were obtained: one based on the trainingdata and the other based on the test data.

Table 1 shows our results from CiteSeerX for the train-ing data. The evaluation method is Team Draft Interleaving(TDI), it statistically test whether the number of wins for thebetter retrieval function is indeed significantly larger by us-ing a test against outcome ≤ 0.5. The outcome of the train-ing data is 0.4468, which means the ranking function of thesite performs better than ours, although not to a substantialdegree.

Table 2 shows our from CiteSeerX on the test data. Therewere three rounds for the test period. In the first round, thesite had the better performance than our system, in the sec-ond round, our system performed better, and in the thirdround, the outcome was 0.5432, our system performed bettertoo. Around the all rounds, our ranking function is effectiveand not bad.

ConclusionIn this paper, we study the problem of optimizing theranking function with clickthrough data. We used theword2vector model to represent the queries and the docu-ments effectively instead of traditional vector space or othermodel. A unified framework is proposed, incorporating webinformation and clickthrough data. In the optimization pro-cess, the popular regularization methods of elastic net re-gression and ridge regression are adopted. Experiments overlong periods were conducted to evaluate the proposed frame-

work on a real world academic search engine and the experi-mental results demonstrate the effectiveness of our proposedframework.

In the future work, we will consider more features to bet-ter represent documents and conduct further extensive ex-periments to improve the robustness and stability of ourmethod.

AcknowledgmentsThis research was supported by the National Natural ScienceFoundation of China (No. 61671030), the Excellent TalentsFoundation of Beijing, the Importation and Development ofHigh-Caliber Talents Project of Beijing Municipal Institu-tions (No.CIT&TCD201404052), and the Guangxi Collegesand Universities Key Laboratory of Cloud Computing andComplex Systems (No15205).

ReferencesAgichtein, E.; Brill, E.; and Dumais, S. 2006. Improvingweb search ranking by incorporating user behavior informa-tion. In Proceedings of the 29th annual international ACMSIGIR conference on Research and development in informa-tion retrieval, 19–26. ACM.Beck, A., and Teboulle, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAMjournal on imaging sciences 2(1):183–202.Hu, X.; Tang, J.; Zhang, Y.; and Liu, H. 2013. Social spam-mer detection in microblogging. In nternational Joint Con-ference on Artificial Intelligence, volume 13, 2633–2639.Citeseer.Joachims, T. 2002. Optimizing search engines using click-through data. In Proceedings of the eighth ACM SIGKDDinternational conference on Knowledge discovery and datamining, 133–142. ACM.Radlinski, F.; Kurup, M.; and Joachims, T. 2008. How doesclickthrough data reflect retrieval quality? In Proceedingsof the 17th ACM conference on Information and knowledgemanagement, 43–52. ACM.Schuth, A.; Balog, K.; and Kelly, L. 2015. Overview ofthe living labs for information retrieval evaluation (ll4ir)clef lab 2015. In International Conference of the Cross-Language Evaluation Forum for European Languages, 484–496. Springer.

bjut at trec 2016 opensearch: search ranking based on clickthrough...

Documents