recommender systems for scientific publications: the

Recommender Systems for scientific publications: The concept drift and The implicit ratings problemAlbert-Ludwigs-Universität Freiburg
vorgelegt von
Anas Alzogbi
Erstgutachter und Betreuer der Arbeit Prof. Dr. Georg Lausen Albert-Ludwigs-Universität Freiburg
Zweitgutachter Prof. Dr. Dr. Lars Schmidt-Thieme Universität Hildesheim
Datum der Promotion [22.03.2019]
Abstract
The rapidly increasing number of newly published scientific publications put scholars and researchers against the challenge of staying up to date and well informed about new findings in their domains. Recent studies showed that there are more than one hundred thousand new papers in computer science are published each year, and there are three times more papers published in 2010 than in 2000. Recommender systems (RS) have lately gained considerable attention as a powerful tool for providing personalized scientific paper recommendations. Owing to their capability of modeling the user’s interests and exploring the online archives for relevant papers, RS are the natural fit for this scenario. Various approaches from existing RS methods have been explored and successfully applied for scientific paper recommendation ranging from Content-Based Filtering (CBF) to Collaborative Filtering (CF) and hybrid approaches. In this thesis, we investigate suitable approaches for recommending scientific publications. Among the challenges that face recommender systems for scientific publications is the high sparsity in users-items relation driven by the large number of papers relative to the much lower number of users, known as the high sparsity in the rating matrix. Because of this high sparsity, which is a major obstacle against CF methods, the focus of existing works has been on CBF approaches. Within these approaches, constructing user’s profile is mainly achieved based on memory-based or heuristic-based methods. Memory-based methods construct the user profile by applying an aggregation function over the feature vectors of relevant papers. Such methods depend strongly on the underlying assumption related to the employed aggregation method. On the contrary, model-based approaches, since they rely on a learning algorithm, have the potential to build more representative user models. However, model-based approaches didn’t gain much attention as a solution for scientific publication recommendation due to the following problems: (a) the low number of ratings available per user, which corresponds to low number of training instances; (b) the available ratings are positive-only ratings, which gives rise to the one-class problem; and (c) the high cost related to training and maintaining a separate model for each user in the system. On the other hand, the temporal aspect of the system adds extra challenges considering the drifting interests of the users. Users’ interests change over time, as a result, old ratings do not hold the same importance for the recommender system as the recent ones. Therefore, not all available ratings (which are already few) can be beneficial for learning the recommendation model. This aspect adds more compli- cations to the previously mentioned challenges.
In this thesis, we focus on these issues, which we summarize in the following two challenges: the one-class problem and the concept drift in user’s interest. Based on our survey on the related literature, we found out that these issues didn’t gain enough attention among the works that developed recommender systems for scientific recommendations. Thus, we investigate in this thesis the adaptation of different ideas from several domains such as machine learning and information retrieval to design useful recommender systems for scientific publications. The contribution of this work can be grouped into four parts:
• First, we introduce a literature survey exploring the latest related works. The goal of this survey is twofold: first, to identify successful and promising existing recommendation approaches for scientific publications and second, to investigate which works addressed our targeted challenges in this thesis.
• Second, we address the one-class problem and present two model-based content- based recommendation approaches. In the first solution, we model the problem as a linear regression model and train a supervised model for each user. In the second solution, we investigate the application of pairwise preference learning for content-based filtering and present an approach based on pairwise learning-to-rank.
• Third, we address the efficiency problem of our model-based recommender and present a system design that builds on the widely known Apache Spark cluster management framework. Our presented system allows efficient computation of multi-models (one model per user) on a cluster of machines.
• In the last part of this work, we focus on the concept drift in user’s interest aspect. We first study the presence of concept drift in a real-world dataset and then, we present a time-aware recommender system that accounts for the concept drift in user’s interest.
Zusammenfassung
Die rasant wachsende Anzahl neu erscheinender wissenschaftlicher Publikationen stellt Wissenschaftler vor die Herausforderung sich kontinuierlich über neue Erkennt- nisse in ihrem Forschungsbereich zu informieren. Neuere Studien zeigten, dass jedes Jahr mehr als hunderttausend neue Arbeiten in der Informatik veröffentlicht werden, und zwischen den Jahren 2000 und 2010 stieg die Anzahl veröffentlicher Arbeiten auf das Dreifache an. Empfehlungssysteme haben in letzter Zeit viel Aufmerksamkeit als leistungsstarkes Werkzeug zur Erstellung einer personalisierten Empfehlung für wissenschaftliche Arbeiten erlangt. Aufgrund der Fähigkeit, Interessen von Benutzern zu modellieren und Online-Archieve nach relevanten Artikeln zu durchsuchen, sind die Empfehlungssysteme die ideale Lösung für dieses Szenario. Verschiedene An- sätze aus bestehenden Empfehlungssysteme wurden bereits erforscht und erfolgreich für die Empfehlung wissenschaftlicher Arbeiten angewendet, die von Content-Based Filtering (CBF) über Collaborative Filtering (CF) bis hin zu hybriden Ansätzen reichen. Wir untersuchen in dieser Arbeit geeignete Ansätze zur Empfehlung wissenschaftlicher Publikationen. Die Empfehlungssysteme für wissenschaftliche Pub- likationen stehen vor mehreren Herausforderungen. Zu diesen Herausforderungen gehört die hohe Spärlichkeit zwischen Benutzern und Artikeln, die durch die große Anzahl von Artikeln im Vergleich zu der viel geringeren Anzahl von Benutzern verur- sacht wird. Ein Problem, das als die hohe Spärlichkeit in der Bewertungsmatrix bekannt ist. Da diese hohe Spärlichkeit der Bewertungsmatrix ein Problem für CF-Methoden darstellt, lag der Schwerpunkt der bestehenden Arbeiten auf CBF- Ansätzen. Diese Ansätze lösen die Erstellung von Benutzerprofile hauptsächlich auf Grundlage speicherbasierter oder heuristischer Methoden. Das speicherbasierte Ver- fahren konstruiert das Benutzerprofil, indem es eine Aggregationsfunktion über die Feature-Vektoren relevanter Artikel anwendet. Solche Methoden hängen stark von der zugrunde liegenden Annahme in Bezug auf die verwendete Aggregationsmethode ab. Anders als speicherbasierte Verfahren, haben modellbasierte Ansätze das Poten- zial repräsentativere Nutzermodelle aufzubauen, da sie auf einem Lernalgorithmus basieren. Allerdings sie fanden als Lösung für die Empfehlung wissenschaftlicher Publikationen aufgrund der folgenden Probleme nicht viel Beachtung: (a) die geringe Anzahl der verfügbaren Ratings pro Benutzer, was einer geringen Anzahl von Train- ingsinstanzen entspricht; (b) die verfügbaren Ratings sind nur positive Ratings, was zu dem Einklassenproblem führt; und (c) die hohen Kosten für das Lernen und Ak- tualisieren eines individuellen Modells für jeden Benutzer im System. Andererseits spiegelt sich der zeitliche Aspekt des Systems deutlich in dem sich
ständig ändernden Interesse der Nutzer wider. Die Interessen der Nutzer ändern sich im Laufe der Zeit, so dass alte Ratings für das Empfehlungssystem nicht die gleiche Bedeutung haben wie die neusten. Daher können nicht alle verfügbaren Rat- ings (die bereits wenige sind) für das Erlernen des Empfehlungsmodells verwendet werden. Dieser Aspekt fügt weitere Komplikationen zu dem zuvor erwähnten Prob- lem hinzu. In dieser Arbeit beschäftigen wir uns mit diesen Themen. Wir fassen sie in den folgenden zwei Herausforderungen zusammen: das Einklassenproblem und die Konzeptverschiebung im Interesse der Benutzer. Basierend auf unserer struk- turierten Literaturstudie fanden wir heraus, dass diese Fragen bei den Arbeiten, die Empfehlungssysteme für wissenschaftliche Empfehlungen entwickelten, nicht genügend Beachtung fanden. In Rahmen dieser Arbeit untersuchen wir deshalb die Anwendung und Erweiterung von Methoden aus verschiedenen Bereichen wie Maschinelles Lernen und Information Retrieval, um nützliche Empfehlungssysteme für wissenschaftliche Publikationen zu entwickeln. Der Beitrag dieser Arbeit lässt sich in vier Teile gliedern:
• Zuerst stellen wir eine Literaturstudie vor, die die verwandten Arbeiten unter- sucht. Das Ziel dieser Literaturübersicht ist Zweifach: Einerseits, erfolgreiche und vielversprechende bestehende Empfehlungsansätze für wissenschaftliche Publikationen zu identifizieren und andererseits, zu untersuchen, welche Ar- beiten sich mit unseren zielgerichteten Herausforderungen befassen.
• Zweitens gehen wir auf das Einklassenproblem ein und präsentieren zwei modellbasierte inhaltsbasierte Ansätze. In der ersten Lösung modellieren wir das Problem als lineares Regressionsmodell und trainieren ein überwachtes Modell für jeden Benutzer. In der zweiten Lösung untersuchen wir die Anwendung des Pairwise preference Learning für die inhaltsbasierte Filterung und präsentieren einen Ansatz, der auf dem Pairwise Learning-to-rank basiert.
• Drittens gehen wir auf das Effizienzproblem unseres modellbasierten Empfehlungssystems ein und präsentieren ein Systemdesign, das auf dem weit verbreiteten Apache Spark Cluster Management Framework aufbaut. Unser vorgestelltes System ermöglicht die effiziente Berechnung von Multi-Modellen auf einem Cluster von Maschinen.
• Im letzten Teil dieser Arbeit konzentrieren wir uns auf das Problem der Konzeptverschiebung von Benutzerinteressen. Wir untersuchen zunächst das Vorkommen von Konzeptverschiebungen in realen Datensätzen und präsen- tieren dann ein zeitbewusstes Empfehlungssystem, das die Konzeptver- schiebung von Benutzerinteressen berücksichtigt.
Contents
I. Preface 1
1. Introduction 3 1.1. Motivation and Problem description . . . . . . . . . . . . . . . . . . . 3 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3. Thesis Contributions & Published Work . . . . . . . . . . . . . . . . 5
1.3.1. Literature survey . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2. Addressing the one-class problem . . . . . . . . . . . . . . . . 6 1.3.3. Addressing the concept-drift in users interests . . . . . . . . . 7
1.4. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
II. State-of-the-art: Recommender systems for Scientific Pa- pers 9
2. Recommender Systems 11 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2. Recommender Systems and the Users-items Interactions . . . . . . . 12
2.2.1. Explicit feedback . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2. Implicit feedback . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3. Overview of Recommendation Techniques . . . . . . . . . . . . . . . . 14 2.3.1. Content-based filtering (CBF) . . . . . . . . . . . . . . . . . . 14 2.3.2. Collaborative filtering and latent factor models . . . . . . . . 16 2.3.3. Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4. Matrix Factorization in CF . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1. Least squares modeling . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2. Probabilistic matrix factorization . . . . . . . . . . . . . . . . 19 2.4.3. Matrix factorization algorithms . . . . . . . . . . . . . . . . . 20
2.5. Implicit Feedback and One-class problem . . . . . . . . . . . . . . . . 24
3. Literature Survey 27 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2. Relevant Papers Identification . . . . . . . . . . . . . . . . . . . . . . 28 3.3. Paper Recommendation and Citation Recommendation . . . . . . . . 29
i
3.4. Dimensions for Approaches Comparison . . . . . . . . . . . . . . . . . 29 3.4.1. Recommendation approaches . . . . . . . . . . . . . . . . . . 30 3.4.2. Recommendation scenarios . . . . . . . . . . . . . . . . . . . . 33 3.4.3. User modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.4. Publications data usage . . . . . . . . . . . . . . . . . . . . . 34 3.4.5. Publication representation . . . . . . . . . . . . . . . . . . . . 35 3.4.6. Matching method . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.7. Evaluation strategy . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.8. Implicit feedback and time-aware . . . . . . . . . . . . . . . . 40
3.5. Approaches for Recommending Scientific Publications . . . . . . . . . 41 3.5.1. Content-based filtering approaches . . . . . . . . . . . . . . . 41 3.5.2. Graph-based approaches . . . . . . . . . . . . . . . . . . . . . 44 3.5.3. Latent factor model approaches . . . . . . . . . . . . . . . . . 46 3.5.4. Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.5. Preference learning approaches . . . . . . . . . . . . . . . . . 47 3.5.6. Cross-domain approaches . . . . . . . . . . . . . . . . . . . . . 48 3.5.7. Co-occurrence based approaches . . . . . . . . . . . . . . . . . 48
3.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
III. One-class Problem in Scientific Paper Recommender Sys- tem 51
4. Content-based Filtering using Multi-variate Linear Regression 53 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2. PubRec Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3. Papers Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1. Publication representation . . . . . . . . . . . . . . . . . . . . 56 4.3.2. Keywords extraction . . . . . . . . . . . . . . . . . . . . . . . 57
4.4. Learning Algorithm and Recommendation Process of PubRec . . . . 58 4.4.1. Importance score . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2. Learning user profile . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.3. Recommendation generation . . . . . . . . . . . . . . . . . . . 62
4.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.2. Experimental setup and results . . . . . . . . . . . . . . . . . 63
4.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5. Pairwise Preference Learning for CBF Recommenders 69 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2. Learning-to-Rank Overview . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1. Pointwise learning-to-rank . . . . . . . . . . . . . . . . . . . . 72 5.2.2. Pairwise learning-to-rank . . . . . . . . . . . . . . . . . . . . . 72
5.2.3. Listwise learning-to-rank . . . . . . . . . . . . . . . . . . . . . 72 5.3. Ranking in Recommender Systems . . . . . . . . . . . . . . . . . . . 72
5.3.1. LTR model for recommender systems . . . . . . . . . . . . . . 73 5.3.2. Pairwise preferences to learn from positive-only feedback . . . 73 5.3.3. General model versus an individual model . . . . . . . . . . . 74
5.4. From Preference Pairs to Recommendations . . . . . . . . . . . . . . 76 5.4.1. Notation and definitions . . . . . . . . . . . . . . . . . . . . . 76 5.4.2. The recommendation approach . . . . . . . . . . . . . . . . . 76 5.4.3. Model learning . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.4. Recommendation generation . . . . . . . . . . . . . . . . . . . 80
5.5. Preference Pairs Validation . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5.1. Pruning based validation (PBV) . . . . . . . . . . . . . . . . . 80 5.5.2. Weighting based validation (WBV) . . . . . . . . . . . . . . . 81
5.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6.1. Dataset & experimental setup . . . . . . . . . . . . . . . . . . 82 5.6.2. Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6.3. Results and discussion . . . . . . . . . . . . . . . . . . . . . . 83
5.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6. Simultaneous model Learning for Multiple LTR Models 87 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2. Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3. Distributed Computation Framework . . . . . . . . . . . . . . . . . . 89
6.3.1. Spark architecture . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.2. SVM on Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4. RankingSVM Recommender on Spark . . . . . . . . . . . . . . . . . . 93 6.4.1. Sequential models learning . . . . . . . . . . . . . . . . . . . . 94 6.4.2. Parallel models learning . . . . . . . . . . . . . . . . . . . . . 95
6.5. Efficiency Analyzing for PML and SML . . . . . . . . . . . . . . . . . 96 6.5.1. Map computations . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5.2. Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5.3. Reduce computations . . . . . . . . . . . . . . . . . . . . . . . 102 6.5.4. Conclusion of complexity analysis . . . . . . . . . . . . . . . . 103
6.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6.2. Experiments and results discussion . . . . . . . . . . . . . . . 103
6.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
IV. Time-aware Recommendations 107
7. Concept Drift Detection in Users Behavior 109 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2. Concept Drift Definition . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3. Concept Drift Detection . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4. Detecting Concept Drift for Publication Recommendation . . . . . . 114
7.4.1. Representation model of papers . . . . . . . . . . . . . . . . . 114 7.4.2. Drift points identification . . . . . . . . . . . . . . . . . . . . 115
7.5. Citeulike Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.6. Concept Drift in Citeulike Dataset . . . . . . . . . . . . . . . . . . . 117
7.6.1. Parameters selection . . . . . . . . . . . . . . . . . . . . . . . 117 7.6.2. Analyzing users behavior . . . . . . . . . . . . . . . . . . . . . 120 7.6.3. Users with similar behavioral patterns . . . . . . . . . . . . . 122 7.6.4. Drift points detection in citeulike dataset . . . . . . . . . . . . 122
7.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8. Time-aware Collaborative Topic Regression 127 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.3. Problem Statement and Preliminaries . . . . . . . . . . . . . . . . . . 130
8.3.1. Notation and problem statement . . . . . . . . . . . . . . . . 130 8.3.2. Collaborative topic regression (CTR) . . . . . . . . . . . . . . 131
8.4. Time-aware Collaborative Topic Regression (T-CTR) . . . . . . . . . 132 8.4.1. Concept drift score . . . . . . . . . . . . . . . . . . . . . . . . 133 8.4.2. Confidence weights . . . . . . . . . . . . . . . . . . . . . . . . 134 8.4.3. Model learning and prediction . . . . . . . . . . . . . . . . . . 134
8.5. Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 135 8.5.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.5.2. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 136 8.5.3. Time-aware vs time-ignorant evaluations . . . . . . . . . . . . 137 8.5.4. Baselines comparison . . . . . . . . . . . . . . . . . . . . . . . 137 8.5.5. User-specific vs common concept drift scores . . . . . . . . . . 139
8.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
V. Discussion 143
Bibliography 149
List of Tables
3.1. The mapping of reviewed papers. Part 1 . . . . . . . . . . . . . . . . 31 3.1. The mapping of reviewed papers. Part 2 . . . . . . . . . . . . . . . . 32 3.2. Reviewed papers distribution over the recommendation approaches . 33 3.3. The mapping of reviewed papers. Part 3 . . . . . . . . . . . . . . . . 36 3.3. The mapping of reviewed papers. Part 4 . . . . . . . . . . . . . . . . 37
4.1. Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1. Performance comparison between WBV and baselines . . . . . . . . . 83 5.2. Performance comparison between WBV, LR and SVM . . . . . . . . 83
6.1. Cost comparison between PML and SML for reduce operations . . . . 102 6.2. Fold Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3. Training time of SML and PML . . . . . . . . . . . . . . . . . . . . . 105
7.1. Setting the topic significance threshold . . . . . . . . . . . . . . . . . 118 7.2. User groups in citeulike dataset . . . . . . . . . . . . . . . . . . . . . 122
8.1. Citeulike dataset statistics for each fold . . . . . . . . . . . . . . . . . 136 8.2. Number of intervals in the dataset for each fold . . . . . . . . . . . . 140
v
2.1. Schematic illustration of content-based filtering recommender system 15 2.2. Schematic illustration of Collaborative filtering recommender system . 17 2.3. Matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1. Scholarly data usage in the reviewed recommendation papers . . . . . 35 3.2. Publication representation in the reviewed works . . . . . . . . . . . . 39 3.3. The matching methods in the reviewed works . . . . . . . . . . . . . 39 3.4. The evaluation methods in the reviewed works . . . . . . . . . . . . . 40
4.1. Overview of the recommendation approach (PubRec) . . . . . . . . . 55 4.2. Scientific publications data structure . . . . . . . . . . . . . . . . . . 56 4.3. Decay function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4. Linear regression modeling . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5. Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6. MRR results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1. Example of LTR modeling for recommender systems . . . . . . . . . . 75 5.2. Peer papers and preference pairs formulation . . . . . . . . . . . . . . 77 5.3. Overview of the proposed approach steps . . . . . . . . . . . . . . . . 78 5.4. Performance comparison between WBV, PBV, LR and SVM . . . . . 84
6.1. Spark Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2. Sequential model learning . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3. Parallels model learning . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4. Shuffle operation example . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5. Training Time for SML and PML . . . . . . . . . . . . . . . . . . . . 105
7.1. The rating series and the rating graph . . . . . . . . . . . . . . . . . 113 7.2. Terms-papers distributions in citeulike dataset . . . . . . . . . . . . . 117 7.3. The distribution of representative topics per paper . . . . . . . . . . . 119 7.4. The distribution of representative topics per paper . . . . . . . . . . . 120 7.5. Analysis of users behavior in the citeulike dataset . . . . . . . . . . . 121 7.6. Number of drift points in each user group . . . . . . . . . . . . . . . 123 7.7. Number of drift points per duration for each user group . . . . . . . . 124 7.8. Number of drift points per ratings for each user group . . . . . . . . . 124
8.1. Computing the pairwise similarities in the rating series . . . . . . . . 133
vii
8.2. Concept drift score influence on the decay function . . . . . . . . . . 135 8.3. Time-aware and time-ignorant splits . . . . . . . . . . . . . . . . . . 137 8.4. Performance comparison of CTR between time-aware and time-ignorant
splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.5. Performance comparison for T-CTR and the baseline methods . . . . 139 8.6. User-specific compared to common concept drift score . . . . . . . . . 140
Part I.
Introduction
Contents
1.1. Motivation and Problem description . . . . . . . . . . . 3 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . 5 1.3. Thesis Contributions & Published Work . . . . . . . . 5 1.4. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1. Motivation and Problem description
Modern research is remarkably boosted by contemporary research-supporting tools available for researchers nowadays. Thanks to digital libraries, researchers through- out the world have the opportunity to boost their work by accessing a large body of complete human knowledge with little effort. However, the sheer amount of rapidly published scientific publications in all science disciplines overwhelms researchers and scholars with a large number of potentially relevant and important scientific publications. Digital libraries and online archives for scientific publications typically offer the possibility to explore their archives over keyword-based and bibliographic-based search. Such methods are important and present a possibility to explore the libraries and locate potentially relevant content. However, the efficiency of such traditional information retrieval methods lies in the hands of the users. In the absence of user modeling, users are expected to guess the correct search terms in order to find potentially relevant papers, which opens the door for missing relevant papers that are indexed under different terms. This case sheds light on the need for tools that can understand and model users interests, and then explore the online archives to discover relevant scientific publications. Recommender systems provide the natural fit for this problem, and a considerable amount of research has been done in the last decade with the goal of designing suitable recommender systems for this task [BGLB16]. The primary goal of employing recommender systems here is twofold. First, discovering relevant research publications which otherwise would not be found by the user; second, allowing researchers and scholars to concentrate more on the actual research work by
3
Chapter 1 Introduction
lifting the searching and exploring tasks off their shoulders. The fundamental problem that recommender systems, in general, try to solve is to estimate the utility of a particular item, the candidate item for a given user, the active user. To achieve this task, recommender systems leverage any combination of the following information, information related to the user, to the item, to the previous interactions between users and items, or information related to the context in which the recommendation is requested, such as time or purpose. Thus, recommender systems deal with two types of entities, the user and the item. In scientific publication recommendation, the items are strictly scientific publications. This type of items has particular characteristics which push towards specific design decisions due to the challenges or the opportunities associated with these characteristics. Furthermore, the challenges do not come from the papers only, but there are also challenges related to the users and their interactions with the scientific publications. We can summarize the challenges that face a recommender system when recommending scientific publications as follows:
1. Challenges related to scientific publications: Scientific publications are mainly represented by their textual content. This content holds all ideas and contributions brought by the publications. It also plays a central role in attracting users towards the paper. After all, users are interested in that content. The unstructured nature of the textual content poses a challenge against the systems that try to analyze and match it against the content of other publications or against the user’s needs. An important aspect here is the polysemy and synonymy problems encountered in the textual content. Therefore, recommender designers need to find suitable representation model for the publications’ textual content, that allows expressive modeling for the publications.
2. Challenges related to users behavior
• Unlike other recommendation applications such as movies or songs recommendations, in scientific publications recommendation, the number of papers surpasses the number of users by far, especially users that are willing to provide ratings. This leads to an extreme sparse setup. For example, in a study conducted in [Vel13], it was found that the ratings’ sparsity in Mendeley [JHGdZG12], an academic social network, is almost three orders of magnitude higher than that of Netflix, the famous movies provider. Also, in an analysis on a real-world dataset, we found that users have on average around 40 papers in their personalized libraries. Above that, these libraries are built over months of the user’s activity in the system, whereas users of Spotify, for example, listen on average to 50 songs in a single day.
• Not only are the available ratings limited but they also are all positive ratings. This issue is known as the one-class problem where only positive ratings are available. Users do not usually provide information about
4
1.2 Problem Statement
irrelevant papers. This forces the recommender system to learn a model from an extremely biased set of ratings that contains information about the positive class only.
• The last challenge is related to the temporal aspect of the system. Several conditions might change over time in such a system. Most importantly is the change in user’s interest. For example, a user who was interested in papers related to semantic web three years ago might have shifted his/her main interest towards machine learning-related papers. Consequently, not all available ratings (which are already few) have the same level of importance for understanding the actual user’s needs. Therefore, it is essential for the recommendation method to be aware of this temporal aspect and to adapt to the changes in users’ interests over time.
These challenges form the basic problems that we tackle in this thesis. In the following sections, we present the problem statement and discuss these challenges in the light of the main contributions of this thesis.
1.2. Problem Statement
Concretely, the problem statement for this thesis can be described as follows. We design a recommender system that can recommend “useful” scientific publications to users assuming the following setup. The input to the recommender system is:
1. A set of scientific publications P , where each publication is associated with two kinds of attributes, the textual content including the title, the abstract and the author-defined keywords list; and structural attributes, namely the publication year and the publishing venue.
2. A set of users U , for each user we have the list of relevant publications from P . Each relevant publication is associated with the timestamp which refers to the time when the user showed his/her interest towards the publication.
The expected output is an ordered list of recommended publications from P for each user from U . We base our work in this thesis on this setup and study and present different recommendation techniques that can solve the previously mentioned challenges.
1.3. Thesis Contributions & Published Work
Our contributions in this thesis mainly tackle the challenges presented earlier. In the following subsections, we summarize these contributions in three points.
5
1.3.1. Literature survey
Scientific publication recommendation is developing as a separate research direction emerging from the much-bigger recommender systems community. Therefore, surveying the contributions that address scientific publication recommendation provides a proper understanding about the approaches and challenges specific to this branch. In our first contribution, we present a survey for the recent works in the domain of scientific publication recommendation. Our goal is to form an exten- sive understanding of the following points: identifying successful and promising recommendation approaches that have been employed to solve our problem; and investigating the challenges and opportunities each of these approaches bring. Ad- ditionally, this survey allows us to situate our contributions of this thesis among existing works. We present the survey in Chapter 3.
1.3.2. Addressing the one-class problem
Most of existing approaches for recommending scientific publications depend on implicit feedback rather than the explicit feedback as a source for the users ratings. This is because of the much more availability of the former compared to that of the latter. Such methods analyze users’ online activities such as paper searching and browsing [XGLC14]; bookmarking or tagging [WB11, WCL13, BBM16], or paper authoring and citing [APSF16, SK13]. However, using such interactions as a source for the users ratings leads to the one-class problem, which is basically the absence of the negative feedback [PZC+08]. Existing works that addressed the one-class problem for recommender systems in general have typically considered the Collaborative Filtering (CF) recommendation scenario. However, CF approaches are not the best fit for the scientific paper recommendation since they can not recommend unseen papers. Another problem that contributed to abandoning CF approaches is the high sparsity in users-items relation driven by the large number of papers relative to the much lower number of users. As a result, the focus of previous works for papers recommendation has been on Content-based Filtering (CBF) approaches. In CBF, constructing user’s profile was almost exclusively achieved based on memory- based or heuristic-based methods. These methods construct the user profile by applying an aggregation function over the feature vectors of relevant papers. Such methods depend strongly on the underlying assumption related to the employed aggregation method. On the contrary, model-based approaches, since they rely on a learning algorithm, have the potential to build more representative user models. But, model-based CBF approaches didn’t gain much attention as a solution for scientific publication recommendation. The reasons can be summarized in the following three points: (a) the absence of negative ratings (the one-class problem) significantly limits the ability of machine learning algorithms to learn a representative model; (b) the scarcity in the available ratings for each user, which means a
6
1.3 Thesis Contributions & Published Work
low number of training instances; and (c) model-based CBF approaches are always connected with a high cost since a separate model should be trained for each user. In Part III of this thesis, we focus on these challenges and show how model-based CBF recommenders can be both applicable and efficient for serving the scientific publications recommendation. Our contributions in this aspect are:
• In our first contribution [AAFL15], we studied how to model the recommendation task under the absence of the negative class. We presented a supervised learning modeling that models the recommendation prediction as a regression problem and suggested to employ the rating’s age in order to achieve a multi-level labeling scheme as a solution for the one-class problem.
• In our second work [AAFL16], we adopted a different technique for addressing the one-class problem, namely pairwise learning-to-rank. In this case, we modeled the recommendation task as a ranking prediction problem and defined a rank-based loss that can learn from the preference between relevant and unobserved papers. We provided two verification methods for accounting for the potential errors resulting from utilizing the unobserved papers and conducted offline evaluations on a real-world dataset to evaluate our approach.
• As an extension to our learning-to-rank approach, we addressed the efficiency issue of the model-based content-based recommender in [AKL19]. We provided a system design that leverages the computational power of multiple computation units (cluster of machines) in order to enable efficient training of supervised models for a large number of users. This system was implemented in Apache Spark.
1.3.3. Addressing the concept-drift in users interests
In our final contribution, we addressed the temporal aspect of the problem. We first studied the presence of concept drift in user’s interest on a real-world dataset collected from an online system that allows users to save and share academic papers, the citeulike social bookmark website. Afterward, we presented in [Alz18] a time-aware recommendation method, where we adapted and extended an existing promising recommendation method, namely the Collaborative Topic Regression (CTR) [WB11], enabling the method to account for the concept drift in users interest. We additionally conducted systematic experiments on the citeulike dataset and found that time-ignorant offline evaluation methods promise unrealistic results. Additionally, we showed that our presented time-aware approach leads to better recommendations especially under the realistic time-aware evaluation framework.
7
1.4. Thesis Outline
This thesis is divided into five parts. The following table provides an overview of the topics and contributions of these parts.
Part I Chapter 1 provides an introduction to the theses, where we state the problem statement and summarize the contributions.
Part II Provides the foundations and the background knowledge important for understanding the presented concepts. We also present in this part our literature survey. Chapter 2 presents an overview about the recommender systems, high- lighting concepts that are relevant to our work. Chapter 3 presents a survey on recent related research that is relevant to our problem.
Part III In this part, we address the one-class problem. Chapter 4 presents a model-based content-based system for recommending scientific publications. Chapter 5 presents another model-based approach, which tackles the one-class problem by employing pairwise preference comparison technique. Chapter 6 presents an efficient implementation for the learning-to-rank recommender that can efficiently train supervised models for a large number of users.
Part IV In this part, we address the temporal aspect of the scientific paper recommender system. Chapter 7 presents an exploration study for concept drift detection and provide a study on a real-world dataset to examine the presence of concept drift in our scenario. Chapter 8 presents our approach to account for the concept drift in user’s interest in the underlying problem.
Part V Chapter 9 concludes the thesis with a summary.
8
9
Contents
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2. Recommender Systems and the Users-items Interactions 12 2.3. Overview of Recommendation Techniques . . . . . . . 14 2.4. Matrix Factorization in CF . . . . . . . . . . . . . . . . 18 2.5. Implicit Feedback and One-class problem . . . . . . . . 24
2.1. Introduction
This chapter serves as a review of background knowledge related to recommender systems. We focus on the topics relevant to this thesis, important for understanding the presented contributions. Readers that are familiar with the following concepts can skip this chapter: content-based filtering, collaborative filtering, model-based recommender systems, matrix factorization and implicit feedback. We will start our explanation with an introduction to the notation used in this thesis.
Mathematical notation. We denote a matrix by a capital letter (e.g., R, D). The elements of the matrix are represented by lower case letters with a double subscript (e.g., rij, djk). A row or a column of a matrix is represented by a subscripted capital letter (e.g., Rj is the jth row or column of the matrix R. Vectors are denoted by lower case letters printed in a boldface font. For example, u, v. The elements of the vector are represented by lower case letters with a single subscript (e.g., ui). We also consider that vectors are always column vectors, which is also valid to indexing a row or a column of a matrix. For example, the dimensionality of a vector u is Rd×1 where d is the vector length. Similarly, for a matrix R ∈ Rn×m, the ith row is the column vector Ri ∈ Rm×1 and the jth column is the column vector Rj ∈ Rn×1.
11
Chapter 2 Recommender Systems
Chapter structure The rest of this chapter is organized as following. First, we introduce recommender systems and the different interaction models in section 2.2. In section 2.3, we present an overview about main techniques for recommender systems. We then move to explain in section 2.4 an important method for CF recommenders, namely matrix factorization. Finally, we explain in Section 2.5 an important problem that we address in this thesis, the one-class problem .
2.2. Recommender Systems and the Users-items Interactions
At the basic level, recommender systems deal with two main entities, namely users and items. Users interact with items in several ways depending on the underlying domain, this interaction is the essential source of information that help recommender systems generate future recommendations. Therefore, the user-items interactions can be seen as the third main entity of the recommender system. We first introduce some terminology which is widely adopted in the community of recommender systems. The user for which we generate recommendations is the active user. The set of items that appear in the previous interactions of the user are the observed items and the rest of the items are the unobserved items. Unobserved items are candidates for recommendation and they are, therefore, called the candidate items. More precisely, the candidate set of items is a subset of the unobserved items. The goal of a recommender system is to estimate the utility of each candidate item i to a user u (the active user), and recommend the set of candidate items with the highest estimated utility to u. To achieve its goal, a recommender system leverages a wide range of available information that can be categorized in:
• Items-related information
• Users-related information
• Contextual information related to the context in which the recommendation is requested.
• Information related to previous interactions between the user(s) and the item(s).
Different recommendation techniques use different data from this list. They also vary in the way how the utilized data is processed. For example, we can explain the three main classes of recommendation approaches namely the Content-based Filtering (CBF), the Collaborative Filtering (CF) and the hybrid approaches, by categorizing the used information as follows:
• Using previous collective interactions of the active user only in addition to information related to the items gives rise to the content-base filtering approaches;
12
2.2 Recommender Systems and the Users-items Interactions
• Using previous interactions of all users, gives rise to the collaborative filtering approaches.
• Using the interactions of all users in addition to the item’s information leads to hybrid approaches.
Later in section 2.3, we will give more details about these approaches, but before that, we will explain how users’ interactions and activities are modeled. Users interactions with items can be seen as feedback provided by the users to the recommender system. Different types of feedback can be observed, ranging from the less obvious feedback such as searching for a product, clicking on an advertisement or reading a news article, to more direct feedback such as purchasing a product or adding a paper to the user library. Therefore, these actions can be categorized in two categories, namely implicit feedback and explicit feedback. In both cases, users interactions with items are translated into ratings. The following subsections explain these two feedback categories.
2.2.1. Explicit feedback
This is the most convenient input for the recommender system, where users explicitly give a kind of “rating” for an item expressing how much the item is relevant. An example for such feedback is the stars feedback system that Netflix adopts. Netflix users can rate movies in a scale between one star and five stars with one star means not interested and five stars means very interested. For recommender systems, this kind of feedback is very useful since it provides a convenient level of details about users’ tastes. However, requesting explicit feedback from users doesn’t coincide with the users goals. Users want to spend time on the systems to benefit and utilize the services provided by that system. They are not usually willing to spend time for giving feedback especially when they don’t find a direct reward. For example, Net- flix users would prefer start watching another movie rather than spending some time rating the already watched movie. Therefore, explicit feedback are not always available and even when they are available, they are sparse. As a result, an alternative source for information about the users interactions is needed, which brings us to the implicit feedback.
2.2.2. Implicit feedback
Obtaining explicit feedback imposes extra work on the users. Ratings needed for learning a representative model are rarely supplied by the users [SKK01]. Therefore, instead of expecting users to explicitly provide feedback, recommender systems can observe users actions and activities and infer users’ tastes accordingly. Here, the users are relieved from providing explicit feedback. Instead, the system “implicitly” derives the users feedback. For example, if the user stopped watching a movie after a short time and never continued, then we can infer that the user is most likely not
13
interested in that movie. On the contrary, if a user watches a movie multiple times, we assume that he/she is interested in that movie.
2.3. Overview of Recommendation Techniques
Approaches to generate recommendations can be grouped in several categories, we provide in this section a general overview of approaches that are widely adopted for scientific paper recommendations and detail important background information needed to understand the rest of this thesis. A more in-depth categorization for recommendation approaches relevant to scientific papers recommendation, will be introduced later in the literature survey presented in Chapter 3. As we mentioned earlier, we can identify different recommendation approaches by looking at two aspects, (a) which information is utilized; and (b) how this information is processed. Based on this, we can differentiate between the following main classes for recommendation approaches.
2.3.1. Content-based filtering (CBF)
Content-based systems recommend items that are similar to what the active user has liked in the past. In CBF recommenders, the similarity is based on the items’ attributes or features. Therefore, both users and items are modeled in a shared feature space. This approach makes content-based recommender system especially useful in such scenarios where items can be represented using a descriptive set of features. It is particularly useful when there are only few ratings available for the items or when new items are added to the system because other items with similar attributes or features might have been rated by the active user. Hence, content- based systems are based on the following two sources of information:
1. The set of attributes that provide a comprehensive description of the items. An example of such attributes for a scientific publication could be the title, the authors-list, the list of author-defined keywords, the year of publication, the number of pages, the venue, etc.
2. The affinity of the active user towards items features, i.e. the importance of each feature/attribute to the user.
Figure 2.1 depicts the general system design of a CBF recommender. Designing or choosing the features for the items is done by applying feature engineering to define a representative set of features. On the other hand, defining the mapping between the active user and the items features is a central task of CBF systems which is known as “user modeling” [RRS15]. In the simplest form, the mapping is provided explicitly by the user through a user profile. According to Vivacqua et al. [VOdS09], this is called the declaration method for constructing user profiles. In
14
Figure 2.1. Schematic illustration of content-based filtering recommender system.
this case, the CBF recommendation algorithm predicts the relevance of a candidate item to the active user by measuring the similarity between the user profile and the candidate item’s attributes. Hence, no “user modeling” is involved here. However, we can not count on the availability of such profiles since users don’t tend to put efforts in defining their interest in details. Therefore, more advanced approaches try to build or learn the user profile by analyzing the properties of the observed items, the items that appear in the user’s previous activities. In other words, the system tries to find the correct mapping between the user and the items features. The important decision that should be made here is how to model the problem of learning user profile. Based on the classification suggested by Adomavicius et al. in [AHT08], approaches for building user profile can be categorized in the following two categories:
2.3.1.1. Memory-based methods for user modeling
Such methods categorize the observed items into relevant items and irrelevant items denoted by I+
u , I − u respectively. Then, the user profile is constructed as an aggre-
gation of the features of the relevant items. An example for this approach is the relevance feedback methods, the Rocchio algorithm [Roc71]. This algorithm builds the user profile w(u) for a user u using the following formula:
w(u) = β
|I+ u |
∑ i∈I+
x(i) (2.1)
Where x(i) is the feature vector of item i. The parameters β and γ control the influence of the relevant and irrelevant items respectively. Memory-based methods are also seen as heuristic-based methods, they are practical
15
and easy to implement, but are usually criticized because they over simplify the model and depend on a presumed heuristic encoded in the adopted aggregation function for building the user profile.
2.3.1.2. Model-based methods for user modeling
Instead of using an aggregation function, model-based approaches depend on a machine learning algorithm for learning the features importance for the active user. A supervised learning problem is formulated where the user profile is trained using a machine learning algorithm with the observed user’s ratings as the training data. Compared to memory-based, this approach relies on less predefined assumptions, which allows producing more representative models. However, model-based methods suffer from the shortage in training data which explains why most of the existing CBF systems opt for memory-based methods over model-based methods for building user profile. In CBF, we learn one model per user using the user’s ratings only. Therefore, the training data are limited to the ratings provided by that user. Usu- ally in recommender systems, we don’t have a lot of ratings per user that suffice for training a representative model. This is also an important reason why recommender systems seek to collect ratings from implicit feedback (cf. section 2.2.2). However, utilizing implicit feedback comes with its own challenges such as the one-class problem, where ratings are limited to the positive feedback only. Later in section 2.5, we will provide more details about the one-class problem.
2.3.2. Collaborative filtering and latent factor models
Collaborative filtering (CF) is another main class of recommendation approaches. In CF recommenders, all ratings available in the system are utilized, including ratings from other users in addition to ratings from the active user. It is therefore a sense of collaboration between the users because every rating added by any user helps the system getting better in terms of generating relevant recommendations. Because CBF systems base their recommendations only on the active user ratings, the resulting recommendations are always very similar to what the user has already consumed. Hence, CBF recommender cannot bring anything new to the user, which is known as the over-specification problem of CBF systems. CF recommenders overcome this problem as they employ ratings from other users, which makes the generated recommendations to be influenced by the new kind of items explored by other users. Figure 2.2 illustrates the system design of collaborative filtering recommenders. In CF methods, available ratings are modeled as an incomplete matrix, known as the rating matrix. This is a two dimensional matrix with one row for each user and one column for each item. Having n users and m items, we denote the rating matrix by R ∈ Rn×m, where an entry rui represents the rating of user u on item i. Only known or available ratings are stored in the rating matrix, whereas unknown ratings are left empty. Based on this modeling, the task of a CF recommender system is to
16
Figure 2.2. Schematic illustration of Collaborative filtering recommender system.
predict the missing values (or ratings) in the rating matrix. Afterwards, in order to generate the recommendations for a user u, we sort the predicted values from the row Ru and recommend the set of items that correspond to the highest predicted scores. CF approaches are mainly grouped in two categories, neighborhood models and Latent Factor Models (LFM) [HKV08].
Neighborhood models. These methods are also referred to as memory-based CF try to find users with “similar” behavior to that of the active user, these users form the neighborhood. The similarity between users is based on their rating behavior i.e. two users are similar if they rate items alike. After identifying the neighborhood, a process that is similar to majority voting computes a vote or a prediction score for each candidate item.
Latent Factor Models. On the other hand, latent factor models which are also known as model-based CF are based on dimensionality reduction. The basic idea here is to factorize the rating matrix R into low-dimensional but complete matrices. Multiplying the resulted low-dimensional complete matrices together leads to a complete matrix R′ that approximates R. The unknown ratings from R can now be estimated by the corresponding values from R′. This scheme is a direct application of matrix factorization in the recommendation domain and was first presented by Simon Funk in his blog post [Fun06] as a successful solution to the Netflix challenge [Net06]. Latent factor models gained recently a lot of attention in the recommender systems community especially after matrix factorization was one of the winning methods in the Netflix prize in 2009. LFM are also considered as the state-of-the-art method for
17
recommender systems [A+16], and there has been a wide range of different variations for matrix factorization presented for solving the recommendation problem. We will explain matrix factorization in more details in the next section, but first, we will proceed to explain the third recommendation approach in the following subsection.
2.3.3. Hybrid approaches
As we mentioned earlier, content-based filtering excel in cases where the items have expressive features and where items might not have enough ratings to make them discoverable. But CBF approaches are sensitive to the feature extraction method, which can be erroneous. They also suffer from the over-specification problem where users get recommendations very similar to what they have consumed. Collaborative filtering recommenders on the other hand, don’t require feature representations for the items. But, they depend on other users’ ratings to generate recommendations, which might be inefficient when no enough ratings are available and when new users or items are added to the system. To overcome the problems of these two approaches hybrid recommendations were presented to seek the best of both worlds. A lot of recommendation methods has been presented in the literature that can be categorized under hybrid approaches, we refer the reader to the work presented by Robin Burke in [Bur07] for a comprehensive categorization for hybrid methods.
2.4. Matrix Factorization in CF
The basic method for matrix factorization in CF is the UV-decomposition, which is usually referred to as SVD within the recommender systems community. It is important to note that this is not the Singular Value Decomposition which is also a dimensionality reduction method that does matrix factorization. The difference is that UV-decomposition factorizes the matrix in two matrices whereas Singular Value Decomposition factorizes the matrix into a product of three matrices. To avoid the confusion between these two methods, we will use the term UV-decomposition. Figure 2.3 illustrates the process of UV-decomposition, where we factorize the rating matrix R in two matrices U and V with dimensions n × k and m × k respectively. k is the number of latent factors which decides the dimensionality of the shared space. U gives the latent representation for all users in the low-dimensional space, where each row of U represents a user as a k-dimensional vector. Similarly, V provides the latent representation of all items in the k-dimensional space, and each item is represented as a k-dimensional vector as well. The factorization is basically achieved by the following method, find U and V such that, for all known ratings rui, we get UT
u Vi = rui. This means, find two low-dimensional matrices such that their multiplication results in reconstructing the known ratings. Existing algorithms for solving the UV-decomposition are all based on this basic method, but they vary in the problem modeling and in the algorithmic steps.
18
There exist different modelings to the matrix factorization such as least squares, probabilistic matrix factorization, maximum margin-based, ranking error-based etc. In the following subsections, we explain the first two modeling methods, which are widely adopted. Additionally, we explain two different algorithms for solving the UV-decomposition. For a comprehensive explanation about the rest of the models, we refer the reader to [VBCG17].
Figure 2.3. Illustrating UV-decomposition as a matrix factorization method on the rating matrix R using k latent factors.
2.4.1. Least squares modeling
Let the set of all known ratings denoted by ,
= {rij ∈ R | rij is a known rating}
The basic modeling for UV-decomposition maps directly to the basic method presented above. In least squares modeling, we want to find two matrices U ∈ Rn×k
and V ∈ Rm×k such that the following loss is minimized:
L = 1 ||
2.4.2. Probabilistic matrix factorization
Mnih and Salakhutdinov presented in [MS08] a probabilistic model for matrix factorization, which came to be widely known as the Probabilistic Matrix Factorization
19
(PMF). It applies maximum posterior estimation, which is derived as following,
(U , V ) = arg max U,V
P (U, V |R)
P (R)
] = arg max
= arg max U,V
[logP (R |U, V ) + log P (U) + logP (V )]
The conditional distribution over the known ratings is defined as a Gaussian distribution with mean rij and variance σ2,
P (R|U, V ) = ∏ rij∈
N (rij|UT i Vj, σ
2) (2.3)
Assuming the priors P (U), P (V ) follow Gaussian distribution with a zero mean and variances σ2
u, σ 2 v respectively, and I is a k × k identity matrix:
p(Ui) = n∏ i=1 N (0, σ2
u I) (2.4)
v I) (2.5)
Given the ratings’ conditional distribution, the priors in addition to the variances σ, σu, σv, we get the following log posterior of U, V and R.
log(U, V |R, σ, σu, σv) = − 1 2σ2
∑ rij∈
2σ2 u
V T j Vj (2.6)
Maximizing the log-posterior is equivalent to minimizing the sum of squared errors loss function with quadratic regularization terms:
L = 1 2 ∑ rij∈
(rij − UT i Vj)2 + λu
2
Where λu = σ2/σ2 u and λv = σ2/σ2
v are the regularization parameters for U and V respectively.
2.4.3. Matrix factorization algorithms
Both of the modelings presented in the previous subsections reduce matrix factorization to an optimization problem in which, the sum of squared errors is to be
20
2.4 Matrix Factorization in CF
minimized. As we saw, the Maximum-a-Posteriori solution (MAP) leads to a regu- larized loss (Equation 2.7). We can rewrite this loss function in the following form. This form makes it easier to compute the derivatives in the coming subsections,
L = 1 2
n∑ i=1
m∑ j=1
2
V T j Vj (2.8)
Where Iij is an indicator function that indicates the known ratings:
Iij =
We will explain two widely-adopted algorithms for solving the matrix factorization. Both algorithms solve the following optimization:
(U , V ) = arg min U,V
L (2.9)
where L is the loss defined in Equation 2.8. Both algorithms find the optimal value of U and V . Afterwards, we can estimate the utility of an item j to user i by the dot product between their latent factors:
Rij = UT i Vj (2.10)
2.4.3.1. Stochastic gradient descent (SGD)
SGD was first presented to solve the matrix factorization by Simon Funk in his blog post for the Netflix prize [Fun06]. In order to solve the optimization defined in Equation 2.9, the algorithm loops over all known ratings, and computes for each rating the prediction error,
Eij := rij − UT i Vj (2.11)
Then, Ui and Vj are updated by a magnitude proportional to a given learning rate γ in a direction opposite to the gradient of the loss (Equation 2.8). To compute the gradient of the loss with respect to Ui, we assume Vj is constant. Analogously, when computing the gradient of the loss with respect to Vj, we fix Ui, as following
∂L ∂Ui
∂L ∂Vj
21
Based on these partial derivatives of the loss with respect to Ui and Vj, we extract the following update rules for Ui and Vj,
Ui := Ui + γ ( m∑ j=1
Iij Eij Vj − λuUi)
Iij Eij Ui − λvVj)
The initial values of Ui and Vj are assigned randomly. After iterating over all known ratings, the algorithm repeats the same steps again until convergence or until reaching a predefined number of epochs. Finally, it outputs the matrix factors U and V .
2.4.3.2. Alternating least squares (ALS)
The loss function (Equation 2.8) is non-convex because it has two unknown variables, namely U and V . Therefore, the optimization problem (Equation 2.9) cannot be solved analytically. If we fix one of the variables, the loss function becomes quadratic in terms of the other variable and the optimization can be solved in a closed ana- lytical form. This is the basis of ALS, presented by Zhou et al. in [ZWSP08]. ALS rotates between fixing one of the variables Ui and Vj and solves the loss function for the other variable. This way, the non-convex problem is turned into a quadratic problem that can be solved optimally as following. Differentiating Equation 2.8 with respect to vector Ui, analogously with respect to the vector Vj results in the partial derivatives that are formulated in Equation 2.12 and Equation 2.13 respectively. In order to find the optimal values for these variables, we set the respective derivatives to zero and solve for the corresponding variable,
λuUi − m∑ j=1
Iij rij Vj + m∑ j=1
Iij Vj (UT i Vj) = 0
The dot product of the real vectors Ui and Vj is commutative, UT i Vj = V T
j Ui,
Iij Vj (V T j Ui) = 0
Matrix multiplication is associative,
⇒ λuUi − m∑ j=1
Iij (Vj V T j )Ui = 0
The sum ∑m j=1 Iij rij Vj can be rewritten as V T I(i) Ri, also
∑m j=1 Iij (Vj V T
j ) can be rewritten as V T I(i) V , where I(i) is a diagonal matrix with the dimensions m ×m
22
2.4 Matrix Factorization in CF
and the values Iij at its diagonal for all j’s,
⇒ λuUi − V T I(i) Ri + V T I(i) V Ui = 0 ⇒ (λuI + V T I(i) V )Ui = V T I(i) Ri
Which leads to the following solution for Ui,
Ui = (λuI + V T I(i) V )−1 V T I(i) Ri (2.14)
Analogously, fixing Ui and setting the derivative to zero and solving for Vj leads to
Vj = (λvI + UT I(j) U)−1 UT I(j) Rj (2.15)
Where I(j) is a diagonal matrix with the dimensions n× n and the values Iij at its diagonal for all i’s, The ALS steps are:
1. Initialize the matrices U and V randomly 2. for each user i, set the new values for Ui using Equation 2.14 3. for each item j, set the new values for Vi using Equation 2.15 4. repeat the last two steps until convergence or until reaching a predefined num-
ber of epochs
2.4.3.3. Comparing SGD and ALS
Both algorithms solve the UV-decomposition for CF recommendations, but each one has its own characteristics which makes it better in some scenarios. We will review important aspects that allows comparing the algorithms and understanding when and why to use which algorithm. Some of these aspects are based on the discussion provided in [KBV09]. In terms of efficiency, ALS is typically slower than SGD as it involves a least squares solution, the solution in Equations 2.14 and 2.15. However, in total, ALS needs fewer iterations to reach the same level of accuracy. An important remark about ALS is that computing users latent vectors Ui’s are independent from each others, the same applies for the items latent vectors Vj’s. This allows ALS to benefit from parallel computation for the latent vectors. On the other hand, since SGD iterates over known ratings only and updates the models accordingly, it gets more efficient in case the known ratings are very few i.e. in case of a sparse rating matrix. The opposite occurs in the case of implicit feedback, where the unknown ratings are usually treated as negative ratings with low confidence (as we will see in the next subsection). In such case, ALS is more practical and SGD gets very expensive as it needs to iterate over all matrix entries, whereas ALS will benefit from the parallelism. Regarding parameter tuning, Both algorithms need to tune the following parameters
1. Regularization parameters;
2. The Number of epochs (iterations); and 3. The number of latent features k.
However, SGD requires additionally tuning the step size γ.
2.5. Implicit Feedback and One-class problem
As we mentioned earlier in section 2.2.2, users hardly give explicit feedback and therefore, recommender systems depend on other sources of ratings, namely the implicit feedback inferred from users activities. Some examples for users activities in the domain for scientific publications recommendation are: authoring, adding a paper to the personal collection, adding social tags, downloading, reading or browsing papers, etc. Dealing with implicit feedback usually comes at a cost that implicit feedback refer to positive ratings only and on the contrary, negative ratings are hardly identified. This leads to a situation known as the one-class problem. One-class problem occurs in recommender systems when the system uses implicit feedback in scenarios where negative ratings are not available. As a result, the rating matrix contains only positive ratings, whereas the rest of the ratings are unknown. The naive solution is to treat all unknown ratings as negative i.e. assuming that all unobserved items are irrelevant. Obviously, this approach is not the correct solution especially because the potential relevant items belong to the set of unobserved items, and assigning negative ratings to these items will instruct the recommendation algorithms to consider them irrelevant. Consequently, the algorithm will not recommend them.
WALS algorithm The widely adopted method for UV-decomposition in case of positive-only rating or one-class collaborative filtering is treat unknown ratings as negatives while associating them with low weights on the error term. This way, unknown ratings will have less contribution during the learning process in comparison to known ratings. The method is based on the ALS algorithm, it was presented by Pan et al. in [PZC+08] as the wALS algorithm. Compared to the ALS algorithm, wALS adds some changes to the loss function. Instead of multiplying the prediction error of the rating rij by the indicator function Iij as seen in Equation 2.8, it gets multiplied by a weighting score Cij in wALS. Hence, the loss function which is used in wALS is
LWALS = 1 2
n∑ i=1
m∑ j=1
2
V T j Vj (2.16)
Different suggestions for the weights Cij where presented in the literature, the general idea is that unknown ratings should be associeated weights (or confidence weights) that are lower than the wieghts associated to known ratings. For example, Pan et
24
2.5 Implicit Feedback and One-class problem
al. presented in [PZC+08] the user-oriented weighting scheme, where the weights correlate to the number of positive ratings of the user. Other works set the weights as the number of time the user consumed an item, allowing for several confidence levels even for positive ratings [HKV08], or as a static predefined value b that is much smaller than the weight of observed rating a: a > b > 0.
25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2. Relevant Papers Identification . . . . . . . . . . . . . . 28 3.3. Paper Recommendation and Citation Recommendation 29 3.4. Dimensions for Approaches Comparison . . . . . . . . 29 3.5. Approaches for Recommending Scientific Publications 41 3.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1. Introduction
Building a recommender system for scientific publications gained considerable attention in the research community. Since scientific publications are feature-rich, there exists a wide range of different representation models for them. For example, scientific publications can be modeled based on the citation relationships that connect papers with each others, leading to a graph structure or a graph model. Another possible representation is the vector space model, representing the papers as feature vectors taking into consideration their rich attributes or textual content. This made the task of solving scientific paper recommendation attractive for researchers from different disciplines including information retrieval, web science, machine learning, database, etc. Each discipline approached the problem from its perspective. As a result, we can find a wide range of approaches in the literature that range from applying graph algorithms that work on the citation network, modeling the problem as a random walker problem, to model-based machine learning methods that see the problem as a supervised learning task. Given this wide range of contributing communities, and in order to draw a detailed picture about the state-of-the-art methods that addressed scientific publication recommendation, we provide in this chapter a literature survey.
27
Chapter 3 Literature Survey
The main goal of this survey is to provide an overview about the status quo of research on our topic, and to understand the challenges existing works face and solve. Additionally, to investigate if and how the issues which we address in this thesis have been approached in previous works. This allows situating this work among existing literature. The main contributions of this chapter can be summarized in:
1. We present a categorization for the recommendation approaches to the problem of recommending scientific publications.
2. We identify popular approaches in the research community. 3. We identify the promising emerging approaches that are gaining lately more
attention.
Chapter structure In section 3.2, we introduce the set of publications on which we based our survey. Afterward, we distinguish between the scientific papers recommendation and a close domain which is the scenario of citation recommendation in section 3.3. Then, we introduce in section 3.4 our comparison dimensions. The details of the identified recommendation approach besides with a description for each reviewed work are presented in section 3.5. Afterwards, we provide a brief discussion in section 3.6 and we conclude the chapter in 3.7.
3.2. Relevant Papers Identification
There has not been a lot of works that survey the domain of scientific publications recommendation. To the best of our knowledge, the latest and the most cited works are two surveys done by Beel et al. [BLG+13, BGLB16]. The first one was published in 2013 and focused on the evaluation aspect of scientific paper recommender systems. Later in 2015, it was extended into a more comprehensive survey that focused on the recommendation approaches in addition to evaluation methods [BGLB16]. The latter surveyed over 200 relevant publications published until June 2013 and provided an analysis of the active research communities, the employed recommendation approaches, and evaluation methods. As our work for this thesis was done during a later period, i.e., after 2013, we consider Beel’s survey [BGLB16] the starting point, and we aim to draw a more recent picture about the related literature. Therefore, we investigated the papers that addressed the problem of recommending scientific publications which either have not been considered in Beel’s survey, or were published after June 2013. For this purpose, we collected our set of relevant publications from two sources. The first one is a set of 87 publications collected by Altina Spahija [Spa18]. This set was collected in December 2017 using Google Scholar and contains all papers that cite Beel’s survey and are relevant to the problem of scientific publication recommendation. We studied all papers in this set and filtered out all works that (a) don’t describe a
28
3.3 Paper Recommendation and Citation Recommendation
novel recommendation approach; (b) are not published in a peer-reviewed venue, or (c) are not implemented and evaluated. We ended up with 28 papers that present novel approaches for recommending scientific publications. The second source is the set of 14 relevant works which we discovered during our research work on this thesis after filtering out all papers covered in [BGLB16] and the papers which appear in the first set. The union of these two sets forms the paper corpus which is the focus of our survey in this chapter. This paper corpus contains 42 scientific papers published in peer-reviewed venues, the majority of the corpus’ papers (35 papers) are very recent, published between 2015 and 2018 and each of these papers presents a novel approach for scientific paper recommendation.
3.3. Paper Recommendation and Citation Recommendation
A problem that is very close to scientific publication recommendation is citation recommendation. Generally speaking, the two problems seem very similar since in both cases, the task is to recommend a set of scientific publications to the user. How- ever, the main difference lies in the recommendation purpose. Unlike our problem, the user in citation recommendation seeks a set of papers to be used as references for a given publication. Therefore, the key factor here is not the user interest in general, but the publication which will contain the recommended references. Based on this main difference, we draw a line to distinguish between citation recommendation and scientific papers recommendation where users are usually interested in exploring unknown publications which might be relevant, important or trending. However, in our paper corpus, we found one scenario for citation recommendation that we consider relevant to our study. In this scenario, the system recommends papers for citation considering the whole input publication instead of considering the citation context only. Citation context refers to the set of words that sur- round the expected reference position. This context is usually utilized to decide which paper(s) could be cited at that position. We call this scenario citation recommendation without context. We consider this scenario similar to recommendation approaches that recommend relevant papers given a single document, which will be explained in more details in the next section (section 3.4.2). In our paper corpus, the works [RFP16, GCH+17, JS17, JS18] presented such a citation recommendation approach that falls into this scenario.
3.4. Dimensions for Approaches Comparison
We start our study by defining a set of dimensions for comparing the recommender systems presented in the paper corpus. After reviewing all papers in the corpus, we
29
defined nine comparison dimensions and recorded for each paper the corresponding values. Five of these dimensions have predefined set of values. The detailed categorization of the reviewed works on these dimensions is presented in Table 3.1 and in Table 3.3. In these tables, the sign ’X’ means that the corresponding value is observed in the paper, the sign ’-’ means the value is not observed in the paper, and the absence of any sign refers to the fact that the dimension is not applicable for the corresponding paper. In the following subsections, we introduce each of these dimensions in details.
3.4.1. Recommendation approaches
The first and the most significant dimension in our comparison is the recommendation approach. Beel et al. in [BGLB16], distinguished between recommendation classes, approaches, algorithms, and implementations where each concept in this list is more specific than the preceding one. It is a hierarchy of four levels with different abstraction levels. For example, Collaborative Filtering (CF) is a recommendation class. The recommendation approach was defined as “a model of how to bring a recommendation class into practice.”, the user-based CF is, for example, a recommendation approach. We started from this hierarchical categorization, but we soon realized that it is too rigorous and not all works can be smoothly mapped. Some works might fall in two different approaches, also some recommendation approaches might be mapped to multiple classes, and finally, not all recommendation classes listed in [BGLB16] are represented in our paper corpus. Therefore, we decided to relax the hierarchical structure of Beel’s categorization and to concentrate on the recommendation approaches only. For our study, the recommendation approaches give the needed level of details to understand the main idea behind presented recommenders. After reviewing all papers from our corpus, we identified the following seven different recommendation approaches: Latent Factor Models (LFM), Preference learning, Cross-domain, Content-based Filtering (CBF), Graph-based, Hybrid and Co- occurrence. The first three of these approaches are emerging approaches that didn’t appear in Beel’s survey [BGLB16]. Table 3.2 shows the list of recommendation approaches with the number of papers assigned for each approach. For some of these approaches, we could identify different directions in the presented works, which al- lowed introducing sub-approaches. We will provide for each of these approaches a definition and introduce the sub approaches later in section 3.5, where we will also provide a small summary for each reviewed work. It is important to mention that we do not restrict the works to fall into a single approach, on the contrary, a paper might present a recommender that applies multiple recommendation approaches. For example, [ZWL16] and [DNT14] present approaches that fall in both CBF and graph-based categories. Another example is [BBM16] which approach is both hybrid and LFM.
30
er Y ea r
P ap
m od
do cu
Le ar ni ng
ai n,
in g th e de ta ile d m ap
pi ng
ed pa
pe rs
w in g di m en sio
ns : re co m m en da
tio n ap
pr oa ch ,
tio n sc en ar io ,P
ap er sr
el ,u
se rm
d ha
nd lin
g im
ck or
le m . (C
Y ear
R ecom
m endation
approach Scenario
P aper
3.4 Dimensions for Approaches Comparison
Table 3.2. The recommendation approaches with the number of papers from the studied corpus in each approach.
Approach Number of papers
Content-Based Filtering (CBF) 22 Hybrid 9 Graph-based 8 Latent Factor Model 6 Preference Learning 2 Cross-domain 2 Co-occurrence based recommendation 2
3.4.2. Recommendation scenarios
The second dimension is the recommendation scenario. We defined the following three different scenarios based on the type of input passed to the recommender:
1. Input Query, where the user specifies a query that defines the user’s needs or requests explicitly. The recommender system, in this case, acts more like a search engine that searches for relevant papers matching the search query. Consequently, such systems do not realize the user entity and don’t apply user modeling and personalization. If two users enter the same search query, the recommender does not see two different users and delivers the same recommendations for both of them. In our paper corpus, six approaches follow this scenario.
2. Single Document, where the user specifies a single document (a research paper) and expects the recommender to deliver a set of papers that are similar or relevant to the input paper. This scenario is also not a typical recommendation scenario since it removes the user entity from the model just as the first scenario. However, systems that serve such scenarios do not apply query matching techniques, but they can employ meta-data associated with the input paper such as the list of references, the authors’ list, the venue, etc. to discover related papers. Similar to the previous scenario, also six of the reviewed papers follow this scenario.
3. Multiple Documents, where the recommender is given, as input, a set of publications that reflect the user interest. Systems that follow this scenario analyze the input publications and discover commonalities among them in order to build a user model reflecting the user’s interest. This scenario is the predominant in our paper corpus with 25 papers.
Systems that apply the first two scenarios can be seen as “Case-Based” recommender system as defined by Aggarwal in [A+16], where the user provides a single example
33
of interest that is considered as a user requirement rather than a historical rating. In addition to these scenarios, we shed light on some interesting special scenarios we came across in the reviewed papers. One case is the shortlisting task presented in [RFP16]. The goal of shortlisting is, given a list of relevant paper, the reading list, the system should identify important papers out of that list. It is rather more about finding important papers from a limited list than discovering possible relevant papers from a wide candidate set. Another case is the work presented in [ZWL16] where the goal is, given a research target and the researcher’s background knowledge, the system should recommend a set of papers that help the researcher to achieve the research target. In other words, find the papers that bridge the user’s knowledge gap. The final case is presented in [XCJ+17], which tackles an interesting case that is not particularly for scientific papers, but a reading recommendation system in general. In addition to user interest list, the authors refer to the importance of other aspects such as the stress levels associated with the reading material, and account for such aspects in their recommendations.
3.4.3. User modeling
Another dimension that we consider in our comparison is user modeling. Based on this dimension, we check if the work provides a method for user modeling or for building a user profile. We record this information in Table 3.1. All works that followed the input query or the single document scenarios did not provide a method for user modeling. On the contrary, most of the works for multiple documents scenario provided a method for modeling the user entity in the system. However, some works followed the multiple documents scenario but didn’t support user modeling, such as [DNT14].
3.4.4. Publications data usage
Scientific publications are rich with the available descriptive and content data. We can categorize the data available for scientific publications in three categories:
1. Textual content, including the title, the abstract, the list of authors-defined keywords and the publication’s full-text. The first three information are usually made publicly available online by the publication’s publisher or the digital libraries. Unlike the publication’s full-text which is usually protected behind paywalls, and it is therefore, a challenging task for the recommender system to access the publications full-text. In our paper corpus, 17 papers relied solely on textual content, 11 of them utilized the publications’ the full-text.
2. Structured attributes, including other publications meta-data which are like- wise usually publicly available. Such attributes include the list of authors, the publication’s venue, the year of publication and the number of pages, etc.
34
3.4 Dimensions for Approaches Comparison
Only two works from the reviewed paper corpus relied solely on the structured attributes which are [XLL+16, SDF17]. Whereas, other works that benefited from structured attributes, utilized them in addition to other types of publications’ data such as textual content as in [ZZWS14, TMOZ16, AA17], references list as in [ICG17, SJC+17] or both of them as in [XGLC