a data analytics approach for university competitiveness

Instituto Tecnologico y de Estudios Superiores de Monterrey

Campus Monterrey

School of Engineering and Sciences

A Data Analytics Approach for University Competitiveness: The QSRankings

A thesis presented by

Ana Carmen Estrada Real

Submitted to theSchool of Engineering and Sciences

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Science

Monterrey, Nuevo Leon, June, 2020

Dedication

To my loving parents.

3

Acknowledgements

To God, for filling my heart with strength and motivation. ”I used to be afraid of failingat something that really mattered to me, but now I’m more afraid of succeeding at thingsthat don’t matter.” (Go↵, B., 2012)

To Professor Francisco Cantu, for the knowledge, the time, and the beautiful guid-ance. “I would maintain that thanks are the highest form of thought; and that gratitude ishappiness doubled by wonder.” (Chesterton, G. K., 1908)

To my family, because no matter where I go, I can always go back home to find restand love. ”... may come home with a smooth round stone as small as the world and aslarge as alone. For whatever we lose (like a you or a me) it’s always ourselves we find inthe sea.” (Cummings, E.E., 1942)

To my classmates, specially Polo and Nora, for being my brohter and sister throughthis process. ”Each of us stands under the same bright lights, shinnig down on the stageof life. Though we may be singing di↵erent harmony parts, we’re all still part of the samesong.” (Gardiner, H. & M., 2019)

To all my friends, specially Larissa and her parents Silvia and Jaime, for giving mea safe space in their home and hearts. ”I used to think being loved was the greatest thingto think about, but now I know love is never satisfied just thinking about it.” (Go↵, B., 2012)

To the beautiful people that helped me accomplish my dreams, Irene and Markusfor giving me a home in London, Vale and Lalo for receiving me in Monterrey with somuch love. ”I used to think God guided us by opening and closing doors, but now I knowsometimes God wants us to kick some doors down.” (Go↵, B., 2012)

To Tecnologico de Monterrey and Conacyt, for the grant and the funding recieved.”There are places where we are going to compete to be best in class, and there are placeswhere we can work together to add value for each other.” (Nadella, S., 2017)

4

A Data Analytics Approach for UniversityCompetitiveness: The QS Rankings

byAna Carmen Estrada Real

Abstract

In recent years, higher education has been facing the entrance to the internationalmarket due to globalization, this has developed a highly competitive environment, in whichmany institutions have used university rankings as a tool to attract the best academic andstudent talent from all over the world. In this work we take as a base the ranking of QSWord University Rankings and QS Best Student Cities, to apply data science techniques.Extract information on the performance of the most attractive institutions and cities forstudents worldwide, and develop a methodology that allows the stakeholders of the insti-tutions and cities to improve their services for the benefit of students interested in receivingan education of global quality. We accumulated ten years of university rankings (2011-2020) and six years of city rankings (2014-2019), we carried out an exploratory analysisof the indicators and their influence with the final score, later we trained a multiple regres-sion model and panel data to make predictions in the score. Finally, in order to predictthe position, we carry out groupings and train various machine learning algorithms. Withthis work we show amethodology that allows administrators to plan long-term institutionalimprovements to o↵er a better education and improve their performance in world rankings.

5

List of Figures

2.1 Sigmoid function plot from [�5, 5]. . . . . . . . . . . . . . . . . . . . . . 152.2 Support vector machine hyperplanes for three classes using data iris. . . . 162.3 Decision tree with target variable Yes/No. . . . . . . . . . . . . . . . . . 182.4 Bayesian network of three random variables [33]. . . . . . . . . . . . . . 20

3.1 CRISP methodology diagram. . . . . . . . . . . . . . . . . . . . . . . . 233.2 Histogram and probabilistic distributions of the six score indicators and

overall score. (Academic Reputation, Employer Reputation, Faculty Stu-dent, Citations per Faculty, International Faculty, International Studentsand Score.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Cumulative empirical probability distribution for each indicator and theoverall score. (Academic Reputation, Employer Reputation, Faculty Stu-dent, Citations per Faculty, International Faculty, International Studentsand Score.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Clustering plots evaluating number of clusters with four algorithms. . . . 343.5 Clustering Top 100 universities using Academic Reputation and Citations

per Faculty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6 RMSE achieved by the Feature Selection algorithm with di↵erent number

of variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.7 Spearman correlation between the ranking of the cities and their indicators. 373.8 Rank vs A↵ordability with cities ranked in 2018. . . . . . . . . . . . . . 383.9 Rank vs Student View with cities ranked in 2018. . . . . . . . . . . . . . 383.10 RMSE achieved by using di↵erent number of variables with the QS BSC

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.11 Scatter plot of correlations between each indicator against the overall Score. 433.12 Bayesian network learned from 2011 data. . . . . . . . . . . . . . . . . . 513.13 Bayesian network learned from 2012 data. . . . . . . . . . . . . . . . . . 513.14 Bayesian network learned from 2013 data. . . . . . . . . . . . . . . . . . 513.15 Bayesian network learned from 2014 data. . . . . . . . . . . . . . . . . . 51

6

3.16 Bayesian network learned from 2015 data. . . . . . . . . . . . . . . . . . 513.17 Bayesian network learned from 2016 data. . . . . . . . . . . . . . . . . . 513.18 Bayesian network learned from 2017 data. . . . . . . . . . . . . . . . . . 523.19 Bayesian network learned from 2018 data. . . . . . . . . . . . . . . . . . 523.20 Bayesian network learned from 2019 data. . . . . . . . . . . . . . . . . . 523.21 Bayesian network learned from 2020 data. . . . . . . . . . . . . . . . . . 52

4.1 ROC curves with AUC values for: (a) logistic regression, (b) SVM linear,(c) SVM radial, and (d) random forest. . . . . . . . . . . . . . . . . . . . 60

4.2 ROC curves with AUC values for: (a) decision trees, (b) SVM linear, (c)SVM radial, and (d) random forest. . . . . . . . . . . . . . . . . . . . . . 64

4.3 ROC curves with AUC values for: (a) decision trees, (b) SVM linear, (c)SVM radial, and (d) random forest. . . . . . . . . . . . . . . . . . . . . . 67

4.4 Trained bayesian network. . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1 Scatter plots of the Tecnologico de Monterrey indicators for years 2011 �2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Scatter plots of Carnegie Mellon University indicators for years 2011�2019. 745.3 Scatter plots of University Of Texas At Austin indicators for years 2011 �

2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Scatter plots of Universidad De Buenos Aires indicators for years 2011 �

2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.5 Scatter plots of Pontificia Universidad Catolica De Chile indicators for

years 2011 � 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.6 Scatter plots of Universidad Nacional Autonoma DeMexico indicators for

years 2011 � 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.7 Scatter plots of Universidade De Sao Paulo indicators for years 2011�2019. 81

7

List of Tables

3.1 Quacquarelli Symonds World University Rankings Methodology . . . . . 263.2 Grouping universities by frequency. . . . . . . . . . . . . . . . . . . . . 293.3 Table with statistics from the six indicators and overall score. . . . . . . . 313.4 Clustering universities by Rank. . . . . . . . . . . . . . . . . . . . . . . 313.5 Table with the maximum and minimum scores achieved by universities in

each group by indicator in the 2020 ranking. . . . . . . . . . . . . . . . . 323.6 Table with the Spearman correlation coe�cients for the six indicators re-

lated to the final score. . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.7 Collinearity measures from the six indicators related to the Score. . . . . 333.8 Table showing the countries with more cities ranked by QS BSC. . . . . . 393.9 Table presenting the Top 10 countries of the cities with the highest average

fees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.10 Top 10 cities with more universities ranked by QS. . . . . . . . . . . . . 413.11 Accuracy in training set for the four models. . . . . . . . . . . . . . . . . 493.12 Accuracy in training set for the four models. . . . . . . . . . . . . . . . . 503.13 Table with the utilities of the actions taken by two universities competing. 53

4.1 Metrics with performance of multiple regression and panel data on the testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Di↵erences between the predicted and real score values applying holdoutto every year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Table with statistics of the di↵erence between the predicted and the realvalue of the Overall for the six years. . . . . . . . . . . . . . . . . . . . . 57

4.4 Logistic regression confusion matrix (Accuracy 0.9948) . . . . . . . . . . 584.5 Support Vector Machine Linear confusion matrix (Accuracy 0.9812) . . . 594.6 Support Vector Machine Radial confusion matrix (Accuracy 0.9608) . . . 594.7 Random forest confusion matrix (Accuracy 0.9948) . . . . . . . . . . . . 594.8 Accuracy and AUC for the four models. . . . . . . . . . . . . . . . . . . 594.9 Decision trees confusion matrix (Accuracy: 0.3622) . . . . . . . . . . . . 61

8

4.10 Support vector machine linear confusion matrix (Accuracy: 0.3928) . . . 624.11 Support vector machine radial confusion matrix (Accuracy: 0.6734) . . . 624.12 Random forest confusion matrix (Accuracy: 0.8979) . . . . . . . . . . . 634.13 Accuracy and AUC for the four models. . . . . . . . . . . . . . . . . . . 634.14 Decision Trees confusion matrix. (Accuracy: 0.2517) . . . . . . . . . . . 654.15 SVM Linear confusion matrix. (Accuracy: 0.6666) . . . . . . . . . . . . 654.16 SVM Radial confusion matrix. (Accuracy: 0.4966) . . . . . . . . . . . . 664.17 Random forest confusion matrix. (Accuracy: 0.5374) . . . . . . . . . . . 664.18 Accuracy and AUC for the four models. . . . . . . . . . . . . . . . . . . 664.19 Fitted nodes of the Bayesian Network. . . . . . . . . . . . . . . . . . . . 684.20 Conditional probability table for Academic Reputation. . . . . . . . . . . 694.21 Conditional probability table for Faculty Student Ratio. . . . . . . . . . . 694.22 Conditional probability table for International Faculty. . . . . . . . . . . 694.23 Conditional probability table for Employer Reputation (Dependent on Aca-

demic Reputation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.24 Conditional probability table for Citations per Faculty, dependent on Aca-

demic Reputation and International Faculty. . . . . . . . . . . . . . . . . 704.25 Conditional probability table for International Students, dependent on Em-

ployer Reputation and International Faculty. . . . . . . . . . . . . . . . . 704.26 For the Score, dependent on the six indicators. Some missing values are

due to not possible combinations of T/F between variables. . . . . . . . . 71

5.1 Prediction of indicators and overall score for year 2020 for Tecnologicode Monterrey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Prediction of indicators and overall score for year 2020 for Carnegie Mel-lon University. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Prediction of indicators and overall score for year 2020 for University OfTexas At Austin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Prediction of indicators and overall score for year 2020 for UniversidadDe Buenos Aires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5 Prediction of indicators and overall score for year 2020 for Pontificia Uni-versidad Catolica De Chile. . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.6 Prediction of indicators and overall score for year 2020 for UniversidadNacional Autonoma De Mexico. . . . . . . . . . . . . . . . . . . . . . . 81

5.7 Prediction of indicators and overall score for year 2020 for UniversidadeDe Sao Paulo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.8 Summary of results from the seven universities. . . . . . . . . . . . . . . 82

9

Contents

Abstract 5

List of Figures 7

List of Tables 9

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem Statement and Context . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 QS WUR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 QS BSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Hypothesis and Research Questions . . . . . . . . . . . . . . . . . . . . 61.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background and Theoretical Framework 82.1 University Rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Statistical measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . 122.3.2 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Non-linear multiple regression . . . . . . . . . . . . . . . . . . . 132.3.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Probabilistic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

10

3 Methodology 223.1 Business Understanding: Rankings and Competitiveness . . . . . . . . . 243.2 Data Understanding and Preparation . . . . . . . . . . . . . . . . . . . . 25

3.2.1 QS World University Ranking . . . . . . . . . . . . . . . . . . . 253.2.2 QS Best Student Cities Ranking . . . . . . . . . . . . . . . . . . 27

3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.1 QS WUR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.2 QS BSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Multiple Regression and Panel Data . . . . . . . . . . . . . . . . 423.4.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.3 Probabilistic: Bayesian Networks . . . . . . . . . . . . . . . . . 503.4.4 Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5.1 Main findings in QS WUR . . . . . . . . . . . . . . . . . . . . . 533.5.2 Main findings in QS BSC . . . . . . . . . . . . . . . . . . . . . 54

4 Results and Evaluation 554.1 Multiple Regression and Panel Data . . . . . . . . . . . . . . . . . . . . 55

4.1.1 QS WUR (World University Ranking) . . . . . . . . . . . . . . . 554.1.2 QS BSC (Best Student Cities) . . . . . . . . . . . . . . . . . . . 56

4.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.1 QS WUR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.2 QS BSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Probabilistic: Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 654.3.1 Conditional Probability Tables . . . . . . . . . . . . . . . . . . . 68

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Deployment 725.1 Tecnologico de Monterrey . . . . . . . . . . . . . . . . . . . . . . . . . 725.2 Carnegie Mellon University . . . . . . . . . . . . . . . . . . . . . . . . . 745.3 University Of Texas At Austin . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.1 Final Score probability calculation . . . . . . . . . . . . . . . . . 775.4 Universidad De Buenos Aires . . . . . . . . . . . . . . . . . . . . . . . . 775.5 Pontificia Universidad Catolica De Chile . . . . . . . . . . . . . . . . . . 785.6 Universidad Nacional Autonoma De Mexico (UNAM) . . . . . . . . . . 805.7 Universidade De Sao Paulo (USP) . . . . . . . . . . . . . . . . . . . . . 815.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

11

6 Discussion 846.1 QS WUR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.2 QS BSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7 Conclusions 937.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A Recommendations for university administrators in order to enhance rankingoutcomes. 97

B Publications 98B.1 Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.2 Book chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.3 Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Bibliography 104

12

Chapter 1

Introduction

In this work we discuss the story behind the rise of university rankings and how theyhave gained media importance, focusing specifically on QS World University Rankigs(QS WUR) later on. We also talk about the growing international activity of students as areference for the introduction of the QS Best Student Cities (QS BSC) ranking, which willalso be studied.

For this work we chose to use the QS ranking, knowing that there are two other ma-jor university world rankings ARWU and THE, due to the availability of the data, if youwanted to carry out an analysis of any of the other two it would be necessary to contact thecompany and acquire the data. Instead for QS they are available on its page for all public.

At the beginning of the 21st century, the concept of world-class universities wasintroduced. World-class universities have particular characteristics that di↵erentiate anddistinguish them, among which are recognized graduates, cutting-edge research and in-ternational information and technology transfer. Institutions of higher education competeand collaborate with each other in order to attract the most talented students, the best aca-demics and the international funding.

To analyze this new concept, display tables were created, numerical values wereassigned to the attributes of these institutions including Nobel prices, citations and publi-cations in Nature, the data got ordered and weighted and World University Rankings werecreated. Currently these rankings are not only used by universities, but are also used bystudents seeking access to the best education, academics and professors wanting to starttheir careers, private institutions that want to collaborate with research centers and gov-ernments to make the assignment of the education budget [11].

1

CHAPTER 1. INTRODUCTION 2

On the other hand, by November 15, 2015 about 40% of the world population had ac-cess to the internet, translating into the creation and tra�c of information that is increasingrapidly, from individuals, institutions, governments, universities, events happening aroundthe world every day. Scientific information has also been part of this growth, the numberof publications, books, patents, reviews, access to articles and citations, the large basesof authors and documents that need to be found in Bibliometric databases. An exampleof the size of these databases Clarivate Analytics WoS contained, in 2014, 50,000 schoolbooks, 12,000 research journals, 160,000 conference proceedings, 90 million records and1 billion appointments with 65 million added every year.

To help us understand Bibliometric databases, scientometrics applies Data Analyticsmethods and tools and calculate scientific indicators in science, technology and innova-tion. Scientometrics should answer several questions, these include how to understandscientific citations, how to measure scientific impact, including that of researchers, jour-nals and institutions; how to compare disciplines: and what kind of indicators should beused [11].

Also, the mobility of students is increasing every year, the UNESCO Institute forStatistics reports around 4.8 million international students in 2016 compared to 2 millionin 2000 [68]. The flow of students is international, some countries have a tendency to sendstudents and others to receive students. The factors that influence have to do with cul-ture, economics and politics. Some regions have traditionally been recipients of studentsbecause of their economic stability, such as the United States, the United Kingdom, Ger-many and France. But it is known that di↵erent factors of internationalization that havecreated emerging regions that receive students from neighboring countries such as Japan,Mexico, Russia and South Korea [29]. This has attracted the attention of higher educationinstitutions to enter the international market.

1.1 MotivationWithin this context, QS launched World University Ranking and Best Student Cities tomeet this contemporary demand, where, through a structured and transparent process, in-formation is graspable and immediate and where the realities defining academic institu-tions around the world have been captured by numerical measure. Ranking is defined asthe practice of listing universities in an ordered list based on performance indicators. QS


has identified four main pillars that contribute to a world-class university. These are (1)research, (2) teaching, (3) employability, and (4) internationalization [61].

Behind the number of ranked institutions, there is a highly dynamic and evolvingprocess, the QS team remains on continuous quest to identify gaps and seek further dataand methodological refinement in order to improve accuracy of rankings and other re-gional and specialized tables. QS is committed with the world academic institutions,transparency, continued accuracy and relevance, which continue to be a powerful toolfor students and stakeholders. Data acquisition teams are responsible for validating thedata used on the rankings, from domestic performance, survey performance, geographicalbalancing to universities directly requesting being added to ranking lists. Many aspects areweighted to represent correctly development of education institutions around the world thebest way possible.

Universities have found themselves in need of knowing their indicators and creatingstrategies that strengthen those in which they have obtained low qualifications in order toremain competitive and rising in the world list of institutions.

1.2 Problem Statement and ContextIt is inevitable that the higher education system is being marketed, the participants of theeducation system in general are more aware of their role as part of a business model, ob-tain benefits from the institution while receiving support and gain skills. Universities arebuilding their strategies as a market, trying to be more e↵ective, studying their competi-tors, knowing their strengths and their lower indicators. Administrations have faced fallinggovernment support and rising costs in education, which has increased competition to at-tract potential donors, talented students and qualified academics [22].

1.2.1 QS WURMaria Yudkevich, vice rector of the National Research University in Moscow discuss theanalogy between university rankings and Olympic games [74]. Just like Olympic medals,rankings are a zero-sum game where there is only one #1 university, and only 100 institu-tions can be in the top 100. Some systems are inherently better than others, producing lotsof strong winners, elite students and professors, attracting top talent to improve even more


their competitiveness. However the evaluation of achievement can not focus just in thepodium of winners, but it has to be able to understand the complex environment in whichuniversities and education develops to ensure a healthy and dynamic competition.

Scientific data can be properly analyzed to evaluate and design indicators that canillustrate diverse behaviours in the academic structures, in the International Society ofScientometrics and Infometrics Conference historical evaluations are being discussed, thee↵ects of funding in the quality of knowledge produced, the role of genders in graduateprograms [53]. A deeper analysis of the information contained in the indicators of therankings can give advantages to universities that currently do not belong to the elite, pro-viding tools that allow them to develop and know their competitors better and telling themwhich characteristics can improve beyond the academic.

Universities want to predict their score and position in next year’s ranking based onthe improvements they implement and that get a direct impact on the indicators that areconsidered to grade them. Therefore, structuring the behaviour of the top 800 universitiesover the last ten years using QS data bases can give practical tools to stakeholders for thedecision making process in the development of goal-plans for each university.

Universities are changing their interinstitutional and intrainstitutional behavior thanksto Research Analytics. There are positive and negative aspects of rankings, if a universityfollows the right ranking it may develop useful strategies, but following the wrong rankingcan make it wander far from its institutional mission.

Higher education institutions also maintain relationships with each other and withintermediary institutions such as media and ranking agencies. As these are relationshipsbetween two institutions , they a↵ect two large populations of individuals who participatein the institutions, the decisions that impact all of the participants are concentrated in thehands of interinstitutional brokers, who have great responsibility to craft strategies for pos-itive change.

Media and rankings are institutions that determine much of a university’s brand andare critical to a university’s strategy. Many universities are valuable in ways that rankingsdo not capture, history and correct context make elite universities di↵erent in each country.Stakeholders want to be sure the universities keep their cultural and contextual characterand also be able to game the global research university rankings. If a university is sure inits objectives, using Research Analytics can be helpful to build new strengths.


James Dearden of the Economics Department of Lehigh University analyzes the in-dicator manipulation strategies carried out by di↵erent universities in the United Statesto obtain benefits [15]. Universities admit students ahead of time, because it has beenshown that the most interested students tend to be the most stable in their enrollment. Thediscussion focuses on knowing the impact of rankings in universities, the idea being thatuniversities are encouraged to have an integral growth and to help students to have optionsand not limit them to the same institutions every year. The variety in indicator weights isimportant to grant opportunities to di↵erent competitors.

In general, universities and institutions are a↵ected because there is no methodologyfor the analysis of the indicators of the rankings. Where useful information can be ex-tracted, about which indicators the competitors are using to improve their position. Also,there is not model known that analyzes the dynamics of the universities participating in therankings, how many universities enter and leave each year, how many places a universitycan rise or fall depending on its performance in the previous year.

This study focuses on the relationship between indicators, the global qualificationsof universities and the competitiveness among them to understand how places are de-fined each year. The results are relevant for the academic community and for societybecause they will allow to understand what makes a university improve each year, and itwill be possible to project e↵orts and talent in order to achieve comprehensive develop-ments among students, institutions and resources in higher education.

The Tecnologico de Monterrey has compiled the database of QS rankings of worlduniversities from 2011 to 2020, around 800 universities each year. In this database theranking, the global score, the name of the university, and the indicators in score and rank-ing are available.

We expect to successfully predict the scores and positions universities will occupyin the following year based on the performance shown during their presence in these rank-ings. These results will serve to build a tool that allows the universities and governmentauthorities to know their performance and the specific areas they can improve to increasetheir position in the rankings, and therefore be able to project the expected results andresources that should be used.


1.2.2 QS BSCUNESCO published in a General Conference in 1946 that international education pro-motes friendly relations between people and States with di↵erent political systems, respectfor human rights and freedom. The intellectual development of human beings, allows themto develop capacities of responsibility, freedom and cooperation towards peace, thanks tothe spirit of tolerance and objectivity [40].

In addition to the social importance of international education, the economic ex-change of cities that receive international students has shown positive growth. For theUnited Kingdom in the 2014-2015 period, the number of international students reached19% of the total number of students enrolled and the activity of these students contributed13.8 billion pounds added to the gross value (GVA). Impacting 206,000 full-time jobs,transportation and real estate companies [28].

That is why we believe that QS approach is of great interest to governments, to knowthe conditions under which higher education students decide to participate in internationalexperiences. We know that QS Best Student Cities, has established a methodology thattakes into account the most important aspects that help a student to decide their place ofuniversity stay, institutional fees, the cost of housing, safety, tolerance and cultural inclu-sion and the experience of previous students. These indicators will be analyzed in thispaper to give statistical information of existing indicators, build a predictive model andpropose di↵erent metrics that could be useful to improve the international perspective ofcities.

The objective of the research is to obtain the behavior patterns of the universitiesfrom their indicators and their performance over the nine years analyzed. Take into ac-count its evolution over time and the behavior of its competitors. Data science will allowto study, by means of regressions, correlations and operations, the quality and informationcontained in the database.

1.3 Hypothesis and Research QuestionsThis work aims to show that it is possible to build a predictive models of university perfor-mance. The means for achieving this goal is by the use of statistical and machine learningalgorithms to datasets of rankings. This will provide insight to stakeholders for taking dataoriented business decisions.


The research questions that we want to answer are:

• Are rankings datasets amenable to data science analysis?

• Which are the best models that can be applied to data to get an accurate predictionof QS WUR and QS BSC?

• What are the e↵ects of grouping universities or cities for the analysis of their indi-cators and how to characterize their di↵erences?

• How can we contribute to the decision making process of a university and an inter-national city that wishes to improve its performance and increase its position in theranking?

1.4 SummaryOnce the problem has been presented in a general way, we will proceed with the followingchapters as follows: in Chapter 2 where we explain past work that has been done regardingunderstanding and prediction of rankings, and we also explain the statistics and models thatapply to the data theoretically. In Chapter 3 we talk about the methodology, CRISP-DMwill be presented in a general way and after developing the steps we follow to understandthe data, study them and model them. In Chapter 4 we present results and evaluation,both of panel data and of the di↵erent machine learning algorithms used. In Chapter 5 wepresent the deployment of seven universities, their performance over the years 2011-2019and our prediction for 2020. In Chapter 6 we have a discussion of the results of the twodatabases. Finally, in Chapter 7 we talk about our conclusions and future work.

Chapter 2

Background and TheoreticalFramework

In this chapter we will talk about the importance of university rankings, some studies thathave been carried out about the analysis of rankings and predictions, these works weretaken as a starting point when starting our research. We then present the statistical mea-surements that were carried out in both databases and the predictive algorithms.

2.1 University RankingsUniversities are changing their interinstitutional and intrainstitutional behavior thanks toResearch Analytics. There are positive and negative aspects of rankings, if a universityfollows the right ranking it may develop useful strategies, but following the wrong rankingcan make it wander far from its institutional mission.

Media and rankings are institutions that determine much of a university’s brand andare critical to a university’s strategy. Many universities are valuable in ways that rankingsdo not capture, history and correct context make elite universities di↵erent in each country.Stakeholders want to be sure the universities keep their cultural and contextual characterand also be able to game the global research university rankings. If a university is sure inits objectives, using Research Analytics can be helpful to build new strengths [23].

The university rankings began in China in 2003 with the launch of the ARWU, todate its methodology has been consistent. It consists of six indicators, and for a universityto be considered it is necessary for it to have a Nobel Prize, a medal or a highly cited

8

CHAPTER 2. BACKGROUND AND THEORETICAL FRAMEWORK 9

researcher [11].

The indicators are quality of alumni (10%) that takes into account the number ofstudents who win Nobel prizes, number of teachers with Nobel prizes (20%), teacherswith high number of citations (20%), articles published in Nature and Science (20%), arti-cles indexed by Science Citation Index (20%) and performance per person (10%), adding100% [14].

A year later the THE-QS ranking was launched, in which Times Higher Educationand Quacquarelli Symonds contributed to create a new ranking. Finally in 2010 they sep-arate. THE decides to adopt a methodology oriented to employability and not only to theeducation of an institution, among its indicators it is measured: teaching (30%) whichtakes into account reputation, the number of students per teacher, the number of teacherswith PhD and awards received, research (30%) which also takes into account a survey, re-search and productivity income, citations (30%), international image (7.5%) and industryincome (2.5%) [7].

We focus on QS due to the availability of data, but in this section we present relatedworks that cover the three rankings or even local rankings that also give us an idea of theanalyzes that have been carried out in other studies.

In 2016 [16], the years 2008-2013 were studied in which the I-distance for eachindicator is analyzed, where it is sought to give more justified weights in relation to thecorrelations of previous years. It is concluded that weights provide little opportunity fornew universities to enter the ranking, however the influence is recognized and that is whyit is important that universities learn to interpret them.

In another study [43], correlation was carried out among the di↵erent world rank-ings, it is concluded that each university should be able to plan its growth strategies withrespect to the rankings but oriented to its mission. It also recommends rankings segmentedby subject, which QS has already implemented in recent years.

In a particular case of the Nordic universities [18], it is recommended to have an ob-jective attitude towards not basing decisions on the rankings but on improving the overallquality of the institution. But it is important to keep attention in place worldwide as anadvantage of competitiveness.


Among other works, predictions have been made for universities in Japan [37], show-ing the influence of university size, internationalization and whether it is a public or privateinstitution, in the world ranking of a university. In contrast to this result, in Spain the uni-versity rankings were compared with the results of medical students in health institutions,showing that better ranked universities do not guarantee that their students have better testresults than universities that are below [31].

In studies related to the prediction of the rankings, a study was carried out in 154institutions in the United Kingdom and found that scientific metrics such as publicationsand citations are a good estimate of academic reputation when the count is carried outby institution and department and not at the individual level [47]. On the other hand, aprediction of Times Higher Education (THE) World University Ranking was also carriedout, using five years as a database (2011-2015) and using as a test set 2016. They usea prediction algorithm based on empirical statistics and they achieve few deviations withtheir prediction regarding the test year [64].

Just as some articles related to university rankings were mentioned, books have alsobeen written about it. Some are focused on collecting ranking information and provid-ing a summary that allows students to be guided more e�ciently, complementing withinformation on specific applications and careers [44]. The latest results of the rankings bysubject have also been published, di↵erent government supports are shown depending onthe university and the country. What is the student life like and how much does it cost tomove to another city [45]. These books complement the information in the rankings sothat students can make a better decision.

Other books rather make a statistical and phenomenological analysis of the resultsof the rankings in institutions and countries. Which complements the critical vision withwhich the result of the rankings should be looked at each year [60]. Putting into perspec-tive the reaction that institutions have to these results and recommending a more objectiveand less competitive vision, since weights and indicators often give advantage to the sameinstitutions and are redundant, not allowing the entry into new institutions of great quality.

Despite the advantages and disadvantages that the presence of university rankingsin the media can represent, it is a fact that they will continue to be an influential sourceof information and a reference for decision makers, such as administrators, educationalpolicies, parents and students [17]. Institutions must be open to understand them and learnto use them in their favor to promote their individual growth.


2.2 Statistical measuresThis section shows calculations such as minimum and maximum per indicator, correla-tions of the indicators with respect to the overall score and a grouping exercise.

2.2.1 CorrelationThere are many correlation coe�cients in the literature, Pearson is widely used, how-ever it assumes that the data is continuous, normal, linear, no outliers and presence ofhomoscedasticity. We tested for the presence of heteroskedasticity and we found that theresiduals of the indicators have about the same variace but there is presence of outliers, todeal with this Spearman correlation was proposed.

The correlation coe�cient reveals the strength and the sense of the relationship thatexists between two variables. The range of values goes from [�1, 1], where closer to 1/�1is a very strong relationship and 0 is a negligible relationship.

In this case, the Spearman correlation coe�cient, also known as Spearman rank, wasused. This coe�cient requires only that the data be ordinal and monotonic, this allowsthe coe�cient not to be a↵ected in the presence of outliers. The equation that defines thiscorrelation is,

⇢ = 1 � 6Pd2i

n(n2 � 1) (2.1)

where di is the di↵erence between paired ranks and n is the number of cases [54].For every indicator there is a tendency of a great number of universities on the high-

est scores, for the Citations per Faculty indicator there is a practically normal distribution.On the contrary, in International Faculty we see a tendency of very low and very highscores.

2.2.2 ClusteringA cluster is a set of items with similar features. Normally the features belong to a highdimension space, and the similarity between points is defined using a distance measure[19].


We evaluated universities with four di↵erent measures to decide the number of clus-ters to divide in every group.

First, the distance between two points a and b can be defined by Euclidean, Manhat-tan, Canberra, Maximum, etc. [75]. In this case Euclidean was used.

d(a, b) =

vut dX

j=1

(x j � y j)2 (2.2)

The distance between them can be euclidean or Manhattan; k-means; Pam whichstands for partition around medoids, similar to k-means but instead of updating the meanbetween instances, the dissimilarity between objects is calculated Dp, and minimized; andClara which stands for CLustering LARge applications and applies pam’s clustering tosamples of data and generates an optimal set [55].

The points are assigned to their closest cluster by their euclidean distance di, thenthe cluster center Cj is updated to be the mean of its constituent instances [71].

2.3 ModelingThe modeling refers to the algorithms that we will train to learn the output that interests us,in this case we will use multiple regression and panel data, we will introduce the variablesknown to the institutions, which in this case are the indicators, and we will make a finalscore prediction. As well as the prediction of the positions of the universities and citiesthrough the use of machine learning algorithms.

2.3.1 Multiple Linear RegressionWe know that the QS methodology uses six indicators (variables) to calculate the finalscore which will then be used to rank the universities. So the first approach to modelingthe predictive algorithm is training a multiple linear regression.

In this algorithm a target variable y is defined, with k independent variables (x0, x1, ·, xk)or predictors. Usually x0 = 1 and k is six in this case.

y = �0 + �1x1 + �2x2 + · + �kxk + ✏ (2.3)


In 2.3 the parameters �0, �1, ·, �k are the regression coe�cients and ✏ is the statisticalerror [51].

When we perform multiple linear regression on the training set, we obtain the coef-ficients. The coe�cients are very close to the weights given by the methodology, we areconfident that the regression is getting very close to the model.

2.3.2 Panel DataThe panel data model is an specific case in which time series are taken into account andeach individual during the time series gets fixed [3].

yit = ↵i + X0it� + uit (2.4)

Where i = 1, ...,N represents the individuals, in this case the universities, and t =1, ...,T represents the time series, in this case 2011 � 2020. yit is the final score for anindividual university in a specific year, Xit are the dependent variables, � is the matrix ofcoe�cients and ↵ are the individual e↵ects. The error u has a one way component.

uit = µi + ⌫it (2.5)

Where µi is the university’s unobservable e↵ect and ⌫it is the disturbance.

In this case fixed e↵ects was used, and each university gets a dummy variable and aglobal average ⌫. Fixed e↵ects are used under the assumption that each individual have animpact on the predictor [2].

2.3.3 Non-linear multiple regressionLoess comes from Lowess which means LOcally Weighted regression. This regressionprovides a local fitting that tends to be very e↵ective when the function of the data is un-known. Following from this, using Taylor’s theorem we know that every function can beapproximated by the sum of a polynomial.


Assuming we have a target point around which we want to build the function x =

(x0, x1). The distance between this point and it’s nearest neighbor can be Euclidean, a spanh is defined, and the weight is given by equation 2.6 [12].

wi(x) = W(||xi, x||h

) (2.6)

The Loess logistic regression was used to work on the case study, by experience thelinear regression was giving a very optimistic results, so this local fitting approach wasproposed to obtain a more conservative solution and prediction; then the two of them werecompared.

2.3.4 Machine LearningThe question machine learning wants to answer is, how do we make computers betterwith experience. The formal definition of machine learning is understood as: A computerprogram is said to learn from experience E with respect to some class of tasks T and per-formance measure P, if its performance at tasks in T, as measured by P, improves withexperience E. [39]

Machine Learning is useful when you have complex problems whether they are elab-orate tasks that require many instructions, for example, to handle; problems that go beyondhuman capabilities such as genetic data analysis; and finally adaptability problems such asvoice recognition. [57]

All the techniques we use are based in supervised learning. That means classes arebeing labeled from learning. In this case, if we talk about the two classes of the universitiesthat are in the top 100 or top 200, the algorithm learns these labels from the beginning andlater when a new unlabeled example is shown, it will try to classify it into any of these twogroups.

Logistic Regression

Logistic regression is a statistical model, usually applied in binary classification problems,that uses the sigmoid function to determine the probability that an independent variablebelongs to a class or not.


g(z) =1

1 + e�z(2.7)

This function tends to zero for small values of z, and tends to one for large values ofz. In Figure 2.1 we can observe this behavior.

0.00

0.25

0.50

0.75

1.00

−5.0 −2.5 0.0 2.5 5.0

z

g(z

)

Sigmoid Function

Figure 2.1: Sigmoid function plot from [�5, 5].

The result of the function evaluation is interpreted as a probability of belonging toa discrete, normally binary class and in this case we will use it, although there are morecomplex variations. The data is trained already being labeled in their respective class, thefunction is sigmoid is evaluated for both classes and the highest probability is the one thatis classified. [67]

p(y|x, ✓) = 11 + exp(h�y�(x), ✓i)

(2.8)

Normally the labels of the classes are y 2 {±1}. And in practice the logistic functionis as in Equation 2.8. This function will predict +1 when p(y = +1|x, ✓) � p(y = �1|x, ✓),otherwise predict �1.

Support Vector Machine

This is a supervised learning method usually applied in classification or regression anal-ysis. Support vector machine creates a hyperplane that partitions the observation space


trying to separate the classes maximizing the distance between them. The new observa-tions will be classified depending on the side of the plane they fall into. [1]

This method has the advantage of avoiding back propagation during the computationthanks to its mathematical and statistical approach. It solves the optimization problem an-alytically so it always returns the same hyperplane, in contrast to generation techniques.SVM needs less computation but can also be more vulnerable to outliers.

setosa

versicolor

virginica

1 2 3 4 5 6

0.5

1.0

1.5

2.0

2.5

ooo oo

ooooooo

o

oo ooo

o

o

oo oo

o

ooooooo ooo

o

oooooo

o

oo

o

o

oo

oo

oo

oo

ooo

ooo

o

o

o

o

ooooo

o

o

o

o

oo

o

o

o

oo

oo

oo

ooo

o

oo

o

o

oo o

o

o

o

xx

x

xx

x

xx

xx

x

xx

x

x

x

xxx

x

xx

x

xx

xxx

x

x

xx

x

x

x

x

x

x xxx

x

x

xx

xxxx

x

x

SVM classification plot

Petal.Length

Petal.W

idth

Figure 2.2: Support vector machine hyperplanes for three classes using data iris.

In Figure 2.2 we can observe three classes, support vector machine has to create twohyperplanes represented by the background color, the observations are represented by thepoints. So if a new observation arrives it will be classified by the hyperplanes into setosa,virginica or versicolor.

This algorithm has been used to solve problems related to weather prediction, speakerrecognition, handwriting, display advertising, image and video processing and other diag-nosis.

The main feature of support vector machines is that instead of trying to minimizethe error in the training data, it tries to increase the distance between the classes that thehyperplane is separating, this technique is called maximum margin separator.


In mathematical terms suppose we have two linearly separable classes w1 and w2,the distance from any observation to the hyperplane is given by |g(x)|/||w|| and we have tofind w and b for g(x) = 1 corresponding to the data points nearest to the hyperplane.

J(w) =12||w||2 (2.9)

Equation 2.9 is the objective function to minimize, subject to yi(wTi x + b) � 1, i =

1, 2, ...,N).

In practice, classes are rarely perfectly separable by linear planes, so there are di↵er-ent approaches that support vector machines propose to deal with possible mixed classes.Some examples are, soft-margin SVM, kernel SVM, polynomial SVM and Radial SVM.In our case we will apply multiclass SVM with a linear and a radial approach to our data.

Random Forest

A random forest is an ensemble of a large number of unpruned decision trees. Where adecision tree is defined as a predictive model that can classify or perform a regression. Itcan be considered an expert system that is able to automate the process of making deci-sions by accumulating knowledge according to data. [52]

Decision trees work by using a divide and conquer approach, the process starts byselecting an attribute (root) that partitions the data into subgroups that are as pure as possi-ble considering the target variable. The partitioning is recursive, splitting all the variablesor stopping until all the nodes (leaves) are pure.

In Figure 2.3 there is an example of the ideal decision tree, where the question maybe, from a group of eight individuals, how to know if they will respond yes or no? In thiscase we focus on physical characteristics, the three variables to evaluate are: body shape,head shape and body color. The first variable chosen is body shape creating a binary par-tition, then on the left there is the body color, that partition creates two pure leaves of thetarget variable. On the right side the next variable is head shape, which also creates pureleaves of the target variable. [48]

The idea is, when a new individual appears, we can evaluate it with our decision treeand know its answer. So if the individual has oval body shape and squared head we know


Figure 2.3: Decision tree with target variable Yes/No.

that the answer will be no.

However, it is not always possible to get pure leaves, there may be too many featuresto test or too many possible partitions, there are some stopping criteria to decide if a tree isgood enough at classifying. In the case of random forest, we take advantage of the creationof several trees, so instead of trying to get exact splits for each node, the combination of”good enough” trees will improve the accuracy. The advantages of random forest are thatit can handle a lot of variables and it is fast.

The algorithm used to build the trees is Algorithm 1. The IDT is any induction algo-rithm where nodes are not pruned, instead of choosing the best split among all attributes,IDT selects randomly a subset of size N.

Algorithm 1 Individual trees algorithm1: procedure Require: IDT (a decision tree inducer), T (the number of iterations), S

(the training set), µ (the subsample size), N ((number of attributes in each node))2: procedure Ensure: Mt; t = 1, ...,T3: t 14: repeat:5: S t Sample µ instances from S with replacement.6: Build classifier Mt using IDT (N) on S t

7: t + +

8: until t > T


2.4 Probabilistic methodsProbability allows us to know the conditions that influence the occurrence of an event, inthis case we will present Bayesian networks as a strategic instrument, to discover causalrelationships between variables.

2.4.1 Bayesian NetworksBayesian Networks (BNs) became extremely popular models in the last decade. Theyhave been used for applications such as machine learning, text mining, natural languageprocessing, speech recognition, signal processing, bioinformatics. The general graph formof BNs can represent hypotheses, beliefs and latent variables. The edges represent directdependence among the variables. In particular, an edge from node Xi to node Xj representsa statistical dependence between the corresponding variables. [4].

Bayesian networks were born in the intersection of Artificial Intelligence, statisticsand probability. They belong to a more general class called probabilistic graphical mod-els, which are models designed to handle complex probabilistic models by decomposingthem into smaller components. Bayesian networks are used to investigate distant relationsbetween variables, to make predictions and to explain conditions in data, they compute aconditional probability distribution of one unknown variable based on the others data [33].

There are two key factors in a Bayesian network, the decomposition which providesan understandable description of the system and let the data be e�ciently distributed andthe second factor is breaks down the distribution between variables by the use of condi-tional independence. In Figure 4. there are three random variables Y1, Y2, Y3, Y1 and Y2

are independent given Y3 if the conditional distribution of Y1, given Y2, Y3 is a function ofitself, just as shown in Figure 4. Formally written:

p(y1|y2, y3) = p(y1, y3) (2.10)

where p(y|x) is the probability density of Y given X.In this work we use the R bnlearn package, with a tabu learning algorithm. This is

a score-based search algorithm that have the objective to find an acyclic digraph G thatmaximizes s : G ! R. So that, if (G, p) is a Bayesian Network, and D is a collection ofN iid data generated from p, then,


Figure 2.4: Bayesian network of three random variables [33].

N ! inf ,! G = argmaxs(G)

So, if the likelihood of G and G0 are equal but G has more parameters than G0 thenconsistency and succinctness ensure that true Bayes nets can be recovered asymptoticallyup to Markov equivalences. The score s(G) is expected to be computed in polynimial time[36].


2.5 Game TheoryGame theory has been commonly used to model strategic situations of di↵erent types, in-cluding competitions, races, economic business scenarios, social choice settings. Rankingcan be modeled as a game, where the ranking indicates how e�cient a player has beenin relation to the other players. Ranking games are defined as a normal-form game and asub-class of zero-sum games for n-players, where agents seek for maximizing their payo↵relative to the other agents.

A game form is a quadruple (N, (Ai)i2N ,⌦, g), where N is a finite non-empty set ofplayers, Ai a finite and non-empty set of actions available to player i, ⌦ a set of outcomes,and g :

⇣

i2N Ai ! ⌦ an outcome function mapping each action profile to an outcome in⌦. The only di↵erence with a normal-form game is the addition of a valued payo↵ func-tion to the tuple, which is defined pi : ⌦! R [9].

A ranking indicates how well each player has done relative to the other players ina game. The ordering of the players is given by r = [r1, ..., rn], the first player is r1, the sec-ond player is r2, the last player is rn. So if the game is in normal form (N, (Ai)i2N ,⌦, g, (pi)i2N),where the set of outcomes is given by ⌦ = RN . It is assumed that players prefer to be firstand all of them prefer to be on higher rankings. The payo↵ function satisfies the followingconditions:

pi � pi(r0), i f rk = r0m = i&k m,pi(r) = 1, i f i = r1,&

pi(r) = 0, i f i = rn

(2.11)

A ranking game is a normal form game if ⌦ is the set RN of rankings over N andeach pi : RN ! R is a rank payo↵ function over RN .

2.6 SummaryIn this chapter we talked about some works that we use as a starting point for the beginningof our study, then we present the metrics and algorithms used for analysis and modeling.In the next chapter we will talk about the methodology we follow, we explain CRISP-DMand its development as a structure to originate our research.

Chapter 3

Methodology

Since our work aims to create a framework of reference that analyzes the rankings in away that is useful for university administrators, it is important for us to follow a method-ology that answers the questions raised by decision makers. In this section we present theCRISP-DM methodology with which we will structure the flow of the data throughout thestudy of this business problem.

The CRISP-DMmethodology was introduced in 2000 by Colin Shearer as a standardfollowed by industry leaders data to implement best practices and ensure best results. Thename CRISP-DM comes from CRoss Industry Standard Process for Data Mining and it iscomposed of six steps or phases [58].

1. Business understanding. In this phase, the objectives of the problem are sought,what is the desired result, to know the available resources. In this case it is expectedto build a prediction model with the data of the last 10 years of QS.

2. Data understanding & data preparation. These two phases are carried out inparallel. It corresponds to collecting the data and doing exploratory and descriptiveanalyzes and cleaning integrating and formatting the data.

3. Modeling. We select the modeling technique, the separation of training and test datais carried out, the model is built.

4. Evaluation. The results obtained are evaluated and we propose a possible improve-ment, we analyze the results and comment on possible failures.

5. Deployment. Finally we apply the proposed model to a particular case (Tecnologicode Monterrey) and compare it with real values.

22

CHAPTER 3. METHODOLOGY 23

DATA

Deployment

BusinessUnderstanding

DataUnderstanding

Data Preparation

Modeling

Evaluation

Figure 3.1: CRISP methodology diagram.

Figure 3.1 shows the flow of the six steps of the CRISP methodology, the arrowsbetween the phases are the dependencies and the external arrows is the general flow ofdata mining. There are some double arrows of processes, between business and data un-derstanding, as the data is better understood the business problem is deepening. In thesame way when evaluating a model, it is often necessary t return to the methodology andimprove it or change the first proposal.

The CRISP-DM methodology is followed throughout this chapter as follows. Insection 3.1 we present the Business Understanding where we highlight the influence ofrankings on university competitiveness and its possibility of obtaining financing and in-ternational popularity. In section 3.2 we present the phases of Data Understanding andData Preparation where we explain the data, how they were obtained, validated and theexploratory analysis. In section 3.4 we present the Modeling phase, in which we carry outthe regression models used for prediction of the Score, machine learning algorithms forclassification, a probabilistic analysis with Bayesian networks, and finally, a game theorymodel. The last two phases of the CRISP-DM methodology will be presented in chapters4 (Evaluation) and chapter 5 (Deplyment).


3.1 Business Understanding: Rankings and Competitive-ness

The phenomenon of university competitiveness promoted since the appearance of the rank-ings has been recognized from the sociological aspect, in which all the institutions of theworld can be grouped and evaluated for the benefit of the students. Universities are eval-uated by international journals and these results are used by institutions to show theiracademic ability [10].

The result of the comparison between universities is usually given in tables, whichallow the best universities in the world to be displayed hierarchically with quantitativeresults. These results are published annually by the ranking organizations and the mediacoverage intensifies their credibility.

In fact in 2007 the OECD published a document stating how rankings are influencingHigher Education Institutions [42]:

• 50% of respondents use rankings for publicity.

• 70% want to be in the top 10% nationally.

• 71% want to be in the top 25% internationally.

• Over 50% have a formal process of revision of results.

• 68% use them as a strategic tool for management and academic change.

Not only are the universities and students the ones that pay attention to the rankings,it is also the governments and organizations that are continuously evaluating the resourcesthat will be granted to education. In a study carried out by they analyzed the top 300 uni-versities at the QS World University Rankings in 2018, as well as the kind of university,the funding they get, the amount of students and their position. They found that 84% ofthe universities at the top 300 are public, this means that public funding is key to achieveexcellence in most cases. Also the average budget is doubled by North American top300 universities and quadruple for European universities. Top 100 universities double thefunds of 101-200 universities [5].

Another way in which universities attract funds is through the entry of internationalstudents, the internationalization of higher education is a reality and university rankingsplay an important role in the choice of students. In fact it is proven that a university with


a high place in the rankings has 24% more chances of being chosen by a high performingstudent [34].

Although the university’s position plays an important role in the decision of a studentof high academic level, geographic location is very important when a student is makingthe decision to study abroad. That is why we also decided to explore the study of theQS ranking called Best Student Cities Ranking, in which the best cities in the world arequalified regarding the educational quality they present to students from all over the worldand the standard of living that is requires to be able to attract them. In the next section wepresent the data from QS World University Rankings and QS Best Student Cities and wewill show how these data help us understand the evolution of higher education over timeand also its internationalization.

3.2 Data Understanding and Preparation

3.2.1 QS World University RankingRanking is defined as the practice of listing universities in an ordered list based on per-formance indicators. QS has identified four main pillars that contribute to a world-classuniversity. These are (1) research, (2) teaching, (3) employability, and (4) international-ization [61].

As mentioned in the introduction, the QS ranking was first published in 2004 and in2010 it separated from THE to create its own methodology. This methodology is basedon the four pillars stated by Ben Sowter before, these pillars have been represented bysix indicators which have a numerical value between [1, 100]. Each indicator correspondsto a weight, explained in Table 3.1, and once it is applied, the six indicators are added,forming the overall score with a value between [1 � 100], the institutions are then orderedin descending order and the one with the highest score will obtain the first position in theranking and so on. [11].

QS WUR publishes a database with the results of the indicators in Score and Rankfor each university as well as the total Score and global Rank. These data have been col-lected by Tecnologico de Monterrey during 2011-2020.

The data is downloaded from the QS Intelligence Unit site [49]. It is possible todownload a sheet from Excel for each year. To carry out our analysis is convenient to have


Table 3.1: Quacquarelli Symonds World University Rankings Methodology

Indicator Weight Justification

Academic Reputation(AcRepS) 40%

Is the metric with the highest weight, this indicatoris based on a survey currently responded by around 100,000experts from all around the world.

Employer Reputation(EmRepS) 10%

Based on a survey responded by 45,000 employers,which asks them to identify the institutions withthe most competent graduates.

Faculty/Student Ratio(FacStuS) 20% Teacher/student ratio aims to represent the teaching

quality of an institution.

Citations per Faculty(CitoFacS) 20%

To measure research quality they use the Scopus database.The total number of citations of the institution,in the last 5 years, is divided by the number of facultymembers.

International Faculty Ratio(IntFacS) 5%

Representing the strength of the internationalizationof each institution by attracting sta↵ from aroundthe world, providing a good environment for knowledgeexchange.

International Student Ratio(IntStuS) 5%

Attracting students from around the word showsa strong international brand, and also providesstudents with important soft skills.


the data in longitudinal form, so the 10 years are in the same table, 2020 at the top and2011 at the bottom. There are 1000 universities ranked but after the rank 500 the indi-cators are not reported, just the overall score, since in this work we are focusing on themethodology, this universities were eliminated.

Then, a name validation process is carried out, all the universities are listed andthe names are checked, it is important that the same university has the same exact namethrough all the database in order to be recognized by the algorithm. This cleaning processwas carried out using the R software.

After the name cleaning process there were still missing values in some universities,these were imputed using k nearest neighbor (KNN) Imputation. In this case we used Ralgorithm for 3NN, this algorithm is very e↵ective in our case since the universities aresorted by ranking and normally the scores among nearby universities are similar. In thecase of R algorithm, the value to impute is calculated using the weighted average of thethree values of the neighbors calculated with exp(�dist(k, x)), where dist(k, x) is the Eu-clidean distance.

3.2.2 QS Best Student Cities RankingIn the case of the QS Best Student Cities Ranking, it was published for the first time in2014 and the last year registered is 2019, therefore we have a total of six years of data.Currently the ranking is composed of six indicators that can have a value between 1 and100 for each of them, and the final score is obtained by making the direct sum of the six,with the highest possible overall score being the value of 600.

QS Best Student Cities ranking currently qualifies 128 international cities from 54di↵erent countries. The 2014-2019 data were integrated into a single panel in order toperform an analysis of the indicators. The cities are ranked according to six indicators thataim to represent the standard of living of an international student in each city. Finally, theyadd up and the cities rank from highest to lowest. [62].

1. Rankings. It refers to University Rankings. The total number of universities rankedare considered, as well as their performance in the ranking, giving a higher score tocities with highest ranked institutions.

2. Student Mix. The number of students enrolled in ranked universities, as well as


the proportion of international students over the total number of students attendinguniversities in the city are considered. Also, the Social Progress Index that tracksinclusion and opportunity in the city.

3. Desirability. Is an indicator that sums up safety, pollution, corruption, globalization,economy and a student survey where they expressed their dream student city.

4. Employer Activity. Considering the QS employer survey, the cities with univer-sities that produce excellent graduates (highly sought by employers) are taken intoaccount.

5. A↵ordability. For a↵ordability there are many indicators taken into account. Tu-ition fees, big mac index, iPad index and the mercer cost of living rankings.

6. Student View. This indicator is based on students and graduates surveys whichconsider tolerance, diversity, friendliness, a↵ordability, employment opportunities,culture and the probability that students stay after graduation seeking opportunitiesin the city.

In addition to the six indicators, country information was added, the best universityranked in the city, total universities ranked in the city, and average fee. The idea is tohave a panel that analyzes the di↵erent capacities that a city has to capture an internationalstudent and provide development opportunities.

Similar to the universities, a review of names was made to ensure they were consis-tent during the six years. In the 2014 ranking, the value of Student View did not yet exist,so we averaged the subsequent years to obtain that value. Once the names were reviewedand we had analyzed and filled in all the missing values, we proceeded to make an ex-ploratory analysis of both rankings.

3.3 Exploratory Data AnalysisBefore developing a predictive methodology there is an exploratory phase in which westudy the stability of the indicators, their distribution and the correlation between them. Inthis section we explain the theory behind the algorithms we chose.


Table 3.2: Grouping universities by frequency.

Frequency (years) Label Number of institutions

10 Group A 3179 Group B 288 Group C 237 Group D 266 Group E 265 Group F 294 Group G 423 Group H 262 Group I 301 Group J 29

Total 616

3.3.1 QS WURUniversities were grouped by frequency of appearance in the ranking. That is, of the 616universities that make up the panel over the 10 years, we created a group A for which the10 years appear, group B for those who stood 9 years and so on.

The information from Table 3.2 is very important because it will help us get a bal-anced panel to carry out the linear regression. Also, the Group A conformed by 317 uni-versities the biggest group, this means that the majority of the institutions maintain theirpresence in the ranking through the years. This gives an insight for new universities thatwant to enter the ranking the stability through the years is very important and also as theposition is higher the competition is tighter.

To know the distribution of the indicators, histograms were plotted for each of themand the density line above them (Figure 3.2). The x axis represent the score and eachbin is 10 units wide so it is easy to see the distributions of scores that are more frequentfor universities, for example most universities have Academic Reputation score between[40, 50], for Faculty Student the more frequent range of scores achieved by universitieswas [30, 40], for overall score there are no universities with scores below 23, most arebetween [40, 50]. Complementing with Figure 3.3 we graph the accumulated probabilitydistribution as the sum that must reach one in all cases.


Score

CitpFacS IntFacS IntStuS

AcRepS EmRepS FacStuS

0 25 50 75 100

0 25 50 75 100 0 25 50 75 100

0.000

0.005

0.010

0.015

0.020

0.025

0.000

0.005

0.010

0.015

0.020

0.025

0.000

0.005

0.010

0.015

0.020

0.025

Score

De

nsi

ty

Indicators

AcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score

Distribution from the indicators 317 universities in 10 years

Figure 3.2: Histogram and probabilistic distributions of the six score indicators and over-all score. (Academic Reputation, Employer Reputation, Faculty Student, Citations perFaculty, International Faculty, International Students and Score.)

0.00

0.25

0.50

0.75

1.00

0 25 50 75 100

Score

Pro

ba

bili

ty

Indicators

AcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score

Cumulative empirical distribution for each indicator.

Figure 3.3: Cumulative empirical probability distribution for each indicator and the over-all score. (Academic Reputation, Employer Reputation, Faculty Student, Citations perFaculty, International Faculty, International Students and Score.)


In Table 3.3 we show the median, mean and standard deviation for each indicator ina 10 year, with all the scores being around 50 and we can conclude that our distributionsare normal.

Table 3.3: Table with statistics from the six indicators and overall score.

AcRepS EmRepS FacStuS CitpFacS IntFacS IntStuS ScoreMedian 56.60 55.10 51.85 54.05 59.25 50.65 53.80Mean 59.48 55.06 54.84 54.42 57.17 53.04 57.01

Standard Deviation 25.17 27.34 28.02 26.38 33.46 30.19 17.63

In Table 3.4 there is a clustering exercise we carried out, in which we decided tochoose the groups were most stakeholders have special interest in, the Top 10 universitiesis a very interesting group, the Top 50 universities, then Top 100, and Top 200. In Table3.5 the first exercise with these groups is extracting the maximum and minimum scoreachieved by universities in the year 2020.

Table 3.4: Clustering universities by Rank.

Group Rank Range

A10 [1, 10]A50 [1, 50]A100 [1, 100]A200 [1, 200]A101 [101, 200]A201 [201, 317]data [1, 317]

If a university in a position below the 100 rank, that want to get into the Top 100it can evaluate the minimum score that universities in that group are getting in relation tothe last year of the publication of the rank. For Academic Reputation a minimum score of40.10 is required and a minimum overall score of 59.90.

In this way, universities can evaluate their own indicators and know how far they arefrom the minimum of the group they wish to enter. Although these values will change yearafter year, it is an exercise that allows us to understand the performance that a university


should have, and probably to carry out long-term strategies that allow them to approachtheir target group.

Table 3.5: Table with the maximum and minimum scores achieved by universities in eachgroup by indicator in the 2020 ranking.

Group Academic.Rep Employer.Rep FacultySudent CitperFac Int.Faculty Int.Students ScoreMax A10 100.00 100.00 100.00 100.00 100.00 100.00 100.00Min A10 97.80 81.20 85.00 72.10 70.20 62.20 92.00Max A50 100.00 100.00 100.00 100.00 100.00 100.00 100.00Min A50 71.20 51.80 19.80 24.00 11.10 10.10 74.20Max A100 100.00 100.00 100.00 100.00 100.00 100.00 100.00Min A100 40.10 22.70 11.80 2.40 6.90 3.60 59.90Max A200 100.00 100.00 100.00 100.00 100.00 100.00 100.00Min A200 19.90 7.40 5.40 2.40 3.30 1.50 44.00Max A101 90.90 97.20 100.00 97.10 100.00 99.10 59.50Min A101 19.90 7.40 5.40 3.80 3.30 1.50 44.00Max A201 72.90 71.60 100.00 95.40 100.00 100.00 43.50Min A201 7.80 5.00 3.40 1.90 1.90 1.00 24.20Max data 100.00 100.00 100.00 100.00 100.00 100.00 100.00Min data 7.80 5.00 3.40 1.90 1.90 1.00 24.20

We carried out a normality test to know if our data is formed in a normal way. Thetest was Shapiro-Wilk, and it was performed on the six indicators and the overall score; Wefound that none is considered to come from a normally distributed population obtainingp < 2.2e�16 values for our six continuous scores. This test justifies the use of Spearman’scorrelation. It is usually more popular to use the Pearson correlation but it assumes that thedata is distributed normally. Spearman’s correlation assumes that the data must be ordinal,monotonic, and independent. Similarly does the Kendall correlation, which is more robustfor outliers but its complexity is O(n2) and Spearman is O(nlogn)

After calculating these statistical measures, we calculate the Spearman correlationfor each group and each indicator with respect to the overall score, as shown in the table3.6. In this way, the most influential indicators for each group are observed.

In the Table 3.6 it is possible to compare the di↵erent levels of correlation betweenthe indicators of each group and the total score.

For the A10 group where the first 10 universities are located, there is a very similarcorrelation, approximately 0.5, between the indicators of Academic Reputation, EmployerReputation and Citations per Faculty. In fact, it has the highest correlation for Citations


Table 3.6: Table with the Spearman correlation coe�cients for the six indicators related tothe final score.

AcRepS EmRepS FacStuS CitpFacS IntFacS IntStuS ScoreA10 0.56 0.53 0.32 0.46 0.07 -0.09 1.00A50 0.59 0.48 0.62 0.57 0.20 0.22 1.00A100 0.82 0.65 0.51 0.40 0.24 0.31 1.00A200 0.84 0.68 0.41 0.48 0.25 0.37 1.00A101 0.45 0.25 0.20 0.27 0.02 0.12 1.00A201 0.59 0.42 0.10 0.11 0.01 0.08 1.00All 0.88 0.69 0.45 0.56 0.24 0.34 1.00

and Employer Reputation. Normally the most recognized universities also have a highquality in research and in the performance of their graduates in working life.

On the other hand, the A50 group has the strongest correlations between FacultyStudent Ratio and International Faculty, these universities are highly focused on teaching,and have a good international attraction.

Another interesting observation is that the correlation with respect to Citations perFaculty decreases as the place in the ranking increases, except in the A150 group whichincreases in comparison to the previous one.

We also did a collinearity test of the Score with respect to the six indicators. TheVariance Inflation Factor (VIF) measures the correlation inflation between the regressors.We can observe, in table 3.7, that all of them are below 4 which means that there is nosuspicion of multicollinearity in the dataset.

Table 3.7: Collinearity measures from the six indicators related to the Score.

Variables Tolerance VIFAcRepS 0.43 2.30EmRepS 0.46 2.15FacStuS 0.94 1.07CitpFacS 0.84 1.19IntFacS 0.53 1.90IntStuS 0.50 1.99


Also as part of the exploratory analysis we carry out a clustering exercise with toperform an unsupervised classification.

We decided to carry it out with the first 100 universities of 2020. In these algorithmswe must specify the number of groups that we want to create, for this we evaluate the pos-sibility of creating between 2 and 10 groups and the connectivity was calculated (whichthe lower it is better because the clusters are more separated) and the silhouette, which themore defined the better the groups are. This calculation is plotted in the figure 3.4 and wedecided to form a cluster with two groups.

11

1

1

1

11

11

50

100

150

200

250

Internal validation

Number of Clusters

Connect

ivity

2

2

2

2

2

2

2

2

2

3

3

3

3

3

3

33

3

4

44

4

4

4

4 44

2 3 4 5 6 7 8 9

1234

hierarchicalkmeanspamclara

1 1 1

1 1 1 1

1 10.0

60.0

80.1

00.1

20.1

4

Internal validation

Number of Clusters

Dunn

2

2

2

2

2 2 2

2

2

33

3

3

3

3 3 3

34

4

4

44

4

4

4

4

2 3 4 5 6 7 8 9

1234


1

1

1

1 1

11

11

0.1

50.2

00.2

50.3

0

Internal validation

Number of Clusters

Silh

ouette

2

2

2 2

2

22

22

3

3

3

3

3

3

3

3

3

4

4

4

4

44

4

4

4

2 3 4 5 6 7 8 9

1234


Figure 3.4: Clustering plots evaluating number of clusters with four algorithms.

In the graph Figure 3.4, three graphs are observed, in the one on the left, the numberof clusters against connectivity is plotted. The connectivity between clusters must be min-imal, and for the four algorithms the lowest connectivity exists in two clusters. Similarlyin the graph on the right, the silhouette seeks to maximize and that is why two clusterswere chosen.

The variables that were used to form this cluster were Academic Reputation and Ci-tations per Faculty, the numbers in figure 3.5 correspond to the position that university hasin the ranking.


123

4

56

78

9

10

11

12

13

14

15

16

1718

19

20

21

22

23

2425

26

27

28

29

30

31

32

33 34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

5758

59 60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

8889

90

91

92

9394

95

96

97

98

−3

−2

−1

0

1

−2 −1 0 1

AcRepS

Citp

Fa

cS

cluster

a

a

1

2

Cluster plot

Figure 3.5: Clustering Top 100 universities using Academic Reputation and Citations perFaculty.

We can see the two groups and how they are not overlapping, in addition in the up-per right corner approximately the universities that occupy the top 10 accumulate that inprinciple we know that they have excellent academic reputation and great productivity forarticle citations.

To evaluate the quality of the clusters created with the unsupervised algorithm pamwe use the metric called silhouette. This metric is the ratio of the distance from one pointin the cluster to the points in the other cluster and the average distance between this pointand those of the same group. The value is from -1 to 1, where -1 is a point very close to agroup other than yours and 1 is a well-defined point in its own group. In this case we geta silhouette of what it is 0.4208134. It is an average value, it can be improved by testingmore clusters.

Finally, we carried out a Feature Selection exercise to evaluate the validity of the sixindicators, we used the Recursive Feature Elimination algorithm of the R caret package.

This algorithm delivers a list with the most relevant features from highest to lowest.


Variables

RM

SE

(C

ross

−V

alid

atio

n)

5

10

15

1 2 3 4 5 6

Figure 3.6: RMSE achieved by the Feature Selection algorithm with di↵erent number ofvariables.

We got them listed as follows,

1. Faculty Student.

2. Academic Reputation.

3. Citations per Faculty.

4. International Faculty.

5. International Students.

6. Employer Reputation.

However, the algorithm does recommend keeping the six indicators as it achieves thelowest RMSE, as shown in Figure 3.6.

3.3.2 QS BSCTo carry out the exploration of the information contained in each city, the panel containingthe six-year rankings, 496 measurements in total, including positions, scores, indicators,universities, country and fees was used.

To know what are the most important characteristics of each city we will carry out acorrelation.

In the Figure 3.7 it is possible to see how strongly the positions of the cities are re-lated to the indicators granted to them in the study. If we focus on the second column,


−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Rank

Overall

Rankings

Student Mix

Desirability

Employer Activity

Affordability

Student View

−0.98

−0.76

−0.62

−0.76

−0.78

0.08

−0.75

0.76

0.62

0.74

0.77

−0.06

0.79

0.34

0.48

0.77

−0.28

0.48

0.63

0.27

−0.38

0.47

0.43

−0.3

0.5

−0.14

0.5 −0.09

Figure 3.7: Spearman correlation between the ranking of the cities and their indicators.

which refers to the overall, we can see that Student View is the indicator that most in-fluences the final grade, continuing with Employer activity, Rankings, Desirability andStudent Mix. On the other hand we can see that A↵ordability is an indicator that practi-cally does not influence the final qualification, in addition to having a negative sign.

Two clusters were built using PAM’s methodology. They trained using all the citiesof 2018 (a total of 100). And they were built by comparing Rank with A↵ordability andRank with Student View. For the first, two groups were built according to the heuristicH-clust recommended, and the second was built with three groups. The pink group in theleft figure (a) has the cities that are in the top of the ranking, and in general they have alow A↵ordability rating. On the other hand, in the right figure (b), the orange contains thecities that are in the top of the ranking and also have a good score in the Student Viewindicator.

What we want to show in the first cluster figure 3.8 is how the ranking is related toa↵ordability, and to understand which cities have the best place in the ranking and tend tohave very low a↵ordability scores, which we had already observed in the correlation. Inthe second figure 3.9 you can see how the cities that have the best position are very wellrated by Student View, with the cluster we complement what had been observed in thecorrelation matrix. In addition to being able to observe the separation between the groupsand how these cities are classified by their cost of living and by the experience of graduatestudents.


Figure 3.8: Rank vs A↵ordability with cities ranked in 2018.

Figure 3.9: Rank vs Student View with cities ranked in 2018.


For cities, two clusters were made with the unsupervised algorithm pam. The sil-houette of all points were averaged for both clusters and we obtained the mean for thetwo-group cluster 0.4607127 and an average of 0.4555437 for the three-group cluster.

In addition to the indicators, we want to understand what trends exist among themost popular cities for international students. So di↵erent analyzes were also carried outregarding the additional information that was included in the database.

Table 3.8: Table showing the countries with more cities ranked by QS BSC.

Country Total cities ranked1 United Kingdom 142 United States 143 Australia 74 Canada 55 France 56 India 57 Japan 58 China 49 Russia 410 Germany 311 Poland 312 Spain 313 United Arab Emirates 3

Table 3.8 shows the countries with more than three ranked cities. United Kingdomand United States are tied as the most popular countries with 14 cities, continuing withAustralia and Canada. In general, these countries o↵er an inclusive environment that isvery attractive to international students, and these countries have English as a first lan-guage in common.

Regarding the countries with more cities chosen by students, we are interested inknowing what is the relationship with the cost of living, since in the correlation it wasobtained that A↵ordability was an indicator that practically does not influence the positionof the cities considered .

The cities were ordered with respect to the average fees from highest to lowest and


Table 3.9: Table presenting the Top 10 countries of the cities with the highest average fees.

Rank City Country Average fees1 35 San Francisco United States 392002 97 Bristol United Kingdom 270003 41 Perth Australia 257004 27 Auckland New Zeland 224005 37 Stockholm Sweden 215006 20 Singapore Singapore 18700

7 60 Dubai-SharjahAjman

United ArabEmirates 16000

8 11 Toronto Canada 148009 37 Dublin Ireland 1470010 89 Daejeon South Korea 14500

then we extracted the most expensive city by country, so we can get an idea of the rela-tionship between the most popular countries for students and their cost of living.

Table 3.9 shows the 10 most expensive cities in relation to the average fees. All ofthem belonging to the United States, starting with San Francisco, which is known to beone of the cities with the highest cost of living in its country. Although we must alsohighlight that it has one of the best universities in the country (Stanford University), andthe possibility that graduate students are captured by the world’s largest technology com-panies, providing great growth opportunities.

Then you have the United Kingdom that is the most popular country and has thebest ranked city in recent years (London). The United Kingdom o↵ers opportunities forinternational students, and also has several of the best universities in the world, althoughthe cost of living is high. Followed in cost by Australia which is also the third countrywith the most ranked cities.

One of the indicators that we think are most important have to do with the academicreputation of the institutions that exist in the city to which the students wish to go. In thiscase, QS awards a score depending on the position of the university in the world or localranking.

In the case of Table 3.10, we sort the cities with the most universities ranked by QS,presenting the Top 10. We see that London has the best ranking of Rankings and has 18


Table 3.10: Top 10 cities with more universities ranked by QS.

Rank City Rankings Total Ranked1 London 100 1810 Seoul 95 182 Tokyo 83 127 Paris 81 1232 Beijing 82 1119 New York 78 1031 Buenos Aires 68 934 Moscow 69 912 Boston 78 853 Mexico City 52 8

universities recognized by QS, in recent years it has also been the best city for students.

However, the number of ranked universities is not necessarily proportional to a goodposition in the ranking of cities, since for example Mexico City is in the 53rd place in spiteof having 8 recognized universities.

Similarly to QS WUR we performed a feature selection algorithm in the QS BSCdataset, with the idea to validate the indicators on how much they may be influencingthe final model. Using the same Recursive Feature Elimination algorithm we got to thefollowing order of importance,

1. Student View.

2. Employer Activity.

3. Desirability.

4. A↵ordability.

5. Rankings.

6. Student Mix.

The lowest RMSE is achieved by keeping the six indicators as shown in figure 3.10.


Variables

RM

SE

(C

ross

−V

alid

atio

n)

20

30

40

50

1 2 3 4 5 6

Figure 3.10: RMSE achieved by using di↵erent number of variables with the QS BSCdataset.

3.4 ModelingIn the modeling section we will talk about the techniques that were applied to the data toobtain the required predictions regarding the overall score and the di↵erent groups that arewithin the data. These techniques were applied in both the QSWUR and QS BSC datasets.The techniques can be divided into those performed with regression models, in principlefor the prediction of continuous data and with machine learning models for prediction ofdiscrete data or groups. Finally, a Bayesian probabilistic learning technique was used anda game theory problem was raised.

3.4.1 Multiple Regression and Panel DataRegarding the regression model, multiple regression and panel data were used. The di↵er-ence between the two, as mentioned in the theoretical framework is that the panel has theability to identify individuals and their evolution over time.

These models were built for QS WUR and QS BSC with the idea of being able tomake predictions of the final score that a university or city could have in a specific year.

QSWUR

For the modeling of the QS WUR dataset, Group A was used, this means that there were317 universities that appear during the 10 years of the study, this makes a total of 3170rows. As far as the information per observation, we kept the year, rank, institution name,overall score, academic reputation score, employer reputation score, citations per faculty


score, faculty student score, international faculty score, and international students score.

20

40

60

80

0 25 50 75 100

Indicators

Sco

re

Indicator

AcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Figure 3.11: Scatter plot of correlations between each indicator against the overall Score.

A scatter plot was made with which the linear regression line can be observed. Thisis useful for the training with linear regression and the direct correlation between theindicators and the overall score can be analyzed. From the methodology we expect astronger correlation for indicators that have heavier weight. However, Employer Reputa-tion presents a greater slope with a related weight of 10% in the overall score.

First we carreid out a multivariate regression, the goal variable is the overall scoreand the independent variables or predictors are the six indicators. The regression wastrained using 9 years of data, the last year (2020) was left out for testing/validation.

This gave us the following equation.


y = � 0.0190+0.4056 ⇥ Academic Reputation+0.0957 ⇥ Employer Reputation+0.2019 ⇥ Faculty Student+0.2004 ⇥ Citations per Faculty+0.0493 ⇥ International Faculty+0.0535 ⇥ International Students

This exercise is useful if there is a specific need to know the final score of any in-stitution that is not necessarily in the panel. For example, if an institution wants to knowwhat its final score would be by calculating its indicators independently, either by com-paring with other universities that may be similar to it or trying to approximate the QSmethodology.

This exercise is also important to know how close our model of predicting the QSmethodology, that we can compare in table 3.1. We can observe that all the indicators arevery close to the value given by QS, this helps us to evaluate the quality of our data, themodel that was built and that the coe�cients would be useful to predict the score of someinstitution that is not in the panel.

The R � squared we got when building this model was 0.9972, which is very closeto 1 being 1 a perfect model. However, in rankings the position can be very tight and theerror has to be very small to be able to predict accurate results.

In the case of the panel we obtain a coe�cient by university and by year, we under-stand that each university has di↵erent e↵ects on the final score and it is important for usto know what it is, the equation that has to do with that regression is below.


y = 3.0013+0.3644 ⇥ Academic Reputation+0.0969 ⇥ Employer Reputation+0.1970 ⇥ Faculty Student+0.2039 ⇥ Citations per Faculty+0.0487 ⇥ International Faculty+0.0442 ⇥ International Students

The coe�cients with respect to the variables have changed, moving slightly awayfrom the methodology with respect to the previous multiple regression. But we can showsome of the indicators by university we obtained. For example for the MIT we got a co-e�cient of 2.1909 this means that this institution has a very positive impact on the finalscore, this institution has been in first place for over five years in the ranking, StanfordUniversity has a very similar coe�cient 2.0887, and Harvard with 2.1372, which are inthe top 3. As universities get far from the first places in the ranking their impact in thefinal score tends to be smaller, for example, University of St. Andrews is number 100 inthe ranking with a coe�cient of 0.7074, Keio Universiti in place 200 has a coe�cient of0.4416, Ecole Des Ponts Paristech in position 250 has a coe�cient of �0.5546, which hasturned negative.

This coe�cients are individual, so when we try to predict the score of a specific uni-versity the score from each indicator is introduced to the equation and then this coe�cientis added for each particular case giving us a more accurate approximation of the final value.

For this model we got an R � squared of 0.9981 which is better than the one in theregression model, that means that, even if the coe�cients from the indicators are not asclose as the ones from the QS methodology the individual e↵ects improve the quality ofthe model.

Finally, for QS WUR we used a non-linear regression, because we found that linearregression was very optimistic in terms of prediction. Loess was used. We trained withnine years of data and left the last one for testing/validation. We used a span = 0.75 anda degree = 2. Since this regression is applied individually we do not get an equation,this regression was used for the three universities in our Deployment; Tecnologico deMonterrey, University of Texas at Austin, and Carneige Mellon University.


QS BSC

In terms of the cities, we have a total of 496 in the 6 years, and 46 are consistent. These 46cities were used to build the panel, we kept the first five years (2014 � 2018) for trainingand 2019 for testing/validation.

We did not carry out a multiple regression because we know that the indicators arelinearly added to get the final score, so the weight of each indicator is 1. However, for thepanel, since each city gets a coe�cient the weight of the indicators change during time.

y = 132.0324+0.9956 ⇥ Student View+0.9962 ⇥ Employer Activity+1.0007 ⇥ Desirability+0.9989 ⇥ Rankings+1.0089 ⇥ Student Mix+1.0039 ⇥ A↵ordability

The coe�cients are very close to 1, all of them have a very high significance withp values < 2e � 16. Some of the individual coe�cients gotten by the cities are: 0.3549for London, 0.5227 for Paris, 0.6695 for New York, 0.1990 for Sydney, and 1.0189 forMexico City.

The R � squared : 0.9999 which is fairly close to 1 and will give a very good fit toany city.

3.4.2 Machine LearningIn addition to wanting to predict the overall score of an institution or a city, we also wantto know if it is possible to predict its position. The position variable is discrete, for thatreason we used machine learning techniques.


QSWUR

We developed groups to categorize the positions of the institutions, so for the top 200 ifthe ranking we created two groupings. The first one is a division of the top 200 universitiesin ten smaller groups of twenty universities; the second one is a division of the top 200universities in two smaller groups of a hundred universities. These groupings are relevantto stakeholders because they usually focus on belonging to a specific group in the ranking.For example, most universities want to belong to the Top 100, so this exercise will makeit easier for them to predict if the indicators they currently have will get them into theirdesired group.

We decided to try di↵erent classification algorithms to compare the one that had thebest performance in identifying the groups. Next we will explain the algorithms used. Thealgorithms were trained with the years 2011 � 2019 and the test/validation was the 2020dataset.

1. Logistic Regression. This algorithm was used only for the two groups of 100 uni-versities, since it only does binary classification. Below we present the equation wegot for classification of the groups.

y = 3951.5478+4.2114 ⇥ Student View+1.0263 ⇥ Employer Activity+2.0991 ⇥ Desirability+2.0936 ⇥ Rankings+0.5191 ⇥ Student Mix+0.5256 ⇥ A↵ordability�11.4616 ⇥ Overall�1.9268 ⇥ Year

And we got an Accuracy of 0.9628 in the training set.


2. Support Vector Machines. The SVM algorithm is usually used when there are lin-ear or polynomial relationships between the variables, since it builds hyperplannesin a higher dimension to separate the classes, these planes can be linear or radial.We trained both SVMLinear and SVMRadial algorithms. The ten groups of 20 wereclassified for the two cases.

Support Vector Machine with Linear Kernel, for the 10 classes we used the sameeight predictors used in logistic regression and we got an accuracy in the training setof 0.5676 and a kappa value of 0.5179.

The exercise with two categories we used eight predictors with a tuning valueC = 1,we got an accuracy for the training set of 0.9846 and a kappa of 0.9692.

Support Vector Machine with Radial Basis Function Kernel, for the 10 classeswe used the same eight predictors used in logistic regression, with a tuning parame-ters (set automatically by the algorithm) C = 1 and sigma = 0.1023 and we got anaccuracy in the test set of 0.6529 and a kappa value of 0.6120.

The exercise with two categories we used eight predictors and tuning parametersC = 1 and sigma = 0.0418. We got an accuracy for the training set of 0.9649 and akappa value of 0.9298.

3. Random Forest. This was the control algorithm in both cases, for the classificationof two groups of 100 universities and for the ten groups of 20. With this training weexpected the best result, and against it the previous algorithms will be compared.

For the 10 categories exercise we trained with eight predictors and bootstrap of 25repetitions, we got a tuning parameter mtry = 8 with an accuracy of 0.8897 andkappa value of 0.8771.

For the two categories exercise we used eight predictors a bootstrap of 25 a tuningparameter mtry = 16 and got an accuracy on the training set of 0.9890 and a kappaof 0.9780

Similar to the regression, the predictive or independent variables were the indicators


Table 3.11: Accuracy in training set for the four models.

Model Categories AccuracyLogistic regression 2 0.9628SVM Linear 10 0.5676SVM Linear 2 0.9846SVM Radial 10 0.6529SVM Radial 2 0.9298Random forest 10 0.8897Random forest 2 0.9890

and the target variable was the group to which the university would belong with these in-dicators.

QS BSC

Regarding the database of cities, we also carried out a categorization exercise. The citieswere divided into ten groups of 10. We trained with the first five years and 2019 wasleft as a test/validation set. Parallel to the exercise of universities, the support vector ma-chine (SVMLinear and SVMRadial), Random Forest, and Decision Trees algorithms weretrained.

Decision Trees were trained with seven predictors, the six indicators and the year,with a tuning parameter of cp = 0.0586. We got an accuracy of 0.1140 in the training setand a kappa of 0.0361.

The SVM Linear kernel exercise was built with seven predictors, the tuning param-eter C = 1 and we got an accuracy in the training set of 0.6033 and a kappa 0.5567.

The SVM Radial Basis Function kernel was trained again with seven predictors andwith the tuning parametersC = 1 and sigma = 0.0659. We achieved an accuracy of 0.3924in the training set and kappa of 0.3225.

For Random Forest we used seven predictors and a bootstrap of 25, a tuning param-eter mtry = 11 and an accuracy of 0.4377 with a kappa value of 0.3729.

To close the modeling with machine learning algorithms, we can putualize that the


Table 3.12: Accuracy in training set for the four models.

Model AccuracyDecision trees 0.1140SVM Linear 0.6033SVM Radial 0.3924Random forest 0.4377

university dataset had much better results than the city dataset, this may be because the uni-versity methodology has been more consistent and is a broader dataset. As for cities, it is adataset with a methodology that continues to adapt and every year individuals continue tobe added, that may have a↵ected the accuracy in this second machine learning experiment.

3.4.3 Probabilistic: Bayesian NetworksIn addition to the prediction exercises, we also wanted to find conditional structures in thedata. These structures were modeled using Bayesian networks, the networks were learneddirectly from the data. We took the top 100 universities and train a static network for eachyear having a total of 10 static networks trained.

For Bayesian network training R has di↵erent types of learning algorithms, wechoose score-based algorithms and specifically tabu. Tabu is a modified version of hill-climbing that explores a space of directed acyclic graphs by adding one arc at a time untilthe highest-rated structure is found [56].

In figures 3.12 to 3.21 we can observe the ten structures learned from the data. Thestructure changes from year to year, we believe these are a reflection of the evolution ofthe methodology with which QS calculates the final score. We kept the last one (Figure3.21) to make inferences.

The use of dynamic Bayesian networks was also contemplated, some experimentswere carried out and we hope to expand this part of the work to take advantage of thetemporal structure of the networks since we have historical data, it will be confirmed inthe future.


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score

Figure 3.12: Bayesian network learned from2011 data.

AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score



AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score


3.4.4 Game TheoryFor the modeling of the problem as a zero-sum game we begin in the approach of theactions that a university can take to raise its place, this approach is theoretical and has todo with the indicators. Each action has to be assigned a utility depending on how much itinfluences the advantage of one university over another.

In this case, the Nashpy python package will be used, which allows entering the al-lowed actions between two players and calculating the usefulness of each action taken.

The actions defined are the following:

• Action 1: Hire a teacher (utility = 2).


• Action 2: Receive an appointment in an article indexed by Scopus (utility = 2).

• Action 3: Hire an international teacher (utility = 1).

• Action 4: Admit an international student (utility = 1).

The way to represent it computationally is through a matrix in which one player takesthe values of the columns and the other the values of the rows, in this case the players aretwo universities that are competing in the ranking.

Table 3.13: Table with the utilities of the actions taken by two universities competing.

University 1

University 2(2, 2) (2, 2) (2, 1) (2, 1)(2, 2) (2, 2) (2, 1) (2, 1)(1, 2) (1, 2) (1, 1) (1, 1)(1, 2) (1, 2) (1, 1) (1, 1)

The problem is modeled by concentrating on the score, so if two competing univer-sities take the same actions they will raise their score, but they will probably not changetheir position in the ranking. Actions can also be defined thinking that one university per-forms actions while the other remains static, in that case the one that does not act wouldlose points and remain at zero while its competitor would win in the final score and insteadof the ranking.

3.5 SummaryData explorations and modeling were carried out in this section. In order to better un-derstand the data and to ensure that we are choosing the correct models to represent thisknowledge. We made an exploration of data (correlation, distributions and clustering),with which we seek to better understand our data, to know how the variables of eachdataset are related and if there are grouping trends among individuals.

3.5.1 Main findings in QS WUR• Based on manual and inclusive clusters, we believe we have found a target measure-ment tool through the min / max analysis.


• Also as part of the grouping, making correlations by group helps us understand thestrengths of each group.

• The panel regression has been useful to maintain the individuality of the participantsin the panel and their performance over time.

• The use of machine learning can help us estimate the group a university will belocated in, although it is usually much better for groups divided into 1 � 100 and101 � 200.

In the case of cities, we enriched the database with costs, countries and ranked uni-versities to help decision-makers understand why a city is popular for students and howthey can improve its international image. We also performed feature selection as valida-tion of the importance of variables in our validation experiments, in both cases all variableswere included for modeling.

3.5.2 Main findings in QS BSC• One of the most important tools was the enrichment of the dataset in order to providethe analysis of cities by country, cost of living and outstanding universities.

• From the prediction models, we hope to be able to enrich the dataset to be better,since this ranking is recent and the methodology continues to evolve as well as itsindividuals.

We tried doing the regression before panel data because the panel to validate themodel. In a similar way di↵erent machine learning algorithms were tested and we decidedto have several options, anyway we will test the algorithms in the test sets.

Finally, a training test of Bayesian networks was made, with this test we tried toextract probabilistic dependencies between the variables. We also conducted a modeledgame theory test, with the idea of showing the actions that can benefit a university that iscompeting with its neighbors.

Chapter 4

Results and Evaluation

This chapter explains the fifth phase of the CRISP-DMmethodology in which we evaluatethe results of the modeling phase. Once the modeling was carried out, we developed a se-ries of evaluations that validate the prediction model chosen for our data. After this phase,and once our model is validated, we will move on to the Deployment (chapter 5) in whichwe will show the use of our model in individuals of our interest.

4.1 Multiple Regression and Panel DataMultiple regression models and panel data were used to make the predictions of the finalscores. Later, it is shown which ones were used and how we measured their performance,as well as the evaluation of possible overfitting.

4.1.1 QS WUR (World University Ranking)Since we performed a multiple regression and a panel data model, we evaluated the per-formance of the two and compared them to justify the one that we choose for later experi-ments.

The metrics used are: Root Mean Square Error (RMSE) RMSE(x) =pE((x � x)2),

which at value 0 represents a perfect fit, R-squared R2 = (PT

t=1(Yt � Y)2)/(PT

t=1(Yt � Y)2)which represents the proportional variance in the regression, when R-squared is closer to1 the model has a better adjustment to the data; and Mean Absolute Error MAE MAE =

55

CHAPTER 4. RESULTS AND EVALUATION 56

Pni=1 |ei|/n which is a measure of the di↵erence between paired observations [72].

Table 4.1: Metrics with performance of multiple regression and panel data on the test set.

RMSE Rsquared MAEMultiple Regression 0.94254 0.99906 0.83472

Panel Data 0.54893 0.99910 0.38425

In the Table 4.1 we show R-squared that was obtained for the test set, the closer to1 the better the model behaves, so Panel Data is slightly better than multiple regression.In the case of RMSE and MAE, the larger the error, similarly, R-squared obtained betterPanel Data performance. We know that the di↵erence is minimal but in a ranking as com-petitive as QS decimals can define several places in the ranking, that is why we seek thebest performance. So we decided to use the Panel Data model for the rest of the experi-ments.

To verify that there is no overfitting, a cross-validation was carried out. It consistedof performing the training of the model leaving out one year at a and using that holdoutyear for validation, 10 tests were performed, we calculated the di↵erence between the realoverall score and the prediction of each year as shown in Table 4.2.

The largest average di↵erence is presented in 2011 and the lowest in the 2014 pre-diction. We believe our model is very good at predicting overall scores, even when themethodology has evolved during those ten years.

4.1.2 QS BSC (Best Student Cities)We carry out in parallel the evaluation of the models built for the city dataset. Later wecan see the cross-validation of the panel, as well as the confusion matrices of the machinelearning exercises.

In the table 4.3 we show some statistics of the di↵erence between the predicted andthe actual score for each year that was left as the holdout of each training.

Of the average errors, the highest was obtained in 2019 with 0.4 and the lowest in2016 with 0.09. In general we find very low di↵erences for all years. Although we know


Table 4.2: Di↵erences between the predicted and real score values applying holdout toevery year.

Year Mean.error Max.error Min.error2011 -0.75 8.90 -6.802012 -0.56 0.40 -1.402013 -0.17 0.66 -1.092014 0.14 0.90 -0.802015 0.23 1.20 -0.702016 0.20 1.10 -0.702017 0.26 1.10 -0.802018 0.21 1.00 -0.902019 0.19 0.80 -0.802020 0.17 0.80 -1.00

Table 4.3: Table with statistics of the di↵erence between the predicted and the real valueof the Overall for the six years.

Year Max Error Min Error Mean Error2014 0.14 -0.13 0.142015 0.12 -0.08 0.122016 0.09 -0.09 0.092017 0.11 -0.15 0.112018 0.12 -0.07 0.122019 0.40 -0.50 0.40

that decimals in rankings can mean di↵erences between positions that are also very impor-tant for cities. That is why we also carried out the categorization exercise with machinelearning that is evaluated later.

4.2 Machine Learning

4.2.1 QS WURTo know if it is possible to identify each group of universities based on their ranking,we decided to build two types of clusters and then train machine learning algorithms tocompare performance and quality of our cluster. The groups that trained were:


• Ten groups of twenty universities each.

• Two groups of one hundred universities each.

With the idea of making predictions for subsequent years depending on the indicatorsthat a university estimates to have, we can train the algorithms and ask them to categorizethe university that interests us, trying to know if with the score that it presumes to have itwill be able to rise to a group above the present one.

The algorithms that we trained were:

• Random forest.

• Support vector machine.

• Logistic regression. (Only used in the two class exercise).

For these experiments we used the top 200 universities through the ten years (1960universities), we split the data into a training set with the data from 2011-2019 (1764 uni-versities) of the samples and a test set with 2020 (196 universities). Below we present theconfusion matrix of the test set with the accuracy that we got for the three algorithms. Wealso present the ROC Curve with the AUC (Area Under the Curve) value, the AUC canhave a value from 0 to 1, the closer to 1 the AUC is the better is the classifier.

Categories (2 of 100)

This categorization is very important since many universities aim to enter the Top 100universities worldwide. This training makes it possible for universities to propose a com-bination of scores that they plan to achieve in subsequent years through improving theirresource management and seeing if it is enough to enter the Top 100 group.

Table 4.4: Logistic regression confusion matrix (Accuracy 0.9948)

Top100 Top200Top100 296 2Top200 1 288

For this categorization we trained a logistic regression, with which an accuracy of0.9948 was achieved and 3 universities were classified by mistake that are actually in the


Table 4.5: Support Vector Machine Linear confusion matrix (Accuracy 0.9812)

Top100 Top200Top100 291 5Top200 6 285

Table 4.6: Support Vector Machine Radial confusion matrix (Accuracy 0.9608)

Top100 Top200Top100 284 10Top200 13 280

Top 200 to the Top 100.

In this case SVM Lineal had a better performance than SVM Radial, both belowlogistic regression.

Table 4.7: Random forest confusion matrix (Accuracy 0.9948)

Top100 Top200Top100 296 2Top200 1 288

Random forest had the same accuracy and performance than logistic regression. Allfour algorithms did a very good job classifying the two groups. This is positive for us,because this groups are easy to identify by a machine learning approach and if a universitythat is in the top 200 wants to know if it is going to be able to get into the top 100 we cananswer to that request.

Table 4.8: Accuracy and AUC for the four models.

Model Accuracy AUCLogistic regression 0.9948 0.995SVM Linear 0.9812 0.959SVM Radial 0.9608 0.903Random forest 0.9948 0.995

In Table 4.8 we can see the summary of the results for the four algorithms.


Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.995

(a)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.959

(b)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.903

(c)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.995

(d)

Figure 4.1: ROC curves with AUC values for: (a) logistic regression, (b) SVM linear, (c)SVM radial, and (d) random forest.


Categories (10 of 20)

For the training of the ten categories of twenty universities we use the random forest andsupport vector machine algorithms, these algorithms handle more than two categories.In total, 196 universities were categorized as we are working with the balanced panel inwhich we ensure that no university enters or leaves the ranking, so some that possibly rosein recent years are not taken into account.

Table 4.9: Decision trees confusion matrix (Accuracy: 0.3622)

Top20 Top40 Top60 Top80 Top100 Top120 Top140 Top160 Top180 200Top20 15 0 0 0 0 0 0 0 0 0Top40 5 16 0 0 0 0 0 0 0 0Top60 0 0 0 0 0 0 0 0 0 0Top80 0 0 0 0 0 0 0 0 0 0Top100 0 4 20 18 20 14 0 0 0 0Top120 0 0 0 0 0 0 0 0 0 0Top140 0 0 0 0 0 0 0 0 0 0Top160 0 0 0 0 0 0 0 0 0 0Top180 0 0 0 0 0 0 0 0 0 0

200 0 0 8 0 0 8 20 18 18 20

The first confusion matrix corresponds to the classification of the testing set usingdecision trees. The accuracy is 0.3622 which is very low. Most universities were classifiedin the 200 or Top100.

For SVM Linear there were only 79 universities classified correctly, 117 errors leadsto an accuracy of almost 0.4, which is very low.

In the case of SVM Radial, 132 universities were correctly classified in their group,and an accuracy of 0.67.

Random forest classified 196 universities, having an accuracy of 0.89 correspondingto 19 universities outside its true group.

In the case of Support vector machine we see that the result of the classification doesnot improve with respect to random forest. We know that support vector machine gen-erates a hyperspace in which it projects all the dimensions of each class and is probably


Table 4.10: Support vector machine linear confusion matrix (Accuracy: 0.3928)


200 0 0 0 0 0 0 2 10 16 20

Table 4.11: Support vector machine radial confusion matrix (Accuracy: 0.6734)


200 0 0 0 0 0 0 0 2 8 11

much less tolerant to the movement of universities between groups since it is generallywrong with groups close to the objective.

In Table 4.13 we have the summary of the results for the four algorithms.


Table 4.12: Random forest confusion matrix (Accuracy: 0.8979)


200 0 0 0 0 0 0 0 0 0 17


Model Accuracy AUCDecision trees 0.3622 0.833SVM Linear 0.3928 0.881SVM Radial 0.6734 0.950Random forest 0.8979 0.964

4.2.2 QS BSCFor machine learning algorithms we choose Decision trees, Random forest, Support VectorMachine Linear and Support Vector Machine Radial. The data was divided into a trainingset with 70% of the samples and a test set with the remaining 30%.

We can observe the confusion matrix of decision trees, it had an accuracy of 0.25which is extremely low. So in this case this algorithm would not help us to categorize thecities in their position.

The best categorization algorithm for cities was SVM Linear with an accuracy of0.666.

SVM Radial was trained, which had an accuracy of 0.49, which barely manages tocorrectly categorize almost half of the cities in their respective groups.

For random forest we have an accuracy of 0.54, which is only slightly more than half.


Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.833

(a)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.881

(b)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.950

(c)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.964

(d)

Figure 4.2: ROC curves with AUC values for: (a) decision trees, (b) SVM linear, (c) SVMradial, and (d) random forest.

In Figure 4.3 we observe the ROC curve for each algorithm, their correspondingAUC is 0.5070 for Decision Trees, 0.9170 for Random Forest, 0.9580 for SVM Linear,and 0.8330 for SVM Radial. We can see that the value of the AUC for Decision trees is0.507, when the AUC is 0.5 it means an algorithm that randomly categorizes can do thesame job without training.

Overall, SVM Linear did the best job for the categorization of cities, however, wesee it was very hard for the algorithms to categorize in general, to us, that means that thegroups do not present relevant trends. As the methodology for this ranking has changed


Table 4.14: Decision Trees confusion matrix. (Accuracy: 0.2517)

100+ Top10 Top20 Top30 Top40 Top50 Top60 Top70 Top80 Top90100+ 15 0 2 5 6 11 6 7 7 9Top10 0 15 14 3 5 3 1 0 0 0Top20 0 0 0 0 0 0 0 0 0 0Top30 0 0 0 0 0 0 0 0 0 0Top40 0 3 2 10 7 4 5 4 3 0Top50 0 0 0 0 0 0 0 0 0 0Top60 0 0 0 0 0 0 0 0 0 0Top70 0 0 0 0 0 0 0 0 0 0Top80 0 0 0 0 0 0 0 0 0 0Top90 0 0 0 0 0 0 0 0 0 0

Table 4.15: SVM Linear confusion matrix. (Accuracy: 0.6666)


dramatically over the years, and the positions have been very dynamic we find it hard topredict groups for this ranking.

4.3 Probabilistic: Bayesian NetworksTo know the probabilistic structure of the data we train a Bayesian network using the bn-learn package in R. We use the first 100 universities of 2020 as training data. Using scoresas variables, with a total of 7 nodes. The relationship between the nodes (directed vertices)is built from a greedy search algorithm.


Table 4.16: SVM Radial confusion matrix. (Accuracy: 0.4966)


Table 4.17: Random forest confusion matrix. (Accuracy: 0.5374)



Model Accuracy AUCDecision trees 0.2517 0.507SVM Linear 0.6666 0.958SVM Radial 0.4966 0.833Random forest 0.5374 0.917


Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.507

(a)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.958

(b)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.833

(c)

Specificity

Sensi

tivity

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

AUC: 0.917

(d)

Figure 4.3: ROC curves with AUC values for: (a) decision trees, (b) SVM linear, (c) SVMradial, and (d) random forest.

Subsequently, training of the distribution of the incisors was carried out. We dis-covered that the network finds the QS methodology as soon as all the nodes of the in-cisors point to the total score and the weights of each found with the maximum likelihoodmethod.

We can also see the weights of the contributions that the other relationships betweenthe indicators have, for example Academic Reputation has a contribution of 0.838 in Em-ployer Reputation, throughout this work we have seen that although Employer Reputationhas a weight of 10% its correlation with the final Score it is very high.


AcRepS

EmRepS

FacStuS CitpFacS

IntFacS

IntStuS

Score0.401

0.1000.200 0.200

0.050

0.050

0.838

0.224

0.209

0.248

0.654

Figure 4.4: Trained bayesian network.

Table 4.19: Fitted nodes of the Bayesian Network.

From/To AcRepS EmRepS FacStuS CitpFacS IntStuS IntFacS ScoreAcRepS 0 0 0 0 0 0 0EmRepS 0.838 0 0 0 0 0 0FacStuS 0 0 0 0 0 0 0CitpFacS 0.224 0 0 0 0 0.248 0IntStuS 0 0.209 0 0 0 0.654 0IntFacS 0 0 0 0 0 0 0Score 0.401 0.100 0.200 0.200 0.050 0.050 0

4.3.1 Conditional Probability TablesAs part of the Bayesian probability work we decided to calculate the conditional prob-ability tables. For the calculation of these tables it is necessary that the variables havecategorical values, in this case the indicators of the universities in the Top 100 of the year2020 were taken and two groups were created: indicators ranked 1-50 and 51-100; for thiscase we will take 1-50 as the true value and 51-100 as false. Next we show the tables withthe obtained probabilities.

The density of probability function is:


P(S , AR, ER,CF, FS , IS , FS ) =�s,ar,er,c f , f s,is,i f P(S |AR, ER, FS ,CF, IS , IF)⇤P(AR) ⇤ P(ER|AR) ⇤ P(FS ) ⇤ P(CF|AR, IF)⇤P(IS |ER, IF) ⇤ P(IF)

Table 4.20: Conditional probability table for Academic Reputation.

True False0.2808 0.7194

Table 4.21: Conditional probability table for Faculty Student Ratio.

True False0.3928 0.6072

Table 4.22: Conditional probability table for International Faculty.

True False0.2959 0.7041

Table 4.23: Conditional probability table for Employer Reputation (Dependent on Aca-demic Reputation).

ER True FalseAR True 0.6363 0.3637

False 0.7588 0.2412


Table 4.24: Conditional probability table for Citations per Faculty, dependent on Aca-demic Reputation and International Faculty.

CPT CFAR IF T FT T 0.6153 0.3847T F 0.3572 0.6428F T 0.5333 0.4667F F 0.2812 0.7188

Table 4.25: Conditional probability table for International Students, dependent on Em-ployer Reputation and International Faculty.

CPT ISER IF T FT T 0.9444 0.0556T F 0.4705 0.6428F T 0.7250 0.2750F F 0.1954 0.8046

4.4 SummaryIn this chapter we show the results of evaluating the models that were trained to makepredictions about our dataset. Panel Data performed better than multiple regression andwe found that overfitting is not occurring. Regarding the categorization results, randomforest had the best results for the QS WUR dataset, and for QS BSC, the SVM Linearalgorithm had the best accuracy in the test set. Finally, we obtained the coe�cients for theBayesian network and it is possible to make approximations since we observed that thenetwork learned the QS methodology regarding the contribution ratio of the indicators tothe final Score.


Table 4.26: For the Score, dependent on the six indicators. Some missing values are dueto not possible combinations of T/F between variables.

CPT ScoreAR ER FS CF IF IS TRUE FALSEF F F F F F 0.0000 1.0000T F F F F F 0.0000 1.0000F T F F F F 0.0000 1.0000T T F F F F 0.3750 0.6250F F T F F F 0.0000 1.0000T F T F F F 0.6667 0.3333F T T F F F 0.0000 1.0000T T T F F F 1.0000 0.0000F F F T F F 0.0000 1.0000T F F T F FF T F T F F 0.0000 1.0000T T F T F F 1.0000 0.0000F F T T F F 0.0000 1.0000T F T T F F 1.0000 0.0000F T T T F F 1.0000 0.0000T T T T F F 1.0000 0.0000F F F F T FT F F F T F 0.0000 1.0000F T F F T FT T F F T F 0.0000 1.0000F F T F T FT F T F T F 0.2500 0.7500F T T F T FT T T F T FF F F T T F 0.0000 1.0000T F F T T FF T T F T FT T T F T FF F T T T F 1.0000 0.0000T F T T T FF T T T T FT T T T T FF F F F F T 0.0000 1.0000T F F F F T 0.0000 1.0000F T F F F TT T F F F T 0.3333 0.6667F F T F F T 0.0000 1.0000T F T F F TF T T F F T 0.1428 0.8572T T T F F TF F F T F T 0.0000 1.0000T F F T F T 1.0000 0.0000F T F T F T 0.0000 1.0000T T F T F T 0.8000 0.2000F F T T F T 0.0000 1.0000T F T T F TF T T T F T 0.6667 0.3333T T T T F T 1.0000 0.0000F F F F T T 0.0000 1.0000T F F F T TF T F F T T 0.0000 1.0000T T F F T T 0.5000 0.5000F F T F T T 0.2000 0.8000T F T F T T 1.0000 0.0000F T T F T T 0.6667 0.333T T T F T TF F F T T T 0.1250 0.8750T F F T T T 1.0000 0.0000F T F T T T 0.2500 0.7500T T F T T T 0.8000 0.2000F F T T T T 0.6667 0.3333T F T T T TF T T T T T 1.0000 0.0000T T T T T T

Chapter 5

Deployment

This study was conducted before the 2020 ranking came out. In this way we consider thatthis work is not a↵ected by the final result of the ranking. We will use it as a point ofcomparison to evaluate the models.

In order to predict the 2020 indicators, a linear regression was performed with theindicators depending on the time.

5.1 Tecnologico de MonterreyEquation in section 4.1 was used, the coe�cient for Tecnologico de Monterrey is 0.2123which in the last year was in position 173, currently it is in the position 158.

In relation to the linear regression, the International Faculty indicator has been sys-tematically improving (figure 5.1 (a)), and for this reason the prediction exceeds the valueof 100 in score, which is considered to be the highest score, so a maximum limit was setof 100.

In the Loess regression (figure 5.1 (b)) it is observed that the Faculty Student andCitations per Faculty indicator tend to grow, while the others tend to decrease.

This contrast between the two models is important because observing the linear re-gression we can see that there is an optimistic scenario with respect to all the indicatorswith a positive slope, so the Loess regression provides a more conservative scenario that

72

CHAPTER 5. DEPLOYMENT 73

0

30

60

90

120

2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score

(a) Linear regression on the six in-dicators.

0

30

60

90

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score

(b) Loess regression on the six indi-cators.

Figure 5.1: Scatter plots of the Tecnologico de Monterrey indicators for years 2011�2019.

helps to contrast both results.

Table 5.1: Prediction of indicators and overall score for year 2020 for Tecnologico deMonterrey.

AcRepS EmRepS FacStuS CitpFacS IntStuS IntFacS ScoreLoess 28.94 75.08 98.11 4.52 16.96 95.34 45.90Linear 42.11 95.61 90.52 4.45 33.51 100.00 52.00Mean 35.52 85.34 94.31 4.48 25.23 97.67 48.95Real 36.90 88.90 89.50 4.60 18.40 98.20 48.50

Di↵erence -1.38 -3.56 4.81 -0.12 6.23 -0.53 0.45

In table 5.1 it can be seen that the two regressions and predictions for 2020 have atotal score of 45.9 with Loess and 52 with linear.

In the 2020 rank there was a 48.5 result for 2020. If an average is calculated betweenthe two predictions that were made, an overall score of 48.9 would be obtained, this cal-culation was made as a means of comparison to know how far apart they are. the valuesthat were obtained against the real one.

In general, in the linear regression we see how all the slopes for the six indicators arepositive, which is why an optimistic prediction is obtained except for Citations per Faculty


that actually achieved a better grade. In relation to the Loess regression that has a localweight for all indicators, the only one that was not underestimated was Faculty Studentfor the good performance it had the previous year. We believe that these results can bevery useful for institutions to understand their performance in the ranking and explore thedi↵erent models proposed to calculate their scores using historical information.

5.2 Carnegie Mellon UniversityThe second case we carried out was Carnegie Mellon University, which currently has arank of 48. Similar to the previous case, we first performed the analysis of the 2011-2019indicators by performing a linear and a non-linear regression.

0

40

80

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


0

25

50

75

100

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


Figure 5.2: Scatter plots of Carnegie Mellon University indicators for years 2011 � 2019.

In the linear regression we can see that international students have had consistentlyhigh scores, citations per faculty has a positive slope despite the fact that it has decreasedin recent years. Employer reputation also has a positive slope, and international facultythat has improved significantly in recent years. We can observe two indicators with neg-ative slope, academic reputation and faculty student, which means that they have beenaccepting many more students than professors. This has been reflected in the final Scorethat can be seen with a negative slope as well.


The advantage of Loess is that it gives a local weight to each point, here you can bet-ter see the changes in each of the indicators. We can see that in 2016 there was a changein several indicators, some begin to improve and others to lower their performance, thismay be due to a change in the methodology that was published in 2017. Faculty student,international faculty have similarly improved. Employer reputation and academic reputa-tion seem to be down in recent years. The final score dropped in the last year as well.

Table 5.2: Prediction of indicators and overall score for year 2020 for Carnegie MellonUniversity.

AcRepS EmRepS FacStuS CitpFacS IntStuS IntFacS ScoreLinear 81.21 85.51 28.88 100.00 100.00 58.34 75.00Loess 73.02 75.92 53.76 97.16 99.79 100.00 77.36Mean 77.115 80.715 41.32 98.58 99.895 79.17 76.18Real 75.2 77.5 43.5 94.3 99.9 83.6 74.8

Di↵erence -1.915 -3.22 2.18 -4.28 0.01 -4.43 -1.38

In the Table 5.2, we can see the indicator predictions. The linear regression favorscitations per faculty and international students, but it punishes international faculty, obtain-ing a final score of 75. For the non-linear regression, international faculty has benefitedfrom its improvement in recent years, and in general it performs better in this prediction,unlike the Tecnologico de Monterrey where the linear regression is more optimistic.

After calculating the average of the two regressions, we carried out the di↵erencewith the real value 2020 and obtained a di↵erence of �1.38 for the final score.

5.3 University Of Texas At AustinThis university is currently ranked 65, we are interested in analyzing a range of institutionsin order to validate our proposal. In this case this university is in the top 100 and but belowthe top 50 to which Carnegie Mellon belongs.

We first analyze its performance in previous years. In the linear regression, it onlyhas two indicators with a positive slope that are citations per faculty and employer reputa-tion, the rest have negative slopes, in fact international faculty has dropped a lot, as well


0

40

80

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


0

25

50

75

100

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


Figure 5.3: Scatter plots of University Of Texas At Austin indicators for years 2011�2019.

as faculty student and international faculty.

As was observed in the case of Carnegie Mellon, it seems that in 2016 the method-ological changes a↵ected the performance of this university. With the non-linear cal-culation we see the evolution of the indicators, faculty student and international facultyimproved in the last year, all the others lowered their score including the final score.

Table 5.3: Prediction of indicators and overall score for year 2020 for University Of TexasAt Austin.


Di↵erence 0.905 3.53 -3.145 -1.865 -0.26 -3.54 -1.67

In the table we can see that linear prediction benefited citations per faculty and in-stead predicted a very low (actually negative) rating for international faculty but we as-signed it a lower limit of zero. In the non-linear regression we had a lower prediction be-cause academic reputation, which is the indicator with the highest weight, had decreasedin recent years. Finally, comparing the average with the actual result 2020 we have a dif-ference of �1.67 of the final score.


5.3.1 Final Score probability calculationAs a comment, in case of not knowing the indicators to accuracy, it is possible to use theconditional probability tables to carry out a prediction exercise of the final score range.

Taking the particular example of University of Texas at Austin, for 2019 we havethree indicators in the Top50 and three indicators in the range of 51-100. The probabilitythat the Score is also in the Top50.

P(S = T |AR = T, ER = T,CF = T, FS = F, IS = F, FS = F) =1.0 ⇤ 0.2808 ⇤ 0.6363 ⇤ 0.6072 ⇤ 0.3571 ⇤ 0.4705 ⇤ 0.7041 = 0.0128

5.4 Universidad De Buenos AiresCurrently this university is in rank 74. We are interested in analyzing Latin Americanuniversities, below we can see the evolution of its indicators in the last nine years, before2020.

0

40

80

120

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


0

30

60

90

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


Figure 5.4: Scatter plots of Universidad De Buenos Aires indicators for years 2011�2019.


In the linear regression we can see that its strongest indicators are academic reputa-tion and employer reputation. Faculty student has improved in recent years, internationalstudent and international faculty also have a positive slope. The only indicator with a neg-ative slope is citations per faculty.

Regarding non-linear regression, we can see that academic reputation and employerreputation, despite having the best scores, have been decreasing. International studentsand citations per faculty have also decreased, more subtly. International faculty and fac-ulty student have improved in 2019 which can help in predicting 2020.

Table 5.4: Prediction of indicators and overall score for year 2020 for Universidad DeBuenos Aires.

AcRepS EmRepS FacStuS CitpFacS IntStuS IntFacS ScoreLinear 100.00 100.00 85.82 2.47 74.85 58.17 73.77Loess 78.34 84.64 75.17 3.26 60.36 52.49 61.30Mean 98.17 92.32 80.49 2.86 67.61 55.33 67.53Real 87.2 91.3 77.4 2.4 64.7 50.7 66

Di↵erence -1.97 -1.02 -3.09 -0.46 -2.90 -4.63 -1.54

We can see that in this case there is a great di↵erence between the linear predictionand that of the loess model, by more than 12 units. However when calculating the averagewe can see that the di↵erence with the real value of 2020 is �1.54 in the final score, whichis the range that we have observed of di↵erence between our predictions and the real val-ues of 2020.

5.5 Pontificia Universidad Catolica De ChileThis university is currently in the 127 ranking. It is part of our analysis of Latin Americanuniversities and in the last ten years it has risen from 331, which are more than 200 placesof improvement.

In the linear regression we see a very good score in the indicators of academic repu-tation and employer reputation. However, all the other indicators have negative slopes.


0

40

80

120

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


0

25

50

75

100

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


Figure 5.5: Scatter plots of Pontificia Universidad Catolica De Chile indicators for years2011 � 2019.

In the loess regression we can see a behavior similar to that of the Universidad DeBuenos Aires in which the indicators of academic reputation and employer reputation havedeclined in recent years, but the other indicators that had a positive slope show an improve-ment from 2018 to 2019, except for international students.

Table 5.5: Prediction of indicators and overall score for year 2020 for Pontificia Universi-dad Catolica De Chile.


Di↵erence -1.53 1.035 -7.9 0.58 0.73 -1.8 -1.75

Regarding our predictions shown in the table, we see an 8 point di↵erence betweenlinear regression and nonlinear regression, we can see that their faculty student indicatorworsened in the last year, down 4 points compared to 2019 and showing a di↵erence �7.9over our average. Finally, the di↵erence of the average against the actual final score was�1.75.


5.6 Universidad Nacional Autonoma De Mexico (UNAM)The next Latin American university is UNAM. It is currently in 103rd place, about to enterthe Top 100, and rising systematically since 2011 when it started in 222nd place.

0

30

60

90

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


0

25

50

75

100

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


Figure 5.6: Scatter plots of Universidad Nacional Autonoma De Mexico indicators foryears 2011 � 2019.

In the linear regression we observe two indicators with a positive slope and four witha negative slope. Their citations per faculty and international students indicators are thelowest by far, in terms of academic reputation and employer reputation, these two findscores around the 90’s.

For nonlinear regression there are di↵erent behaviors, academic reputation and em-ployer reputation have fallen in recent years, this is a phenomenon that we have seen inseveral universities due to a change in methodology. International faculty is improving inthe last year.

We can see in the table that the largest di↵erence is found in the faculty studentindicator that improved a lot between 2019 and 2020, around 7 points, and also for in-ternational faculty that dropped more than 6 points also in the last year. However, ourprediction regarding the final score has a di↵erence of 1.09, in the expected range.


Table 5.6: Prediction of indicators and overall score for year 2020 for Universidad Na-cional Autonoma De Mexico.

AcRepS EmRepS FacStuS CitpFacS IntStuS IntFacS ScoreLinear 100.00 100.00 48.35 4.38 2.82 14.31 61.17Loess 82.94 81.86 48.89 4.84 5.38 31.57 54.24Mean 91.47 90.93 48.62 4.61 4.1 22.94 57.71Real 90.9 91 57.6 3.8 4.3 13.8 58.8

Di↵erence -0.57 0.07 8.98 -0.81 0.2 -9.14 1.09

5.7 Universidade De Sao Paulo (USP)This university is currently in 116th place and has generally been imrpoving every year.

0

25

50

75

100

125

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


0

25

50

75

100

2010.0 2012.5 2015.0 2017.5Year

Indicators

Indicatora

a

a

a

a

a

aAcRepS

EmRepS

FacStuS

CitpFacS

IntFacS

IntStuS

Score


Figure 5.7: Scatter plots of Universidade De Sao Paulo indicators for years 2011 � 2019.

In linear regression we have consistent behavior in which academic reputation andemployer reputation improve a lot compared to the other four indicators. Citations per fac-ulty does not appear to show much improvement but also does not have a negative slope,and the other three indicators have subtly worsened.

For loess recession we note again that after 2016 academic reputation and employerreputation are being penalized and has been getting worse over the years, as well as facultystudent. You see an improvement in citations per faculty in the last year and a smaller onein international faculty.


Table 5.7: Prediction of indicators and overall score for year 2020 for Universidade DeSao Paulo.


Di↵erence -1.69 -9.68 -2.43 -1.28 -0.35 -0.04 -2.46

In our table we can see that in general the di↵erence in the average prediction andthe real value is low except for employer reputation with a value of �9.68 which is quitehigh for this indicator, which represents 10% of the weight of the final score. Here ourmethodology presents the greatest di↵erence between the actual final score and the pre-dicted average, which is �2.46, more than two units. His employer reputation indicatorfell 7 points compared to 2019.

5.8 Summary

Table 5.8: Summary of results from the seven universities.

Real 2020 � Linear � Loess �Mean Current rankTec 48.50 3.5 -2.6 -0.45 158CMU 74.80 0.2 2.56 -1.38 48UT Austin 68.60 3.69 -0.35 -1.67 65UBA 66.00 7.77 -4.4 -1.54 74PUC Chile 53.40 5.62 -2.13 -1.75 127UNAM 58.8 2.37 -4.56 1.09 103USP 55.50 6.82 -1.9 -2.46 116

In this chapter we apply our methodology to seven universities five Latin Americanand two from the United States, first to the Tecnologico de Monterrey with which we ob-tained a di↵erence of 5 tenths with the actual final score obtained in 2020, then CarnegieMellon University for which the di↵erence was �1.38, University Of Texas At Austin withwhich we obtained a final score di↵erence of �1.67. The highest di↵erence between ourprediction of final score and the real value happened for the Universidad De Sao Paulo


with more than two units. What we can observe is that having the linear and non-linear re-gression helps us to balance the di↵erent trends in each university, one is usually optimisticand the other pessimistic, this can be by indicator and when we average both we obtain aresult that is close to real. Also most universities got punished with methodolgy changesbetween 2016 and 2017, we coyld see from the loess regerssions that this indicators wereimproving before those years.

Chapter 6

Discussion

In this chapter we will discuss the results and the implemented model, we will show thescope of these results and their limitations that may become future work. We will alsotalk about the research questions and the hypothesis, we will show how we answered thequestions that were asked in the beginning of the work with the results obtained.

6.1 QS WUROne of our most relevant results was the validation of the Panel Data model, this modelwas evaluated with cross-validation to verify that there was no overfitting, where our pre-dictive error was In the two trained datasets, QS WUR and QS BSC were found to be verylow error between the di↵erent years, with which we see that the model learns in relationto the methodology and the individuals.

We believe that the Panel Data model really fulfills the function of recognizing in-dividual influences and compensating the weight of each indicator with the historical per-formance of each university and each city.

After we carried out the training with machine learning models, these models weretrained to be able to obtain categorical predictions regarding the position of the universitiesand cities in the ranking.

For the QS WUR exercise where the dataset was divided into two groups of 100universities, the four algorithms used achieved an accuracy above 0.95, with this result webelieve that categorizing the universities in Top 100 and Top 200 really helps to extract

84

CHAPTER 6. DISCUSSION 85

information about of those that are in the best places in the ranking. Looking at the tengroups of 20 we see that only Random Forest achieved an accuracy above 0.8, if we wantedto categorize a university in a more specific way we could be wrong with a probability of20% using the best trained algorithm.

Regarding the training of Bayesian networks, we decided to stay with the most recentstructure (trained with data from 2020) and statically. With this we build the relationshiptable between nodes, we believe that this result is very important for those in charge ofcreating business strategies for universities, since you can see the internal structure of theindicators, if a university invests in increasing its appointments Possibly by copying thebest research equipment and attracting international students, they will be indirectly im-pacting their six indicators in a positive way.

In the deployment chapter we carried out seven experiments, each one calculatingthe prediction of 2020 socres for di↵erent universities. We performed a linear regression,a loess and then calculated the average between the two, this value was the one that wascompared with the actual value published in QS WUR and based on the di↵erence weevaluated our methodology.

Of the seven cases, the smallest di↵erence in final score was observed for the Tecno-logico de Monterrey, and the largest di↵erence for the Universidade De Sao Paulo. In fiveof the cases our prediction was above the real value, we can say that our model tends to beoptimistic regarding the final score of a university.

However, large di↵erences can be found between the prediction and the real value ofsome indicators for the universities we tested. We believe that a methodological changeannounced by QS in 2017 could strongly a↵ect all universities, mainly the indicators ofAcademic Reputation and Employer Reputation fell for all universities in 2017 and thefollowing years.

Among the seven universities that were studied, five are Latin American and wecan observe various trends. Its indicator of citations per faculty is usually the lowest, un-like the two American universities studied, the only exception was the Universidade deSao Paulo, which presents scores between 25-50 compared to 2-5 of its Latin Americancompetitors, even so It is below the University of Buenos Aires and UNAM in the ranking.

Another important observation is regarding the indicator of international students, inthe case of Carnegie Mellon this is its highest indicator, for the other six universities it is


among the three lowest, despite being an indicator with a weight of 5 % We believe that itreflects the good performance of a university that is in the top 50 of the world ranking.

Finally, we believe that in addition to making the predictions of the final score andof the indicators of the universities, the greatest value of this work is to find the trends thatallow a university to plan its long-term improvement as an institution. We know that themission of each university varies according to its nature and its community, but ultimatelyit will always try to provide a better service to its students.

We should clarify that a formal deployment has not been carried out, that is, ourmethodology has still been applied independently by an educational institution, howeverwe have made some proposals from universities that are growing as a point of comparison.

Some work related to the analysis of university rankings was developed by MasaoMori [41], from the Tokyo Institute of Technology presents a study and analysis of theway scores in World University Rankings distribute, defining criteria and weights for theTHE and QS university rankings. Collecting relative scores of 800 universities for theTHE WUR and 400 universities for the QS ranking he presents histograms showing howthe weights given by the methodologies proposed by the rankings are fairly representinguniversities depending on how crowded each score is and varying the weights. QS resultedto be the single mode, which means that the score value was the most popular for mostuniversities ranked.

Another work analyzing QS World University Rankings in a more mathematical andcomputational way, is the cluster analysis of universities done by Kathiresan Gopal [21]from the University of Putra Malaysia. Using multivariable statistical techniques, theyshow that distance between universities is another e↵ective way to rank universities com-pared to the 200 best ranked universities from QS. Clustering universities by Euclidiandistance let them learn about what scoring in rankings actually do, how they can interpretuniversities positions and how that influences the best universities in the world to keeptheir positions.

Muzakir Hussain [25] created an algorithm to aggregate rankings, finding correla-tions between di↵erent rankings, and helping students get better results by providing arecommendation system in which many popular global rankings are taken into consider-ation to build a complete and more objective ranking model. In a time where universitiesare just focused in adding to the most important scores in rankings, his team is actually try-ing to extract from rankings valuable information for students depending on their specific


needs. Proposing the Shimura Preference Order Rank Aggregation (SPORA) algorithm toe�ciently aggregate many rankings and develop a useful recommending tool.

Other important work that has to do with rankings is not only analyzing the resultsbut the way the data is obtained from the source. Chengkai Shi [59], introduces the Com-puter Science Academic Rankings System (CSAR), which aims to extract informationfrom rankings successfully. Information extraction is very important when doing research.Their system is able to collect data from Google Scholar, DBLP, ACM digital library andMicrosoft academic and collect papers and authors information, then they start working ofrelating the topics with authors and papers, they measure the contribution and finally rankauthors and organizations. This kind of tools are an alternative to WUR (World UniversityRankings).

Szentirmai did a very complete analysis on university rankings [63], cultural and ge-ographical circumstances. Studying the results from the Times Higher Education WUR,Academic Ranking of World Universities and QS WUR which are the most popular rank-ings. Top 200 universities from this three methodologies are analyzed, and it is found thattop 10 universities coincide. United Stated dominate in all rankings, the reason of thisphenomena is that most university rankings use mono-dimensional systems use indica-tors that discriminate only research-intensive institutions. The conclusion of this work isthat Europe will develop a system for international comparison of universities with widerranges of criteria to be more competitive.

Anika Tabassum of the Bangladesh University of Engineering and Technology hasbeen working on predictive models for University Rankings. Presenting di↵erent method-ologies that implement learning algorithms with a newly proposed list-wise approach [65].Depending of the data set, the indicators are broken down and they show the behavior byregion, by area of research, by gender, year, number of students, but finally they decidedthat splitting by country gave the best results. Then they separated the last year of the dataset from the rest, to be used as a test set and verify that the algorithm correctly predictedthe following year. In conclusion, they rated the prediction algorithm based on outlier de-tection as acceptable.


6.2 QS BSCThis study the QS Best Student Cities ranking gave light into the benefits and challengesthat the relationship between universities and cities give to the society. We found that hav-ing leading institutions is not the most important factor that attracts students to the cities,a welcoming language such as English plays a very important roll on students choosingtheir university, as well as the ability to attend cultural and entertaining events and beingable to find services that resemble their home amenities.

However, the importance of a urban strategy that works together with universities tocreate policies that supports international and local talent to remain in the city and providehighly qualified workforce, as well as leading edge research that improves technology andquality of living is lacking. The marketisation of education is causing students from allaround the world to live in poor conditions. Policy makers and institutions have to workhard on providing students with solutions to achieve their professional goals and becomethe citizens of the future that the new technological wave is in need for.

We believe that QS BSC ranking was able to create a ranking that reflects the strengthsand weaknesses that each city is facing, and we want to encourage policy makers and ad-ministrators pull factors that encourage the attraction and stay of international students byimproving services like housing, city signaling, administrative support, admissions, cul-tural resources and economic stability [73].

Regarding the same machine learning exercise that we carried out for the QS BSCdataset, the best accuracy achieved was 0.66 with the SVM Linear algorithm, this rank-ing is more recent and the movement between cities is much more dynamic due to largechanges in methodology every year, as well as the increasing entry of new cities. So webelieve that this exercise would not work if we wanted to categorize a city, since the prob-ability of being wrong in the group is almost 40% with our best result.

Stakeholders can make use of the panel data model to keep track of their indicatorsand project their current performance into future years, this can help them identify theirareas of opportunity and prioritise which actions to take next, as we found this rankingsummarizes the main characteristics of the quality of education and quality of living thata city provides.

However, most Latin American universities have achieved to get better positions inthe ranking in latest years. Normally, their weakest indicators are citations per faculty and


international students, we can conclude that is because most of these universities are publicand have to admit great amounts of local students, which is positive for their communities,even if the ranking does not reflect a good score in the indicator.

As discussed in [69] students are the citizens and working force of the future and theability that a community has to host them will lead to greater regional development, thatwill provide diversification and can stimulate the working environment around fulfillingthe needs of the new population.

Rebecca Hughes, the director of education of the British Council, talks about com-munity engagement as being able to combine local benefits with global thinking. Someof the activities she describes as examples are visiting academics even from Skype to linkforeign students with the university and o↵ering volunteering activities in local organiza-tions. [24].

International students want to have relationships with local students, even if hav-ing co-national relationships help them feel connected. Universities should encouragecross-cultural activities. In a study carried out at Melbourne, which is the third city in theranking, analyses the phenomena of local and international students socialising in di↵erentareas, as some foreign students have expressed living in racialized spaces. This means thatinternational students tend to find accommodation in specific areas, while local studentslive with their families or in the suburbs with more networking and activities in the city[20].

The strategies to accommodate international students in an e↵ective way vary fromimproving local library services [76], to changing the economy of the area to be able toprovide with services and entertainment that align with the new culture [13, 38].

It is also a reality that many jobs will be taken by automation, however highly cre-ative and skilled jobs will remain in need, this is why cities should focus on improvingtheir collaboration with universities, in order to have continuous learning available for thecitizens of the future. James Ransom, form Universities UK, is trying to look into the thecities of 2065, talking about the the World Economic Forum he describes education as theultimate ”soft connectivity”, for cities to focus on knowledge as the tool to make invest-ments and new technology more productive [50].

We must keep in mind, as mentioned in Lynch’s work [32], that numbers are not ableto represent what education provides for a society, education is a right for the population.


But marketisation of education is inevitable and has become one of the biggest exportsvalued, rankings and marketing have increased flow of students around the globe, as theyestimate that by 2025 8 million students will be studying outside their home countries.

In the paper written by Jenkins about international students in Taiwan the problemsof international students were compared to what the faculty perceived as problems [26].They discovered that having a common language (English) that faculty could actuallyspeak fluently to communicate with international students, this may be the reason why thetop three countries with more cities ranked (Table 3.8) have English as a first language.Another two major problems students deal with when getting in a new country is the finan-cial aid and placement services. Helping students get a better experience is very importantfor higher education institutions if they want to attract more students in the future.

There is a great cultural background to the best ranked cities, in the way their culturecommunicates and teaches, from the study made by Pattison in British Universities [46],she noticed that students from Pacific Asia were not used to participating in class or givingan opinion, they do not feel comfortable with interactive learning. This is a great insightin the reason why some countries and cities are more capable of giving a successful inter-national experience, the confidence and adaptability from the western world help studentssolve problems more e�ciently.

Another linear regression was carried out by Marconi [35], to know the e↵ect ofrankings and accreditation on international students. He found that the ranking of the uni-versity to choose was more important for exchange students with high academic perfor-mance, but not so much for students with low academic achievement. Another importantfeature he found relevant for students was the location of the university, if it was close tothe beach or the equator. He also found that the price level was not significant and withnegative sign with respect to the students choice, which directly relates to our finding ofA↵ordabilty being the least relevant indicator in the QS BSC ranking.

There was a great positive correlation between the rank of the city and the qualityand amount of higher education institutions. This relates to the previous idea, in which,most international students tend to be high performance, this means that when they getthe opportunity to study abroad they try to get to the best institutions and they can applyto financial aids. This is beneficial for developed countries that may be facing a droppingpopulation to keep receiving qualified professionals.


However, big cities are facing a worldwide crisis in providing a↵ordable accom-modation to students, mainly local. In 2016 students across London went on rent strike,students are not showing up for lectures as they work to pay rent [66]. The worst part isthat the accommodation is usually of very poor quality and is getting incredibly expensive,in 2019 housing in Liverpool was 110% of the average maintenance loan [30].

In the United States homelessness is a growing issue in universities cities such asCalifornia [6, 27] and Germany [8]. Students living in their cars and depending on the freefood universities can give. Richard Vedder talks about how the government is failing togive higher education to everyone, as accommodation targets elite students [70].

6.3 SummaryWe were able to prove that the panel model provided valuable information for stakehold-ers, we managed to use the years 2011-2019 as input for the prediction of 2020 and applyit to seven universities, the di↵erences obtained were analyzed. The Bayesian networkalso provided insight into the internal structure of the indicators and can help to streamlineuniversity resources on the most influential indicators.

Regarding the QS BSC dataset, we also obtained good predictive results for the first25 universities in the ranking in 2019, that is, the same positions and scores were obtainedthat were very similar to the real ones. Furthermore, with the correlation exercise we be-lieve that it provides a hierarchy structure of importance for the strategies that the city canimplement to improve its performance with the arrival of international students.

Regarding the research questions:

• The best prediction model we found was Panel Data for the final score and theRandom Forest model for position prediction.

• In the di↵erent groups we find trends that allow universities to locate their strongestindicators and calculate the distance of their indicators from the target group, inpoints to create long-term strategies.

• We have found everything from exploratory analysis tools to predictive, machinelearning and probabilistic tools. We believe that decision-making will be a compre-hensive strategy to measure the current position of an institution and its ability toproject future improvements.


Regarding our work compared to others, focusing on the world university rankings,we believe that our work stands out from the others in the sense that the data that we useis available and all universities and administrators will be able to replicate our analysis,we created a methodology that will make it easier for them to understand haw far they areto the position they seek and they will be able to track their future performance with themodeling proposed. As disadvantages we did not provide tools for students of parents,some of the other works try to help students weight all the rankings to understand the dif-ferent benefits of the university they are interested in. On the other side we depend on theavailability of the data, other approaches to rankings get data from web resources, such asGoogle Scholar and ACM digital libraries, that will help build independent rankings.

With the Cities data, we believe that this work is targeted to a wider group of people,we wanted to contribute with insights of international education quality for administrators,students and parents. Comparing our work with others we can see it is focused on givingstatistical tools and data oriented information. Most of the work we found is focusingon giving a better experience to international students by improving internal services inschools and the neighbourhoods around. We believe our work can help take more dataoriented decisions for students and parents, and this work in collaboration with other re-lated ones can help administrators increase the success of the international students in theirschools.

In this chapter we talked about the contributions that we have managed to show withthe results of this work as well as the comparison with other works done to answer theresearch questions. We believe that we have improved predictability with a model that isunderstandable. Similarly, we believe that only in the case of universities is it possibleto make predictions regarding the position. Finally, we present in the di↵erent worksanalyzes that help to understand the importance of rankings in the decision making ofstudents, parents and for the development of institutions.

Chapter 7

Conclusions

Higher education has evolved thanks to globalization, institutions have been in a posi-tion to be compared internationally. With this work we believe that we can help simplifythis comparative process by showing e�cient metrics that allow us to know precisely thedi↵erences between your current place and the place you wish to enter. In addition to sta-tistical metrics we use clustering to separate universities from the Top 100 and know thedistance between them regarding citations and academic reputation. Finally, panel datawas useful to know the influence of each institution in the final score, improving the pre-diction regarding the methodology of QS.

The panel regression with fixed e↵ects, in fact helps to improve the accuracy of themodel and helps to deepen the e↵ects that the di↵erent changes in the methodology haveon the final performance of each university. In the same way, the correlation coe�cientsresolve doubts as to which are the most important areas of work for universities that wantto enter a group above. You can see how the indicators have di↵erent degrees of correla-tion and in what areas they need to work.

We believe that panel data outperforms other prediction methods by the possibilityof identifying individuals in the data and providing an interpretable model that universityadministrators can justify and use. It was also considered that the grouping of universitiesby position in the ranking allows to identify the characteristics that have the institutionsthat are positioned in the highest part of the ranking and allow universities that want to up-load a clear explanation of what happens in the groups you want to belong to. Finally, webelieve that we have contributed to facilitate the decision-making of university authoritiesto plan the improvements they wish to apply in their institution to remain competitive andincrease their position in the long term.

93

CHAPTER 7. CONCLUSIONS 94

Regarding the possibility of classifying universities belonging to a group, based ontheir position in the ranking, we believe that random forest is the best classification al-gorithm, since it is flexible to the movement of universities historically, we can proposethe scores that a university projects to have in the future and evaluate them to see if it isenough to reach their goal.

Bayesian networks as a probabilistic method allows us to discover the relationshipsbetween the indicators, in this case Citations per Faculty influences Faculty Student Ratio,International Faculty and we believe that it also does so in Academic Reputation. Interna-tional students influence International Faculty, which makes sense since it is possible that astudent studying a postgraduate degree in a university can be considered for later hiring bythe same institution. Thus, indirect actions can be taken to help improve the performanceof a university in the long term.

As for the zero-sum game theory, it allows us to define a utility to each of theseactions and know the impact that it has among competitors, in any way we believe that auniversity should not deviate from its mission when deciding where place the resources assoon as each university responds to di↵erent needs in their community.

The QS Best Student Cities ranking gave light into the benefits and challenges thatthe relationship between universities and cities give to the society. We found that hav-ing leading institutions is not the most important factor that attracts students to the cities,a welcoming language such as English plays a very important roll on students choosingtheir university, as well as the ability to attend cultural and entertaining events and beingable to find services that resemble their home amenities.

However, the importance of a urban strategy that works together with universities tocreate policies that supports international and local talent to remain in the city and providehighly qualified workforce, as well as leading edge research that improves technology andquality of living is lacking. The marketization of education is causing students from allaround the world to live in poor conditions. Policy makers and institutions have to workhard on providing students with solutions to achieve their professional goals and becomethe citizens of the future that the new technological wave is in need for.

We believe that QS BSC ranking was able to create a ranking that reflects the strengthsand weaknesses that each city is facing, and we want to encourage policy makers and ad-ministrators pull factors that encourage the attraction and stay of international students by


improving services like housing, city signaling, administrative support, admissions, cul-tural resources and economic stability [73].

Stakeholders can make use of the panel data model to keep track of their indicatorsand project their current performance into future years, this can help them identify theirareas of opportunity and prioritise which actions to take next, as we found this rankingsummarizes the main characteristics of the quality of education and quality of living thata city provides.

7.1 Future WorkAs future work we have di↵erent proposals. One of them is the transfer of these modelsto other known datasets of university rankings such as THE and ARWU, each one havingdi↵erent methodologies and number of indicators. We believe that our work can be trans-ferable to these rankings.

It is also possible to enrich the study by gathering complementary information fromother available sources that may be influencing the rankings and that would help univer-sities monitor their performance. This database can be internal, that is, based on theirteacher recruitment, international student enrollment, publications, and help them supple-ment the predictions with these decisions to refine future improvement projections.

Another concept that began to be explored was the learning of dynamic Bayesiannetworks. Due to the time these results were not concluded, but we think that it may beuseful since our database is time dependent.

It is also possible to extend the work of game theory, and in this way help to havea broader vision of university competitiveness, modeling the di↵erent actions that institu-tions carry out to improve their performance worldwide. Because game theory becomesvery complex with the large number of individuals in our dataset, it is also possible topropose a decision theory model.

Regarding the city dataset, a comparison was started with the work of the GPCIGlobal Power City Index and its indicators, we believe that other works that take into ac-count economic, environmental, cultural and standard of living development share verysimilar characteristics with the indicators proposed by QS, which means that the level of


development is one of the most important factors, even so cities with di↵erent characteris-tics are opening space, o↵ering more accessible proposals.

7.2 SummaryWe believe that the research questions were answered positively. First, we show that rank-ing databases are a useful tool for performing data analysis. The proposed multiple regres-sion and panel data models were able to make predictions taking into account the evolutionover time of each individual in the datasets. We also carry out groupings and in this waywe can study the performance of di↵erent institutions depending on their position, thissheds light on the improvement planning of an institution that seeks to enter the top group,knowing the general strengths of this new group that it wishes to join. We believe that wehave designed a methodology for data exploration and score and position prediction thatcan help an institution or city to design strategies that help it improve its performance inthe ranking.

We have also seen that the needs of universities continue to evolve and their successin delivering the best education and being ranked best will continue to evolve. That is whywe believe that this work will continue to develop depending on where higher educationcontinues to advance.

Finally, we believe that this work made a relevant analysis of the characteristics ofhigher education institutions and the most influential international cities among students,in order to promote e�cient improvements in the di↵erent areas that universities and citiesneed to work together to be able to have a good performance of highly talented studentsand promote that in these cities there are highly qualified professionals, and therefore de-velopment at the student level leads to an improvement in the quality of life in the longterm.

Appendix A

Recommendations for universityadministrators in order to enhanceranking outcomes.

As part of the vision of this thesis is to help decision-makers through a structured processof analysis of results. Based on current result.

1. Locate the target group and make a distance analysis by indicator using the Min /

Max table.

2. With the correlation table, understand which indicators are most influential in theinterest group and locate the internal actions that can influence these indicators.

3. Complement decision-making with the Bayesian network and analyze dependenciesbetween indicators that can help make more strategic decisions.

4. Use linear and loess regression of indicators to project the following year. With thisnew set of indicators, the final score prediction can be made using the panel datamodel and the position range using random forest.

5. It is possible to tune these results with the internal experience of university per-formance and the influence of the decisions that have been made to improve itsindicators.

97

Appendix B

Publications

B.1 ArticlesA Data Analytics study on the influence of top universities over world-class cities basedon QS Best-Student-Cities Ranking. (Accepted for For CONF-CDS 2020)

Predicting World University Performance with Data Analytics and Machine Learn-ing Models using QS Datasets. Sent for evaluation to 5th Workshop on EducationalInnovation.

B.2 Book chapterRazonamiento estadıstico. (Conocimiento y Razonamiento Computacional http://amexcomp.mx/files/Libro-CyR.pdf)

B.3 PresentationsPanel participant. (University Rankings Summit)

Poster presentation. (CID)

98

Bibliography

[1] Awad, M., and Khanna, R. E�cient Learning Machines Theories, Concepts, andApplications for Engineers and System Designers. Apress, 2015.

[2] Bai, J. Panel data models with interactive fixed e↵ects. Econometrica 77, 4 (Jul2009), 1229–1279.

[3] Baltagi, B. H. Econometric analysis of panel data. John Wiley & Sons, Inc., 2016.

[4] Ben-Gal, I. Bayesian networks. Encyclopedia of Statistics in Quality and Reliability(2008).

[5] Benito, M., Gil, P., and Romera, R. Funding, is it key for standing out in the univer-sity rankings? Scientometrics 121, 2 (Sep 2019), 771–792.

[6] Bizjak, T., and Morrar, S. The new face of sacramento’s a↵ordable housing crisis:College students forced to drop out, Oct 2019.

[7] Bothwell, E., McKie, A., Macfarlane, B., Soteriou, H., Ross, J., Ross, D., andHsueh, C.-M. The world university rankings 2020: methodology, Sep 2019.

[8] Brady, K. German dorms are so pricey, students are building their own, Apr 2019.

[9] Brandt, F., Fischer, F., Harrenstein, P., and Shoham, Y. Ranking games. ArtificialIntelligence 173, 2 (2009), 221–239.

[10] Brankovic, J., Ringel, L., and Werron, T. How rankings produce competition: Thecase of global university rankings. Zeitschrift fur Soziologie 47 (10 2018), 270–288.

[11] Cantu-Ortiz, F. J. Research Analytics: Boosting University Productivity and Com-petitiveness through Scientometrics. CRC Press, Taylor & Francis Group, 2018.

99

BIBLIOGRAPHY 100

[12] Cleveland, W. S., and Devlin, S. J. Locally weighted regression: An approach toregression analysis by local fitting. Journal of the American Statistical Association83, 403 (Sep 1988), 596–610.

[13] Collins, F. L. International students as urban agents: International education andurban transformation in auckland, new zealand. Geoforum 41, 6 (2010), 940 – 950.

[14] Consultancy, S. Arwu methodology.

[15] Dearden, J., Grewal, R., and Lilien, G. Framing the university ranking game: actors,motivations, and actions. Ethics in Science and Environmental Politics 13, 2 (2014),131–139.

[16] Dobrota, M., Bulajic, M., Bornmann, L., and Jeremic, V. A new approach to the qsuniversity ranking using the composite i-distance indicator: Uncertainty and sensi-tivity analyses. Journal of the Association for Information Science and Technology67, 1 (2016), 200–211. Cited By :32.

[17] Downing, K., and Ganotice, F. World University Rankings and the Future of HigherEducation. Advances in educational marketing, administration, and leadership (AE-MAL) book series. IGI Global, 2016.

[18] Elken, M., Hovdhaugen, E., and Stensaker, B. Global rankings in the nordic region:challenging the identity of research-intensive universities? Higher Education 72, 6(Dec 2016), 781–795.

[19] Eynard, D., Javarone, M. A., and Matteucci, M. Clustering Algorithms. SpringerNew York, New York, NY, 2017, pp. 1–14.

[20] Fincher, R., and Shaw, K. Enacting separate social worlds: ‘international’ and ‘local’students in public space in central melbourne. Geoforum 42, 5 (2011), 539 – 549.

[21] Gopal, K., and Shitan, M. Cluster analysis of top 200 universities in mathematics.2015 International Symposium on Mathematical Sciences and Computing Research(iSMSC) (2015).

[22] Grewal, R., Dearden, J. A., and Lilien, G. L. The university rankings game: Mod-eling the competition among universities for ranking. The American Statistician 62,3 (2008), 232–237.

[23] Grewal, R., Dearden, J. A., and Llilien, G. L. The university rankings game. TheAmerican Statistician 62, 3 (2008), 232–237.

BIBLIOGRAPHY 101

[24] Hughes, R. How do international students shape uk towns and cities?, May 2017.

[25] Hussain, M. M., Rahman, S. A., Beg, M. S., and Ali, R. Cognitive fuzzy rankaggregation for non-transitive rankings: An institute recommendation system casestudy. 2018 IEEE 17th International Conference on Cognitive Informatics & Cogni-tive Computing (ICCI*CC) (2018).

[26] Jenkins, J. R., and Galloway, F. The adjustment problems faced by international andoverseas chinese students studying in taiwan universities: a comparison of studentand faculty/sta↵ perceptions. Asia Pacific Education Review 10, 2 (Jun 2009), 159–168.

[27] Jones, C. Homeless in college: Students sleep in cars, on couches when they havenowhere else to go, Dec 2019.

[28] Kaur, K. The economc impact of international students. Universities UK (March2017), 1–7.

[29] Kondakci, Y., Bedenlier, S., and Zawacki-Richter, O. Social network analysis of in-ternational student mobility: uncovering the rise of regional hubs. Higher Education75, 3 (Mar 2018), 517–535.

[30] Lee, M., and Baucher, C. Students: don’t let rising rents drive you out of university,Mar 2019.

[31] Lopez-Martın, E., Moreno-Pulido, A., and Exposito-Casas, E. Validez predictivadel u-ranking en las titulaciones universitarias de ciencias de la salud. Bordon. Re-vista de Pedagogıa 68, 2 (2017).

[32] Lynch, K. Control by numbers: new managerialism and ranking in higher education.Critical Studies in Education 56, 2 (2015), 190–207.

[33] Maimon, O. Z., and Rokach, L. Data mining and knowledge discovery handbook.Springer, 2010.

[34] Marconi, G. Rankings, accreditations, and international exchange students. IZAJournal of European Labor Studies 2, 1 (May 2013).

[35] Marconi, G. Rankings, accreditations, and international exchange students. IZAJournal of European Labor Studies 2, 1 (Jul 2013), 5.

[36] Maua, D. D. Score-based structure learning, Nov 2018.

BIBLIOGRAPHY 102

[37] Mcaleer, M., Nakamura, T., and Watkins, C. Size, internationalization, and univer-sity rankings: Evaluating and predicting times higher education (the) data for japan.Sustainability 11, 5 (2019), 1366.

[38] Mistlin, A. Want black students to feel at home? don’t ignore the little things, May2019.

[39] Mitchell, T. M. Machine learning. McGraw-Hill, 1997.

[40] Morentin, J. I. M. D. Developing the concept of international education: Sixty yearsof unesco history. Prospects 41, 4 (2011), 597–611.

[41] Mori, M. How do the scores of world university rankings distribute? 2016 5th IIAIInternational Congress on Advanced Applied Informatics (IIAI-AAI) (2016).

[42] OECD. How do rankings impact on higher education? IMHE info (Dec 2007), 1–4.

[43] Olcay, G. A., and Bulu, M. Is measuring the knowledge creation of universitiespossible?: A review of university rankings. Technological Forecasting and SocialChange 123 (2017), 153–160. Cited By :17.

[44] OLeary, J. Times good university guide 2020. Times Books, 2019.

[45] OLeary, J., Quacquarelli, N., and Ince, M. Guide to the worlds top universities. QSQuacquarelli Symonds, 2006.

[46] Pattison, S., and Robson, S. Internationalization of british universities: Learningfrom the experiences of international counselling students. International Journal forthe Advancement of Counselling 35, 3 (Sep 2013), 188–202.

[47] Pride, D., and Knoth, P. Peer review and citation data in predicting university rank-ings, a large-scale analysis. Digital Libraries for Open Knowledge Lecture Notes inComputer Science (Sep 2018), 195–207.

[48] Provost, F., and Fawcett, T. Data science for business: what you need to know aboutdata mining and data-analytic thinking. OReilly, 2013.

[49] QSIU. Qs world university rankings, 2019.

[50] Ransom, J. Future of cities: Universities and cities, November 2015.

[51] Rencher, A. C., and Schaalje, G. B. Linear Models in Statistics. JohnWiley & Sons,2008.

BIBLIOGRAPHY 103

[52] Rokach, L., and Maimon, O. Data mining with decision trees: theory and applica-tions. World Scientific Pub. Co, 2015.

[53] Schlogl, C. European doctoral forum at the 14th international society of scientomet-rics and informetrics conference. Bulletin of the American Society for InformationScience and Technology 40, 1 (2013), 17–18.

[54] Schober, P., Boer, C., and Schwarte, L. A. Correlation coe�cients. Anesthesia &

Analgesia 126, 5 (May 2018), 1763–1768.

[55] Schubert, E., and Rousseeuw, P. J. Faster k-medoids clustering: Improving the pam,clara, and CLARANS algorithms. CoRR abs/1810.05691 (2018).

[56] Scutari, M. Bayesian network structure learning, 2019.

[57] Shalev-Shwartz, S., and Ben-David, S. Understanding machine learning: from the-ory to algorithms. Cambridge University Press, 2017.

[58] Shearer, C. The crisp-dm model: The new blueprint for data mining. Journal ofData Warehousing 5, 5 (2000).

[59] Shi, C., Quan, J., and Li, M. Information extraction for computer science academicrankings system. 2013 International Conference on Cloud and Service Computing(2013).

[60] Soh, K. World university rankings: statistical issues and possible remedies. WorldScientific, 2017.

[61] Sowter, B., Hijazi, S., and Reggio, D. Ranking world universities. Advances inEducational Marketing, Administration, and Leadership World University Rankingsand the Future of Higher Education (2017), 1–24.

[62] Symonds, Q. Q. Qs best student cities - methodology, Aug 2018.

[63] Szentirmai, L., and Radacs, L. World university rankings qualify teaching and pri-marily research. 2013 IEEE 11th International Conference on Emerging eLearningTechnologies and Applications (ICETA) (2013).

[64] Tabassum, A., Hasan, M., Ahmed, S., Tasmin, R., Abdullah, D. M., and Musharrat,T. University ranking prediction system by analyzing influential global performanceindicators. 2017 9th International Conference on Knowledge and Smart Technology(KST) (2017).

BIBLIOGRAPHY 104

[65] Tabassum, A., Hasan, M., Ahmed, S., Tasmin, R., Abdullah, D. M., and Musharrat,T. University ranking prediction system by analyzing influential global performanceindicators. 2017 9th International Conference on Knowledge and Smart Technology(KST) (2017).

[66] Taylor, D. University students across london take part in rent strike, May 2016.

[67] Tolles, J., andMeurer, W. J. Logistic regression. Jama 316, 5 (Feb 2016), 533.

[68] UNESCO. Unesco institute of statistics, 2019.

[69] van den Berg, L., and Russo, A. The student city. strategic planning for studentcommunities in eu cities. European Congress of the Regional Science Association(09 2003).

[70] Vedder, R. The new campus housing bubble, Sep 2019.

[71] Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S. Constrained k-means cluster-ing with background knowledge. Proceedings of the Eighteenth International Con-ference on Machine Learning (2001), 577–584.

[72] Willmott, C., and Matsuura, K. Advantages of the mean absolute error (mae) overthe root mean square error (rmse) in assessing average model performance. ClimateResearch 30 (2005), 79–82.

[73] Wu, C., and Wilkes, R. International students’ post-graduation migration plans andthe search for home. Geoforum 80 (2017), 123 – 132.

[74] Yudkevich, M. Global university rankings as the olympic games of higher education.The Global Academic Rankings Game (2016), 1–11.

[75] Zaki, M. J., and Meira, W. Data mining and analysis: fundamental concepts andalgorithms. Cambridge University Press, 2017.

[76] Zhou, L., Han, Y., and Li, P. Home away from home: Extending library services forinternational students in china’s universities. The Journal of Academic Librarianship44, 1 (2018), 52 – 59.

Curriculum Vitae

Born in Mexico City in 1992. Earned the Physics Engineering degree from UniversidadAutonoma Metropolitana Campus Azcapotzalco, with a thesis that studied the theoreticalcharacteristics of Gallium Arsenide quantum bits using Von Neumann master equation andWagsness-Bloch dissipator. Also worked at an AMEXCID project for diabetes in collabo-ration with Universidad de la Republica de Uruguay running micro-second simulations ofinsulin and the insulin receptor using Molecular Dynamics and Markov Chains. Workedas a Research Assistant giving physics laboratories to up to 100 undergraduate engineer-ing students and UAM-A. Then she was accepted at the Master of Science in ComputerScience programm at Tecnologico de Monterrey Campus Estado de Mexico where sheexpects to graduate in June 2020.

Outside research she loves to play piano, bake cookies for her family and classmates,dance classical ballet, read, meditate and exercise outdoors.

This document was typed in using LATEX2"a by Ana Carmen Estrada Real.aThe style file phdThesisFormat.sty used to set up this thesis was prepared by the Center of Intelligent

Systems of the Instituto Tecnologico y de Estudios Superiores de Monterrey, Monterrey Campus

a data analytics approach for university competitiveness

Documents