david thesis final 1 sided
TRANSCRIPT
Gryffindor or Slytherin? The effect of an Oxford College
David Lawrence∗
Supervisor: Dr Johannes Abeler
Submitted in partial fulfilment of the requirements for the degree of
Master of Philosophy in Economics
Department of Economics
University of Oxford
Trinity Term 2016
∗I would like to thank my supervisor, Johannes Abeler, for the patient guidance, encouragement and advice hehas provided throughout my time as his student. I have been extremely lucky to have a supervisor who cared somuch about my work, and who responded to my questions and queries so enthusiastically and promptly. I am verygrateful to Dr Gosia Turner in Student Data Management and Analysis at Oxford University for providing the data andanswering my many questions about it. Valuable comments were received from Theres Lessing, Jonas Mueller-Gastell,Leon Musolff and Matthew Ridley. This work was supported by the Economic and Social Research Council. Wordcount: 29,904 (356 words on page 2, including footnotes, multiplied by 84 pages, including the title page)
Abstract
Students at Oxford University attend different colleges. Does the college a student
attends matter for their examination results? To answer this question, I use data on all
Oxford applicants and entrants between 2009 and 2013, focusing primarily on Preliminary
Examination (Prelims) results for 3 courses: Philosophy, Politics and Economics (PPE),
Economics and Management (E&M) and Law. I use two methods to account for the
possibility student ability differs systematically between colleges. First, I control for
“selection on observables” by running an OLS regression on college dummy variables
and variables capturing almost all information available to admissions tutors. Results
show that colleges matter statistically and practically. Colleges have a modest impact on
average Prelims scores, similar to the impact secondary schools have on GCSE results. A
one standard deviation increase in college effectiveness leads to a 0.11 standard deviation
increase in PPE average Prelims score. The equivalent figures are 0.15 for E&M, 0.14 for
Law and 0.09 for all courses combined. Second, I take advantage of a special feature of the
Oxford admissions process – that “open applicants” are randomly assigned to colleges –
to control for “selection on observables and unobservables”. Results suggest differences in
college effectiveness are large and accounting for unobservable ability can change college
effectiveness estimates considerably. However, the results are very imprecise so it is
difficult to draw strong conclusions. I also test whether my college effectiveness estimates
can be explained by college characteristics and find college endowment and peer effects,
operating through the number of student per course within a college, are related to college
effectiveness.
Keywords: Oxford, college effectiveness, selection bias, selection on observables and
unobservables, examination results
ii
Contents1 Introduction 1
1.1 Prior Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Institutional Background 8
3 Theoretical Model 93.1 Defining College Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 College Admissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Applications and Applicant Ability . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Application Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 Enrolment Probabilities and Expected Exam Results . . . . . . . . . . . . . . . 123.2.4 The College Admissions Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Econometric Models 164.1 Model 1 – Norrington Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Model 2 – Selection on Observables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Model 3 – Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . . 25
5 Data 295.1 Why use Four Datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Choice of Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3 Choice of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.5 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5.1 Testing Assumptions for Selection on Observables and Unobservables . . . . . 43
6 Results 456.1 Results for Norrington Table Plus and Selection on Observables . . . . . . . . . . . . 456.2 Robustness Checks for Norrington Table and Selection on Observables . . . . . . . . . 54
6.2.1 Alternative Outcome Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.2 Interval Scale Metric Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.3 Heterogeneity in College Effectiveness across Students of Different Types . . . 59
6.3 Results for Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . 60
7 Characteristics of Effective Colleges 65
8 Discussion and Limitations 70
9 Conclusion and Future Work 72
A Proof of Proposition 1 79
iii
List of Tables1 Information Available in each Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Description of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Sample Selection: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Sample Selection: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Sample Selection: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 Sample Selection: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Application, Offer and Enrolment Statistics: PPE and E&M . . . . . . . . . . . . . . . 378 Application, Offer and Enrolment Statistics: Law and All Subjects . . . . . . . . . . . 389 Mean Applicant and Exam Taker Characteristics: PPE . . . . . . . . . . . . . . . . . 3910 Mean Applicant and Exam Taker Characteristics: E&M . . . . . . . . . . . . . . . . . 4011 Mean Applicant and Exam Taker Characteristics: Law . . . . . . . . . . . . . . . . . . 4112 Mean Applicant and Exam Taker Characteristics: All Subjects . . . . . . . . . . . . . 4213 Tests for Differences in Mean and Variance of Applicant Ability across Colleges . . . . 4214 P-values from Balance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4415 Regressions: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4616 Regressions: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4617 Regressions: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4718 Regressions: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4719 Correlation in College Effects across Courses . . . . . . . . . . . . . . . . . . . . . . . 5420 Alternative Dependent Variable Regressions: PPE . . . . . . . . . . . . . . . . . . . . 5621 Alternative Dependent Variable Regressions: E&M . . . . . . . . . . . . . . . . . . . . 5722 Alternative Dependent Variable Regressions: Law . . . . . . . . . . . . . . . . . . . . . 5823 P-values from Tests for Heterogeneity in College Effects across Students . . . . . . . . 6024 Selection on Observables and Unobservables Results for various λ1: PPE, E&M and
Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6225 Selection on Observables and Unobservables Results: All Subjects, English, Maths
and History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6326 Second Stage Regression Results: Impact of Endowment . . . . . . . . . . . . . . . . . 6727 Second Stage Regression Results: Evidence of Peer Effects . . . . . . . . . . . . . . . . 68
List of Figures1 Applicant Ability and College Admissions Decisions . . . . . . . . . . . . . . . . . . . 152 College Ranking by Course: Norrington Table Plus vs Selection on Observables . . . . 483 Comparison of Selection on Observables College Ranking across Courses . . . . . . . . 544 Comparison of College Rankings across Models: All Subjects . . . . . . . . . . . . . . 64
iv
1 Introduction
The popular Harry Potter novels of J.K. Rowling are set in the fictional Hogwarts School of Witchcraft
and Wizardry where all the students are magically assigned by a “sorting hat” to one of four houses:
Gryffindor, Slytherin, Hufflepuff, and Ravenclaw. Oxford University is organised in a similar way to
Hogwarts. Oxford divides students into colleges, just as Hogwarts divides students into houses. The
college a student attends can influence not only the facilities available to them (like catering services
and libraries), their accommodation and their peers but also the teaching they receive.
In this paper I address two basic questions that arise in the context of Oxford colleges. First,
to what extent do colleges “make a difference” to student outcomes? Second are any differences in
college effectiveness1 captured by college characteristics such as endowment, age and size? To answer
these questions I use admissions and examination (exam) data on all Oxford applicants and entrants
between 2009 and 2013, focusing on how exam results (specifically first year “Prelims” results) vary
across colleges in three particular courses: Philosophy, Politics and Economics (PPE), Economics
and Management (E&M) and Law as well as across all courses (“All Subjects”).
The key complication in answering these questions is selection bias. Selection into colleges is
non-random and thus student ability may differ systematically between colleges. Selection occurs:
(i) at the application stage (students choose to apply to one college and not to others); (ii) at the
admissions stage (admission tutors take decisions to make offers to some students and not others);
and (iii) at the enrolment stage (students with offers decide whether they want to accept the offer).
Non-random selection into colleges can be based on observables characteristics (e.g. prior attainment)
and unobservable characteristics (e.g. motivation) which may themselves be correlated with exam
results. Failure to adequately control for such selection would lead to biased estimates of college
effectiveness, favouring colleges with higher ability students.
To overcome the problem of selection bias I employ two empirical methods. First, I estimate
an OLS regression which identifies college effects only under a “selection on observables” assump-
tion. Detailed data on almost all variables used by admissions tutors provides some support for
1I use the term “college effectiveness” to mean the contribution of colleges to student examination results. I use“college effectiveness”, “college effect” and “college quality” interchangeably.
1
this assumption. Nevertheless, concern remains that “selection on unobservables” may bias college
effectiveness estimates.
Second, I take advantage of a special feature of the Oxford admissions process: some applicants
choose to make an “open application”. These applicants do not apply directly to a college, instead
their application profiles are randomly allocated between those colleges that receive relatively few
direct applicants. Intuitively, random assignment implies all colleges receive open applicants with
equal ability on average. Hence, in relative terms, colleges accepting a large proportion of open
applicants allocated to them must have received weak direct applications and have low admissions
standards, while colleges that accept a low proportion of open applicants must have received strong
direct applications and have high admissions standards. I formalise this intuition in a theoretical
model. Given additional assumptions concerning the distribution of applicant ability, this method
can account for “selection on observables and unobservables”. Exam results differences across colleges
remaining after controlling for both observables and unobservables can be considered a measure of
college effectiveness or alternatively college “value-added”.2
My results reveal colleges matter. A simple comparison of average exam results suggests large
differences between colleges. When I account for observable student characteristics, exam result
differences shrink because high ability students tend to attend more effective colleges. The vast
majority of variation in exam results is due to between-student differences. However, even after
controlling for observables there remains strong evidence that colleges differ in their effectiveness
in boosting student exam results – college effectiveness differences are statistically and practically
significant in all courses I consider. A one standard deviation increase in college effectiveness leads
to a 0.11 standard deviation increase in Prelims average score in PPE (a 0.65 mark increase). This
would be enough to move a 50th percentile student up to the 55th percentile. The estimated standard
deviation of college effectiveness is 0.15 for E&M, 0.14 for Law and 0.09 across All Subjects. College
effectiveness differences are comparable to school effectiveness differences and slightly lower than
teacher effectiveness differences.
2Although widely used, the “value-added” term is questionable because inputs and outputs are measured in differentunits (Goldstein and Spiegelhalter, 1996).
2
I also produce course-specific college rankings that improve on the Norrington table3 as they
account for observable student characteristics. College rankings at an aggregate level are of limited
use because college effectiveness differs across courses – hence my focus attention on courses within
colleges. Course-specific college rankings are subject to large confidence intervals because of the low
number of students per course at each colleges.
Accounting for selection on unobservable student characteristics would likely further change the
results. Unfortunately for PPE, E&M and Law, estimation error prevents me from obtaining point
estimates for the effectiveness of each college (as only a small number of open applicants enrol
at Oxford). Instead I present college effectiveness estimates for different parameterisations of the
relationship between prior ability and exam results. I do obtain college effectiveness estimates for
some other courses (English, Maths and History) and for All Subjects combined.4 The results suggest
variation in college effectiveness remains large and that unobservable ability can dramatically change
college effectiveness estimates. However, the estimates are imprecise so it is difficult to reach strong
conclusions.
Having established that college effects exist, I use a second stage regression to examine whether
they can be explained by college characteristics. The most interesting finding is evidence that peer
effects, operating through the number of students per college studying the same course, contribute
to college effectiveness. Reversal causality is also possible – if a college happens to be strong in one
subject for whatever reason, they will be likely to hire more fellows and thus increase the size of
the cohort at that college. If there are benefits to clustering together students studying the same
subject then a potential policy implication would be to close small, under-performing courses within
a college. There is also evidence that richer colleges are more effective than poorer colleges. However,
given that college effectiveness is imperfectly correlated across courses, it seems likely that college
effectiveness is primarily determined by course-specific variables related to teaching and peer effects.
Overall, much of the variation in college effectiveness remains unexplained.
The results of this study may be of interest to a number of different audiences. First, it may3The Norrington table, published each year, documents the degree outcomes of students at each Oxford college.
It ranks colleges using the Norrington score, devised in the 1960s by Sir Arthur Norrington, which attaches a score todegree classifications and expresses the overall calculation for each college as a percentage.
4Though aggregating across courses makes the random assignment of open applicants far less credible.
3
interest economists studying the educational production function. At a school level, economists have
struggled to identify a systematic relationship between school resources and academic performance.
This study informs us about the relationship between college resources and academic performance.
Second, this study can help prospective students deciding which college to apply to. An Oxford
college education is an experience good, with quality difficult to observe in advance and only really
ascertained upon consumption. Thus the application decisions of prospective students are likely to
be based on imperfect information. This paper shows attending a high quality college can boost
students’ exam results which is important given the substantial economic return to better university
exam performance. Better exam performance at UK universities is closely related to entering further
study (Smith et al., 2000), employment (Smith et al., 2000), industry choice (Feng and Graetz,
2015), short-run earnings (Feng and Graetz, 2015; Naylor et al., 2015) and lifecycle earnings (Walker
and Zhu, 2013). For example, Feng and Graetz (2015) study students from the London School of
Economics and find the causal wage payoff 12 months after graduating with a First compared with an
Upper Second is a 3% higher expected wage. The difference between an Upper Second and a Lower
Second is 7% higher wages. Thus there should be demand by applicants for third parties evaluations
of college quality just as there is demand for league tables of university quality (Chevalier and Jia,
2015). My college effectiveness estimates help to fill this gap in the market – they improve on the
unadjusted college rankings currently available to prospective students in the Norrington table.56
Third, my analysis may be of interest to Oxford colleges themselves. Colleges need to measure past
effectiveness relative to other colleges for a number of reasons. It allows them to learn best practices
from, and share problems with, other colleges, evaluate their own practices, allocate resources more
efficiently and plan and set targets for the future. Yet currently colleges receive scant feedback on their
past performance in raising exam results and the information they do receive from the Norrington
table can be misleading or demoralising due to selection bias – Norrington table rank may be more
5Of course, exam based rankings are only a starting point for application decisions and should complement otherinformation about colleges’ quality (such as cost, location, accommodation and facilities) from publications, oldersiblings, friends at Oxford and personal visits to colleges.
6More informed students may create dynamic effects as they would then be able to “vote with their feet” likeconsumers in a Tiebout model. On the one hand, this may drive up college quality by increasing competition betweencolleges. On the other hand, as pointed out by Lucas (1980), when criticising the Norrington table, it may increaseinequality in raw exam results between colleges because lower ranked colleges would find it difficult to recruit highability students. Increased competition may also discourage colleges from cooperating with each other.
4
informative about who their students are than how they were taught. My estimates provide a better
picture of a college’s performance. Furthermore, my analysis suggests colleges effectiveness may be
increased by admitting larger number of students per course, perhaps colleges should concentrate on
a narrower range of courses. Even small improvements in college effectiveness are important, because
they might be cumulative and because they refer to a large number of students.7
1.1 Prior Literature
This is the first study of differences between Oxford colleges. However my paper is related to various
literatures interested in measuring differences in effectiveness across teachers, schools and universities.
First, there is a large and active literature (much done by economists) on the value-added of
teachers in schools (Hanushek, 1971; Chetty et al., 2013a,b; Koedel et al., 2015) and universities
(Carrell and West, 2008; Waldinger, 2010; Illanes et al., 2012; Braga et al., 2014). Empirical evid-
ence shows students are not randomly assigned to teachers, even within schools or universities (e.g.
Rothstein (2009)). To account for non-random assignment, teacher value-added models use similar
methods to those in this paper – either “selection on observables” where observables include student
and family input measures and a lagged standardised test score or random assignment of students
to teachers (Nye et al., 2004; Carrell and West, 2008). The main conclusions of teacher value-added
studies also mirror my findings. Teachers like colleges vary in their effectiveness (Nye et al., 2004;
Ladd, 2008; Hanushek and Rivkin, 2010; Braga et al., 2014). Within schools, Nye et al. (2004),
reviews 18 early studies of teacher value-added. Using the same method I use (though I correct for
measurement error), they find a median standard deviation of teacher effectiveness of 0.34. Hanushek
and Rivkin (2010) review more recent studies and report estimates, adjusted for measurement er-
ror, that range from 0.08 to 0.26 (average 0.11) using reading tests and 0.11 to 0.36 (average 0.15)
in maths. They conclude the literature leaves “little doubt that there are significant differences in
teacher effectiveness” (p. 269). Within universities, Braga et al. (2014) find a one standard deviation
increase in teacher quality leads to a 0.14 standard deviation increase in Economics test scores and a
7Estimates of effectiveness similar to mine are often used for teacher and school accountability purposes. However,for reasons detailed in section 8, I do not believe my college effect estimates should be used to hold colleges to account.
5
0.22 standard deviation increase in Law and Management test scores. Overall, teacher effects appear
slightly larger than the college effects I find (0.09 - 0.15). However, there is no consistent relationship
between teacher effectiveness and observable teacher characteristics such as education, experience or
salary (Burgess, 2015).
Second, there is a literature on the value-added of schools (though only some by economists)
(Aitkin and Longford, 1986; Goldhaber and Brewer, 1997; Ladd and Walsh, 2002; Rubin et al., 2004;
Reardon and Raudenbush, 2009). Again similar empirical strategies are used, though non-economists
tend to use random effect models whereas economists favour fixed effect models. Although school
effectiveness is found to impact test scores, there is a consistent finding that schools, like colleges,
have less impact on test scores than teachers with most estimates in the range 0.05-0.20 (Nye et al.,
2004; Konstantopoulos, 2005; Deutsch, 2012; Deming, 2014).8 In one of the most credible studies,
Deutsch (2012) takes advantage of a school choice lottery to estimate a school effect size, adjusted
for measurement error, of 0.12. School effect sizes seem similar to college effect sizes. Thomas et al.
(1997), for example, find the standard deviation in total GCSE performance between schools is 0.10
when pooled across all subjects and is higher in individual subjects ranging from 0.13 in English to
0.28 in History. This closely mirrors my results in terms terms of the size of school (college) effects,
the variation across subjects (courses) and the fact there is less variation in effectiveness once subjects
(courses) are pooled together. Therefore the impact of colleges on exam results appears similar to
the impact of schools on GCSE results. This literature also finds school resources have only a weak
relationship with test scores, leaving much variation in school effectiveness unexplained (Hanushek,
2006; Burgess, 2015).
Third, a small number of studies have attempted to measure university effects on degree out-
comes (Bratti, 2002), student satisfaction Cheng and Marsh (2010), standardised test scores (Klein
et al., 2005) and earnings (Miller III, 2009; Cunha and Miller, 2014). In the attempt to account
for selection bias, “selection on observables” methods have been used exclusively. Results suggest
large unconditional differences in outcomes across universities with observable student covariates
8School effect sizes differ depending on the age of the students – they are highest in Kindergarden, fall as studentsbecome older until bottoming out around GCSE age and rising again in the 6th form (e.g. Goldstein and Sammons(1997) and Fitz-Gibbon (1991)).
6
accounting for a substantial portion, but not all of these differences (Miller III, 2009; Cunha and
Miller, 2014). Observable university characteristics explained only a small proportion of variation in
university value-added (Bratti, 2002).
Beyond “value-added”, this paper is related to the research done by economists on the effect on
earnings from attending a higher “quality” university, where “quality” is usually defined in terms of
mean entry grade, expenditure per student, student/staff ratio and/or ranking in popular league
tables (Dale and Krueger, 1999; Black and Smith, 2004, 2006). Conceptually measuring the return
to institution quality is quite different to my analysis focusing on institution effectiveness. Whereas I
attempt to estimate quality directly, this literature takes quality as given and attempts to estimate the
labour market return to a higher quality. Nevertheless, the university quality literature is interesting
to consider because it has found interesting ways to tackle the non-random selection of students into
universities (better students sort into higher quality colleges). Studies tend to aggregate universities
into a small number of quality groups, thereby reducing the dimensionality of the selection problem.
This facilitates the use of selection on observables based on OLS (James et al., 1989; Black et al.,
2005), selection on observables based on matching (Black and Smith, 2004; Chevalier, 2014) and
methods to account for selection on unobservables including regression discontinuity (Saavedra, 2009;
Hoekstra, 2009), instrumental variables (Long, 2008) and applicant group fixed effects (Dale and
Krueger, 1999, 2014; Broecke, 2012).9 However, no study in this literature has had the opportunity
to exploit random assignment, as I am able to do.
The rest of the paper is organised as follows: Section 2 briefly explains the institutional back-
ground. Section 3 lays out a theoretical model of Oxford admissions that defines college effects.
Section 4 explains the problem of selection bias and outlines econometric models that account for
“selection on observables” and “selection on observables and unobservables” respectively. Section 5
describes the data. Section 6 presents the results. Section 7 considers whether college characteristics
9I considered, but ultimately rejected, using these methods to account for selection on unobservables. For instance,matching could be applied to Oxford colleges with only minimal complications, such as in Davison (2012), but woulddo nothing to help account for unobservables. Instrumental variables requires finding over 30 valid instruments, onefor each college, which is a formidable challenge. Applicant group fixed effects, work better in a university contextthan a college context because they face a multicollinearity problem when students apply to only one college (seediscussion in Miller III (2009)). In addition, applicant group fixed effects make the strong assumption that studentsapply to colleges in a rational way. I did estimate regressions with applicant group fixed effects but the results wereunconvincing and are not reported.
7
can explain differences in college effectiveness. Section 8 discusses limitations and section 9 concludes.
Proofs are collected in the appendix.
2 Institutional Background
The college model is one of the oldest forms of academic organisation in existence. It originated 700
years ago in the UK and was long confined to the universities of Oxford, Cambridge, and Durham.
Today however, college systems have spread worldwide. College systems now operate at several other
British universities including Bristol, Kent and Lancaster. In the US, Harvard, Yale and others have
established similar college systems. College systems are also common in Canada, Australia, and New
Zealand and are present in a numerous other countries from Mexico to China (O’Hara, 2016).
Oxford University can be thought of as consisting of two parts – (1) a Central Administration
and (2) the 32 colleges.10 The Central Administration is composed of academic departments, re-
search centres, administrative departments, libraries and museums. The Central Administration (i)
determines the content of the courses within which college teaching takes place, (ii) organises lectures,
seminars and lab work, (iii) provides resources for teaching and learning such as libraries, laborator-
ies, museums and computing facilities, (iv) provides administrative services and centrally managed
student services such as counselling and careers and (v) sets and marks exams, and awards degrees.
The colleges are self-governing, financially independent and are related to the Central Administra-
tion in a federal system not unlike the federal relationship between of the 51 states of America and
the US Federal Government. The colleges (i) select and admit undergraduate students, (ii) provide
accommodation, meals, common rooms, libraries, sports and social facilities, and pastoral care for
their students and (iii) are responsible for tutorial teaching for undergraduates. Thus Oxford colleges
play a significant role in university life, making Oxford an ideal place to study college effects.
10There are also five Permanent Private Halls at Oxford admitting undergraduates. They tend to be smaller thancolleges, and offer fewer subjects but are otherwise similar. From now on I include them when I refer to “colleges”.
8
3 Theoretical Model
In this section I develop a theoretical model of college admissions. The model serves two main
purposes. First, it allows me to formally define the “effect” of attending an Oxford college. A failure
to clearly define the causal effect of interest has been a criticism of much of the school effect literature
(Rubin et al., 2004; Reardon and Raudenbush, 2009). Second, the model motivates the empirical
strategies I employ to identify college effects in section 4.
3.1 Defining College Effects
There are a total of N applicants to Oxford indexed i = 1, 2, ..., N and J colleges indexed j =
1, 2, . . . , J . For each student i there exist J potential exam results Y 1i , Y
2i , . . . .Y
Ji , where Y ji denotes
the exam result at some specified time (such as end of year 1) that would be realised by individual i
if he or she attended college j. Let each potential exam result depend on pre-admission ability Ai, a
1 x K row vector. Ai permits multiple sources of ability which may be observable or unobservable.
It should be interpreted broadly to include not only cognitive ability but also motivation. Potential
exam results also depend on college effects cij , which are allowed to vary across students, and a
possibly heteroskedastic random shock eij , uncorrelated with ability and representing measurement
error in exam results such as illness on the day of the exam and subjective marking of exams. The
potential exam result obtained by an individual i who attends college j is:
Y ji = Y ji (Ai, cij , eij). (1)
For student i the causal effect of attending college j as opposed to college k is the difference in
potential outcomes Y ji − Y ki . The main focus of this paper is on estimating the average causal effect
of college j relative to a reference college k for the subpopulation of n ≤ N students who actual enrol
at Oxford (denoted by the set E). This average causal effect of college j relative to college k is:
βj = cj − ck =1
n
∑i∈E
cij −1
n
∑i∈E
cik. (2)
Focusing on the subpopulation of students who attend Oxford, rather than the full population
of applicants, makes sense because many applicants (perhaps due to weak prior achievement at
9
school) may have only a low chance of attending Oxford. The definition college effects relies on two
assumptions.
Assumption 1. “Manipulability”: Y ji exists for all i and j
Assumption 1 is the assumption of manipulable college assignment (Rosenbaum and Rubin, 1983;
Reardon and Raudenbush, 2009). It says each student has at least one potential outcome per college.
Intuitively to talk about the effect of college j one needs to be able to imagine student i attending
college j, without changing the student’s prior characteristics Ai. “Manipulability” would be violated,
for instance, if a college only accepted women implying the potential outcome of a male student at that
college may not exist. This assumption is relatively unproblematic at Oxford (certainly compared
to schools or universities). Oxford colleges are not generally segregated by student characteristics11
so it is not difficult to imagine Oxford applicants attending different colleges. Randomness in the
admissions process also makes it possible that all applicants have at least some chance, however
small, of being offered a place at an Oxford college.
Assumption 2. “No interference between units” : Y ji is unique for all i and j
Assumption 2 says each student possesses a maximum of one potential exam result in each college,
regardless of the colleges attended by other students (Reardon and Raudenbush, 2009). The “no
interference between units” assumption of Cox (1958) is one part of the “Stable Unit Treatment Value
Assumption” (or SUTVA; Rubin, 1978). Strictly speaking, this means that a given student’s exam
result in a particular college does not depend on who his college peers are (or even how many of them
there are). Evidence of peer effects in education make this assumption questionable (e.g. Feld and
Zölitz, 2015). Without it, however, we must treat each student as having as JN potential outcomes,
one for each possible permutation of students across colleges. Thus adopting the no interference
assumption makes the problem of causal inference tractable (at the cost of some plausibility). The
consequences of violations of this assumption on the estimates of college effects are unclear, since
without it the causal effects of interest are not well-defined.
11St Hilda’s, the last all women’s college started accepted men in 2008. An exception is colleges that accept onlymature students such as Harris Manchester.
10
3.2 College Admissions
3.2.1 Applications and Applicant Ability
Responsibility for admissions is devolved at the college level, then again at the course level. To save
notation, let all applicants apply for the same course. College j is allocated (receives the application
profiles of) Dj direct applicants and Oj open applicants to consider for admission.
The direct applicants received by college j are the students who expressed a preference for college j
on their application forms - they applied directly to college j. In total there areD1+D2+. . .+DJ = D
direct applicants to Oxford. Let the ability of direct applicants to each college be normally distributed
with the mean ability of direct applicants allowed to differ between colleges but with the variance
constrained to be the same for all colleges. In particular, let the ability of direct applicants to college
j be distributed ADj ∼ N(µDj , 1) where ADj is the ability of a direct applicant to college j and µDj is
the mean ability of direct applicants to college j.
Colleges also receive open applicants. In total there are O1 + O2 + . . . OJ = O open applicants
to Oxford and their ability follows standard normal distribution: AO ∼ N(0, 1). Oxford admissions
procedures require that all open applicants are pooled together by the Undergraduate Admissions
Office. Open applicants are then randomly drawn out, one at a time and are allocated to the college
with the lowest direct applicant to place ratio. This random assignment to colleges, it the key to my
selection on unobservables identification procedure. I present evidence in section 5.5.1 that supports
random assignment. Since each college receives a random sample (of size Oj) of open applicants, the
ability of open applicants sent to college j, denoted AOj , is also distributed N(0, 1).
3.2.2 Application Profiles
Admissions at Oxford colleges are conducted by faculty, who are also researchers and teachers, in the
subject a student applies for (referred to as “admissions tutors”). Applicant ability Ai and college
effects cij are not perfectly observable to admissions tutors. Instead colleges observe an applicant’s
application profile (“UCAS form”) which includes both “hard characteristics” such as GCSE results, A-
level results and the results of Oxford-specific admission tests and “soft” characteristics such as school
11
reference letters and evidence of enthusiasm in the personal statement.12 The application profile does
not include whether an applicant was a direct applicant or an open applicant. Application profiles
can be thought of as a noisy signal of the ability of each applicant. Denote the characteristics of
applicant i seen by admission tutors as a 1 x K row vector xi = Ai−ri where ri is a 1 x K row vector.
Each of the K elements in xi provides a signal about a component of ability Ai. For example, maths
GCSE result provides a signal of maths ability. Assume that each element of xi is an unbiased signal
for its equivalent element in Ai such that E(Ai|xi) = Ai. Also assume xi and cij are independent,
that is, application profile xi provides admissions tutors with no information about college effects cij
(This assumption is relaxed in some of the empirical work). Let X denote the support of x and let
Xj denote the support of the application profiles for students allocated to college j. Let ηj(x) be the
number of students allocated to college j with application profile x.
3.2.3 Enrolment Probabilities and Expected Exam Results
Let αj(x) denote the probability that student with application profile x, upon being offered admission
at college j, eventually enrols. Let Yj(x) denote the expected exam result of an applicant with
application profile x who enrols at college j. This allows acceptance or rejection of an offer from
college j to provide extra information about the ability (and expected exam result) of an applicant.
Colleges need to condition on acceptance when making admissions decisions in order to make a
correct inference about the student’s ability because of an “acceptance curse”: the student might
accept college j’s admission because she is of low ability and is rejected by other universities (either
UK or foreign).
3.2.4 The College Admissions Problem
Define an admission protocol for college j as a probability pj : Xj → [0, 1] such that an applicant
allocated to college j with application profile x is offered admission at college j with probability
pj(x). Each college has a capacity constraint, Kj (the maximum number of students college j can
12Information on ethnicity and parental social class is also collected on the UCAS form but this information is notavailable to admissions tutors when they decide on admissions
12
admit). College j thus chooses the set of pj(x) ∈ [0, 1] to maximise their objective function:
maxpj(x)
{∑x∈Xj
pj(x)αj(x) ηj(x)Yj(x)
}(3)
subject to their capacity constraint:∑x∈Xj
pj(x)αj(x) ηj(x) ≤ Kj . (4)
This is almost identical to the university admissions decision problem studied by Bhattacharya
et al. (2014) (see also Fu (2014)). The college objective is to maximise total expected exam results
among the admitted applicants. It implicitly assumes “Fair Admissions” (Bhattacharya et al., 2014),
in the sense that it gives equal weight to the exam results of all applicants, regardless of pre-admission
characteristics. This assumption is plausible at Oxford because Oxford emphasises that applicants
are admitted strictly based on academic potential. Extra-curricular activities, such as sport and
charity work are given no weight unless they are related to academic potential. “Fair Admissions”
is consistent with the “Common Framework” which guides undergraduate admissions at Oxford:
“Admissions procedures in all subjects and in all colleges should [. . . ] ensure applicants are selected
for admission on the basis that they are well qualified and have the most potential to excel in their
chosen course of study” (Lankester et al., 2005).
The solution to college j’s admissions problem takes the form described below in Proposition 1,
which holds under Condition 1: admitting everyone with an expected exam result Yj(x) ≥ 0 will
exceed capacity in expectation (Bhattacharya et al., 2014).
Condition 1. αj(x) > 0 for any x ∈ Xj and for some δ > 0 we have∑x∈Xj
αj(x) ηj(x) 1{Yj(x) ≥ 0} ≥ Kj + δ.
Proposition 1. Under Condition 1 the solution the college j’s admissions problem is:
pOPTj =
{1 if Yj(x) ≥ zj0 if Yj(x) < zj
where
zj = min{r :
∑x∈Xj αj(x) ηj(x) 1 {Yj(x)≥ r}≤Kj
}13
Proof in Appendix.
The model shows that college j uses a cut-off rule (admission threshold). The result is intuitive.
Colleges first rank applicants by their expected exam results (conditional on acceptance). Colleges
then admit applicants whose expected exam results are the largest, followed by those for whom it is
the next largest and so on till all places are filled. An admissions policy for the ranked groups {pj(x)}
takes the form {1, . . . , 1, 0, . . . , 0}. Since ability is continuously distributed and x is an unbiased signal,
x is also continuously distributed. Hence there are no point masses in the distribution of Yj(x) and
there is no need for account for ties.
As noted by Bhattacharya et al. (2014), the probability of a student enrolling having received an
offer from college j affects the admission rule only through its impact on the cut-off; the intuition is
that individuals who do not accept an offer of admission do not take up any capacity and this is taken
into account in the admission process. Also note that the assumptions imply, perhaps unrealistically,
no role for risk in admissions decisions.
The Fair Admissions assumption implies student characteristics influence the admission process
is through their effect on expected exam results. The same cut-off zj is used for open and direct
applicants - there is no discrimination against open/direct applicants (or any demographic group).
Discrimination would occur if colleges had a higher cut-off for open applicants than direct applicants
as this would imply that a direct applicant with the same expected exam result as a open applicant
is more likely to be admitted. Equal cut-offs for open and direct applicants are plausible because,
as noted above, colleges are not provided with any information about whether an applicant applied
directly or was an open applicant.
The solution is illustrated in Figure 1 for the case where applicant ability is fully observed by
admissions tutors: xi = Ai (ri = 0 for all i).13
13This model is a highly stylised model of admissions. For simplicity, it ignores a number of features of the admissionsprocess. Oxford admissions actually involve multiple stages. In the first stage colleges choose which applicants to“short-list” and “deselect” and which applicants to “reserve”. Deselected applicants are rejected. Short-listed andreserved applicants are given interviews at the college they were allocated. Shortlisted but unreserved applicants maybe reallocated to another college for interview. After first interviews colleges make some admissions decisions aboutwhich applicants to accept. However, a small number of applicants are given second interviews. Second interviewsprovide applicants not selected by their first college the chance to be accepted by another college (known as “pooling”).It should also be noted that application procedures vary slightly between courses. Capturing all these points wouldinvolve a more complex dynamic game played between colleges. Nevertheless, my empirical work relies only on the
14
Figure 1: Applicant Ability and College Admissions Decisions
AjD ∼ N(µj
D,1)
AjO ∼ N(0,1)
pjD
pjO
−4 −2 0 2 4zjµjD
AbilityFigure 1 shows how colleges would make admissions decisions if ability was fully observable (i.e. Ai = si).Direct applicant ability to college j is distributed Aj
D ∼ N(µj
D,1). The graph is drawn such that µj
D = 0.5.
Open applicant ability to college j is distributed AjO ∼ N(0,1). zj is the cut−off (admissions threshold). All
students with ability above the cut−off (the shaded area) are admitted. The distribution of ability forsuccessful open applicants to college j follows a truncated normal distribution and similarly for successfuldirect applicants. A proportion pj
D of direct applicants and a proportion pj
O of open applicants are accepted.
With this admissions model in mind, the goal is to estimate the college effects cij . I consider three
different empirical models. First, as a simple baseline, I consider differences in mean exam results
between colleges in the spirit of the Norrington table. Second, I use a “selection on observables”
strategy that attempts to estimate college effects by conditioning on almost all the information
available to admissions tutors in the student’s application profile. Third, I take advantage of the
random assignment of open applicants and estimate the thresholds zj for each college. I then use
these threshold estimates together with the assumptions of the theoretical model to obtain estimates
of college effects. The next section explains these strategies in detail.
result that colleges use a cut-off rule and that the cut-off is equal across all applicants. This result would continueto hold if, for example, (i) no new information about applicant ability was revealed at interview, (ii) colleges couldcorrectly predict the admissions decisions of other colleges and (iii) the reallocation of rejected applicants was knownin advance by the colleges.
15
4 Econometric Models
The econometric models in this section must acknowledge some objects in the theory model are un-
observable. First, exam results for applicants who do not attend Oxford are not observed. Second,
even for the applicants who enrol at Oxford, at most one potential exam result per student is ob-
servable (the potential exam result from the college they actually attend). This is the “fundamental
problem of causal inference” (Holland, 1986). With a slight abuse of notation I denote observed exam
results of student i at college j as Yij for i = 1, ..., n. Third, not all the information in an applicant’s
application profile is observable. Decompose the information in application profiles into two parts:
x = x1+x2 where x1 and x2 are 1 x G and 1 x K - G row vectors with (with K > G and remembering
x is 1 x K). “Hard” information x1 is assumed observable to admissions tutors and researchers. “Soft”
information x2 is assumed observable to admissions tutors but not researchers.
The aim is to identify college effects given the available data. All three empirical strategies take
the potential exam results function (1) specified in section 3 and assume observed exam results take
the linear form:
Yij = λ0 + λ1Ai + cij + eij (5)
where λ0 and λ1 are K x 1 column vectors that map ability onto potential exam results and all
elements of λ1 are strictly positive. I can now decompose Ai into x1i, x2i and ri and rewrite (5) as:
Yij = λ0 + λ11x1i + λ12x2i + cij + λ1ri + eij (6)
where λ11 is a G x 1 column vector of the first G elements of λ1 and λ12 is a K - G x 1 row vector
of the last K - G elements of λ1. Student ability unobserved even by admissions tutors is captured
by ri.
4.1 Model 1 – Norrington Table
The first empirical strategy is to estimate college effects using a student-level fixed effects regression
with no control variables for observable or unobservable ability. That is, Model 1 estimates for
16
enrolled students:
Yij = λ0 +
J−1∑j=1
βjCj + vij ∀ i = 1, ..., n (7)
where vij =∑J−1j=1 (βij − βj)Cj + λ1Ai + eij , Cj is a dummy variable denoting enrolment at college
j, βij is a college fixed effect coefficient which may differ across i and βj = 1n
∑ni=1 βij is the average
over students of the college fixed effects. College J is the reference college. Model 1 can be estimated
by regressing exam results on a set of college dummy variables. The fixed effect coefficients βj are
the objects of interest, they give mean differences in exam results relative to the baseline college.
Model 1 is thus similar in spirit to the Norrington table.14
The most important problem with Model 1 (and the Norrington table) is selection bias. Selection
bias prevents us from interpreting the fixed effect coefficient estimates as causal effects. Randomised
experiments are the gold standard for estimating causal effects and imagining a hypothetical random-
ised experiment helps to conceptualise the selection bias problem. Consider a two stage admissions
process. In stage 1 it is decided which students will attend Oxford. In stage 2 admitted students are
randomly assigned to colleges. In this ideal scenario, college assignment is independent of student
ability among the population of enrolled students, so the simple mean difference in observed exam
results gives an unbiased estimate of differences in college effects for students attending Oxford.
Unfortunately for researchers selection into colleges is non-random in ways that are correlated
with exam results. Students and admission tutors deliberately and systematically select who enrols.
At the application stage, students choose where to apply to. At the admissions stage, admission
tutors take decisions to accept some students and not others. There could also be selection at the at
the enrolment stage (in practice, very few students reject offers from Oxford colleges). The selection
bias problem makes it difficult to attribute student exam results to the effect of the college attended
separately from the effect of preexisting student ability.
Formally, since we have assumed λ11 6= 0 and λ12 6= 0, selection bias occurs if:
14Model 1 does differ from the Norrington table is some ways. For instance, the Norrington table does not take intoaccount of differences across courses (getting a First in E&M may be easier or more difficult than getting a First inLaw). As I explain in section 5 below, I standardise exam results by course and year which mitigates this problem.
17
E
J−1∑j=1
(βij − βj)Cj + λ1Ai + eij |cij
6= 0.
Model 1 embodies two types of non-random selection into colleges. First, selection on the het-
erogeneous college effect βij . This occurs if individuals differ in their potential exam results, holding
ability Ai constant, and if they choose a college (or colleges chooses them) in part on that basis.15
Selection on heterogeneous college effects captures the intuition that students and colleges are looking
for a good “match”. The economics of the problem suggest students will tend to apply to colleges that
are relatively good at boosting their exam results - a form of selection bias that bares similarities to
Roy’s model of occupational choice (Roy, 1951). Similarly colleges will tend to make offers to students
who tend to benefit more than average from the college’s teaching. Students enrolled at college j
may thus have higher expected exam results from attending college j than the average student. This
biases college fixed effect coefficients and it would not be appropriate to interpret such estimates of
as causal effects for the average student enrolled at Oxford (though college effect estimates biased in
this way may still be of interest).
Second, selection on ability Ai. Determinants of exam results may be correlated with college
enrolment even if college effects are constant across students (βij = βj for all i). This occurs if
individuals choose colleges or colleges choose students in ways correlated with prior ability. Rational
applicants will choose to apply to the college that maximises their expected utility. Expected utility
is likely to depend on a number of factors including the perceived probability of receiving an offer
from each college, risk aversion, the value of their outside option if they did not attend Oxford
and preferences over college characteristics (including college effectiveness and other characteristics
contributing towards consumption benefits). Observable and unobservable ability are likely to impact
the college a student applies to. Furthermore college admissions decisions are based on student ability.
Positive selection seems likely, though not inevitable, with students of higher ability tending to go
to more effective colleges. In the presence of such selection, estimates of the college fixed effect
coefficients will be biased in favour of colleges with higher ability students.
15This assumes students and tutors have an idea of their own student/college-specific coefficient.
18
Selection bias causes three problems. First, as discussed, college effectiveness estimates are biased.
Second, the importance of variation in college effectiveness in determining exam results could be
exaggerated. The total effect of colleges on student exam results could be overstated because some of
the omitted ability will be included in the portion of the variance in student exam results explained by
college effects.16 Third, bias would lead to errors in supplementary analyses that aim to identify the
characteristics effective colleges. Selection bias implies Model 1 is best used as a basis for comparison
with other models that control for observables and unobservables.
4.2 Model 2 – Selection on Observables
The second empirical strategy is to estimate college effects using a conditional OLS regression. Model
2 estimates for enrolled students:
Yij = λ0 + λ11x1i+
J−1∑j=1
βjCj + vij ∀ i = 1, ..., n (8)
where now vij =∑J−1j=1 (βij− βj)Cj +λ12x2i+λri+eij . The difference between Model 1 and Model 2
is that now observable parts of application profiles x1i are included in the regression. The objects of
interest are the college fixed effect coefficients: βj . In an ideal scenario, we could interpret estimated
coefficients as estimates of the average causal effect relative to the reference college for students
attending Oxford. However, such a causal interpretation requires three further assumptions. I start
with two that are relatively unproblematic.
Assumption 3. “Interval scale metric”. The metric of Yij is interval scaled.
Assumption 3 says that the units of the exam result distribution are on an interval scale (Ballou,
2009; Reardon and Raudenbush, 2009). Interval scales are numeric scales in which we know not only
the order, but also the exact differences between the values. Here the assumption says equal sized
gains at all points on the exam result scale are valued equally. A college that produces two students
with scores of 65 is considered equally as effective as a college producing one with a 50 and another
16The effect of the bias on variation in college quality would depend on the direction of the bias. The text herepresumes the likely scenario with positive selection bias – i.e., where more effective colleges are assigned students withhigher expected exam results.
19
with 80. In comparing mean values of exam results, I implicitly treat exam results as interval-scaled
(the mean has no meaning in a non-interval-scaled metric). If exam results are not interval scaled
then the college effect results will depend on arbitrary scaling decisions.17 However, it is unclear
how to determine whether exam results are interval scaled because there is often no clear reference
metric for cognitive skill (Reardon and Raudenbush, 2009). At a practical level, the importance of
this assumption comes down to the sensitivity of college effects estimates and college rankings to
different transformation of exam results. Prior evidence on this point is reassuring, Papay (2011)
finds test scaling affects teacher rankings only minimally with correlations between teacher effects
using raw and scaled scores exceeding 0.98.18 I proceed as if exam results are interval scaled and in
section 6 test the robustness of my results to various monotonic transformations of the exam results
distribution.
Assumption 4. “Common Support or Functional form”. Either (i) there is adequate observed data in
each college to estimate the distribution of potential exam results for students of all types (“Common
Support”) or (ii) the functional form of Model 2 correctly specifies potential exam results even for
types of students who are not present in a given college (“Functional Form”).
Either “Common Support” or “Functional Form” must hold for college effects to be identified.
The common support assumption is violated if not all colleges contain students with any given set
of characteristics. For instance, if not all colleges have students at all ability levels (or not sufficient
numbers at all levels to provide precise estimates of mean exam results at each ability level), then
the common support assumption will fail. In this case we have identification via functional form -
the model extrapolates from regions with data into regions without data by relying on the estimated
parameters of the specified functional form. If the functional form is also wrong, then regression
estimators will be sensitive to differences in the ability distributions for different colleges. However,
if the distribution of ability are similar across colleges the precise functional form used will not
matter much for estimation (Imbens, 2004). The common support assumption has been questioned17This assumption could be relaxed by adopting a non-parametric approach (and comparing, for example, quantiles
rather than means) but this would require a very large sample size for accurate estimation.18If two colleges have similar students initially, but one produces students with better exam results, it will have a
higher measured college effect regardless of the scale chosen. Similarly, if they produce the same exam results, but onebegan with weaker students, the ranking of the colleges will not depend on the scale.
20
for schools because student covariates differ significantly across schools. However, the distribution of
ability is likely to be much more similar across Oxford colleges, partly because of student reallocation
across colleges during the admission process.
We now come to the most significant problem in estimating college effects: how to deal with
selection bias. I make the following two-part “selection on observables” assumption, which allows
consistent estimation of college effects:
Assumption 5. “Selection on Observables” (i) E[∑J−1
j=1 (βij − βj)Cj | Cj , x1i]
= 0 ∀ i = 1, ..., n
(ii) E [λ12x2i + λri + eij | Cj , x1i] = 0 ∀ i = 1, ..., n
The selection on observables assumption follows work by Barnow et al. (1981) in a regression
setting who observed that unbiasedness is attainable only when the variables driving selection are
known, quantified and included in x1.19 Together parts (i) and (ii) imply that potential exam results
are independent of college assignment, given x1.
Part (i) requires the heterogeneous part of college effects to be mean independent of college
enrolment conditional on x1i and Cj . This assumption is similar to, but slightly weaker than, college
effects being the same for every student. It implies there is no interaction of college effects with
student characteristics in x1i. As noted above, if individuals differ in their college effects, and they
know this, they ought to act on it, even conditional on ability. Thus this assumption relies on
students and tutors being unaware of college effects.20 In the empirical work, I test this assumption
by allowing the college effect coefficients to vary with some elements of x1i.
Part (ii) says the observable control variables x1i are sufficiently rich that the remaining variation
in college enrolment that serves to identify college effects is uncorrelated with the error term in
equation (8). This requires two things. First, the observable control variables in x1i must capture,
either directly or as proxies, all the factors that affect both the college enrolment and exam results.
Second, there must exist variables not included in the model that vary college enrolment in ways
unrelated to the unobserved component of exam results (i.e. instrumental variables must exist, even
19Non-parametric versions of this assumption are variously known as “conditional independence assumption” Lechner(2001) and “unconfoundedness” Rosenbaum and Rubin (1983). These are also closely related to “strongly ignorableassignment”Rosenbaum and Rubin (1983).
20If college effects were obvious to everyone then there would be no need for this thesis!
21
though we do not observe them, as they produce the conditional variation in college enrolment used
implicitly in the estimation). Intuitively, the aim is to compare two otherwise identical students but
who went to different colleges for a reason completely unrelated to their exam results. Practically, I
would like to measure and condition on any characteristic whose influence on exam results might be
confounded with that of college enrolment due to non-random sorting into different colleges.
I am aware that the selection on observables assumption is somewhat heroic. Unobservable ability
could cause it to be violated. For instance, students with very high unobservable ability x2i (including
excellent school references and personal statements) may be close to certain of receiving an offer from
whichever college they apply to and thus may tend to apply to colleges with larger college effects.
Alternatively more “academically motivated” students may be both more likely to apply to colleges
that improve exam results than college that provide large consumption benefits. If students do select
into colleges based on unobservable ability correlated with exam results conditional on observed
characteristics then selection bias results.
Nevertheless, the selection on observables assumption can be justified in a number of ways. First,
the extensive dataset allows me to condition on almost all information available to college admission
tutors when they are selecting students as well as some information not seen by admissions tutors.
Furthermore, there is evidence that the information available to admissions tutors but unavailable
to researchers, the personal statement and school reference, are relative unimportant in admission
decisions. In the personal statement, students describe the ambitions, skills and experience that
make them suitable for the course (e.g. previous work experience, books students have read and
essay competitions they have entered). However, Oxford admissions are strictly academic so this
only impacts admissions decisions if it is linked to academic potential. The absence of the school
reference is also perhaps of limited significance because, as noted by Bhattacharya et al. (2014),
school references tend to be somewhat generic and within-school ranks are typically unavailable
to admission tutors. This is supported by survey evidence. Bhattacharya et al. (2014) conduct an
anonymised online survey of PPE admissions tutors in Oxford asking much weight they attach during
admissions to covariates with "1" representing no weight and "5" denoting maximum weight. The
results, based on 52 responses, found that the personal statement and school reference were given
22
the lowest weights.21
Second, two students with the same values for observed characteristics may go to different colleges
without invalidating the selection on observables assumption if the difference in their colleges is driven
by differences in unobserved characteristics that are themselves unrelated to exam results. There are
plenty of potential sources of exogenous variation in college allocations conditional on observables.
For instance, students might care about factors other than the ability of colleges to boost exam
results. Observation indicates that many applicants explicitly choose among colleges, at least at
the margin, for reasons unlikely to be strongly related to exam results. Application decisions may
reflect preferences over college location, architecture, accommodation, facilities and size. These
preferences may not be strongly linked to ability to perform well in exams. Indeed selection based on
preferences over college characteristics is actively encouraged by the University - the Oxford website
recommends students choose colleges based on these non-academic considerations. Alternatively
applicants might be incapable of discerning the size of college effects. While this would not normally
be a comforting thought, it aids the selection on observables assumption. Evidence from university
admissions supports this point. Scott-Clayton (2012) reviews the literature on university admissions
and concludes applicants and parents often know very little about the likely costs and benefits of
university. For instance, small behavioural economics tricks such as whether or not a scholarship has
a formal name and a tiny change in the cost of sending standardised test scores to universities have
been shown to have non-trivial effects on university applications inconsistent with rational choice
(Avery and Hoxby, 2004; Pallais, 2013). The school choice literature also provides evidence that
students and parents do not select schools according to expectations about future test scores - the
typical voucher program does nothing to improve test scores (Epple et al., 2015). Such exogenous
variation is perhaps even more likely in the context of Oxford colleges because Oxford deemphasises
the importance of college choice, stressing all colleges are similar academically and that the primary
factor when choosing a college college choice should be consumption benefits not exam results.
A couple of final points about Model 2 should be noted. First, since I have multiple cohorts
of students, I pool students across cohorts for each college. Evaluating colleges over multiple years21A-levels appeared to be the most important criterion, followed by the admissions tests and interview scores and
then GCSE performance. The choice of subjects at A-level was given a medium weights.
23
reduces the selection bias problem (Koedel and Betts, 2011), increases students per college thus
reducing average standard errors (McCaffrey et al., 2009) and increases the predictive value of past
college effects over future college effects (Goldhaber and Hansen, 2013). In pooling across cohorts,
I assume that college effects are fixed over time and thus place equal weight on exam results in all
years.22
Second, I allow for heteroskedastic measurement error in exam results by estimating heteroske-
dasticity robust standard errors. Exam results measure latent achievement with error because of (i)
the limited number of questions on exams, (ii) the imperfect information provided by each question,
(iii) maximum and minimum marks, (iv) subjective marking of exams and (v) individual issues such
as exam anxiety or on-the-day illness (Boyd et al., 2013). Numerous studies find test score meas-
urement error is larger at the extremes of the distribution (Koedel et al., 2012). The intuition is
exams are well-designed to assess student learning for “targeted” students (near the centre of the
distribution), but not for students whose level of knowledge is not well-aligned with the content
of the exam (in the tails of the distribution). Ignoring heteroskedastic measurement error in the
dependent variable would lead to biased inference. In addition, ignoring measurement error in the
control variables would bias college effect estimates. However, I control for multiple prior test scores
(A-levels, GCSEs, IB, multiple admissions tests and interview scores) which has been shown to help
mitigate the problem (Lockwood and McCaffrey, 2014).
Third, I treat college effects as fixed effects rather than random effects. Whilst random effects
models are more efficient than fixed effects models, economists have conventionally avoided random
effect approaches (Clarke et al., 2010). This is because their use comes at the cost of an important
additional assumption - that college effectiveness is uncorrelated with the student characteristics
that predict exam results. This “random effects assumption” would fail, for example, if more effective
colleges attracted high ability students measured by prior test scores. Random effect estimators
would be inconsistent for fixed college sizes as the number of colleges grows.23 By contrast, fixed
22As the number of cohorts grows, “drift” in college performance may put downward pressure on the predictivepower of older college effect estimates. Thus if predicting future college effects is the main aim (relevant for prospectiveapplicants to Oxford) then it may be best to down-weight older data (Chetty et al., 2013a). However, my main aim isto gauge the importance of college effectiveness and thus do not account for drift.
23The bias (technically, the inconsistency) disappears as the number of students per college increases - because therandom effect estimates converge to fixed effect estimates. However, the bias still can be important in finite samples.
24
effect estimators will still be consistent for fixed college sizes as the number of colleges grows. Guarino
et al. (2015) find that under non-random assignment, random effect estimates can suffer from severe
bias and underestimate the magnitudes of college effects. They conclude fixed effect estimators should
be preferred in this situation and I follow their advice and specify college effects as fixed effects. In
section 6, I perform Hausman tests (robust to heteroskedasticity) and the results broadly support
this choice.
Fourth, I do not employ shrinkage to my college effect estimates. Estimates can be noisy when
there are only a small number of students per college. This means colleges with very few students
could be more likely to end up in the extremes of the distribution (Kane and Staiger, 2002). Shrinkage
is often used as a way to make imprecise estimates more reliable by shrinking them toward the
average estimated college effect in the sample (a Bayesian prior). As the degree of shrinkage depends
on the number of students per college, estimates for colleges with fewer students are more affected,
potentially helping with the misclassification of these colleges. The cost of shrinkage is that the
weight on the prior introduces a bias in estimates of college effects. Shrinkage can be applied to
both random and fixed effects models (so shrinkage is not a reason to favour random effect models
as is sometimes suggested). Despite the promise of shrinkage, two studies use simulations to show
shrinkage does not itself substantially boost performance (Guarino et al., 2015; Herrmann et al.,
2013). Fixed effect models without shrinkage tend to perform well in simulations and should be the
preferred estimator when there is a possibility of non-random assignment.
Even though I avoid having to make the random effects assumption, there is still a danger that
the selection on observables assumption is violated. As a result I now move on to Model 3 which can
more effectively deal with unobservables.
4.3 Model 3 – Selection on Observables and Unobservables
In this subsection I use a novel procedure to estimate college effects and account for both selection
on observables and unobservables. To do this, I take the theory model of section 3 as a starting
point and assume the ability Ai is a scalar (with multiple sources of ability, Ai can be interpreted
as a composite scalar index, i.e. a weighted average). When ability Ai is a scalar, I can estimate the
25
admission thresholds zj for each college. Admission thresholds can be consistently estimated because
open applicants are randomly allocated to colleges. I then use these threshold estimates and the
linear function form assumption (5) to obtain estimates for Ai and λ1. Colleges with high admissions
thresholds tend to have high ability entrants. This allows me to obtain college effect estimates. I
now explain this procedure in more detail.
First, remember in the theory model of section 3, the ability of open applicants to Oxford was
distributed N(0, 1). The key to identification is that open applicants are randomly allocated, by
the Undergraduate Admissions Office, to colleges. Intuitively, the random allocated means that all
colleges receive open applicants with equal ability on average. If a college accepts a large proportion
of open applicants, this suggests that their cut-off zj is low and their entrants have relatively low
ability. On the other hand, if a college accepts a small proportion of open applicants then we expect
their cut-off to be high and their entrants to be of relatively high ability. Formally, the ability of open
applicants allocated to college j is also distributed N(0,1). This means we can consistently estimate
the true cut-off zj at college j using the estimator:
zj = Φ−1(1− pOj
)(9)
where is Φ is the standard normal cdf and pOj is the proportion of open applicants allocated to
college j who are offered a place at college j (pOj is the area in the upper tail of the standard normal
distribution). When pOj is large, zj is small and vice versa. In an infinite sample we could determine
the cut-off value zj exactly. However colleges are assigned a finite number of open applicants so we
estimate zj using zj . As a simple example, consider the case where a college accepted 5% of the
open applicants they were allocated by the Undergraduate Admissions Office. Hence pOj = 0.05 and
the admissions threshold is estimated to be zj = 1.645. Since college j uses the same admissions
threshold for both open and direct applicants, we expect applicants with ability Ai ≥ 1.645 to be
accepted and applicants with ability Ai < 1.645 to be rejected.
Second, note again the ability of open applicants sent to college j is distributed N(0, 1), the ability
direct applicant’s to college j is distributed N(µDj , 1) and each college makes offers to students with
expected exam results above their cut-off. Together these three statements imply the distribution of
26
ability for successful open applicants to college j follows a truncated normal distribution and similarly
for successful direct applicants. The truncations have the same cut-off point zj but the mean of the
truncated normal distributions may differ. This is shown in Figure 1.
Now consider an equation analogous to (9) but this time for direct applicants: zDj = Φ−1(1− pDj )
where pDj is the proportion of direct applicants, assigned to college j, who are offered a place at
college j. I refer to zDj as the standardised cut-off for the ability of direct applicants zDj .
Together (i) the true cut-off zj , (ii) the standardised cut-off for the ability of direct applicants zDj
and (iii) the assumption that the standard deviation of ability for direct applicants is equal to the
standard deviation of the ability of open applicants: σD = σO = 1, give the mean ability of direct
applicants to college j µDj through the equation:
zDj =zj − µDjσD
⇐⇒ µDj = zj − σDzDj = zj − zDj
Since zj and zDj are unobservable, I use the estimator:
µDj = zj − zDj (10)
Using the standard result for the mean of a truncated normal distribution gives an estimator for the
average ability of open and direct applicants given offers by college j:
E(AOj |AOj > zj) =φ(zj)
1− Φ(zj); E(ADj |ADj > zj) = µDj +
φ(zDj )
1− Φ(zDj )(11)
where φ is the standard normal pdf and φ(.)1−Φ(.) is the hazard function for the normal distribution.
Equation (11) gives estimates of average student ability for students enrolled at each college (which
is the average of the upper tail in the normal distributions in Figure 1). Next, use the linear function
form assumption for exam results given in equation (5) to estimate the parameters λ0 and λ1. By
definition, average realised exam results at college j for enrolled open applicants and enrolled direct
applicants are given by:
Y Oj =1
O∗j
∑iεEO
j
(λ0 + λ1A+ cij + eij) ; Y Dj =1
D∗j
∑iεED
j
(λ0 + λ1A+ cij + eij)
where EOj is the set of open applicants who were allocated to college j and who enrolled at college
j, EDj is the set of direct applicants to college j and who enrolled at college j, O∗j is the number of
27
open applicants who were allocated to college j and who enrolled at college j, D∗j is the number of
direct applicants to college j and who enrolled at college j, Y Oj is the average realised exam result of
open applicants enrolled at college j and Y Dj is the average realised exam results of direct applicants
enrolled at college j. Now assume college effects are constant across students so cij = cj for all i.
Taking differences causes the college effect cj and the constant term λ0 to drop out:
Y Oj − Y Dj = λ1
1
O∗j
∑iεEO
j
Ai −1
D∗j
∑iεED
j
Ai
+1
O∗j
∑iεEO
j
eij −1
D∗j
∑iεED
j
eij .
E(AOj |AOj > zj) − E(ADj |ADj > zj) can be used as an estimator for 1O∗
j
∑iεEj
Ai − 1D∗
j
∑iεEj
Ai for
each college j. Thus we can estimate λ1 using an OLS regression:
Y Oj − Y Dj = λ1
[E(AOj |AOj > zj)− E(ADj |ADj > zj)
]+
1
O∗j
∑iεEj
eij −1
D∗j
∑iεEj
eij (12)
with J observations, one for each college. This gives OLS estimates λ1. Note there is no constant in
this regression because λ0 has been differenced away. Unfortunately, heteroskedastic measurement
error in the explanatory variable will cause the OLS estimate of λ1 will be biased – the estimates
of mean ability of enrolled students contain estimation error and this estimation error differs across
observations (it is likely to be larger for colleges with fewer open applicants as this means that the
cut-off is less accurately estimated). Whilst methods exist to correct for heteroskedastic measurement
error in simple cases (Sullivan, 2001), correcting λ1 estimates is more complex and, as far as I am
aware, there is no appropriate method to correct for this.
Once we have λ1, we can back-out cOj and cDj which are estimates of college effects (inclusive of
the constant term λ0) for open applicants and direct applicants:
cOj = Y Oj − λ1E(AOj |AOj > zj) ; cDj = Y Dj − λ1E(ADj |ADj > zj)
Since we have assumed college effects are constant across students, cOj and cDj are also estimates
of the true college effects cj . A single college effect estimate can be obtained by taking a weighted
average of cOj and cDj , where the weights correspond to the number of students who took Prelims
28
exams:
cj =O∗j
O∗j +D∗jcDj +
D∗jO∗j +D∗j
cDj . (13)
Finally, to make the results of Model 3 directly comparable to those from Model 1 and Model 2, I
present college effects relative to those of the best performing college, college J :
βj = cj − cJ . (14)
Implementing Model 3 in practice requires a number of decisions to be taken with regard to the
data. First, I decide to pool across years as done in Model 1 and Model 2. This increases preci-
sion by increasing the number of applicants (particularly open applicants) at each college. Pooling
applications across years is not ideal because it does not reflect how admissions are carried out in
practice, however open applicants will still be randomly allocated to college and if the distribution
of applicant ability is the same each year then cut-offs will be approximately the same across years.
Second, I only compare the subset of colleges with at least 50 open applicants (again to increase
precision). Third, whereas for Model 1 and Model 2, all students with Prelims scores are included
in the analysis, for Model 3, applicants not selected by the first college they were allocated to (these
students were “Rejected by College 1”) are not used in the analysis because their expected ability is
unknown. This means that Model 3 nests Model 1 as a special case where λ1 = 0 and where Model
1 is estimated on a reduced sample only containing applicants selected by the first college they were
allocated to.
5 Data
5.1 Why use Four Datasets?
I use four different datasets due to a trade-off between sample size and the availability of key covari-
ates. The largest dataset consists of anonymised data on all Oxford applicants in the years 2009-2013.
Information on these students was combined from two different sources. Firstly application records
obtained from the Student Data Management and Analysis (SDMA) team at Oxford University.
29
Table 1: Information Available in each DatasetPPE E&M Law All Subjects
Personal Characteristics Y Y Y YContextual Information Y Y Y YPrevious School Type Y Y Y YGCSEs, A-levels and IB Y Y Y YBreakdown of A-levels by Subject N Y Y NAdmissions Test Scores Y Y Y NInterview Scores N N Y NSchool Reference N N N NPersonal Statement N N N NIndividual Paper Marks Y Y Y N
Second, for enrolled students, the application records were then linked to student records (also held
by the SDMA) through unique student identifiers. Exam results are contained in student records.
I refer to this large dataset as the “All Subjects” dataset because it covers all courses taught at
Oxford. Its obvious advantage is the large number of students. However focusing exclusively on
this large dataset is limiting for a number of reasons. First, given Model 2 relies on a selection on
observables assumption, it is important to condition on all relevant covariates used in the admissions
process. Time, resource and data availability constraints prevented the SDMA from supplying inter-
view scores, admissions test scores and specific A-level subjects taken for all students taking every
Oxford course. For courses where this information is missing, the selection on observables assumption
is much less credible. Second, observable ability controls included on the RHS may have a different
impact on exam results depending on the courses taken, e.g. the effect of an A-level in economics is
probably different if a student studies E&M rather than Law at Oxford. Third, college effects may
differ across courses, given that the quality of teaching may vary within colleges. Fourth, admissions
procedures are carried out at a course (department) level so the theoretical model in section 3, implies
open applicants are only randomly to colleges within subjects.
For these reasons I also analyse three other datasets containing information on PPE, E&M and
Law students respectively. I choose these courses because very detailed admissions data is available
for each of them and because they all receive large numbers of applications. The information available
in these datasets is summarised in Table 1.
30
5.2 Choice of Outcome Variable
Preliminary Examinations (“Prelims”) are the exams taken by students at the end of their first year
at Oxford. In PPE, E&M and Law students each take three first year papers, all marked out of 100.
Each script is marked blindly (so the marking tutors do not know which college the student comes
from). The main outcome variable I use is a student’s average Prelims score standardised within
cohort (and course for the All Subjects dataset). For instance, to construct my outcome variable for
PPE, I first take the average score across the three first year papers and then I then standardise the
result so the mean for each cohort is 0 and the standard deviation for each cohort is 1.
Standardising exam results by cohort is important because the distribution of exam scores var-
ies from year to year (partly due to variation in exam difficulty) even within the same course. I
also standardise by course for the All Subjects dataset because there is significant variation between
subjects in Prelims averages and this variation is mostly unrelated to college effectiveness.24 Stand-
ardising Prelims averages across subjects avoids penalising colleges that teach courses with lower
Prelims averages.25 Using Prelims average is preferable to estimating separate models for each Pre-
lims paper taken for two reasons. First, it increases precision. Second college effectiveness is very
likely to “spill over” across papers.
Research has demonstrated that better university exam performance is closely related to other
desirable outcomes which supports the exam based measurement of college effectiveness (Smith et al.,
2000; Walker and Zhu, 2013; Feng and Graetz, 2015; Naylor et al., 2015). One minor problem is that
interpreting Prelims scores is complicated by the that fact a small number of students retake papers.
Students only retake papers if they fail first time around. In this case the data I have corresponds the
highest mark they obtained which may be the first or second attempt. It would have been preferable
if I had the Prelims scores from first attempts. However retakes are rare so this should not be a
significant problem.
An obvious alternative outcome variable is Final Examination (“Finals”) results such as average
24The variation may reflect differences between subjects in the nature of the subject matter (arguably, naturalscience exams are conducive to more extreme patterns of results) and in conventions within subjects of what is ofsufficient merit to be awarded a given mark.
25I don’t standardise marks for each individual paper because students and colleges may optimally concentrate theirteaching efforts on the Prelims papers that have a higher variance of marks.
31
score across Finals papers. However, Prelims results are preferred for a number of reasons. First,
attrition is greater with Finals (because more students drop out over time) and this implies more
missing data which can bias college effect estimates. Second, in Finals not every student takes the
same exams because of different option choices. This is problematic because there are differences in
score distributions across different options. Third, using Finals results involves excluding students
still in their first or second years at Oxford, substantially reducing the power of the analysis.26
However, when interpreting the results one should keep in mind that Prelims are less important to
students than Finals (they are “lower stakes” exams) and Prelims may over or underestimate Finals
college effects (underestimate because they give less time for any college effect to become evident and
because college effects may be cumulative. Overestimate because teaching is more college-focused in
first year than later years).
For these reasons I focus on standardised average Prelims scores in the main analysis but also
briefly consider the consequences of using individual first year paper scores and average Finals score
as outcome variables.
5.3 Choice of Control Variables
The control variables included in the analysis are summarised on the Table 2.
Most of the controls will be familiar to a UK audience. Less familiar may be contextual in-
formation27, which is provided to admissions tutors in the form of “flags”, identifying disadvantaged
students. Admissions tutors are advised to use the contextual information to suggest extra candid-
ates to interview. The International Baccalaureate (IB) is an alternative to A-levels where students
complete assessments in six subjects. Each student gets a mark out of 45. The Thinking Skills
Assessment (TSA) is the admissions test for PPE and E&M applicants. It includes a 90-minute
multiple-choice test, marked by the Admissions Testing Service and the marks are made available
26Using final degree class as in the Norrington table, has the additional problem in that it is discrete and thusdiscards lots of useful information concerning student achievement. This is particularly a problem at Oxford whereover 50% of students obtain a 2:1.
27It is sometimes argued that contextual information (and some personal characteristics such as gender and race),should not be controlled for. This is because controlling for contextual information sets lower expectations for somedemographics. However, not taking these differences into account may penalise colleges that serve these students forreasons that may be at least partly out of their control.
32
Table 2: Description of Control VariablesPersonal CharacteristicsGender Dummy variable indicating whether the student is male or female
Ethnicity / Overseas status Dummy variables indicating: “UK White”; “UK Black”; “UK Asian”;“UK Other ethnic group”; “UK Information refused”; “EU” and;“Non-EU”
Contextual InformationPre-16 School Flag Performance of applicant’s school at GCSE is below national average
Post-16 School Flag Performance of applicant’s school at A-level is below national average
Care Flag Applicant has been in-care for more than three months
Polar Flag Applicant’s postcode is in POLAR quintiles 1 and 2 - indicating lowestrate of young people’s participation in Higher Education
Acorn Flag Applicant’s postcode is in Acorn groups 4 or 5 meaning residents aretypically categorised as ‘financially stretched’ or living in ‘urbanadversity’
Prior Educational QualificationsPrevious school type Dummy variables for State, Independent and other school type
GCSEs Dummy variables for proportion of A*s obtained at GCSE (if morethan 5 GCSEs). Categories are: “Band 1: 100%”; “Band 2: 75-99%”;“Band 3: 50-74%”; “Band 4: < 50%” and; “Less than 5 GCSEs”
A-levels Dummy variables for A-level bands. The categories are: “Did not takeA-levels”, “Applied to start prior to 2010”, “Applied to start in 2010 orlater and no A*”, “1 A*”, “2 A*”, “3 A*” and “4 or more A*”
A-Level subjects Dummy variables indicating whether students had taken A-levels incertain subjects. Subjects for E&M are Economics, Maths and FurtherMaths. Subjects for Law are History and Law
A-Level subject grades Dummy variables indicating the grade achieved in included subjects
IB Dummy variables for IB bands. “Band 1: 45 (full marks)”; “Band 2:{43, 44}”; “Band 3: {41, 42}”; “Band 4: ≤ 40” and; “Did not take IB”
Admissions Tests and InterviewsTSA Variables for TSA critical thinking score and TSA problem solving score
LNAT Variables for LNAT multiple choice score and LNAT essay score
Interview Score An interview score is given to each candidate out of 10.33
to colleges. The Law National Admissions Test (LNAT) is the admissions test for Law applicants.
The LNAT includes a multiple choice section (machine marked out of 42) and an essay section (in-
dividually marked by colleges). Interviews are usually face-to-face with admissions tutors and most
candidates have have 2 interviews. Law students are given an interview score out of 10.
A quick note should also be made about using A-level grades, which is complicated by two
factors. First, a new A* grade was introduced in 2010. I create a separate A-level dummy variable
for students who applied before the A* grade was introduced. Second, most applicants are only
halfway through their A-levels when they apply to Oxford. In this case admissions tutors observe
predicted grades which are not available in the data. This should not be too problematic because
rational admissions tutors will make correct inferences on average about the actual A-levels grades
an applicant will achieve. Actual A-levels achieved are also probably a better measure of ability than
predicted grades.
5.4 Sample Selection
Sample selection involves choosing both a sample of applicants (only relevant for estimating cut-
offs in Model 3) and a sample of enrolled students (relevant for all three models). Fortunately, the
datasets contain only a very small amount of missing data. The missing data comes in two forms.
First, missing values of control variables for individuals who otherwise provide relatively complete
data. For example, a small number of students (12 in PPE, 39 in Law and 0 in E&M) are missing
admissions test scores perhaps because they were ill on the day of the test if or there were no available
test centres in their home countries (the vast majority are international students with many from
outside the EU). Imputing values for these missing covariates is possible. However, the advantages
of multiple imputation are minimal at best when missing data is less than 5% of the sample (Manly
and Wells, 2015). Multiple imputation also makes interpreting results more difficult (R2 can’t be
reported for example). I thus drop these observations (listwise deletion), which is standard practice in
the value-added literature. This choice should be taken into account when interpreting the resulting
college effect estimates.
Second, and more significantly, some students who matriculated at Oxford have missing Prelims
34
Table 3: Sample Selection: PPEApplicant Sample (2009-2014)a 9867Exclusions
Not Enrolled at Oxford -8404Not in Cohorts 2009-14b -7Withdrew from Oxford -51Exclude Extreme Outliersc -2No Admissions Test Scoresd -12
Final Sample 1391
aApplicant sample excludes 53 studentswho have student records but not applicationrecord. This is likely to be because they ap-plied pre-2009, before the dataset begins.
bThese students were offered deferred entry.c2 students had Economics marks recorded
as 0 or 1. The next lowest mark is 30. It isunclear whether these are typographical errorsor true marks.
d11 of the 12 students with missing ad-missions test scores were international studentswith 10 from non-EU countries.
Table 4: Sample Selection: All SubjectsApplicant Sample (2009-2013)a 75033Exclusions
Not Enrolled at Oxford -61153Not in Cohorts 2009-2013b -76No Prelims Averagec -376St Stephen’s College -1
Final Sample 14427
aExcludes all Medicine and PhysiologicalScience applicants as they are not given “marks”in Prelims. Also excludes Classics I and ClassicsII in the 2013 Ucas Cycle, Biomedical Sciencein 2011 and 2012 and Japanese students in 2009and 2010 as in each case their Prelims scores areall missing.
bThese students were offered deferred entry.c210 of these students have officially with-
drawn from Oxford and 8 are suspended.Numbers per college range from 31 (HarrisManchester) to 5 (Exeter and Hertford).
Table 5: Sample Selection: E&MApplicant Sample (2009-2014) 6874Exclusions
Not Enrolled at OxfordRejected Before Interview -4615Rejected After Interview -1638Declined Offer -24Withdrew during Process -32Failed to meet Offer Grades -30Withdrew After Offer -1
Not in Cohorts 2009-14a -2Withdrew from Oxfordb -15Exclude Extreme Outliersc -1
Final Sample 516
aThese students were offered deferred entry.b4 from Pembroke. No more than 1 at any
other college.cUnusually low TSA score.
Table 6: Sample Selection: LawApplicant Sample (2007-2013) 8148Exclusions
Not Enrolled at OxfordRejected Before Interview -4094Rejected After Interview -2440Declined Offer -59Withdrew during Process -60Failed to meet Offer Grades -136Withdrew After Offer -1
Not in Cohorts 2007-13a -10Skipped Prelimsb -31Withdrew before Prelimsc -49No LNAT/interview scoresd -39
Final Sample 1229
aThese students were offered deferred entry.bMay have come to Oxford with a BA from
overseas and been allowed to transfer automat-ically to year 2 without having to sit Prelims.
c16 from Harris Manchester. Less than 3from most other colleges.
d24 of the 39 students with missing ad-missions test scores were international studentswith 22 from non-EU countries.
35
scores (51 for PPE, 49 in Law and 15 in E&M). The main reasons are (i) students dropping out
of Oxford during their first year and (ii) students taking a year out intending to return and repeat
their first year. I again use listwise deletion. This is not ideal because it rewards “cream skimming”
(encouraging weaker students not to take exams and perhaps dropout). Bias will result if having
missing Prelims scores is an indicator that the student was likely to under-perform relative to their
expected result given their pre-Oxford characteristics. Imputing missing prelims scores would also
not fully correct for bias. However, missing Prelims scores are rare and seem evenly spread across
colleges I do not expect biases to be large.2829
The sample selection criteria are summarised in Tables 3-6.
5.5 Descriptive Statistics
Tables 7 and 8 present application, offer and enrolment statistics for each college. The first two
columns show that most applicants to Oxford (e.g. over 80% in PPE) are direct applicants. There
is large variation in the numbers of direct applicants received by each college. For example, whereas
Balliol received 985 direct applications for PPE, St Hilda’s received only 69. The colleges with
relatively few direct applicants are allocated large numbers of open applicants (Balliol received 0
open applicants in PPE whereas St Hilda’s received 246). The tables show that almost all colleges
make offers to a higher proportion of direct applicants than they to do open applicants, suggesting
that the direct applicants are on average of higher ability. Consequently, over 90% of students who
take exams at Oxford are direct applicants rather than open applicants.
Tables 9-12 present descriptive statistics for applicants and exam takers for each dataset. Columns
1-3 present mean pre-Oxford characteristics of applicants. Columns 1-3 show that open applicants
are more likely than direct applicants to be international students (both from the EU or from outside
the EU). Open applicants also tend to perform less well in GCSEs, A-levels and admissions tests.
28An exception is a disproportionately large number of students dropout of Harris Manchester which may be relatedto the fact Harris Manchester is a college for “mature students”.
29If cream skimming is taking place, we might expect to see a positive correlation between college effectivenessestimates and the share of a college’s students that are missing exam results. However, the correlation between theselection on observables estimates and the share of dropouts is −0.86 for PPE, −0.40 for E&M and 0.09 for Law. Ifanything, the opposite is the case - less effective colleges tend to have larger shares of dropouts.
36
Tab
le7:
App
lication,
Offe
ran
dEnrolmentStatistics:PPE
andE&M
PPE
E&M
App
lican
ts%
offers
EnrolledwithPrelim
sApp
lican
ts%
offers
EnrolledwithPrelim
sDirect
Ope
nDirect
Ope
nDirect
Ope
nReject
College1
Direct
Ope
nDirect
Ope
nDirect
Ope
nReject
College1
BALL
985
09%
-70
00
321
05%
-14
02
BLACKF
30
33%
-0
03
BNC
528
011%
-53
00
537
87%
0%32
02
CCC
137
110
18%
7%21
57
CH-C
H368
1715%
0%44
010
245
188%
0%17
01
EXETER
257
1415%
0%34
02
218
14%
0%9
03
H-M
AN
127
017%
-15
020
540
11%
-4
09
HERT
264
1716%
6%39
17
408
150
9%3%
343
7JE
SUS
164
5015%
2%21
014
146
138
12%
4%15
33
KEBLE
230
4714%
6%32
313
217
188
12%
3%22
411
LIN
C312
3116%
0%49
06
LMH
141
136
20%
4%27
216
126
129
9%4%
103
5MAGD
498
012%
-52
00
MANS
132
127
14%
2%16
123
MERT
314
014%
-37
06
213
7811%
1%22
11
NEW
452
014%
-60
01
154
418%
2%12
12
ORIE
L271
4017%
8%42
36
PEMB
164
8915%
8%24
713
525
519%
4%41
22
QUEENS
158
9417%
4%24
313
7333
3%3%
21
4REGENT
200
10%
-2
014
S-ANNE
160
111
14%
9%20
715
128
168
11%
4%12
62
S-BEN
80
0%-
00
19S-CATS
241
4111%
5%26
117
191
14%
0%6
07
S-HIL
69246
7%4%
37
3152
169
2%7%
18
11S-HUGH
71132
14%
8%9
914
134
213
14%
3%17
510
S-JO
HN
297
913%
22%
342
3134
164%
6%5
13
S-PET
161
157
16%
6%21
1018
206
239
13%
3%26
74
SEH
139
8413%
8%17
611
181
265
9%5%
1412
9SO
MER
109
220
17%
5%15
930
TRIN
251
313%
0%24
04
236
86%
0%14
03
UNIV
431
3014%
3%54
010
WADH
327
016%
-47
02
8954
13%
2%10
11
WORC
265
512%
20%
291
4312
05%
-16
01
Total
8054
1810
14%
6%961
77352
4900
1968
8%4%
355
58103
Tab
le7show
sap
plication,
offer
andenrolm
entstatistics
forPPE
andE&M.The
first
twocolumngive
thenu
mbe
rof
applications
received
byeach
colle
ge.The
thirdan
dfourth
columns
give
thepe
rcentage
ofoff
ersmad
eto
open
anddirect
applican
ts.Colum
ns1-4areba
sedon
theap
plican
tsample.
Colum
ns5-7give
thenu
mbe
rof
enrolle
dstud
ents
withPrelim
sresults.
Colum
n7,
“RejectCollege1”
deno
testhenu
mbe
rof
stud
ents
ateach
colle
gewho
wereno
tmad
ean
offer
bythecolle
gethey
wereoriginally
allocatedto.
37
Tab
le8:
App
lication,
Offe
ran
dEnrolmentStatistics:La
wan
dAllSu
bjects
Law
AllSu
bjects
App
lican
ts%
offers
EnrolledwithPrelim
sApp
lican
ts%
offers
EnrolledwithPrelim
sDirect
Ope
nDirect
Ope
nDirect
Ope
nReject
College1
Direct
Ope
nDirect
Ope
nDirect
Ope
nReject
College1
BALL
256
012%
-26
06
3300
6115%
8%421
462
BLACKF
70
57%
-0
03
BNC
509
211%
0%41
03
3761
4612%
11%
419
235
CCC
109
7923%
10%
218
91009
350
24%
9%219
2764
CH-C
H290
3016%
7%36
214
2671
218
18%
11%
399
16158
EXETER
249
4117%
15%
335
82420
156
15%
12%
331
1178
GREYF
30
33%
-0
04
H-M
AN
281
513%
0%12
09
593
015%
-49
067
HERT
197
5919%
5%35
21
2602
321
19%
5%446
1382
JESU
S223
7217%
10%
285
52190
339
18%
5%353
1587
KEBLE
238
6921%
7%37
37
2832
390
17%
4%446
12115
LIN
C309
4413%
5%33
17
1904
181
19%
8%326
1147
LMH
164
4319%
12%
295
41779
686
21%
7%329
34174
MAGD
349
014%
-43
05
3176
9016%
10%
465
442
MANS
130
4212%
10%
124
14998
473
17%
8%149
24153
MERT
179
2218%
0%23
07
2103
139
18%
7%333
838
NEW
227
3917%
13%
344
82601
173
20%
7%478
952
ORIE
L194
101
21%
10%
319
31738
309
17%
10%
276
2390
PEMB
173
3516%
6%23
111
1913
463
17%
12%
295
43139
QUEENS
133
4820%
17%
165
41511
471
20%
11%
268
42130
REGENT
130
0%-
00
13109
021%
-17
0129
S-ANNE
130
7616%
9%17
520
1655
846
21%
10%
304
64186
S-BEN
430
19%
-7
063
S-CATS
307
3612%
8%33
318
2389
572
18%
10%
380
42221
S-HIL
87200
11%
6%8
825
720
1508
18%
11%
113
125
295
S-HUGH
74125
9%19%
519
151031
1422
22%
11%
206
122
235
S-JO
HN
196
4824%
10%
414
32781
195
17%
10%
426
1876
S-PET
107
8115%
5%13
319
1298
855
18%
9%208
56191
SEH
97159
22%
11%
1316
161464
1071
21%
11%
279
97171
SOMER
62125
19%
14%
1013
15871
1153
28%
12%
212
106
206
TRIN
195
316%
0%22
05
2186
3617%
0%336
038
UNIV
382
1014%
10%
381
92534
105
19%
5%406
473
WADH
229
9016%
7%29
517
2778
285
18%
7%455
16102
WORC
371
113%
0%48
04
4106
3413%
12%
477
444
Total
6463
1685
16%
10%
790
131
308
63073
12948
18%
10%
9828
952
3646
Tab
le8show
sap
plication,
offer
andenrolm
entstatistics
forLaw
andAllSu
bjects.
The
first
twocolumngive
thenu
mbe
rof
applications
received
byeach
colle
ge.The
thirdan
dfourth
columns
give
thepe
rcentage
ofoff
ersmad
eto
open
anddirect
applican
ts.Colum
ns1-4areba
sedon
theap
plican
tsample.
Colum
ns5-7give
thenu
mbe
rof
enrolle
dstud
ents
withPrelim
sresults.
Colum
n7,
“RejectCollege1”
deno
testhenu
mbe
rof
stud
ents
ateach
colle
gewho
wereno
tmad
ean
offer
bythecolle
gethey
wereoriginally
allocatedto.
38
Table 9: Mean Applicant and Exam Taker Characteristics: PPE
Applicants Exam TakersDirect Open All Direct Open All
Personal CharacteristicsFemale 0.38 0.41 0.38 0.33 0.32 0.33UK White 0.38 0.16 0.34 0.63 0.27 0.61UK Black 0.02 0.02 0.02 0.02 0.01 0.02UK Asian 0.08 0.04 0.07 0.09 0.03 0.08UK Other Ethnicity 0.01 0.00 0.01 0.01 0.01 0.01UK Information Refused 0.02 0.01 0.02 0.02 0.02 0.02EU 0.22 0.34 0.24 0.10 0.29 0.11Non EU 0.24 0.42 0.27 0.12 0.37 0.13Contextual FactorsPolar Flag 0.06 0.04 0.05 0.08 0.08 0.08Acorn Flag 0.05 0.04 0.05 0.05 0.05 0.05Pre-16 School Flag 0.04 0.03 0.04 0.05 0.05 0.05Post-16 School Flag 0.09 0.07 0.08 0.10 0.08 0.10Care Flag 0.00 0.00 0.00 0.00 0.00 0.00Overall Flag 0.03 0.03 0.03 0.03 0.02 0.03Previous School TypeState 0.31 0.18 0.29 0.43 0.26 0.42Independent 0.27 0.09 0.23 0.37 0.12 0.35Other School Type 0.42 0.74 0.48 0.20 0.62 0.22Took GCSEs 0.56 0.27 0.51 0.78 0.35 0.76GCSE Band 4 (lowest) 0.17 0.16 0.17 0.07 0.06 0.07GCSE Band 3 0.14 0.06 0.13 0.15 0.13 0.15GCSE Band 2 0.16 0.04 0.13 0.31 0.11 0.30GCSE Band 1 (highest) 0.09 0.01 0.07 0.26 0.05 0.25Took A-levels 0.49 0.34 0.46 0.60 0.40 0.58Took IB 0.08 0.08 0.08 0.06 0.09 0.07Admissions TestsTSA Critical 64.51 60.70 63.85 73.66 72.22 73.56TSA Problem 58.57 55.69 58.07 68.15 68.76 68.19OutcomesPrelims Average 61.82 61.83 61.82Std Prelims Average -0.00 0.00 -0.00
Observations 8055 1812 9867 1298 93 1391Table displays mean characteristics of PPE students. Columns 1-3 give the meancharacteristics of applicants to Oxford for PPE between 2009 and 2014. Column 4-6give the mean characteristics of students who took Prelims at Oxford in PPE.
39
Table 10: Mean Applicant and Exam Taker Characteristics: E&M
Applicants Exam TakersDirect Open All Direct Open All
Personal CharacteristicsFemale 0.37 0.38 0.37 0.30 0.26 0.29UK White 0.31 0.11 0.25 0.59 0.33 0.56UK Black 0.02 0.01 0.02 0.02 0.02 0.02UK Asian 0.14 0.07 0.12 0.17 0.09 0.16UK Other Ethnicity 0.01 0.00 0.01 0.02 0.00 0.02UK Information Refused 0.01 0.00 0.01 0.02 0.00 0.02EU 0.17 0.28 0.20 0.07 0.20 0.09Non EU 0.31 0.52 0.37 0.11 0.36 0.14Contextual FactorsPolar Flag 0.05 0.04 0.05 0.07 0.05 0.07Acorn Flag 0.04 0.03 0.04 0.04 0.05 0.04Pre-16 School Flag 0.04 0.02 0.03 0.03 0.05 0.03Post-16 School Flag 0.07 0.06 0.07 0.09 0.08 0.09Care Flag 0.00 0.00 0.00 0.00 0.00 0.00Overall Flag 0.02 0.02 0.02 0.02 0.03 0.02Previous School TypeState 0.29 0.16 0.25 0.39 0.21 0.37Independent 0.32 0.12 0.27 0.44 0.27 0.42Other School Type 0.38 0.72 0.48 0.16 0.52 0.21Took GCSEs 0.57 0.26 0.48 0.83 0.44 0.78GCSE Band 4 (lowest) 0.16 0.13 0.15 0.05 0.06 0.05GCSE Band 3 0.17 0.07 0.14 0.19 0.17 0.18GCSE Band 2 0.17 0.04 0.14 0.33 0.15 0.31GCSE Band 1 (highest) 0.07 0.01 0.06 0.26 0.06 0.24Took A-levels 0.62 0.37 0.55 0.78 0.44 0.74Took IB 0.07 0.07 0.07 0.04 0.08 0.05Economics 0.52 0.29 0.45 0.68 0.32 0.64Maths 0.61 0.35 0.53 0.77 0.44 0.73Further Maths 0.19 0.09 0.16 0.29 0.11 0.27A* in Economics 0.16 0.06 0.13 0.30 0.11 0.27A* in Maths 0.25 0.12 0.22 0.44 0.24 0.42A* in Further Maths 0.05 0.02 0.04 0.13 0.06 0.12Admissions TestsTSA Critical 60.48 56.96 59.52 71.46 70.24 71.30TSA Problem 58.15 55.71 57.49 68.34 68.66 68.38OutcomesPrelims Average 63.08 64.99 63.33Std Prelims Average -0.04 0.25 -0.00
Observations 4904 1970 6874 450 66 516Table displays mean characteristics for Economics and Management students.Columns 1-3 give the mean characteristics of applicants to Oxford forEconomics and Management between 2009 and 2014. Column 4-6 give the meancharacteristics of students who took Prelims in Economics and Management.
40
Table 11: Mean Applicant and Exam Taker Characteristics: Law
Applicants Exam TakersDirect Open All Direct Open All
Personal CharacteristicsFemale 0.56 0.54 0.55 0.55 0.57 0.55UK White 0.51 0.31 0.47 0.66 0.41 0.63UK Black 0.04 0.03 0.04 0.03 0.03 0.03UK Asian 0.10 0.09 0.10 0.09 0.09 0.09UK Other Ethnicity 0.02 0.01 0.02 0.02 0.03 0.02UK Information Refused 0.27 0.46 0.31 0.16 0.39 0.19EU 0.05 0.08 0.06 0.04 0.06 0.04Non EU 0.24 0.46 0.29 0.14 0.36 0.17Contextual FactorsPolar Flag 0.08 0.08 0.08 0.06 0.10 0.07Acorn Flag 0.06 0.07 0.06 0.04 0.06 0.05Previous School TypeState 0.49 0.36 0.47 0.54 0.43 0.53Independent 0.21 0.10 0.19 0.29 0.14 0.27Other School Type 0.30 0.54 0.34 0.17 0.43 0.20School Exam ResultsTook GCSEs 0.69 0.47 0.64 0.81 0.58 0.78GCSE Band 4 (lowest) 0.31 0.31 0.31 0.10 0.18 0.11GCSE Band 3 0.18 0.09 0.16 0.24 0.12 0.22GCSE Band 2 0.14 0.05 0.12 0.30 0.19 0.28GCSE Band 1 (highest) 0.06 0.02 0.05 0.18 0.09 0.17Took A-levels 0.65 0.48 0.62 0.78 0.53 0.74Took IB 0.04 0.04 0.04 0.04 0.04 0.04History 0.38 0.22 0.35 0.49 0.30 0.47History A 0.24 0.11 0.22 0.35 0.22 0.33History A* 0.06 0.03 0.06 0.12 0.07 0.12Law 0.12 0.14 0.13 0.08 0.14 0.09Law A 0.08 0.09 0.08 0.07 0.11 0.07Law A* 0.02 0.02 0.02 0.01 0.02 0.01Admissions TestsLNAT Multiple Choice 19.64 18.70 19.45 22.76 22.96 22.79LNAT Essay 58.80 56.08 58.24 64.62 64.32 64.58Interview Score 8.04 8.02 8.04OutcomesPrelims Average 65.08 64.67 65.02Std Prelims Average 0.02 -0.13 0.00
Observations 6463 1685 8148 1069 160 1229Table displays mean characteristics for Law students. Columns 1-3 give the meancharacteristics of applicants to Oxford for Law between 2007 and 2013. Column 4-6give the mean characteristics of students who took Prelims at Oxford in Law.
41
Table 12: Mean Applicant and Exam Taker Characteristics: All Subjects
Applicants Exam TakersDirect Open All Direct Open All
Personal CharacteristicsFemale 0.49 0.46 0.48 0.47 0.43 0.46UK White 0.60 0.31 0.55 0.74 0.48 0.72UK Black 0.02 0.02 0.02 0.02 0.01 0.01UK Asian 0.08 0.06 0.07 0.07 0.06 0.07UK Other Ethnicity 0.01 0.01 0.01 0.01 0.01 0.01UK Information Refused 0.02 0.01 0.02 0.02 0.01 0.02EU 0.09 0.19 0.11 0.05 0.16 0.06Non EU 0.16 0.38 0.20 0.09 0.27 0.10Contextual FactorsPolar Flag 0.09 0.08 0.08 0.08 0.10 0.09Acorn Flag 0.06 0.07 0.06 0.05 0.07 0.06Previous School TypeState 0.45 0.33 0.43 0.47 0.42 0.47Independent 0.33 0.13 0.29 0.41 0.17 0.39Other School Type 0.22 0.55 0.28 0.12 0.41 0.14School Exam ResultsTook GCSEs 0.73 0.41 0.68 0.86 0.57 0.84GCSE Band 4 (lowest) 0.26 0.26 0.26 0.12 0.19 0.13GCSE Band 3 0.21 0.11 0.19 0.21 0.18 0.21GCSE Band 2 0.19 0.06 0.17 0.32 0.14 0.30GCSE Band 1 (highest) 0.09 0.02 0.08 0.21 0.06 0.20Took A-levels 0.66 0.43 0.62 0.79 0.56 0.77Took IB 0.05 0.06 0.05 0.04 0.05 0.04OutcomesPrelims Average 64.76 64.68 64.75Std Prelims Average 0.01 -0.07 -0.00
Observations 63081 12952 76033 13306 1121 14427Columns 1-3 give the mean characteristics of applicants to Oxford.Column 4-6 give the mean characteristics of students who took Prelims at Oxford.
Table 13: Tests for Differences in Mean and Variance of Applicant Ability across CollegesPPE E&M Law
TSA Critical TSA Problem TSA Critical TSA Problem LNAT
Variance F-statistic 1.888 1.247 1.120 1.533 1.912Prob > F 0.002 0.156 0.313 0.050 0.001
Mean F-statistic 0.968 0.978 0.962 0.975 0.983Prob > F 0.000 0.000 0.000 0.000 0.000
The robvar command in Stata is used to report Brown’s robust test statistic for the equality of variances of admissionstest scores at different colleges. The mvtest command in Stata is used to test for differences in mean admissions testscores across applicants to different colleges.
42
Admissions test scores (TSA and LNAT) and GCSE results provide particularly strong evidence
that direct applicants are on average higher ability than open applicants. Columns 4-6 present
corresponding descriptive statistics for the final sample of students who take Prelims exams.
5.5.1 Testing Assumptions for Selection on Observables and Unobservables
Before moving on to the results, I test two of the key assumptions of Model 3. First, I test the
assumption that the variance of the ability of direct applicants is the same across colleges (and the
same as the variance of ability of open applicants). Since ability is unobservable, I use admissions
test scores as a proxy for ability. Table 13, at the bottom of page 42, reports Brown’s robust test
statistic for the equality of variances which I calculate using the robvar command in Stata (Brown
and Forsythe, 1974). There is relatively strong evidence that the standard deviation of applicant
ability differs across colleges – 2 of the 5 p-values are less than 0.01 and a further p-value is less
than 0.10. This provides some evidence against Model 3 though the importance of the failure of
this assumption is ultimately an empirical question – it is possible that these differences in standard
deviation are not practically important (the large sample size makes it possible for practically small
differences in the variance of ability to be statistically significant). Table 13 also reports the results of
a test for differences in mean admissions test scores across colleges and open applicants, implemented
using the mvtest command in Stata. The results strongly reject the hypothesis that mean admissions
test scores are the same across colleges and open applicants. This justifies the modelling choice to
allow mean applicant ability to differ across colleges.
Second, I test whether open applicants really are randomly allocated to colleges in the admissions
process using a balancing test (randomisation test), analogous to those typically carried out using
pre-treatment outcomes in a randomised trial. I implement it by taking a candidate confounder
(Admissions test scores, gender etc.) and regressing it on a vector of college dummies for the sample
of open applicants. Zero coefficients on the college dummies support the assumption that open
applicants are randomly allocated to colleges. The balancing test is a simple F-test on the college
dummies. The results reported in Table 14 support the randomisation assumption within courses.
Of the 41 p-values in the first 3 columns only 1 is less than 0.05 and only 5 are less than 0.10. A
43
Table 14: P-values from Balance TestsPPE E&M Law All Subjects
(all courses)All Subjects(by course)
Gender 0.284 0.467 0.091* 0.000** 0.854White 0.244 0.172 0.492 0.000** 0.074Asian 0.917 0.181 0.696 0.355 0.536Black 0.990 0.339 0.989 0.944 0.848EU 0.709 0.548 0.169 0.076 0.619Non-EU 0.454 0.408 0.587 0.000** 0.550Overseas 0.062 0.787 0.612 0.000** 0.831State 0.007** 0.587 0.422 0.000** 0.101Independent 0.519 0.437 0.543 0.288 0.230Took GCSEs 0.102 0.721 0.624 0.000** 0.443Took A-levels 0.357 0.598 0.095 0.003** 0.291Took IB 0.435 0.374 0.264 0.532 0.618TSA Problem 0.356 0.084 - - -TSA Critical 0.169 0.762 - - -LNAT - - 0.794 - -No. Open Applicants 1812 1970 1685 12952 12952
Sample contains all open applicants. Columns 1-4 display the p-values from regressions of candidate confounders on afull set of college dummies. Column 5 displays the p-values from regressions of candidate confounders on a full set ofcollege dummies and course dummies. Significance at the 1 and 5 percent level is denoted by **, and *, respectively.
small number of significant F-statistics does not make randomisation implausible as there are many
candidate confounders. In expectation, p-values should be smaller than 0.05 in approximately 2 of
the 41 tests if the tests were independent (though these tests are not independent). Based on these
tests, and the way the open applicants are allocated to colleges, I believe that the allocation of open
applicants to colleges was random within subjects.
However, the results of column 4 are very different with the null hypothesis convincing rejected on
multiple occasions. This is because column 4 pools open applicants across courses and colleges teach a
different range of courses. Thus the random assignment assumption does not hold for the All subjects
dataset and the results of Model 3 should be interpreted with caution for this dataset. Column 5,
which adds course dummy variables as controls, again supports the view that open applicants are
randomly assigned to colleges conditional on the course they applied for.
44
6 Results
6.1 Results for Norrington Table Plus and Selection on Observables
Tables 15-18 show regression results for Models 1 and 2 for PPE, E&M, Law and All Subjects
respectively. The college effect estimates are displayed and coefficients on control variables are
suppressed. Column 1 is the naïve Model 1 with no control variables. Model 2 in the second column
adds all the observable control variables. Our main interest is in the estimates of the coefficients on
college dummy variables.
The coefficients in column 1 can be interpreted as the average differences in (standardised) Prelims
results at various Oxford colleges, relative to students at the college with the highest mean Prelims
scores (the college with the highest mean Prelims scores is St John’s for PPE, Harris Manchester
for E&M, Magdalen for Law and St John’s again when all subjects are combined). For instance, in
Table 15 for PPE, the coefficient of −0.11 in the first row on University College (“UNIV”) can be
interpreted as saying that, on average, students at University College score 0.11 standard deviations
lower on PPE Prelims than students at St John’s.30 These differences in average Prelims scores
amongst students who matriculate at different Oxford colleges are statistically significant. At the
bottom of each table, for each model, I report the results of F-tests under the null hypothesis that
college effects are equal at all colleges. The results for column 1 show, very convincingly, that average
Prelims scores differ across colleges.31 This is my first result.
Result 1. There are statistically significant differences in unconditional Prelims results across
colleges.
Given Model 1 makes no adjustments for observable or unobservable differences in student char-
30For Oxford based readers more familiar with raw exam marks, this translates into University College PPE studentsscoring approximately 0.11×5.7 ≈ 0.627 raw marks lower in Prelims than PPE students at St John’s (given the standarddeviation in Prelims Average for PPE is 5.7).
31Some readers may ask why I am conducting statistically tests (and also why all standard errors are not equalto zero) when I am analysing the full population of students. For instance, Berk (2004) argues “If the data are apopulation, there is no sampling, no uncertainty because of sampling, and no need for statistical inference. Indeed,statistical inference makes no sense.” However, Abadie et al. (2014) show that uncertainty about causal effects ratherthan sampling justifies the use of standard errors in this context. Even if we observe the entire finite population - sowe can estimate the value of regression coefficients in the population with no uncertainty – causal effects are uncertainbecause for each student, at most one of their potential outcomes is observed.
45
Table 15: Regressions: PPE(1) (2)
Prelims Average Prelims Averageβ SE β SE
UNIV −0.11 (0.20) 0.08 (0.19)ORIEL −0.25 (0.19) −0.08 (0.18)HERT −0.25 (0.21) −0.15 (0.19)BALL −0.20 (0.20) −0.20 (0.18)BNC −0.42 (0.21) −0.23 (0.19)REGENT −0.54∗ (0.25) −0.23 (0.23)EXETER −0.47∗ (0.21) −0.27 (0.20)JESUS −0.54∗ (0.25) −0.27 (0.22)PEMB −0.60∗∗ (0.20) −0.27 (0.20)SEH −0.63∗∗ (0.22) −0.29 (0.21)S-HIL −0.66∗∗ (0.21) −0.31 (0.21)MERT −0.36 (0.23) −0.31 (0.22)NEW −0.43∗ (0.20) −0.31 (0.19)S-PET −0.61∗∗ (0.20) −0.32 (0.19)SOMER −0.58∗∗ (0.19) −0.32 (0.18)MANS −0.81∗∗ (0.19) −0.39∗ (0.19)LMH −0.73∗∗ (0.21) −0.39 (0.20)MAGD −0.48∗ (0.20) −0.40∗ (0.20)CCC −0.65∗∗ (0.21) −0.41∗ (0.20)LINC −0.62∗∗ (0.19) −0.42∗ (0.18)CH-CH −0.59∗∗ (0.19) −0.44∗ (0.19)KEBLE −0.63∗∗ (0.20) −0.45∗ (0.19)S-CATS −0.63∗∗ (0.22) −0.45∗ (0.20)TRIN −0.56∗ (0.24) −0.48∗ (0.22)WORC −0.63∗∗ (0.23) −0.50∗ (0.23)WADH −0.68∗∗ (0.21) −0.53∗∗ (0.19)H-MAN −0.76∗∗ (0.27) −0.54∗ (0.26)S-ANNE −0.75∗∗ (0.22) −0.56∗∗ (0.20)S-BEN −0.93∗∗ (0.24) −0.57∗ (0.23)S-HUGH −1.00∗∗ (0.26) −0.58∗ (0.26)QUEENS −0.77∗∗ (0.21) −0.58∗∗ (0.20)BLACKF −2.07∗ (0.98) −1.89∗ (0.88)
Controls No YesProb > F 0.000 0.024SD 0.176 0.114Hausman 0.224R-squared 0.053 0.212N 1391 1391The baseline college is St John’s. Dependent variable isstandardised by year. Standard errors are heteroskedasticityrobust. Prob > F gives the p-value from an F-test of the nullhypothesis that all colleges are equally effective. Hausmangives the p-value for a robust Hausman test. SD gives thestandard deviation of college effectiveness using the methodof Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.01
Table 16: Regressions: E&M(1) (2)
Prelims Average Prelims Averageβ SE β SE
S-HIL −0.59∗ (0.25) −0.08 (0.24)LMH −0.68∗ (0.29) −0.13 (0.27)S-CATS −0.67 (0.36) −0.13 (0.33)NEW −0.68∗∗ (0.26) −0.19 (0.29)CH-CH −0.72∗∗ (0.27) −0.23 (0.29)SEH −0.82∗∗ (0.22) −0.25 (0.23)HERT −0.84∗∗ (0.23) −0.27 (0.25)S-PET −0.96∗∗ (0.24) −0.36 (0.24)S-HUGH −0.62∗ (0.27) −0.37 (0.24)QUEENS −0.98∗ (0.39) −0.37 (0.27)JESUS −0.91∗∗ (0.26) −0.36 (0.26)EXETER −0.99∗∗ (0.38) −0.37 (0.37)PEMB −0.91∗∗ (0.23) −0.40 (0.23)KEBLE −1.06∗∗ (0.24) −0.51∗ (0.25)WORC −0.86∗∗ (0.26) −0.51 (0.26)BNC −0.96∗∗ (0.25) −0.52∗ (0.24)S-JOHN −0.88∗ (0.39) −0.56 (0.35)S-ANNE −1.21∗∗ (0.28) −0.69∗∗ (0.24)TRIN −1.11∗∗ (0.25) −0.73∗ (0.32)WADH −1.20∗∗ (0.33) −0.82∗∗ (0.29)MERT −1.13∗∗ (0.25) −0.81∗∗ (0.26)BALL −1.59∗∗ (0.26) −1.16∗∗ (0.30)
Controls No YesProb > F 0.000 0.011SD 0.145 0.146Hausman 0.189R-squared 0.063 0.352N 516 516The baseline college is Harris Manchester. Dependentvariable is standardised by year. Standard errors areheteroskedasticity robust. Prob > F gives the p-valuefrom an F-test of the null hypothesis that all collegesare equally effective. Hausman gives the p-value for arobust Hausman test. SD gives the standard deviation ofcollege effectiveness using the method of Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.01
46
Table 17: Regressions: Law(1) (2)
Prelims Average Prelims Averageβ SE β SE
WORC −0.04 (0.20) −0.03 (0.20)LMH −0.25 (0.20) −0.14 (0.19)HERT −0.26 (0.19) −0.16 (0.18)S-CATS −0.35 (0.21) −0.19 (0.20)UNIV −0.38 (0.20) −0.22 (0.20)MANS −0.30 (0.22) −0.22 (0.20)BNC −0.18 (0.20) −0.23 (0.20)S-ANNE −0.38 (0.23) −0.28 (0.22)TRIN −0.30 (0.23) −0.29 (0.22)SEH −0.61∗∗ (0.21) −0.33 (0.20)H-MAN −0.27 (0.26) −0.34 (0.26)MERT −0.29 (0.24) −0.36 (0.23)LINC −0.40∗ (0.20) −0.42∗ (0.20)PEMB −0.60∗∗ (0.21) −0.42∗ (0.19)CCC −0.65∗∗ (0.22) −0.44∗ (0.21)NEW −0.56∗∗ (0.20) −0.48∗ (0.19)CH-CH −0.63∗∗ (0.21) −0.49∗ (0.21)S-PET −0.65∗∗ (0.20) −0.51∗ (0.21)S-HUGH −0.55∗ (0.22) −0.53∗∗ (0.20)BALL −0.58∗∗ (0.22) −0.54∗ (0.21)JESUS −0.74∗∗ (0.19) −0.56∗∗ (0.19)WADH −0.68∗∗ (0.20) −0.57∗∗ (0.19)S-HIL −0.59∗∗ (0.20) −0.58∗∗ (0.21)S-JOHN −0.63∗∗ (0.19) −0.58∗∗ (0.19)KEBLE −0.77∗∗ (0.23) −0.59∗∗ (0.22)QUEENS −0.60∗ (0.28) −0.59∗ (0.29)EXETER −0.72∗∗ (0.23) −0.61∗∗ (0.22)REGENT −0.81∗ (0.35) −0.68 (0.35)ORIEL −0.79∗∗ (0.24) −0.74∗∗ (0.22)SOMER −0.86∗∗ (0.24) −0.77∗∗ (0.23)GREYF −1.58∗∗ (0.48) −1.16∗ (0.58)
Controls No YesProb > F 0.000 0.004SD 0.180 0.141Hausman 0.003R-squared 0.057 0.166N 1229 1229The baseline college is Magdalen. Dependent variable isstandardised by year. Standard errors are heteroskedasticityrobust. Prob > F gives the p-value from an F-test of the nullhypothesis that all colleges are equally effective. Hausmangives the p-value for a robust Hausman test. SD gives thestandard deviation of college effectiveness using the methodof Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.01
Table 18: Regressions: All Subjects(1) (2)
Prelims Average Prelims Averageβ SE β SE
MAGD −0.01 (0.07) −0.04 (0.06)BNC −0.05 (0.06) −0.06 (0.06)NEW −0.06 (0.06) −0.07 (0.06)MERT −0.05 (0.07) −0.09 (0.07)UNIV −0.06 (0.06) −0.10 (0.06)WORC −0.13∗ (0.06) −0.16∗∗ (0.06)S-CATS −0.25∗∗ (0.06) −0.19∗∗ (0.06)KEBLE −0.22∗∗ (0.06) −0.19∗∗ (0.06)PEMB −0.27∗∗ (0.07) −0.19∗∗ (0.06)BALL −0.18∗∗ (0.06) −0.20∗∗ (0.06)S-ANNE −0.24∗∗ (0.06) −0.22∗∗ (0.06)LINC −0.19∗∗ (0.07) −0.22∗∗ (0.06)HERT −0.25∗∗ (0.06) −0.22∗∗ (0.06)S-HIL −0.28∗∗ (0.06) −0.22∗∗ (0.06)SEH −0.29∗∗ (0.06) −0.23∗∗ (0.06)S-HUGH −0.29∗∗ (0.06) −0.24∗∗ (0.06)JESUS −0.24∗∗ (0.07) −0.24∗∗ (0.06)LMH −0.29∗∗ (0.06) −0.25∗∗ (0.06)MANS −0.26∗∗ (0.07) −0.25∗∗ (0.07)CH-CH −0.31∗∗ (0.06) −0.27∗∗ (0.06)TRIN −0.21∗∗ (0.07) −0.29∗∗ (0.06)S-PET −0.37∗∗ (0.07) −0.29∗∗ (0.06)WADH −0.29∗∗ (0.06) −0.30∗∗ (0.06)S-BEN −0.37∗∗ (0.12) −0.30∗ (0.12)REGENT −0.47∗∗ (0.09) −0.32∗∗ (0.09)ORIEL −0.33∗∗ (0.07) −0.33∗∗ (0.07)CCC −0.33∗∗ (0.07) −0.33∗∗ (0.07)SOMER −0.39∗∗ (0.06) −0.34∗∗ (0.06)H-MAN −0.31∗∗ (0.12) −0.35∗∗ (0.11)EXETER −0.36∗∗ (0.07) −0.35∗∗ (0.07)QUEENS −0.43∗∗ (0.07) −0.37∗∗ (0.06)BLACKF −2.13 (1.26) −2.28 (1.22)
Controls No YesProb > F 0.000 0.000SD 0.112 0.089Hausman 1.000R-squared 0.015 0.126N 14426 14426The baseline college is St John’s. Dependent variable isstandardised by year. Standard errors are heteroskedasticityrobust. Prob > F gives the p-value from an F-test of the nullhypothesis that all colleges are equally effective. Hausmangives the p-value for a robust Hausman test. SD gives thestandard deviation of college effectiveness using the methodof Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.0147
Figure 2: College Ranking by Course: Norrington Table Plus vs Selection on Observables
BALL
BLACKF
BNC
CCC
CH−CH
EXETER
H−MAN
HERT
JESUS
KEBLE
LINC
LMHMAGD
MANS
MERTNEW
ORIEL
PEMB
QUEENS
REGENT
S−ANNES−BEN
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
SOMER
TRIN
UNIV
WADHWORC
Correlation: 0.95
010
20
30
40
Mod
el 2
Ran
k
0 10 20 30 40
Model 1 Rank
PPE
BALL
BNC
CH−CH
EXETER
H−MAN
HERT
JESUS
KEBLE
LMH
MERT
NEW
PEMB
QUEENS
S−ANNE
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
TRIN
WADH
WORC
Correlation: 0.88
05
10
15
20
25
Mod
el 2
Ran
k
0 5 10 15 20 25
Model 1 Rank
E&M
BALL
BNC
CCC
CH−CH
EXETER
GREYF
H−MAN
HERT
JESUS
KEBLE
LINC
LMH
MAGD
MANS
MERT
NEW
ORIEL
PEMB
QUEENS
S−ANNE
S−BEN
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
SOMER
TRIN
UNIV
WADH
WORC
Correlation: 0.95
010
20
30
Mo
de
l 2
Ran
k
0 10 20 30
Model 1 Rank
Law
BALL
BLACKF
BNC
CCC
CH−CH
EXETERH−MAN
HERT
JESUS
KEBLE
LINC
LMH
MAGD
MANS
MERTNEW
ORIEL
PEMB
QUEENS
REGENT
S−ANNE
S−BEN
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
SOMER
TRIN
UNIV
WADH
WORC
Correlation: 0.99
010
20
30
40
Mo
del 2 R
ank
0 10 20 30 40
Model 1 Rank
All Subjects
acteristics across colleges, the ranking of colleges provided in column 1 is biased in favour of colleges
that receive intakes of high ability students (relative to other colleges) because part of the effect of
ability is attributed to impact of the college.
As we move from column 1 to column 2 the coefficients on college dummies tend shrink – when
controls are added, gaps in average Prelims scores between colleges decline. For instance, in PPE in
31 out of 32 colleges, the coefficients in column 2 are smaller in magnitude than in column 1 (the
exception is Balliol College where the coefficient remains at −0.20). The coefficients in column 1
sum to −19.9 while the coefficients in column 2 sum to −13.0. Thus differences between coefficients
decline by approximately 35%. This trend could be explained by students selecting into colleges
48
according to ability; students with high observable ability are more likely to attend effective colleges.
In particular, St John’s has PPE students with higher observable ability than PPE students at any
other college except Balliol. Therefore, controlling for observable ability reduces the disparity in
Prelims scores across colleges, bringing the estimates closer to the true causal effects of attending
particular colleges. However, even in Model 2, differences between colleges are statistically significant
at the 5% level in each dataset – the second result.
Result 2. Controlling for selection on observables reduces differences in Prelims results across
colleges but they remain statistically significant.
The finding that colleges are statistically significant determinants of Prelims results, does not
provide information about the practical significance of colleges. College effectiveness is practically
significant if some colleges are substantially more effective than others. One way to measure this is
to look at the standard deviation of college effectiveness (the college “effect size”), which indicates
how much adjusted Prelims results differ across colleges. I report the standard deviation of college
effects using a method proposed by Nye et al. (2004), though unlike Nye et al., I also adjust for
estimation error, an important addition.32 Nye et al. recommend calculating two regressions. One
is a regression of Prelims results on only student characteristics yielding a multiple correlation R21.
The second regression is Prelims results on the same student characteristics but it also includes a
set of college dummy variables, yielding a multiple correlation R22. The difference between the two
regressions in variance accounted for (the change in R2 value or ∆R2 = R22 − R2
1) represents the
proportion of variance in (residualised) Prelims results accounted for by college effects. If we regard
the ∆R2 as the variance accounted for by college effectiveness, then the square root of ∆R2, namely
∆R, can be interpreted as the standard deviation of college effectiveness. However, Nye et al.’s
method gives an estimator of the standard deviation of college effects that is biased upwards due to
estimation error. The problem is that R2 (weakly) increases whenever the college dummy variables
are added to the second regression, even if their true coefficients are zero. Given there are over 3032Various other methods can be used estimate the standard deviation of college effectiveness (Aaronson et al., 2007;
Koedel, 2009; Guarino et al., 2015). Guarino et al.’s method gives almost identical estimates as Nye et al.’s method,but neither adjust for estimation error. Aaronson et al. (2007) and Koedel (2009) do account for estimation error butestimates are very sensitive to the choice of baseline college.
49
colleges, this bias may be large. The change I make is to use adjusted R2 in place of the simple R2
used by Nye et al.. I report the results at the bottom of each column.
A couple of points about the results are worth noting. First, accounting for estimation error using
adjusted R2 is important. It dramatically reduces the standard deviation estimates, particularly for
E&M which has a smaller number of students per college than the other datasets.33 Other studies
have also found accounting for estimation error can be important (Aaronson et al., 2007).
Second, as we move from column 1 to column 2, the standard deviation of college effects falls
slightly. For instance, in PPE the standard deviation of college effects falls from 0.18 in column 1
to 0.11 in column 2. Thus the variation associated with colleges drops as more controls are added,
again reflecting sorting into colleges by ability.
Third, in column 2 the standard deviation of college effects across courses ranges from 0.11 in PPE
to 0.15 in E&M. Differences across courses would be expected if there are differences across courses
in the sensitivity of exam results to teaching. The standard deviation of college effectiveness in the
All Subjects dataset is 0.09 which is lower than in PPE, E&M or Law. This could be because colleges
effectiveness is imperfectly correlated across courses so the true variation in college effectiveness is
underestimated. Alternatively, exam results in E&M, PPE and Law may be more sensitive to college
teaching than other courses.
Fourth, by most standards, these college effects are moderate in size and are large enough to have
policy significance. For example, for PPE, a standard deviation in college effectiveness of 0.11 says
that a one standard deviation increase in college effectiveness should increase Prelims scores by 0.11
standard deviations. If college effects are normally distributed, these findings would suggest that
the difference in Prelims average between having a 25th percentile college (a not so effective college)
and a 75th percentile college (an effective college) is 0.15 of a standard deviation in Prelims.34 This
33I test whether using adjusted R2 is successful in removing estimation error. To do this, I create "placebo colleges”and then randomly assign Oxford students to these colleges and repeat the analysis. That is I create dummy variablesfor each placebo college and use them instead of dummy variables for real colleges. Since students are randomlyassigned a placebo college, the true standard deviation of placebo college effectiveness should be zero. Of 100 placebocollege effectiveness standard deviation estimates I produce for each dataset, over 60% of estimates were identicallyzero (the adjusted R2 in the second regression was greater than the adjusted R2 in the first regression). This suggeststhat estimation error is not longer an issue when I use adjusted R2. In contrast, when I use placebo colleges andsimple R2, I obtain average values from 100 replications of 0.17 for E&M, 0.16 for Law, 0.14 for PPE and 0.04 for AllSubjects, implying large estimation error.
34The college effect standard deviation in PPE is 0.1136948. The difference between the 25th and 75th percentiles
50
would move a student at the middle of the exam result distribution to the 56th percentile. The US
Department of Education defines 0.25 as an effect that is “substantially important” (Seftor et al.,
2011) but what determines whether an effect size is large or small is often context dependent (Hill
et al., 2008). A college effect size of 0.11 can be compared to gaps in Prelims results by demographic
groups. For instance, it is smaller than the raw achievement gap between males and females at Oxford
(0.16 standard deviations) and the raw achievement gap between international and home students
(0.21) but larger than the raw achievement gap between independent school and state school students
(0.05). In PPE, a standard deviation improvement in college effectiveness has a ceteris paribus impact
on Prelims results that is larger than an extra 2 A*s at GCSE35 and comparable to an extra 10 marks
on the TSA. As noted in the introduction, these college effect sizes also are comparable to the effect
of teachers and schools. Thus student achievement could be improved if colleges on the lower end
moved up modestly in the distribution of college effectiveness.
Finally, a college effect size of 0.11-0.15 implies only 1-3% of the variance in Prelims results is
associated with variation in college effectiveness. Thus although variation in college performance is
non-negligible, the difference in mean Prelims performance between the best- and worst-performing
colleges is not nearly as large as the difference in performance between the best and worst students
in the typical college. The majority of variation in exam results is within, not between, colleges. I
can now state results 3 and 4.
Result 3. Differences in college effectiveness estimates based on selection on observables are
practically significant.
Result 4. The vast majority of variation in Prelims results is within colleges not between colleges.
At the bottom of column 2, I report the results of a robust Hausman test. I implement it using
the Stata command rhausman and 50,000 replications (see Cameron and Trivedi (2005, pp 718) and
Kaiser et al. (2014) for more details). The robust Hausman test can be used in the presence of
heteroskedasticity in the error term unlike the traditional Hausman test, which makes the auxiliary
of the standard normal distribution is 1.34 standard deviations, so the difference in Prelims average between a 25thand 75th percentile college is (1.34)(0.1136948) ≈ 0.15.
35The effect size of moving from Band 3 (with 7.1 A*s on average), to Band 2 (with 9.2 A*s on average) is 0.075.
51
assumption that the random effects estimator is asymptotically efficient under the null hypothesis.
The robust Hausman test strongly rejects the null hypothesis that the random effects assumption
holds for the Law dataset but does not reject at the 5% level for E&M, PPE or All Subjects. Thus
the choice of modelling college effects as fixed effects rather than random effects seems important for
Law but less so for PPE, E&M and All Subjects. The Hausman test also has some power to detect
violations of the selection on observables assumption since the Hausman test would be misspecified
if the selection on observables assumption were violated. In this case, the random effects estimator
and the fixed effect estimator typically have different probability limits so the Hausman test may
reject the null because selection on observables is violated. This thus provides some encouragement
with regards to the selection on observables assumption.
The R2 for Model 2 across courses ranges from 17 percent for Law to 35 percent for E&M.
Given that the goodness-of-fit measures typically reported by applied researchers working with cross-
sectional data (e.g. Mincer equations) are only 5 percent, this suggests we can explain a significant
proportion of variation in student Prelims results. Following Oster (2013), this also suggests that
unobservables have more potential to bias Law college effectiveness estimates than E&M college
effectiveness estimates.
The regression estimates can be used to form colleges rankings. Figure 2 shows college effects are
strongly positively correlated across Models 1 and 2. However, the high positive correlations conceal
moderate mean absolute movement in college rankings. More dispersion across the 45-degree line
implies more variation in college rankings. Both tails of the original distributions lie relatively close
to the 45-degree line, but there are big movers elsewhere in the distribution. Even though regression
coefficient changes between Models 1 and 2 are large, ranking changes are modest because student
sort into colleges partly based on observable ability.
Result 5. College rankings change moderately when adjusted for selection on observables.
However, college rankings should acknowledge uncertainty by using the appropriate level of stat-
istical significance. It is tempting to look at the regression results and search for colleges where the
standard p-value is less than 0.05 (in Tables 15-18 these are stared college coefficients) and conclude
52
that these colleges are statistically worse than the baseline college at the 5% level. However, we must
be careful about making comparisons like this if we are devising hypotheses having already observed
the data, so are, in effect, performing multiple hypothesis tests. To gauge statistical significance, we
want to avoid “data snooping” – basing inference on individual p-values without taking the multitude
of tests into account. Data snooping would likely lead us to falsely declare some pairs of colleges as
significantly different (see Afshartous and Wolf (2007) for a detailed discussion of data snooping and
methods to avoid it). To account for multiple comparisons I define a new (lower) critical value for
hypothesis tests using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) and set the
false discovery rate (the proportion of significant results that are actually false positives) to 5%.36
Using Benjamini-Hochberg critical values for the All Subjects dataset, 118 of the 528 pairwise
college effectiveness comparisons were statistically significant. However, none of the pairwise com-
parisons of colleges are statistically significant for PPE, Law or E&M (tables not reported). Thus
in these three courses, the top ranked college is not statistically significantly better than the bottom
ranked college! Sstimation errors are large because colleges are only observed with relatively small
numbers of students, even after pooling over multiple years. The uncertainty undermines the use of
course-specific league tables to rank colleges. Therefore although the results provide strong evidence
that colleges do matter, both statistically and practically, the sample sizes are not large enough to
say with much certainty that college A is better than college B for a given course.
Result 6. Course-specific college rankings have large confidence intervals and cannot distinguish
between the majority of colleges.
Table 19 shows college effects are not strongly correlated across courses. Indeed PPE college
effects are negatively correlated with E&M college effects and are uncorrelated with Law college
effects. This finding is emphasised in Figure 3 which presents scatter plots of Model 2 rankings
across courses. Colleges appear to have strengths in teaching different subjects. As already discussed,36The method works as follows. Put individual p-values in order, from smallest to largest. The smallest p-value has
a rank of i=1, then next smallest has i=2, etc. Compare each individual p-value to its Benjamini-Hochberg criticalvalue, (m−i+1).0.05
(2m), where i is the rank, m is the total number of tests, and 0.05 is the false discovery rate. The largest
p-value that has p < (m−i+1).0.05(2m)
, is significant, and all of the p-values smaller than it are also significant, even theones that aren’t less than their Benjamini-Hochberg critical value.
53
Figure 3: Comparison of Selection on Observables College Ranking across Courses
BALLBNC
CH−CH
EXETER
H−MAN
HERT
JESUS
KEBLE
LMH
MERTNEW
PEMB
QUEENS
S−ANNE
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
TRIN
WADHWORC
01
02
03
0
PP
E M
odel 2 R
ank
0 5 10 15 20 25
E&M Model 2 Rank
Model 2 Rank: E&M vs PPE
BALLBNC
CCC
CH−CH
EXETER
H−MAN
HERT
JESUS
KEBLE
LINC
LMHMAGD
MANS
MERTNEW
ORIEL
PEMB
QUEENS
S−ANNES−BEN
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
SOMER
TRIN
UNIV
WADHWORC
01
02
03
04
0
PP
E M
odel 2 R
ank
0 10 20 30
Law Model 2 Rank
Model 2 Rank: Law vs PPE
BALL
BNC
CH−CH
EXETER
H−MAN
HERT
JESUS
KEBLE
LMH
MERT
NEW
PEMB
QUEENS
S−ANNE
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEHTRIN
WADH
WORC
01
02
03
0
Law
Model 2 R
ank
0 5 10 15 20 25
E&M Model 2 Rank
Model 2 Rank: E&M vs LawTable 19: Correlation in College Effectsacross Courses
Model 1 Model 2PPE vs E&M -0.37 -0.10PPE vs Law -0.00 -0.05PPE vs All Subjects 0.84 0.88E&M vs Law 0.30 0.22E&M vs All Subjects -0.17 -0.24Law vs All Subjects 0.48 0.39
further evidence consistent with this interpretation is the standard deviation of college effectiveness
in the All Subjects dataset is lower than in PPE, E&M or Law.
Result 7. College effectiveness rankings differ between courses.
6.2 Robustness Checks for Norrington Table and Selection on Observables
In this subsection I consider several robustness checks. First, I examine how using different outcome
variables alters college effect estimates. Second, I examine results under monotonic transformations
54
of outcome variables. This helps to assess the interval scale metric assumption. Third, I consider
whether there is any evidence that college effects differ for different types of student.
6.2.1 Alternative Outcome Variables
I regress alternative (standardised) outcome variables on the observable ability controls from Model
2. The results are shown in Tables 20-22. The outcome variables in columns 1-3 are scores in indi-
vidual Prelims papers. These are: Introductory Politics, Introductory Philosophy and Introductory
Economics for PPE; General Management, Financial Management and Introductory Economics for
E&M; Roman Law, Constitutional Law and Criminal Law for Law. Column 4 repeats the Prelims
Average results from earlier. Column 5 also uses Prelims Average as an outcome variable but uses
a restricted sample of students who also have Finals scores which allows easy comparison to column
6 which has Finals Average as the outcome variable. College effects remain jointly statistically sig-
nificant at the 1% level in the majority of cases. The standard deviation of college effectiveness is
larger for Prelims than for Finals for E&M and Law and similar for PPE, though the results are not
precisely estimated due to using smaller sample of Finals students. This may be because students
require more guidance during their first year at Oxford when they are finding their feet but become
more independent in later years which would imply Prelims results were more sensitive to instruction
than Finals results. In addition, more teaching takes place inside colleges in the first year relative to
later years.
College effects are positively correlated under different dependent variables. The correlation
between Prelims Average and Finals Average college rankings are high: 0.50 for PPE, 0.53 in E&M
and 0.78 for Law. Correlations between the three first year paper rankings are similar, between 0.36
and 0.75 in PPE, between 0.29 and 0.53 for E&M and between 0.35 and 0.60 for Law. Correlations
across alternative outcome variables are thus clearly positive which is consistent with there being an
underlying generalisable within-course college effectiveness component embodied in the measures.
Result 8. College effectiveness is positively correlated across Prelims papers and between Prelims
and Finals results.
55
Tab
le20:Alterna
tive
Dep
endent
VariableRegressions:PPE
(1)
(2)
(3)
(4)
(5)
(6)
Philosoph
yPolitics
Econo
mics
Prelim
sAvg
Prelim
sAvg
FinalsAvg
βSE
βSE
βSE
βSE
βSE
βSE
UNIV
0.20
(0.22)
0.23
(0.20)
−0.15
(0.17)
0.07
(0.19)
0.07
(0.27)
−0.14
(0.11)
ORIE
L0.30
(0.21)
0.01
(0.21)
−0.43∗
(0.17)
−0.10
(0.18)
−0.37
(0.25)
−0.19
(0.11)
HERT
0.37
(0.22)
−0.12
(0.23)
−0.50∗∗
(0.18)
−0.15
(0.19)
−0.48
(0.27)
−0.32∗
(0.13)
BALL
0.05
(0.22)
0.23
(0.19)
−0.54∗∗
(0.16)
−0.21
(0.19)
−0.21
(0.25)
−0.01
(0.10)
BNC
0.26
(0.21)
−0.13
(0.22)
−0.56∗∗
(0.18)
−0.24
(0.19)
−0.37
(0.29)
−0.10
(0.11)
REGENT
0.02
(0.26)
0.08
(0.31)
−0.50∗
(0.22)
−0.24
(0.23)
−0.16
(0.29)
−0.10
(0.11)
EXETER
−0.05
(0.22)
−0.01
(0.21)
−0.45∗
(0.21)
−0.27
(0.20)
−0.47
(0.27)
−0.05
(0.11)
JESU
S−0.05
(0.22)
0.12
(0.22)
−0.51∗
(0.21)
−0.29
(0.22)
−0.90∗∗
(0.30)
−0.15
(0.11)
PEMB
−0.14
(0.21)
−0.41
(0.25)
−0.14
(0.17)
−0.29
(0.20)
−0.43
(0.33)
−0.19
(0.10)
SEH
−0.06
(0.24)
−0.08
(0.24)
−0.45∗
(0.21)
−0.30
(0.21)
−0.35
(0.34)
−0.20
(0.12)
S-HIL
−0.16
(0.23)
0.29
(0.22)
−0.62∗∗
(0.18)
−0.32
(0.21)
−0.43
(0.27)
−0.23∗
(0.10)
MERT
−0.00
(0.25)
−0.16
(0.24)
−0.48∗
(0.19)
−0.32
(0.22)
−0.48
(0.30)
−0.10
(0.10)
NEW
−0.16
(0.21)
−0.03
(0.21)
−0.46∗∗
(0.18)
−0.32
(0.19)
−0.44
(0.25)
−0.10
(0.10)
S-PET
−0.30
(0.22)
0.04
(0.22)
−0.38∗
(0.18)
−0.33
(0.19)
−0.55∗
(0.25)
−0.24∗
(0.10)
SOMER
−0.06
(0.21)
−0.27
(0.22)
−0.40∗
(0.16)
−0.34
(0.18)
−0.51∗
(0.23)
−0.04
(0.09)
MANS
−0.14
(0.22)
0.02
(0.23)
−0.57∗∗
(0.18)
−0.38∗
(0.19)
−0.46
(0.26)
−0.23
(0.12)
LMH
0.00
(0.22)
−0.32
(0.22)
−0.55∗∗
(0.19)
−0.41∗
(0.20)
−0.61∗
(0.31)
−0.28∗
(0.14)
MAGD
−0.13
(0.23)
−0.09
(0.23)
−0.55∗∗
(0.18)
−0.41∗
(0.20)
−0.69∗
(0.28)
−0.20∗
(0.10)
CCC
−0.23
(0.24)
−0.06
(0.21)
−0.55∗∗
(0.18)
−0.42∗
(0.20)
−0.41
(0.32)
−0.26
(0.16)
LINC
−0.08
(0.21)
−0.25
(0.21)
−0.57∗∗
(0.17)
−0.43∗
(0.18)
−0.73∗∗
(0.24)
−0.08
(0.11)
CH-C
H−0.37
(0.22)
−0.11
(0.20)
−0.45∗
(0.18)
−0.45∗
(0.19)
−0.72∗∗
(0.25)
−0.20∗
(0.10)
KEBLE
−0.17
(0.22)
−0.32
(0.22)
−0.51∗∗
(0.18)
−0.46∗
(0.19)
−0.75∗∗
(0.28)
−0.22∗
(0.10)
S-CATS
−0.37
(0.24)
−0.23
(0.22)
−0.40∗
(0.18)
−0.46∗
(0.20)
−0.54
(0.27)
−0.13
(0.11)
TRIN
−0.05
(0.26)
−0.25
(0.24)
−0.70∗∗
(0.23)
−0.49∗
(0.22)
−1.22∗∗
(0.31)
−0.34∗∗
(0.12)
WORC
−0.12
(0.27)
−0.20
(0.26)
−0.72∗∗
(0.18)
−0.51∗
(0.23)
−0.53∗
(0.26)
−0.33∗∗
(0.11)
WADH
−0.28
(0.22)
−0.02
(0.20)
−0.77∗∗
(0.19)
−0.55∗∗
(0.19)
−1.06∗∗
(0.25)
−0.27∗∗
(0.10)
H-M
AN
−0.28
(0.29)
−0.12
(0.25)
−0.71∗∗
(0.22)
−0.55∗
(0.26)
−0.70
(0.36)
−0.09
(0.11)
S-ANNE
−0.25
(0.23)
−0.02
(0.22)
−0.81∗∗
(0.19)
−0.57∗∗
(0.20)
−0.95∗∗
(0.31)
−0.30∗∗
(0.11)
S-BEN
−0.33
(0.26)
−0.17
(0.30)
−0.66∗∗
(0.22)
−0.57∗
(0.23)
−0.77∗
(0.31)
−0.19
(0.11)
S-HUGH
−0.26
(0.26)
−0.26
(0.27)
−0.69∗∗
(0.24)
−0.60∗
(0.27)
−0.93∗
(0.37)
−0.42∗∗
(0.13)
QUEENS
−0.13
(0.21)
−0.24
(0.21)
−0.84∗∗
(0.19)
−0.60∗∗
(0.20)
−0.75∗∗
(0.27)
−0.42∗∗
(0.15)
BLA
CKF
−1.77∗
(0.69)
−0.48
(0.83)
−1.77∗
(0.77)
−1.89∗
(0.87)
−1.16∗∗
(0.36)
−0.17
(0.10)
Con
trols
Yes
Yes
Yes
Yes
Yes
Yes
Prob>
F0.001
0.015
0.003
0.023
0.000
0.003
SD0.146
0.091
0.128
0.114
0.186
0.199
R-squ
ared
0.162
0.116
0.185
0.211
0.260
0.267
N1391
1391
1391
1391
660
660
The
baselin
ecolle
geis
StJo
hn’s
college.Alldepe
ndentvariab
lesarestan
dardised.Colum
ns1-4usethesampleof
enrolle
dPPE
stud
ents.Colum
ns5-6use
aredu
cted
sampleof
PPE
stud
ents
withFinalsresults.
Collegesareorderedba
sedon
thecoeffi
cients
incolumn4.
Stan
dard
errors
areheteroskedasticity
robu
st.Prob>
Fgivesthep-valuefrom
anF-testof
thenu
llhy
pothesis
that
allcolle
gesareequa
llyeff
ective.SD
givesthestan
dard
deviationof
colle
geeff
ectiveness
usingthemetho
dof
Nye
etal.(2004).
∗p<
0.05,∗∗p<
0.01
56
Tab
le21:Alterna
tive
Dep
endent
VariableRegressions:E&M
(1)
(2)
(3)
(4)
(5)
(6)
General
Man
agem
ent
Finan
cial
Man
agem
ent
Econo
mics
Prelim
sAvg
Prelim
sAvg
FinalsAvg
βSE
βSE
βSE
βSE
βSE
βSE
S-HIL
−0.11
(0.33)
−0.20
(0.27)
−0.05
(0.29)
−0.08
(0.24)
0.89
(0.55)
1.12∗
(0.44)
LMH
−0.19
(0.33)
−0.40
(0.32)
0.19
(0.29)
−0.13
(0.27)
−0.24
(0.51)
0.72
(0.61)
S-CATS
0.15
(0.40)
−0.67∗
(0.32)
0.14
(0.41)
−0.13
(0.33)
−0.00
(0.79)
0.68
(0.92)
NEW
−0.22
(0.33)
−0.56
(0.31)
0.08
(0.32)
−0.19
(0.29)
−0.19
(0.63)
0.47
(0.53)
CH-C
H0.00
(0.36)
−0.68∗
(0.28)
−0.01
(0.33)
−0.23
(0.29)
0.07
(0.60)
0.72
(0.37)
SEH
−0.06
(0.33)
−0.61∗
(0.25)
−0.04
(0.28)
−0.25
(0.23)
−0.08
(0.59)
0.21
(0.49)
HERT
−0.01
(0.29)
−0.50
(0.27)
−0.09
(0.29)
−0.27
(0.25)
−0.79
(0.59)
0.42
(0.44)
S-PET
−0.40
(0.31)
−0.68∗
(0.27)
−0.03
(0.28)
−0.36
(0.24)
−0.32
(0.53)
0.48
(0.42)
S-HUGH
−0.49
(0.31)
−0.43
(0.27)
−0.14
(0.28)
−0.37
(0.24)
−0.22
(0.52)
0.44
(0.44)
QUEENS
−0.06
(0.41)
−0.31
(0.41)
−0.31
(0.30)
−0.37
(0.27)
−0.74
(0.48)
0.42
(0.52)
JESU
S−0.26
(0.34)
−0.64∗
(0.29)
−0.12
(0.31)
−0.36
(0.26)
−0.73
(0.60)
0.64
(0.53)
EXETER
−0.03
(0.34)
−0.53
(0.39)
−0.29
(0.39)
−0.37
(0.37)
0.53
(0.76)
0.88∗
(0.43)
PEMB
−0.40
(0.29)
−0.66∗
(0.26)
−0.11
(0.27)
−0.40
(0.23)
−0.08
(0.50)
0.40
(0.47)
KEBLE
−0.38
(0.29)
−0.58∗
(0.25)
−0.32
(0.29)
−0.51∗
(0.25)
−0.99∗
(0.47)
0.13
(0.31)
WORC
−0.43
(0.31)
−0.54
(0.30)
−0.41
(0.31)
−0.51
(0.26)
−0.95
(0.55)
0.41
(0.34)
BNC
−0.53
(0.33)
−0.60∗
(0.27)
−0.20
(0.28)
−0.52∗
(0.24)
−0.66
(0.47)
0.62
(0.34)
S-JO
HN
−0.71∗
(0.35)
−0.68
(0.42)
−0.18
(0.39)
−0.56
(0.35)
−1.11∗
(0.55)
0.31
(0.54)
S-ANNE
−0.22
(0.34)
−1.00∗∗
(0.30)
−0.41
(0.29)
−0.69∗∗
(0.24)
−0.64
(0.51)
0.20
(0.30)
TRIN
−0.22
(0.28)
−1.17∗∗
(0.35)
−0.41
(0.36)
−0.73∗
(0.32)
−0.76
(0.49)
0.83∗
(0.38)
WADH
−0.34
(0.40)
−1.01∗∗
(0.29)
−0.51
(0.36)
−0.82∗∗
(0.29)
−1.29
(0.78)
0.34
(0.49)
MERT
−0.49
(0.34)
−0.75∗∗
(0.27)
−0.74∗
(0.32)
−0.81∗∗
(0.26)
−1.23∗∗
(0.45)
0.17
(0.30)
BALL
−0.41
(0.34)
−0.98∗∗
(0.32)
−1.26∗∗
(0.34)
−1.16∗∗
(0.30)
−2.06∗∗
(0.57)
0.32
(0.51)
Con
trols
Yes
Yes
Yes
Yes
Yes
Yes
Prob>
F0.465
0.093
0.002
0.011
0.000
0.240
SD0.000
0.121
0.193
0.146
0.422
0.000
R-squ
ared
0.190
0.314
0.311
0.352
0.571
0.364
N516
516
516
516
161
161
The
baselin
ecolle
geis
HarrisMan
chestercolle
ge.Alldepe
ndentvariab
lesarestan
dardised.Colum
ns1-4usethesampleof
enrolle
dEcono
micsan
dMan
agem
ent
stud
ents.Colum
ns5-6usearedu
cted
sampleof
Econo
micsan
dMan
agem
entstud
ents
withFinalsresults.
Collegesareorderedba
sedon
thecoeffi
cients
incolumn4.
Stan
dard
errors
areheteroskedasticity
robu
st.Prob>
Fgivesthep-valuefrom
anF-testof
thenu
llhy
pothesis
that
allcolle
gesareequa
llyeff
ective.SD
gives
thestan
dard
deviationof
college
effectiveness
usingthemetho
dof
Nye
etal.(2004).
∗p<
0.05,∗∗p<
0.01
57
Tab
le22:Alterna
tive
Dep
endent
VariableRegressions:La
w(1)
(2)
(3)
(4)
(5)
(6)
Rom
anCon
stitutiona
lCriminal
Prelim
sAvg
Prelim
sAvg
FinalsAvg
βSE
βSE
βSE
βSE
βSE
βSE
WORC
0.03
(0.23)
−0.08
(0.18)
−0.06
(0.17)
−0.03
(0.20)
−0.14
(0.21)
−0.19
(0.21)
LMH
−0.16
(0.22)
0.05
(0.19)
−0.21
(0.19)
−0.14
(0.19)
−0.44∗
(0.21)
−0.19
(0.19)
HERT
−0.31
(0.21)
0.08
(0.20)
−0.15
(0.17)
−0.16
(0.18)
−0.38∗
(0.19)
−0.12
(0.18)
S-CATS
−0.15
(0.22)
−0.20
(0.18)
−0.08
(0.18)
−0.19
(0.20)
−0.40
(0.23)
−0.32
(0.17)
UNIV
−0.22
(0.23)
−0.02
(0.18)
−0.24
(0.19)
−0.22
(0.20)
−0.52∗
(0.24)
−0.47∗
(0.21)
MANS
−0.44
(0.23)
0.16
(0.19)
−0.20
(0.22)
−0.22
(0.20)
−0.40
(0.23)
−0.27
(0.16)
BNC
−0.07
(0.22)
−0.18
(0.18)
−0.25
(0.17)
−0.23
(0.20)
−0.52∗
(0.22)
−0.23
(0.17)
S-ANNE
−0.02
(0.22)
−0.23
(0.21)
−0.38
(0.22)
−0.28
(0.22)
−0.40
(0.22)
−0.38
(0.20)
TRIN
−0.38
(0.27)
−0.04
(0.20)
−0.18
(0.20)
−0.29
(0.22)
−0.49∗
(0.25)
−0.22
(0.19)
SEH
−0.26
(0.22)
−0.17
(0.19)
−0.34
(0.20)
−0.33
(0.20)
−0.66∗∗
(0.24)
−0.43∗
(0.20)
H-M
AN
−0.16
(0.31)
−0.13
(0.25)
−0.46∗
(0.22)
−0.34
(0.26)
−0.59
(0.34)
−1.06
(0.68)
MERT
−0.13
(0.24)
−0.47∗
(0.22)
−0.22
(0.24)
−0.36
(0.23)
−0.73∗∗
(0.25)
−0.35
(0.20)
LINC
−0.30
(0.24)
−0.35
(0.19)
−0.30
(0.18)
−0.42∗
(0.20)
−0.71∗∗
(0.21)
−0.32
(0.17)
PEMB
−0.51∗
(0.22)
−0.26
(0.19)
−0.19
(0.17)
−0.42∗
(0.19)
−0.58∗∗
(0.21)
−0.32
(0.19)
CCC
−0.42
(0.22)
−0.17
(0.19)
−0.45∗
(0.21)
−0.44∗
(0.21)
−0.47∗
(0.24)
−0.31
(0.23)
NEW
−0.25
(0.24)
−0.40∗
(0.18)
−0.47∗∗
(0.17)
−0.48∗
(0.19)
−0.65∗∗
(0.20)
−0.37∗
(0.18)
CH-C
H−0.31
(0.22)
−0.44∗
(0.20)
−0.36
(0.20)
−0.49∗
(0.21)
−0.73∗∗
(0.21)
−0.33
(0.17)
S-PET
−0.27
(0.24)
−0.36∗
(0.18)
−0.49∗
(0.21)
−0.51∗
(0.21)
−0.74∗∗
(0.23)
−0.53∗
(0.23)
S-HUGH
−0.24
(0.22)
−0.40∗
(0.19)
−0.54∗∗
(0.20)
−0.53∗∗
(0.20)
−0.67∗∗
(0.21)
−0.42
(0.22)
BALL
−0.43
(0.22)
−0.26
(0.20)
−0.55∗∗
(0.21)
−0.54∗
(0.21)
−0.81∗∗
(0.25)
−0.71∗∗
(0.23)
JESU
S−0.22
(0.23)
−0.14
(0.17)
−0.86∗∗
(0.19)
−0.56∗∗
(0.19)
−0.84∗∗
(0.21)
−0.81∗∗
(0.18)
WADH
−0.45∗
(0.22)
−0.25
(0.16)
−0.55∗∗
(0.17)
−0.57∗∗
(0.19)
−0.84∗∗
(0.21)
−0.29
(0.19)
S-HIL
−0.36
(0.24)
−0.36
(0.19)
−0.57∗∗
(0.18)
−0.58∗∗
(0.21)
−0.80∗∗
(0.21)
−0.51∗
(0.21)
S-JO
HN
−0.26
(0.21)
−0.45∗
(0.21)
−0.63∗∗
(0.18)
−0.58∗∗
(0.19)
−0.75∗∗
(0.20)
−0.46∗
(0.19)
KEBLE
−0.57∗
(0.25)
−0.39∗
(0.19)
−0.44∗
(0.21)
−0.59∗∗
(0.22)
−0.77∗∗
(0.25)
−0.45
(0.27)
QUEENS
−0.46
(0.30)
−0.16
(0.28)
−0.72∗∗
(0.26)
−0.59∗
(0.29)
−0.66∗
(0.32)
−0.35
(0.22)
EXETER
−0.38
(0.25)
−0.31
(0.22)
−0.70∗∗
(0.22)
−0.61∗∗
(0.22)
−0.84∗∗
(0.25)
−0.69∗∗
(0.22)
REGENT
−0.58
(0.39)
−0.40
(0.32)
−0.59
(0.31)
−0.68
(0.35)
−1.30∗∗
(0.35)
−0.88∗
(0.36)
ORIE
L−0.54∗
(0.24)
−0.50∗
(0.20)
−0.62∗∗
(0.19)
−0.74∗∗
(0.22)
−1.22∗∗
(0.25)
−0.84∗∗
(0.24)
SOMER
−0.43
(0.23)
−0.41
(0.22)
−0.91∗∗
(0.24)
−0.77∗∗
(0.23)
−0.93∗∗
(0.24)
−0.69∗∗
(0.21)
GREYF
−0.75
(0.70)
−0.44
(0.45)
−1.36∗∗
(0.42)
−1.16∗
(0.58)
−1.27∗
(0.62)
−1.23∗∗
(0.43)
Con
trols
Yes
Yes
Yes
Yes
Yes
Yes
Prob>
F0.343
0.098
0.000
0.004
0.000
0.004
SD0.056
0.072
0.179
0.141
0.195
0.132
R-squ
ared
0.127
0.104
0.148
0.166
0.185
0.188
N1229
1229
1229
1229
854
854
The
baselin
ecolle
geis
Mag
dalencolle
ge.Allde
pend
entvariab
lesarestan
dardised
.Colum
ns1-4usethesampleof
enrolle
dLaw
stud
ents.Colum
ns5-6use
aredu
cted
sampleof
Law
stud
ents
withFinalsresults.
Collegesareorde
redba
sedon
thecoeffi
cients
incolumn4.
Stan
dard
errors
arehe
terosked
asticity
robu
st.Prob>
Fgivesthep-valuefrom
anF-testof
thenu
llhy
pothesis
that
allcolle
gesareequa
llyeff
ective.SD
givesthestan
dard
deviationof
colle
geeff
ectivene
ssusingthemetho
dof
Nye
etal.(20
04).
∗p<
0.05,∗∗p<
0.01
58
6.2.2 Interval Scale Metric Assumption
If Prelims Average is not an interval scale metric then there is a danger college rankings are not
invariant to the scale of Prelims Average. The importance of this assumption is an empirical issue.
To test it, I compare the results of Model 2 using different (monotonic transformations) of Prelims
Average. I consider (i) standardised Prelims Average (as in the main analysis), (ii) squaring Prelims
Average and then standardising and (iii) taking logarithms of Prelims Average and then standard-
ising. I reexamine the coefficients and college rankings in each case. The rankings of colleges are
unchanged in the majority of cases (no college moves by more than 3 places) and changes in the size
of the coefficient changes tend to be small. For PPE, the correlation between college effects estimates
across the three models is between 0.992 and 0.998. The results for other courses are similar. There-
fore, whilst the monotonic transformations chosen are clearly only a tiny proportion of all possible
rescalings, the fact the rankings are only change slightly does provide comfort.
6.2.3 Heterogeneity in College Effectiveness across Students of Different Types
To test for heterogeneity in college effects by gender, I include in the regression interaction terms
between gender and college attended. An F-test can then determine if these interaction terms are
significant which would indicate evidence that college effects differed by gender. I repeat the process
for overseas status, cohort, previous school type, GCSE results and A-level results. Table 23 displays
the p-values for the F-tests. For PPE, E&M and Law the F-test cannot reject the hypothesis that
college effects are invariant to ability (in terms of prior GCSE and A-level results). Thus there is little
evidence to support a “mismatch” hypothesis that college quality and ability interact in substantively
important ways. Students of all abilities benefit from attending higher quality colleges. However, for
both Law and PPE there is strong evidence that college effects change over time. The All Subjects
model is less well specified than the other models. The evidence from F-tests suggests that college
effects could be heterogeneous across gender, overseas status, cohort, previous school type and GCSE
bands.
59
Table 23: P-values from Tests for Heterogeneity in College Effects across StudentsPPE E&M Law All Subjects
Gender 0.24 0.82 0.02* 0.06Overseas Status 0.00** 0.15 0.09 0.00**Cohort 0.00** 0.68 0.00** 0.05*Previous School type 0.02* 0.06 0.50 0.01*GCSE Band 0.33 0.23 0.99 0.01*A-level Band 0.65 0.72 0.98 0.22
This table shows the results of tests of the null hypothesis that college effects are invariant to student characteristics.Each cell gives the p-value from an F-test on the coefficients of interaction terms between student characteristics andcollege dummies. Significance at the 1 and 5 percent level is denoted by **, and *, respectively.
6.3 Results for Selection on Observables and Unobservables
The estimation of college effects for PPE, Law and E&M using Model 3 turned out to be problem-
atic because OLS regression estimates of the scale parameter λ1 from equation (5) were negative.
Given the theory model constraints λ1 > 0 this is troubling and prevents us from obtaining point
estimates of college effects for these courses. Mechanically this is because (i) colleges tend to select
lower proportions of open applicants than direct applicants, implying that direct applicants are of
higher ability on average relative to open applicants and (ii) at most colleges, open applicants slightly
outperform direct applicants in Prelims. There are a number of possibilities as to why λ1 estimates
are negative. First, is estimation error. There are only a few open applicants at each college so
our estimates of λ1 are not precise. Second, the “Fair Admissions” assumption may not hold. A
negative estimate of λ1 could be generated if colleges were biased against open applicants relative to
direct applicants (open applicants face a higher cut-off). Direct discrimination seems unlikely given
admissions tutors are unaware whether an applicant applied directly or made an open application.
However, discrimination could occur indirectly, for instance if admissions tutors were biased against
international applicants relative to UK applicants and open applicants are disproportionately inter-
national students. In each of PPE, E&M and Law, international students do score more highly on
average than UK students in Prelims but an analysis of marginal students is needed to determine
the validity of the “Fair Admissions” assumption (Bhattacharya et al., 2014). Third, assumptions
made about the distribution of ability (normality and equal variance) may not hold. Evidence that
this is due to estimation error is that the All Subjects dataset, which draws on many more students,
60
estimates a positive λ1.
As a result of this problem I do two things. First, in Table 24, I present college effect estimates for
PPE, E&M and Law for different values of λ1. Cut-off estimates are provided in the first column. For
PPE they range from 1.34 at St Anne’s to 2.15 at Mansfield (remembering that the average ability
of open applicants in the whole population is zero) reflecting that some colleges accept a much larger
proportion of open applicants than others. The second and third columns give the average ability of
enrolled students at each college. In each case direct applicants are estimated to be of higher ability
on average. Columns 4 and 5 give the number of enrolled students at each college. It is notable how
few open applicants attend each college – the most for PPE is 10 at St Peter’s – and this means
that college effect estimates are imprecise. The remaining columns of Table 24, present college effect
estimates for Models 1, 2 and 3 based on the reduced sample of students who were offered a place at
the first college they were allocated to. This makes college effectiveness estimates directly comparable
across models. Colleges are ordered by their Model 1 college effectiveness estimate. Since the estimate
of the scale parameter λ1 < 0 for these courses, the Model 3 estimates are reported for different values
of λ1 (λ1 = 0.5, λ1 = 1, λ1 = 2 and λ1 = 5). Model 1, or equivalently the Model 3 for λ1 ≈ 0, is a
baseline value where the entire difference in Prelims scores is attributed to colleges. As λ1 increases,
Prelims results become more sensitive to ability. This improves the college effect estimates for colleges
with low estimated cut-offs and low estimated enrolled student ability relative to colleges with high
estimated cut-offs and enrolled students of high average ability. In the limit as λ1 → ∞ college are
ranked based solely on the estimated cut-offs. The results for PPE, E&M and Law seem plausible
if, for example, λ1 = 0.5. In this case differences in college effectiveness are similar to differences
that result from controlling for observables in Model 2, with correlations in college effectiveness
estimates of 0.71 for PPE, 0.93 for E&M and 0.79 for Law. However, college effectiveness estimates
for some colleges are quite sensitive to the value of λ1 and the estimate of the cut-off zj . Sensitivity
to the cut-off estimate creates uncertainty about true college effectiveness because the cut-offs are
imprecisely estimated due to the low number of open applicants at each college. For example, the
cut-off estimate for Mansfield (MANS) for PPE of 2.15 seems unrealistically high given estimates
for other colleges range from 1.34 to 1.79. Overall the results in Table 24, suggest similar results to
61
Tab
le24:Selectionon
Observables
andUno
bservables
Results
forvariou
sλ1:PPE,E
&M
andLa
wCutoff
Ability
No.
Enrolled
College
effects β j=
c j−c J
z jOpe
nDirect
Ope
nDirect
Mod
el1
Mod
el2
Mod
el3
λ1≈
0-
λ1=
0.5
λ1=
1λ1=
2λ1=
5
PP
ES-HIL
1.70
2.11
2.14
73
0.00
0.00
0.00
0.00
0.00
0.00
SOMER
1.60
2.03
2.13
915
-0.16
-0.19
-0.14
-0.13
-0.10
-0.02
PEMB
1.41
1.87
1.93
724
-0.32
-0.32
-0.22
-0.12
0.09
0.69
SEH
1.38
1.84
1.88
617
-0.35
-0.19
-0.22
-0.10
0.15
0.89
S-PET
1.52
1.96
2.05
1021
-0.42
-0.43
-0.37
-0.32
-0.23
0.07
CCC
1.46
1.90
2.00
521
-0.42
-0.40
-0.35
-0.28
-0.15
0.27
LMH
1.79
2.19
2.35
227
-0.45
-0.42
-0.56
-0.67
-0.89
-1.53
MANS
2.15
2.51
2.66
116
-0.54
-0.30
-0.80
-1.07
-1.60
-3.19
QUEENS
1.72
2.13
2.26
324
-0.56
-0.59
-0.62
-0.68
-0.80
-1.17
S-ANNE
1.34
1.80
1.85
720
-0.70
-0.71
-0.56
-0.42
-0.13
0.72
S-HUGH
1.43
1.88
1.94
99
-0.75
-0.53
-0.65
-0.55
-0.34
0.28
E&
MS-HIL
1.51
1.95
1.88
81
0.00
0.00
0.00
0.00
0.00
0.00
SEH
1.62
2.04
2.09
1214
-0.29
-0.08
-0.35
-0.41
-0.53
-0.90
S-HUGH
1.84
2.23
2.35
517
-0.33
-0.22
-0.52
-0.71
-1.10
-2.24
HERT
1.83
2.23
2.30
334
-0.46
-0.19
-0.63
-0.81
-1.16
-2.21
JESU
S1.80
2.19
2.28
315
-0.48
-0.22
-0.64
-0.80
-1.13
-2.12
KEBLE
1.93
2.31
2.43
422
-0.58
-0.36
-0.81
-1.04
-1.51
-2.91
PEMB
1.76
2.16
2.22
241
-0.59
-0.36
-0.73
-0.87
-1.15
-1.99
S-PET
1.83
2.23
2.33
726
-0.65
-0.32
-0.84
-1.02
-1.39
-2.48
LMH
1.77
2.17
2.23
310
-0.68
-0.30
-0.81
-0.95
-1.22
-2.03
MERT
2.23
2.58
2.72
122
-0.78
-0.73
-1.17
-1.55
-2.32
-4.63
WADH
2.09
2.45
2.59
110
-0.90
-0.73
-1.22
-1.54
-2.17
-4.08
S-ANNE
1.73
2.14
2.21
612
-0.91
-0.72
-1.03
-1.15
-1.40
-2.14
Law
S-ANNE
1.33
1.79
1.86
517
0.00
0.00
0.00
0.00
0.00
0.00
HERT
1.64
2.06
2.19
235
-0.04
0.05
-0.21
-0.38
-0.72
-1.75
SEH
1.24
1.72
1.81
1613
-0.16
0.08
-0.12
-0.08
0.00
0.23
S-HIL
1.55
1.99
2.04
88
-0.22
-0.31
-0.30
-0.39
-0.56
-1.08
WADH
1.5
1.94
2.03
529
-0.29
-0.22
-0.38
-0.46
-0.63
-1.15
S-PET
1.65
2.07
2.17
313
-0.31
-0.24
-0.47
-0.62
-0.93
-1.85
CCC
1.27
1.75
1.86
821
-0.42
-0.26
-0.41
-0.40
-0.39
-0.34
JESU
S1.3
1.77
1.83
528
-0.49
-0.33
-0.49
-0.48
-0.46
-0.40
S-HUGH
0.87
1.42
1.34
195
-0.50
-0.49
-0.29
-0.07
0.37
1.68
ORIE
L1.29
1.76
1.85
931
-0.55
-0.54
-0.55
-0.54
-0.54
-0.51
KEBLE
1.46
1.9
2.02
337
-0.62
-0.42
-0.71
-0.79
-0.96
-1.47
SOMER
1.06
1.58
1.62
1310
-0.77
-0.70
-0.65
-0.52
-0.27
0.47
The
estimated
valueofλ1foreach
ofPPE,E&M
andLaw
isnegative.Colum
ns2-3give
theestimated
ability
ofenrolle
dstud
ents.Colum
ns4-5give
thenu
mbe
rof
enrolle
dstud
ents.Colum
ns6-11
give
colle
geeff
ectestimates
relative
tothecolle
gewiththehigh
estaveragePrelim
sresultsfrom
Mod
el1.
Estim
ates
areba
sedon
arestricted
samplethat
does
notinclud
estud
ents
who
wereno
toff
ered
aplaceat
thefirst
colle
gethey
wereallocatedto.
Collegesareon
lyinclud
edifthey
have
atleast50
open
applican
ts.
62
Table 25: Selection on Observables and Unobservables Results: All Subjects, English, Maths andHistory
Cutoff Ability No. Enrolled College effects βjzj Open Direct Open Direct Model 1 Model 2 Model 3
All SubjectsMAGD 1.28 1.75 1.81 4 465 0.00 0.00 0.00S-JOHN 1.27 1.74 1.81 18 426 -0.03 0.02 -0.02PEMB 1.17 1.66 1.71 43 295 -0.31 -0.18 -0.19CH-CH 1.25 1.73 1.79 16 399 -0.26 -0.17 -0.23SEH 1.23 1.71 1.79 97 279 -0.30 -0.17 -0.25EXETER 1.17 1.66 1.69 11 331 -0.26 -0.14 -0.26S-CATS 1.27 1.75 1.82 42 380 -0.40 -0.33 -0.26S-HIL 1.23 1.71 1.78 125 113 -0.34 -0.23 -0.27S-HUGH 1.22 1.71 1.80 122 206 -0.33 -0.21 -0.28MERT 1.46 1.91 2.00 8 333 -0.09 -0.08 -0.30BALL 1.39 1.85 1.91 4 421 -0.20 -0.18 -0.30ORIEL 1.26 1.74 1.80 23 276 -0.33 -0.27 -0.31S-ANNE 1.29 1.76 1.86 64 304 -0.29 -0.20 -0.33SOMER 1.18 1.67 1.80 106 212 -0.41 -0.30 -0.35NEW 1.48 1.92 2.04 9 478 -0.12 -0.08 -0.36LINC 1.42 1.87 1.97 11 326 -0.24 -0.22 -0.41QUEENS 1.24 1.72 1.79 42 268 -0.46 -0.35 -0.42S-PET 1.37 1.83 1.91 56 208 -0.37 -0.21 -0.46MANS 1.43 1.88 1.97 24 149 -0.30 -0.20 -0.46CCC 1.37 1.83 1.96 27 219 -0.37 -0.31 -0.51UNIV 1.67 2.08 2.22 4 406 -0.11 -0.11 -0.55LMH 1.47 1.91 2.03 34 329 -0.32 -0.22 -0.55WADH 1.47 1.92 2.02 16 455 -0.34 -0.28 -0.56JESUS 1.64 2.06 2.19 15 353 -0.27 -0.23 -0.67HERT 1.65 2.06 2.19 13 446 -0.28 -0.20 -0.70KEBLE 1.74 2.14 2.28 12 446 -0.26 -0.19 -0.77EnglishS-HUGH 1.24 1.72 1.85 12 17 0.00 0.00 0.00LMH 1.12 1.63 1.72 5 42 -0.28 -0.48 -0.41SOMER 1.07 1.58 1.73 13 35 -0.64 -0.71 -0.62S-HIL 1.09 1.60 1.67 15 16 -0.71 -0.69 -0.67MathsS-HIL 1.41 1.86 1.92 5 6 0.00 0.00 0.00S-PET 1.10 1.60 1.59 7 6 -0.63 -0.57 -0.03SEH 1.38 1.84 1.86 5 4 -0.17 -0.27 -0.08S-HUGH 1.40 1.85 1.95 3 20 -0.25 -0.13 -0.35QUEENS 1.12 1.63 1.72 11 17 -0.98 -0.89 -0.56MANS 1.75 2.15 2.23 2 7 -0.81 -0.93 -1.48LMH 2.08 2.44 2.62 1 21 -0.39 -0.31 -1.85HistoryS-HIL 1.11 1.61 1.67 5 8 0.00 0.00 0.00S-HUGH 0.99 1.52 1.64 7 27 -0.27 0.03 -0.20S-ANNE 1.21 1.69 1.76 5 12 -0.14 0.07 -0.32MANS 1.27 1.74 1.82 4 10 -0.04 0.05 -0.32SOMER 1.14 1.64 1.85 11 30 -0.13 0.10 -0.41
Columns 2-3 give the estimated ability of enrolled students. Columns 4-5 give the number of enrolled students. Allcollege effect estimates based on a restricted sample that does not include students who were not offered a place atthe first college they were allocated to. College effect estimates are given relative to the college with the largest Model3 college effect estimate. Colleges are only included if they have at least 50 open applicants. Estimated value ofλ1: λ1 = 1.09 for All Subjects; λ1 = 0.24 for English; λ1 = 2.04 for Maths; λ1 = 1.92 for History.
63
Figure 4: Comparison of College Rankings across Models: All Subjects
BALL
CCC
CH−CH
EXETER
HERT
JESUS
KEBLE
LINC
LMH
MAGD
MANS
MERTNEW
ORIEL
PEMB
QUEENS
S−ANNE
S−CATS
S−HIL
S−HUGH
S−JOHN
S−PET
SEH
SOMER
UNIV
WADH
05
10
15
20
25
Mo
de
l 2
Ra
nk
0 5 10 15 20 25
Model 1 RankBased on restricted sample of colleges
Model 1 vs Model 2
BALL
CCC
CH−CH
EXETER
HERTJESUS
KEBLE
LINC
LMH
MAGD
MANS
MERT
NEW
ORIEL
PEMB
QUEENS
S−ANNE
S−CATS
S−HILS−HUGH
S−JOHN
S−PET
SEH
SOMER
UNIV
WADH
05
10
15
20
25
Mo
de
l 3
Ra
nk
0 5 10 15 20 25
Model 2 RankBased on restricted sample of colleges
Model 2 vs Model 3
selection on observables estimates for some parameterisations. With more students per college, this
method could produce useful effectiveness estimates but currently there is considerable uncertainty
surrounding college effectiveness estimates.
Second, in Table 25, I present All Subjects selection on observables and unobservables results.
Again the first five columns present the cut-off, average ability of enrolled students and the number
of enrolled students. The final three columns give college effect estimates for Models 1, 2 and 3 again
based on the reduced sample of students and colleges. Colleges are ranked by their Model 3 college
effectiveness estimate. Comparing Model 1, 2 and 3 college effect estimates suggests that taking
into account unobservable ability may actually slightly increase variation in adjusted Prelims results
between colleges. Taking the results at face value also suggests that in many cases, unobservable
ability is not well correlated with observable ability (differences in Prelims results between colleges
usually fall when moving from Model 1 to Model 2 but often rise when moving from Model 2 to Model
3). However, I cannot rule out that these results are mainly due to estimation error. Indeed, for
All Subjects, the estimate of the ability scale parameter is λ1 = 1.09, which seems implausibly high
because it leads to a strong negative correlation between the estimated cut-off zj and the Model 3
college effect estimates βj (the 6 lowest places in the table are occupied by colleges that have cut-offs
in the top 7). Figure 4 illustrates how college effectiveness rankings change across Models. Although
college rankings change very little when we compare Model 1 and Model 2 results, Model 3 results
64
are quite different. The correlation in college effectiveness estimates is 0.94 for Model 1 vs Model
2 but falls to 0.44 for Model 2 vs Model 3. As shown in section 5.5.1, random assignment of open
applicants is less convincing when pooled across subjects so I also give college effect estimates for
English, Maths and History in the lower 3 panels in Table 25. These estimates largely tell the same
story – the Model 3 estimates are at times quite different to the Model 1 and Model 2 estimates
indicating unobservable ability is very important or that these estimates are very imprecise.
7 Characteristics of Effective Colleges
What characteristics are associated with effective colleges? To answer this question I implement a
two-step procedure because Moulton (1986) that estimating the impact of college characteristics in
one step is problematic for the precision of estimated effects. I would like to estimate γ from the
college level equation:
βj = Zjγ + uj ∀ j = 1, 2, . . . , J (15)
where βj is the true college effect for college j composed of a vector of college characteristics Zj and
a homoskedastic random error term uj with E(uj) = 0 and V ar(uj) = σ2. However, true college
effectiveness βj is not observable. Rather we observe college effect coefficient estimates βj from the
first stage models. Using first stage estimated regression coefficients implies an additional error in
the second stage regression because of estimation error:
βj = βj + εj ∀ j = 1, 2, . . . , J. (16)
Thus the second stage regression becomes:
βj = Zjγ + uj + εj ∀ j = 1, 2, . . . , J (17)
where V ar(εj) = w2j . Since the dependent variable in the second stage βj is itself estimated, the
second stage regression residual can be thought of as having two components. One, uj , is the random
shock that would have obtained even if the college effects were directly observed and could well be
homoscedastic. The second component εj is the estimation error from the first stage regression. Even
65
if uj is homoscedastic, this εj will be heteroskedastic because estimation error differs across colleges.
Therefore the regression errors in (17) will be heteroskedastic and OLS will produce inconsistent
standard error estimates.37
I follow Hanushek et al. (1996) and assume w2j is proportional to the sampling variance of βj and
use a specialised form of feasible generalised least squares (FGLS) (Hanushek et al., 1996).38 First,
I estimate equation (17) using OLS and calculate the squared residuals for j = 1, ..., J − 1, where
college J is the baseline college in the first stage. Next, I regress the squared residuals on the squared
standard errors from the college effect estimates. Finally, I use the inverse of the predicted square
of the residuals from this auxiliary regression as the weight the FGLS estimation of (17). Estimates
from this regression will be asymptotically efficient.
The small number of colleges has two implications for inference (Donald and Lang, 2007). First,
the assumption that college effects are normally distributed is crucial for hypothesis testing because
we cannot rely on large sample sizes to provide an asymptotically normal distribution of the parameter
estimates. Second, there is a practical limit on the number of variables that can be included in Z.
For the college characteristics in Zj I use endowment39, the number of students on the course
and the college average admissions test scores. These latter two variables are included, as in Bratti
(2002), to proxy for peer effects. A positive coefficient on the number of students on the course
would suggest that students benefit from being surrounded by lots of other students studying the
same course within the same college. A positive coefficient on college average admissions test scores
would suggest that students benefit from being surrounded by high ability students studying the
same course within the same college. I also consider specifications with a dummy variable for being
a former All Women’s college (LMH, St Anne’s, St Hugh’s and Somerville), dummy variables based
on location and dummy variables for “Old” colleges (foundation pre-1500) and “Young” colleges
37OLS standard errors will be inconsistent with a fixed number of students per college as the number of collegestends to infinity. However, when the number of students per college is large, the second component (the estimationerror) is small and the first component (the random shock) is assumed homoscedastic so OLS produces consistentstandard errors (Donald and Lang, 2007).
38A common approach is to use weighted least squares with weights 1wj
in the second stage regression. However, likeOLS, this is inefficient and may produce inconsistent estimates of parameter uncertainty because it implicitly assumesthat the entire residual uj + εj , and not just the second component εj is heteroskedastic (Hanushek, 1974).
39Specifically 2011 endowment, approximately the midpoint of the time period. I collected this information fromcollege Financial Reports publicly available on the Oxford website.
66
Table 26: Second Stage Regression Results: Impact of Endowment
PPE EM Law All Subjects
Model 1 Model 2 Model 1 Model 2 Model 1 Model 2 Model 1 Model 2
Endowment 0.048∗∗ 0.028 -0.022 -0.024 0.003 0.006 0.027∗ 0.018∗(0.015) (0.015) (0.018) (0.023) (0.018) (0.018) (0.010) (0.008)
Endowment Sq -0.002∗∗ -0.001 0.001 0.001 -0.000 -0.000 -0.001∗∗ -0.001∗∗(0.000) (0.001) (0.001) (0.001) (0.000) (0.000) (0.000) (0.000)
Prob > F 0.01 0.08 0.33 0.59 0.19 0.07 0.01 0.01R-squared 0.25 0.13 0.06 0.04 0.01 0.01 0.24 0.15N 32 32 22 22 31 31 32 32The dependent variable in columns 1, 3, 5 and 7 are the Model 1 college effectiveness estimates. The dependentvariable in columns 2, 4 6 and 8 are the Model 2 college effectiveness estimates. Standard errors are FGLSstandard errors calculated as in Hanushek, Rivkin and Taylor (1996). Prob > F gives the p-vaue for the F-testof the null hypothesis that the coefficients on endowment and endowment squared are equal to zero.The units for endowment are £10m∗ p < 0.05, ∗∗ p < 0.01
(foundation post-1850). However, including lots of dummy variables severely reduces the degrees of
freedom available and I do not report these results.
Tables 26 and 27 presents FGLS estimates of the determinants of the college effects. Regression
results are reported using college effects estimates from Model 1 and Model 2 (no standard errors are
available for Model 3).
The impact of endowment on college effectiveness is best evaluated through regressions with no
other controls, as in Table 26. Table 26 provides evidence that endowment is related to both raw
Prelims scores (Model 1) and college effectiveness adjusted for observables (Model 2). F-tests of the
null hypothesis that endowment has no impact on Model 2 college effectiveness can be rejected for
PPE and Law (at the 10% level) and for All Subjects (at the 1% level), though not for E&M where
the estimated effect is negative and insignificant. The estimated relationship between endowment
and college effectiveness is increasing and concave for PPE, Law and All Subjects. Richer colleges on
average tend to be more effective. For example, the top 6 most effective colleges have endowments
in the top 9.40 For PPE and All Subjects, endowment is more closely related to raw Prelims scores
40This point is made periodically in the media. Eg. Times Higher Education: “Oxford inequalities exposed” 2ndMay 2003 and Cherwell: “Rich colleges enjoy more academic success” 29th October 2010.
67
Tab
le27:Second
StageRegressionResults:Evidenceof
PeerEffe
cts
PPE
EM
Law
AllSu
bjects
Mod
el1
Mod
el2
Mod
el1
Mod
el2
Mod
el1
Mod
el2
Mod
el1
Mod
el2
End
owment
0.017
0.020
-0.033
-0.031
0.005
0.002
0.024∗
0.015
(0.016)
(0.016)
(0.021)
(0.028)
(0.020)
(0.015)
(0.011)
(0.008)
End
owmentSq
-0.001
-0.001
0.001
0.001
-0.000
-0.000
-0.001∗
-0.000
(0.000)
(0.001)
(0.001)
(0.001)
(0.001)
(0.000)
(0.000)
(0.000)
No.
PPE
stud
ents
0.011∗∗
0.008∗∗
(0.003)
(0.002)
Avg
TSA
CriticalP
PE
0.013
-0.011
(0.019)
(0.020)
Avg
TSA
Problem
PPE
0.010
-0.013
(0.013)
(0.013)
No.
EM
stud
ents
0.003
0.004
(0.005)
(0.006)
Avg
TSA
CriticalE
M-0.003
-0.019
(0.051)
(0.051)
Avg
TSA
Problem
EM
0.046
0.057
(0.059)
(0.064)
No.
Law
stud
ents
0.003
0.006
(0.006)
(0.004)
Avg
LNAT
0.124
0.107∗
(0.064)
(0.042)
Total
Stud
ents
peryear
0.001
0.001∗
(0.001)
(0.001)
End
owment:
Prob>
F0.08
0.00
0.16
0.49
0.28
0.11
0.09
0.21
Abilitype
ereff
ects:Prob>
F0.57
0.49
0.53
0.63
0.06
0.02
Allvariab
les:
Prob>
F0.00
0.00
0.28
0.85
0.22
0.03
0.02
0.01
R-squ
ared
0.54
0.40
0.16
0.13
0.19
0.31
0.25
0.21
N32
3221
2130
3032
32The
depe
ndentvariab
lein
columns
1,3,
5an
d7aretheMod
el1colle
geeff
ectiveness
estimates.The
depe
ndentvariab
lein
columns
2,4,
6an
d8aretheMod
el2colle
geeff
ectiveness
estimates.Stan
dard
errors
areFGLSstan
dard
errors
calculated
asin
Han
ushek,
Rivkinan
dTay
lor(1996).End
owment:
Prob>
Fgivesthep-vaue
fortheF-testof
thenu
llhy
pothesis
that
thecoeffi
cients
onendo
wmentan
dendo
wmentsqua
redareequa
lto
zero.Abilitype
ereff
ects:Prob>
Fgivesthep-vaue
fortheF-testof
thenu
llhy
pothesis
that
the
coeffi
cients
onaveragead
mission
stest
scores
areequa
lto
zero.
∗p<
0.05,∗∗p<
0.01
68
from Model 1 than college effectiveness adjusted for observables in Model 2 (as shown by smaller
coefficient estimates and smaller R2 estimates), which suggests high ability students sort into richer
colleges. For PPE, an increase in endowment from £15million to £25million is related to a 0.046
standard deviation increase in raw Prelims scores and an improvement of 0.027 standard deviations
in Prelims scores after accounting for observables. The effect is also slightly underestimated because I
must exclude the baseline college from the analysis since its college effect has no associated standard
error and this is St John’s for PPE and All Subjects which has both the largest endowment and
high college effectiveness. These results are consistent with richer colleges attracting higher ability
students and teaching them more effectively than other colleges. More effective colleges may also
receive more and larger donations from alumni.
Table 27 includes endowment and peer effect proxies as explnatory variables. Evidence on peer
effects, holding endowment fixed, is mixed. Focusing on the regressions with Model 2 college effects
as the dependent variable, average admissions test scores have a positive and significant effect for
Law. This suggests students can learn from other high ability students within the same college or
perhaps benefit from competition with them. However, average admissions test score coefficients are
insignificant for E&M and PPE and are even negative in some cases, which would suggest students
benefit from more from lower ability peers. Thus there is little evidence of ability peer effects.
There is however evidence of peer effects operating though the number of students at each college
per course. The coefficients for the Model 2 college effect regressions are positive and statistically
significant at the 1% level for PPE and are positive but insignificant for E&M and Law (they are also
positive and insignificant for All Subjects). Colleges that take large numbers of students in a given
course, perform well in that course both before and after accounting for observables. Colleges taking
one extra student per year over 5 years in PPE, ceteris paribus, are associated with an improvement
of 0.055 standard deviations in raw Prelims scores and an improvement of 0.040 standard deviations
in Prelims scores after accounting for observables – a large effect for such a small intervention. A
similar size effect is measured for Law – colleges taking one extra student per year over 7 years in Law
are associated with an improvement of 0.042 standard deviations in Prelims scores after accounting
for observables – however it is not statistically significant. One interpretation is students benefit from
69
interacting with college peers within their subjects. Alternatively colleges may accept more students
in courses that they are stronger in or close down poorly performing courses.
Overall, these college characteristics explain only a fraction of differences in college effectiveness.
This does not mean college effectiveness cannot ever be explained – the estimates are very imprecise
and further work may benefit from using data on a wider range of college characteristics that I was
unable to obtain. I discuss this further in the conclusion.
8 Discussion and Limitations
Even though my college effectiveness estimates are an improvement over the Norrington table, their
interpretation should include various caveats and cautions.
First, my first stage college effectiveness estimates are more directly relevant to students than
to college administrators. To understand why, following Raudenbush and Willms (1995), imagine
decomposing my college effect estimates into two parts: (i) college context (the resources available to a
college) and (ii) college practice (the efficiency with which those resources are used). Context includes
college endowment, location and peer interactions. Practice includes teaching style, organisational
structure and college leadership. My college effectiveness estimates include both context and practice
and are known as “Type A effects” (Raudenbush andWillms, 1995), appropriate for students who wish
to ascertain their expected exam results at different colleges conditional on their own characteristics,
but are unconcerned about whether exam results come from college context or college practice. In
contrast, “Type B effects” include only the effect of college practice and not college context. Type B
effects are appropriate for college administrators interested in college accountability and instructional
practice because they measure the efficiency with which colleges to exploit the resources available to
them. Removing college “context” ensures colleges are not held accountable for factors mostly outside
their control.41 Strictly, this means that my college effectiveness estimates are, at best, type A effects,
of interest to students selecting colleges, not type B effects, of interest to administrators analysing
instructional practice. Certainly I hope the estimates have the potential to stimulate useful discussion
41Type A effects are often known as value-added whereas Type B effects are known as contextual value-added.
70
about how to improve practice within colleges and my second stage estimates also contribute to this.
However, the first stage estimates should not be taken as direct evidence of instructional practice.
Second, and relatedly, my college effect estimates are inclusive of any student-effort/input ad-
justments (Bratti, 2002; Todd and Wolpin, 2003). In an optimising behavioural model, changing
a student’s college may change their effort level. Thus colleges exert a twofold impact on students’
exam performance, first directly through college characteristics and second through students’ optimal
effort input. This second effect may be positive or negative. Good teaching may motivate students to
put more effort into studying (college teaching and student inputs are complements). Alternatively,
students may work harder to make up for ineffective teaching (college teaching and student inputs
are substitutes). Thus student behaviour could potentially mute or exacerbate differences in college
effectiveness. This is not necessarily a limitation, as the total college effect is precisely the desired
effect for answering most policy questions (Todd and Wolpin, 2003).
Third, my college effectiveness estimates are relative by construction. The colleges are only
compared to other Oxford colleges – they also do not assess the value of going to an Oxford college
as compared to going to a different university or no university at all.
Fourth, my college effect estimates are backward looking – they measure how effective colleges
were in the past. Potential students are interested in future, rather than past, effectiveness and this
implies larger uncertainty around college effectiveness estimates. I have not examined the stability
of college effects in detail but specification tests did suggest some evidence college effects do change
over time, perhaps due to tutor turnover. The less stable college effects are, the more noise in the
signal of college quality they provide to prospective students.
Fifth, my college effect estimates concentrate on exam results which are only one of many elements
that contribute to college quality (for a discussion of production with multiple outputs see Chizmar
and Zak (1983)). Colleges aim to produce a wide range of private benefits for students from increased
cognitive skills and improved labour market outcomes to an improved ability to make informed life
decisions about marriage, health, and parenting, and even perhaps increased happiness. Colleges also
produce an array of social benefits. Positive externalities from colleges operate through proximity to
knowledgable people (Acemoglu and Angrist, 2001; Moretti, 2004), reduced crime, propensity to vote
71
and support for free speech (Dee, 2004). Finally, colleges aim to instil ethical values in their students.
Yet exams focus on measuring students’ cognitive skills and neglect other dimensions of college quality.
Exams may also mismeasure student cognitive skills because they reward college practices that may
not be considered desirable. These include “cream skimming” – encouraging weaker students not
to take exams and perhaps dropout; and “teaching to the test” – focusing teaching on test-taking
strategies and a narrow range of topics likely to be examined. Since exam results do not fully capture
everything students and society care about, my college effect estimates should not serve as the sole
criteria used by students or administrators to make decisions but should be seen as a starting point
to be complemented by other sources of information. A full appraisal of college effectiveness requires
a broad set of outcomes that proxy for the various dimensions of college effectiveness. I leave this for
further research.
Finally, adjusting exam results for ability makes it impossible to statistically distinguish between
the majority of colleges. Thus ordinal rankings of colleges should have large confidence intervals
around the point estimates. One alternative is to determine a benchmark Prelims / Finals average
and classify colleges relative to that benchmark (e.g. significantly below, within normal statistical
variance of, or statistically above).
9 Conclusion and Future Work
The Oxford college system gives students the benefits of belonging both to a large, internationally
renowned institution and to a small, interdisciplinary academic community. The benefits include
more personal tuition and more support than most other universities can give. The college system
naturally raises questions about whether students at different colleges benefit equally.
I find that although most of the variation in exam results is explained by differences in student
ability, colleges also play an important role, comparable to the role played by schools in boosting
student GCSE results. Across models and courses, there is evidence college effectiveness impacts
exam results. However, college effectiveness differs across courses which suggests focusing on the
effectiveness of colleges as a whole may be too simplistic – it is better to focus on courses with
72
colleges. OLS selection on observables results suggest a one standard deviation improvement in
college effectiveness corresponds to an increase of 0.11 standard deviations in Prelims for PPE, 0.15
for E&M and 0.14 for Law. Selection on unobservables results broadly support this but are imprecisely
estimated.
The finding that effectiveness differs across colleges is encouraging: it implies we can identify the
relevant factors and then improve educational outcomes. I find evidence that high endowment and
large numbers of students studying a given course are associated with more effective colleges. Overall
however, college effectiveness is not easy to explain with available college characteristics.
I hope this study will spark further research on college effectiveness. Future work could build on
this study in a number of ways. First, it would be interesting to examine the external validity of the
findings in this paper by studying colleges at different universities. Cambridge would be an obvious
example, but also universities where colleges exist but play less of a role in teaching than they do
at Oxford. Second, more accurate college effectiveness estimates may be achievable if the data used
here were complemented with measures of the quality of students’ personal statements and school
references. Interview scores for a wider range of courses would also be beneficial. Third, future work
could examine the effect of colleges on a broader set outcome variables such as post-graduate earnings
or student satisfaction ratings. Different outcomes may lead to different college effectiveness estimates
as different outcomes capture different dimensions of college quality. These multiple measures of
college effectiveness could then be combined (with an appropriate set of weights) to produce a better
overall measure of college effectiveness. Fourth, future work could use better information on college
characteristics which may help to better explain differences in college effectiveness. In particular,
using data on tutorial group sizes, tutor qualifications and hours of tuition may be interesting. Much
of this data is available from “Oxford Colleges On-line Reports for Tutorials” (OxCORT) which is a
web application for the collection and processing of tutorial reports for undergraduate teaching. I
would have used this data for my second stage estimation but gaining access to it proved difficult
as each college owns its their own data and must be asked for it individually. Finally, further work
could study tutor value-added at Oxford. While teacher value-added has been extensively studied
at a primary school level and to a lesser extent a secondary school level, there have been only a few
73
studies of tutor value-added studies. Two findings in this study potentially imply tutors may play an
important role in the educational production function: (i) college effectiveness varies across courses
and (ii) aggregate college characteristics leave much variation in college effectiveness unexplained.
Again OxCORT data would make such a study feasible.
References
Aaronson, D., L. Barrow, and W. Sander (2007). Teachers and student achievement in the chicago public high schools.
Journal of Labor Economics 25 (1), 95–135.
Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2014). Finite population causal standard errors. Technical
report, National Bureau of Economic Research.
Acemoglu, D. and J. Angrist (2001). How large are human-capital externalities? evidence from compulsory-schooling
laws. In NBER Macroeconomics Annual 2000, Volume 15, pp. 9–74. MIT Press.
Afshartous, D. and M. Wolf (2007). Avoiding ‘data snooping’in multilevel and mixed effects models. Journal of the
Royal Statistical Society: Series A (Statistics in Society) 170 (4), 1035–1059.
Aitkin, M. and N. Longford (1986). Statistical modelling issues in school effectiveness studies. Journal of the Royal
Statistical Society. Series A (General), 1–43.
Avery, C. and C. M. Hoxby (2004). Do and should financial aid packages affect students’ college choices? In College
choices: The economics of where to go, when to go, and how to pay for it, pp. 239–302. University of Chicago Press.
Ballou, D. (2009). Test scaling and value-added measurement. Education 4 (4), 351–383.
Barnow, B., G. Cain, and A. Goldberger (1981). Selection on observables. Evaluation Studies Review Annual 5 (1),
43–59.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to
multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.
Berk, R. A. (2004). Regression analysis: A constructive critique, Volume 11. Sage.
Bhattacharya, D., S. Kanaya, and M. Stevens (2014). Are university admissions academically fair? Available at SSRN
2082976 .
Black, D., J. Smith, and K. Daniel (2005). College quality and wages in the united states. German Economic
Review 6 (3), 415–443.
Black, D. A. and J. A. Smith (2004). How robust is the evidence on the effects of college quality? evidence from
matching. Journal of Econometrics 121 (1), 99–124.
Black, D. A. and J. A. Smith (2006). Estimating the returns to college quality with multiple proxies for quality. Journal
of Labor Economics 24 (3), 701–728.
74
Boyd, D., H. Lankford, S. Loeb, and J. Wyckoff (2013). Measuring test measurement error a general approach. Journal
of Educational and Behavioral Statistics 38 (6), 629–663.
Braga, M., M. Paccagnella, and M. Pellizzari (2014). Evaluating students’ evaluations of professors. Economics of
Education Review 41, 71–88.
Bratti, M. (2002). Does the choice of university matter?: a study of the differences across uk universities in life sciences
students’ degree performance. Economics of Education Review 21 (5), 431–443.
Broecke, S. (2012). University selectivity and earnings: Evidence from uk data on applications and admissions to
university. Economics of Education Review 31 (3), 96–107.
Brown, M. B. and A. B. Forsythe (1974). Robust tests for the equality of variances. Journal of the American Statistical
Association 69 (346), 364–367.
Burgess, S. (2015). Human capital and education: The state of the art in the economics of education.
Cameron, A. C. and P. K. Trivedi (2005). Microeconometrics: methods and applications. Cambridge university press.
Carrell, S. E. and J. E. West (2008). Does professor quality matter? evidence from random assignment of students to
professors. Technical report, National Bureau of Economic Research.
Cheng, J. H. and H. W. Marsh (2010). National student survey: are differences between universities and courses
reliable and meaningful? Oxford Review of Education 36 (6), 693–712.
Chetty, R., J. N. Friedman, and J. E. Rockoff (2013a). Measuring the impacts of teachers i: Evaluating bias in teacher
value-added estimates. Technical report, National Bureau of Economic Research.
Chetty, R., J. N. Friedman, and J. E. Rockoff (2013b). Measuring the impacts of teachers ii: Teacher value-added and
student outcomes in adulthood. Technical report, National Bureau of Economic Research.
Chevalier, A. (2014). Does higher education quality matter in the uk. Research in Labor Economics 40, 257–292.
Chevalier, A. and X. Jia (2015). Subject-specific league tables and students’ application decisions. The Manchester
School .
Chizmar, J. F. and T. A. Zak (1983). Modeling multiple outputs in educational production functions. American
Economic Review 73 (2), 18–22.
Clarke, P., C. Crawford, F. Steele, and A. F. Vignoles (2010). The choice between fixed and random effects models:
some considerations for educational research.
Cox, D. R. (1958). Planning of experiments.
Cunha, J. M. and T. Miller (2014). Measuring value-added in higher education: Possibilities and limitations in the
use of administrative data. Economics of Education Review 42, 64–77.
Dale, S. B. and A. B. Krueger (1999). Estimating the payoff to attending a more selective college: An application of
selection on observables and unobservables. Technical report, National Bureau of Economic Research.
Dale, S. B. and A. B. Krueger (2014). Estimating the effects of college characteristics over the career using adminis-
trative earnings data. Journal of Human Resources 49 (2), 323–358.
75
Davison, K. K. (2012). Propensity score methods as alternatives to value-added modeling for the estimation of teacher
contributions to student achievement.
Dee, T. S. (2004). Are there civic returns to education? Journal of Public Economics 88 (9), 1697–1720.
Deming, D. J. (2014). Using school choice lotteries to test measures of school effectiveness. Technical report, National
Bureau of Economic Research.
Deutsch, J. (2012). Using school lotteries to evaluate the value-added model. Unpublished working paper .
Donald, S. G. and K. Lang (2007). Inference with difference-in-differences and other panel data. Review of Economics
and Statistics 89 (2), 221–233.
Epple, D., R. E. Romano, and M. Urquiola (2015). School vouchers: a survey of the economics literature. Technical
report, National Bureau of Economic Research.
Feld, J. and U. Zölitz (2015). Understanding peer effects: on the nature, estimation and channels of peer effects.
Feng, A. and G. Graetz (2015). A question of degree: the effects of degree class on labor market outcomes. Technical
report, IZA Discussion Papers.
Fitz-Gibbon, C. T. (1991). Multilevel modelling in an indicator system. Schools, classrooms and pupils: international
studies of schooling from multilevel perspective, 67–83.
Fu, C. (2014). Equilibrium tuition, applications, admissions, and enrollment in the college market. Journal of Political
Economy 122 (2), 225–281.
Goldhaber, D. and M. Hansen (2013). Is it just a bad class? assessing the long-term stability of estimated teacher
performance. Economica 80 (319), 589–612.
Goldhaber, D. D. and D. J. Brewer (1997). Why don’t schools and teachers seem to matter? assessing the impact of
unobservables on educational productivity. Journal of Human Resources, 505–523.
Goldstein, H. and P. Sammons (1997). The influence of secondary and junior schools on sixteen year examination
performance: A cross-classified multilevel analysis. School Effectiveness and School Improvement 8 (2), 219–230.
Goldstein, H. and D. J. Spiegelhalter (1996). League tables and their limitations: statistical issues in comparisons of
institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 385–443.
Guarino, C. M., M. Maxfield, M. D. Reckase, P. N. Thompson, and J. M. Wooldridge (2015). An evaluation of
empirical bayes’s estimation of value-added teacher performance measures. Journal of Educational and Behavioral
Statistics 40 (2), 190–222.
Hanushek, E. (1971). Teacher characteristics and gains in student achievement: Estimation using micro data. American
Economic Review 61 (2), 280–288.
Hanushek, E. A. (1974). Efficient estimators for regressing regression coefficients. American Statistician 28 (2), 66–67.
Hanushek, E. A. (2006). School resources. Handbook of the Economics of Education 2, 865–908.
Hanushek, E. A. and S. G. Rivkin (2010). Generalizations about using value-added measures of teacher quality.
American Economic Review 100 (2), 267–271.
76
Hanushek, E. A., S. G. Rivkin, and L. L. Taylor (1996). Aggregation and the estimated effects of school resources.
Technical report, National Bureau of Economic Research.
Herrmann, M., E. Walsh, E. Isenberg, A. Resch, et al. (2013). Shrinkage of value-added estimates and characteristics
of students with hard-to-predict achievement levels. Washington, DC: Mathematica Policy Research.
Hill, C. J., H. S. Bloom, A. R. Black, and M. W. Lipsey (2008). Empirical benchmarks for interpreting effect sizes in
research. Child Development Perspectives 2 (3), 172–177.
Hoekstra, M. (2009). The effect of attending the flagship state university on earnings: A discontinuity-based approach.
Review of Economics and Statistics 91 (4), 717–724.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association 81 (396),
945–960.
Illanes, G., C. Sapelli, et al. (2012). Class size and teacher effects in higher education. Technical report.
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of
Economics and Statistics 86 (1), 4–29.
James, E., N. Alsalam, J. C. Conaty, and D.-L. To (1989). College quality and future earnings: where should you send
your child to college? American Economic Review 79 (2), 247–252.
Kaiser, B. et al. (2014). Rhausman: Stata module to perform robust hausman specification test. Statistical Software
Components.
Kane, T. J. and D. O. Staiger (2002). The promise and pitfalls of using imprecise school accountability measures.
Journal of Economic Perspectives 16 (4), 91–114.
Klein, S. P., G. Kuh, M. Chun, L. Hamilton, and R. Shavelson (2005). An approach to measuring cognitive outcomes
across higher education institutions. Research in Higher Education 46 (3), 251–276.
Koedel, C. (2009). An empirical analysis of teacher spillover effects in secondary school. Economics of Education
Review 28 (6), 682–692.
Koedel, C. and J. R. Betts (2011). Does student sorting invalidate value-added models of teacher effectiveness? an
extended analysis of the rothstein critique. Education 6 (1), 18–42.
Koedel, C., R. Leatherman, and E. Parsons (2012). Test measurement error and inference from value-added models.
BE Journal of Economic Analysis & Policy 12 (1).
Koedel, C., K. Mihaly, and J. Rockoff (2015). Value-added modeling: A review. Economics of Education Review .
Konstantopoulos, S. (2005). Trends of school effects on student achievement: Evidence from nls: 72, hsb: 82, and nels:
92.
Ladd, H. F. (2008). Teacher effects: What do we know. Teacher quality: Broadening and deepening the debate, 3–26.
Ladd, H. F. and R. P. Walsh (2002). Implementing value-added measures of school effectiveness: getting the incentives
right. Economics of Education Review 21 (1), 1–17.
Lankester, T. et al. (2005). Undergraduate admissions:policy and procedures. Technical report, WORKING PARTY
ON SELECTION AND ADMISSIONS.
77
Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional
independence assumption. Springer.
Lockwood, J. and D. F. McCaffrey (2014). Correcting for test score measurement error in ancova models for estimating
treatment effects. Journal of Educational and Behavioral Statistics 39 (1), 22–52.
Long, M. C. (2008). College quality and early adult outcomes. Economics of Education Review 27 (5), 588–602.
Lucas, J. (1980). Norrington blues.
Manly, C. A. and R. S. Wells (2015). Reporting the use of multiple imputation for missing data in higher education
research. Research in Higher Education 56 (4), 397–409.
McCaffrey, D. F., T. R. Sass, J. Lockwood, and K. Mihaly (2009). The intertemporal variability of teacher effect
estimates. Education 4 (4), 572–606.
Miller III, D. W. (2009). ESSAYS ON HIGHER EDUCTION POLICY. Ph. D. thesis, Stanford University.
Moretti, E. (2004). Estimating the social return to higher education: evidence from longitudinal and repeated cross-
sectional data. Journal of Econometrics 121 (1), 175–212.
Moulton, B. R. (1986). Random group effects and the precision of regression estimates. Journal of Econometrics 32 (3),
385–397.
Naylor, R., J. Smith, and S. Telhaj (2015). Graduate returns, degree class premia and higher education expansion in
the uk. Oxford Economic Papers, gpv070.
Nye, B., S. Konstantopoulos, and L. V. Hedges (2004). How large are teacher effects? Educational Evaluation and
Policy Analysis 26 (3), 237–257.
O’Hara, R. (2016). The collegiate way. http://collegiateway.org/. Accessed: 2016-04-15.
Oster, E. (2013). Unobservable selection and coefficient stability: Theory and validation. Technical report, National
Bureau of Economic Research.
Pallais, A. (2013). Small differences that matter: Mistakes in applying to college. Technical report, National Bureau
of Economic Research.
Papay, J. P. (2011). Different tests, different answers the stability of teacher value-added estimates across outcome
measures. American Educational Research Journal 48 (1), 163–193.
Raudenbush, S. W. and J. Willms (1995). The estimation of school effects. Journal of Educational and Behavioral
Statistics 20 (4), 307–335.
Reardon, S. F. and S. W. Raudenbush (2009). Assumptions of value-added models for estimating school effects.
Education 4 (4), 492–519.
Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal
effects. Biometrika 70 (1), 41–55.
Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables.
Education 4 (4), 537–571.
Roy, A. D. (1951). Some thoughts on the distribution of earnings. Oxford economic papers 3 (2), 135–146.
78
Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 34–58.
Rubin, D. B., E. A. Stuart, and E. L. Zanutto (2004). A potential outcomes view of value-added assessment in
education. Journal of Educational and Behavioral Statistics, 103–116.
Saavedra, J. E. (2009). The learning and early labor market effects of college quality: A regression discontinuity
analysis. Investigaciones del ICFES .
Scott-Clayton, J. (2012). Information constraints and financial aid policy. Technical report, National Bureau of
Economic Research.
Seftor, N., J. Constantine, S. Cody, M. Ponza, J. Knab, J. Deke, and S. Monahan (2011). What works clearinghouse:
Procedures and standards handbook 2011 (ncee 2011-xxxx). Washington, DC: National Center for Education
Evaluation and Regional Assistance, Institute of Education Sciences, US Department of Education.
Smith, J., A. McKnight, and R. Naylor (2000). Graduate employability: policy and performance in higher education
in the uk. Economic Journal 110 (464), 382–411.
Sullivan, D. G. (2001). A note on the estimation of linear regression models with heteroskedastic measurement errors.
Thomas, S., P. Sammons, P. Mortimore, and R. Smees (1997). Stability and consistency in secondary schools’ effects
on students’ gcse outcomes over three years. School effectiveness and school improvement 8 (2), 169–197.
Todd, P. E. and K. I. Wolpin (2003). On the specification and estimation of the production function for cognitive
achievement. Economic Journal 113 (485), F3–F33.
Waldinger, F. (2010). Quality matters: The expulsion of professors and the consequences for phd student outcomes in
nazi germany. Journal of Political Economy 118 (4), 787–831.
Walker, I. and Y. Zhu (2013). The impact of university degrees on the lifecycle of earnings: some further analysis.
A Proof of Proposition 1
This proof is adapted from Bhattacharya et al. (2014). Consider any feasible admissions policy for
college j pj satisfying the capacity constraint. Since the optimal admissions policy for college j pOPTj
satisfies the capacity constraint with equality (see the definitions of zj) and pj is feasible we must
have:∑x∈Xj
pOPTj (x)αj(x) ηj(x) = Kj ≥∑x∈Xj
pj(x)αj(x) ηj(x) ⇒∑x∈Xj
[pOPTj (x)− pj(x)]αj(x) ηj(x) ≥ 0.
(18)
79
Let W (pj) =∑x∈Xj pj(x)αj(x) ηj(x)Yj(x). Now college welfare resulting from pj differs from:
W (pOPTj )−W (pj) =∑x∈Xj
[pOPTj (x)− pj(x)]αj(x) ηj(x)Yj(x)
=∑x∈Xj
[pOPTj (x)− pj(x)]αj(x) ηj(x) [Yj(x)− zj ] + zj∑x∈Xj
[pOPTj (x)− pj(x)]αj(x) ηj(x)
≥∑x∈Xj
[pOPTj (x)− pj(x)]αj(x) ηj(x)Yj(x)
=∑
Yj(x)≥zj
[pOPTj (x)− pj(x)]α(x)ηj(x)[Yj(x)− zj ]
+∑
Yj(x)<zj
[pOPTj (x)− pj(x)]αj(x) ηj(x) [Yj(x)− zj ]
=∑
Yj(x)≥zj
[1− pj(x)]αj(x) ηj(x) [Yj(x)− zj ] +∑
Yj(x)<zj
pj(x)αj(x) ηj(x) [zj − Yj(x)] ≥ 0
(19)
where the first inequality holds by (18) and that by condition 1, zj > 0. Therefore we have
W (pOPTj ) ≥W (pj) for any feasible pj and the solution given in Proposition 1 is optimal.
To show uniqueness, argue by contradiction. Consider any feasible rule pj which differs from
pOPTj for some admissions profiles x in a non-empty set: X(pj) :={x ∈ Xj | pOPTj (x) 6= pj(x)
}and
let W (pOPTj ) = W (pj). Therefore the last equality on the RHS of (19) holds with equality so pj
must take the form:
pj =
1 if Yj(x) ≥ zj
0 if Yj(x) < zj
However, this implies pj(x) = pOPTj (x) for all x. This contradicts that assumption X(pj) is
non-empty. Therefore W (pOPTj ) > W (pj) for any feasible pj that differs from pOPTj , leading to the
desired uniqueness property of pOPTj .
80