david thesis final 1 sided

Gryffindor or Slytherin? The effect of an Oxford College

David Lawrence∗

Supervisor: Dr Johannes Abeler

Submitted in partial fulfilment of the requirements for the degree of

Master of Philosophy in Economics

Department of Economics

University of Oxford

Trinity Term 2016

∗I would like to thank my supervisor, Johannes Abeler, for the patient guidance, encouragement and advice hehas provided throughout my time as his student. I have been extremely lucky to have a supervisor who cared somuch about my work, and who responded to my questions and queries so enthusiastically and promptly. I am verygrateful to Dr Gosia Turner in Student Data Management and Analysis at Oxford University for providing the data andanswering my many questions about it. Valuable comments were received from Theres Lessing, Jonas Mueller-Gastell,Leon Musolff and Matthew Ridley. This work was supported by the Economic and Social Research Council. Wordcount: 29,904 (356 words on page 2, including footnotes, multiplied by 84 pages, including the title page)

Abstract

Students at Oxford University attend different colleges. Does the college a student

attends matter for their examination results? To answer this question, I use data on all

Oxford applicants and entrants between 2009 and 2013, focusing primarily on Preliminary

Examination (Prelims) results for 3 courses: Philosophy, Politics and Economics (PPE),

Economics and Management (E&M) and Law. I use two methods to account for the

possibility student ability differs systematically between colleges. First, I control for

“selection on observables” by running an OLS regression on college dummy variables

and variables capturing almost all information available to admissions tutors. Results

show that colleges matter statistically and practically. Colleges have a modest impact on

average Prelims scores, similar to the impact secondary schools have on GCSE results. A

one standard deviation increase in college effectiveness leads to a 0.11 standard deviation

increase in PPE average Prelims score. The equivalent figures are 0.15 for E&M, 0.14 for

Law and 0.09 for all courses combined. Second, I take advantage of a special feature of the

Oxford admissions process – that “open applicants” are randomly assigned to colleges –

to control for “selection on observables and unobservables”. Results suggest differences in

college effectiveness are large and accounting for unobservable ability can change college

effectiveness estimates considerably. However, the results are very imprecise so it is

difficult to draw strong conclusions. I also test whether my college effectiveness estimates

can be explained by college characteristics and find college endowment and peer effects,

operating through the number of student per course within a college, are related to college

effectiveness.

Keywords: Oxford, college effectiveness, selection bias, selection on observables and

unobservables, examination results

ii

Contents1 Introduction 1

1.1 Prior Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Institutional Background 8

3 Theoretical Model 93.1 Defining College Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 College Admissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Applications and Applicant Ability . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Application Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 Enrolment Probabilities and Expected Exam Results . . . . . . . . . . . . . . . 123.2.4 The College Admissions Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Econometric Models 164.1 Model 1 – Norrington Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Model 2 – Selection on Observables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Model 3 – Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . . 25

5 Data 295.1 Why use Four Datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Choice of Outcome Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3 Choice of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.5 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.5.1 Testing Assumptions for Selection on Observables and Unobservables . . . . . 43

6 Results 456.1 Results for Norrington Table Plus and Selection on Observables . . . . . . . . . . . . 456.2 Robustness Checks for Norrington Table and Selection on Observables . . . . . . . . . 54

6.2.1 Alternative Outcome Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.2 Interval Scale Metric Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.3 Heterogeneity in College Effectiveness across Students of Different Types . . . 59

6.3 Results for Selection on Observables and Unobservables . . . . . . . . . . . . . . . . . 60

7 Characteristics of Effective Colleges 65

8 Discussion and Limitations 70

9 Conclusion and Future Work 72

A Proof of Proposition 1 79

iii

List of Tables1 Information Available in each Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Description of Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Sample Selection: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Sample Selection: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Sample Selection: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 Sample Selection: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Application, Offer and Enrolment Statistics: PPE and E&M . . . . . . . . . . . . . . . 378 Application, Offer and Enrolment Statistics: Law and All Subjects . . . . . . . . . . . 389 Mean Applicant and Exam Taker Characteristics: PPE . . . . . . . . . . . . . . . . . 3910 Mean Applicant and Exam Taker Characteristics: E&M . . . . . . . . . . . . . . . . . 4011 Mean Applicant and Exam Taker Characteristics: Law . . . . . . . . . . . . . . . . . . 4112 Mean Applicant and Exam Taker Characteristics: All Subjects . . . . . . . . . . . . . 4213 Tests for Differences in Mean and Variance of Applicant Ability across Colleges . . . . 4214 P-values from Balance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4415 Regressions: PPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4616 Regressions: E&M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4617 Regressions: Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4718 Regressions: All Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4719 Correlation in College Effects across Courses . . . . . . . . . . . . . . . . . . . . . . . 5420 Alternative Dependent Variable Regressions: PPE . . . . . . . . . . . . . . . . . . . . 5621 Alternative Dependent Variable Regressions: E&M . . . . . . . . . . . . . . . . . . . . 5722 Alternative Dependent Variable Regressions: Law . . . . . . . . . . . . . . . . . . . . . 5823 P-values from Tests for Heterogeneity in College Effects across Students . . . . . . . . 6024 Selection on Observables and Unobservables Results for various λ1: PPE, E&M and

Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6225 Selection on Observables and Unobservables Results: All Subjects, English, Maths

and History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6326 Second Stage Regression Results: Impact of Endowment . . . . . . . . . . . . . . . . . 6727 Second Stage Regression Results: Evidence of Peer Effects . . . . . . . . . . . . . . . . 68

List of Figures1 Applicant Ability and College Admissions Decisions . . . . . . . . . . . . . . . . . . . 152 College Ranking by Course: Norrington Table Plus vs Selection on Observables . . . . 483 Comparison of Selection on Observables College Ranking across Courses . . . . . . . . 544 Comparison of College Rankings across Models: All Subjects . . . . . . . . . . . . . . 64

iv

1 Introduction

The popular Harry Potter novels of J.K. Rowling are set in the fictional Hogwarts School of Witchcraft

and Wizardry where all the students are magically assigned by a “sorting hat” to one of four houses:

Gryffindor, Slytherin, Hufflepuff, and Ravenclaw. Oxford University is organised in a similar way to

Hogwarts. Oxford divides students into colleges, just as Hogwarts divides students into houses. The

college a student attends can influence not only the facilities available to them (like catering services

and libraries), their accommodation and their peers but also the teaching they receive.

In this paper I address two basic questions that arise in the context of Oxford colleges. First,

to what extent do colleges “make a difference” to student outcomes? Second are any differences in

college effectiveness1 captured by college characteristics such as endowment, age and size? To answer

these questions I use admissions and examination (exam) data on all Oxford applicants and entrants

between 2009 and 2013, focusing on how exam results (specifically first year “Prelims” results) vary

across colleges in three particular courses: Philosophy, Politics and Economics (PPE), Economics

and Management (E&M) and Law as well as across all courses (“All Subjects”).

The key complication in answering these questions is selection bias. Selection into colleges is

non-random and thus student ability may differ systematically between colleges. Selection occurs:

(i) at the application stage (students choose to apply to one college and not to others); (ii) at the

admissions stage (admission tutors take decisions to make offers to some students and not others);

and (iii) at the enrolment stage (students with offers decide whether they want to accept the offer).

Non-random selection into colleges can be based on observables characteristics (e.g. prior attainment)

and unobservable characteristics (e.g. motivation) which may themselves be correlated with exam

results. Failure to adequately control for such selection would lead to biased estimates of college

effectiveness, favouring colleges with higher ability students.

To overcome the problem of selection bias I employ two empirical methods. First, I estimate

an OLS regression which identifies college effects only under a “selection on observables” assump-

tion. Detailed data on almost all variables used by admissions tutors provides some support for

1I use the term “college effectiveness” to mean the contribution of colleges to student examination results. I use“college effectiveness”, “college effect” and “college quality” interchangeably.

1

this assumption. Nevertheless, concern remains that “selection on unobservables” may bias college

effectiveness estimates.

Second, I take advantage of a special feature of the Oxford admissions process: some applicants

choose to make an “open application”. These applicants do not apply directly to a college, instead

their application profiles are randomly allocated between those colleges that receive relatively few

direct applicants. Intuitively, random assignment implies all colleges receive open applicants with

equal ability on average. Hence, in relative terms, colleges accepting a large proportion of open

applicants allocated to them must have received weak direct applications and have low admissions

standards, while colleges that accept a low proportion of open applicants must have received strong

direct applications and have high admissions standards. I formalise this intuition in a theoretical

model. Given additional assumptions concerning the distribution of applicant ability, this method

can account for “selection on observables and unobservables”. Exam results differences across colleges

remaining after controlling for both observables and unobservables can be considered a measure of

college effectiveness or alternatively college “value-added”.2

My results reveal colleges matter. A simple comparison of average exam results suggests large

differences between colleges. When I account for observable student characteristics, exam result

differences shrink because high ability students tend to attend more effective colleges. The vast

majority of variation in exam results is due to between-student differences. However, even after

controlling for observables there remains strong evidence that colleges differ in their effectiveness

in boosting student exam results – college effectiveness differences are statistically and practically

significant in all courses I consider. A one standard deviation increase in college effectiveness leads

to a 0.11 standard deviation increase in Prelims average score in PPE (a 0.65 mark increase). This

would be enough to move a 50th percentile student up to the 55th percentile. The estimated standard

deviation of college effectiveness is 0.15 for E&M, 0.14 for Law and 0.09 across All Subjects. College

effectiveness differences are comparable to school effectiveness differences and slightly lower than

teacher effectiveness differences.

2Although widely used, the “value-added” term is questionable because inputs and outputs are measured in differentunits (Goldstein and Spiegelhalter, 1996).

2

I also produce course-specific college rankings that improve on the Norrington table3 as they

account for observable student characteristics. College rankings at an aggregate level are of limited

use because college effectiveness differs across courses – hence my focus attention on courses within

colleges. Course-specific college rankings are subject to large confidence intervals because of the low

number of students per course at each colleges.

Accounting for selection on unobservable student characteristics would likely further change the

results. Unfortunately for PPE, E&M and Law, estimation error prevents me from obtaining point

estimates for the effectiveness of each college (as only a small number of open applicants enrol

at Oxford). Instead I present college effectiveness estimates for different parameterisations of the

relationship between prior ability and exam results. I do obtain college effectiveness estimates for

some other courses (English, Maths and History) and for All Subjects combined.4 The results suggest

variation in college effectiveness remains large and that unobservable ability can dramatically change

college effectiveness estimates. However, the estimates are imprecise so it is difficult to reach strong

conclusions.

Having established that college effects exist, I use a second stage regression to examine whether

they can be explained by college characteristics. The most interesting finding is evidence that peer

effects, operating through the number of students per college studying the same course, contribute

to college effectiveness. Reversal causality is also possible – if a college happens to be strong in one

subject for whatever reason, they will be likely to hire more fellows and thus increase the size of

the cohort at that college. If there are benefits to clustering together students studying the same

subject then a potential policy implication would be to close small, under-performing courses within

a college. There is also evidence that richer colleges are more effective than poorer colleges. However,

given that college effectiveness is imperfectly correlated across courses, it seems likely that college

effectiveness is primarily determined by course-specific variables related to teaching and peer effects.

Overall, much of the variation in college effectiveness remains unexplained.

The results of this study may be of interest to a number of different audiences. First, it may3The Norrington table, published each year, documents the degree outcomes of students at each Oxford college.

It ranks colleges using the Norrington score, devised in the 1960s by Sir Arthur Norrington, which attaches a score todegree classifications and expresses the overall calculation for each college as a percentage.

4Though aggregating across courses makes the random assignment of open applicants far less credible.

3

interest economists studying the educational production function. At a school level, economists have

struggled to identify a systematic relationship between school resources and academic performance.

This study informs us about the relationship between college resources and academic performance.

Second, this study can help prospective students deciding which college to apply to. An Oxford

college education is an experience good, with quality difficult to observe in advance and only really

ascertained upon consumption. Thus the application decisions of prospective students are likely to

be based on imperfect information. This paper shows attending a high quality college can boost

students’ exam results which is important given the substantial economic return to better university

exam performance. Better exam performance at UK universities is closely related to entering further

study (Smith et al., 2000), employment (Smith et al., 2000), industry choice (Feng and Graetz,

2015), short-run earnings (Feng and Graetz, 2015; Naylor et al., 2015) and lifecycle earnings (Walker

and Zhu, 2013). For example, Feng and Graetz (2015) study students from the London School of

Economics and find the causal wage payoff 12 months after graduating with a First compared with an

Upper Second is a 3% higher expected wage. The difference between an Upper Second and a Lower

Second is 7% higher wages. Thus there should be demand by applicants for third parties evaluations

of college quality just as there is demand for league tables of university quality (Chevalier and Jia,

2015). My college effectiveness estimates help to fill this gap in the market – they improve on the

unadjusted college rankings currently available to prospective students in the Norrington table.56

Third, my analysis may be of interest to Oxford colleges themselves. Colleges need to measure past

effectiveness relative to other colleges for a number of reasons. It allows them to learn best practices

from, and share problems with, other colleges, evaluate their own practices, allocate resources more

efficiently and plan and set targets for the future. Yet currently colleges receive scant feedback on their

past performance in raising exam results and the information they do receive from the Norrington

table can be misleading or demoralising due to selection bias – Norrington table rank may be more

5Of course, exam based rankings are only a starting point for application decisions and should complement otherinformation about colleges’ quality (such as cost, location, accommodation and facilities) from publications, oldersiblings, friends at Oxford and personal visits to colleges.

6More informed students may create dynamic effects as they would then be able to “vote with their feet” likeconsumers in a Tiebout model. On the one hand, this may drive up college quality by increasing competition betweencolleges. On the other hand, as pointed out by Lucas (1980), when criticising the Norrington table, it may increaseinequality in raw exam results between colleges because lower ranked colleges would find it difficult to recruit highability students. Increased competition may also discourage colleges from cooperating with each other.

4

informative about who their students are than how they were taught. My estimates provide a better

picture of a college’s performance. Furthermore, my analysis suggests colleges effectiveness may be

increased by admitting larger number of students per course, perhaps colleges should concentrate on

a narrower range of courses. Even small improvements in college effectiveness are important, because

they might be cumulative and because they refer to a large number of students.7

1.1 Prior Literature

This is the first study of differences between Oxford colleges. However my paper is related to various

literatures interested in measuring differences in effectiveness across teachers, schools and universities.

First, there is a large and active literature (much done by economists) on the value-added of

teachers in schools (Hanushek, 1971; Chetty et al., 2013a,b; Koedel et al., 2015) and universities

(Carrell and West, 2008; Waldinger, 2010; Illanes et al., 2012; Braga et al., 2014). Empirical evid-

ence shows students are not randomly assigned to teachers, even within schools or universities (e.g.

Rothstein (2009)). To account for non-random assignment, teacher value-added models use similar

methods to those in this paper – either “selection on observables” where observables include student

and family input measures and a lagged standardised test score or random assignment of students

to teachers (Nye et al., 2004; Carrell and West, 2008). The main conclusions of teacher value-added

studies also mirror my findings. Teachers like colleges vary in their effectiveness (Nye et al., 2004;

Ladd, 2008; Hanushek and Rivkin, 2010; Braga et al., 2014). Within schools, Nye et al. (2004),

reviews 18 early studies of teacher value-added. Using the same method I use (though I correct for

measurement error), they find a median standard deviation of teacher effectiveness of 0.34. Hanushek

and Rivkin (2010) review more recent studies and report estimates, adjusted for measurement er-

ror, that range from 0.08 to 0.26 (average 0.11) using reading tests and 0.11 to 0.36 (average 0.15)

in maths. They conclude the literature leaves “little doubt that there are significant differences in

teacher effectiveness” (p. 269). Within universities, Braga et al. (2014) find a one standard deviation

increase in teacher quality leads to a 0.14 standard deviation increase in Economics test scores and a

7Estimates of effectiveness similar to mine are often used for teacher and school accountability purposes. However,for reasons detailed in section 8, I do not believe my college effect estimates should be used to hold colleges to account.

5

0.22 standard deviation increase in Law and Management test scores. Overall, teacher effects appear

slightly larger than the college effects I find (0.09 - 0.15). However, there is no consistent relationship

between teacher effectiveness and observable teacher characteristics such as education, experience or

salary (Burgess, 2015).

Second, there is a literature on the value-added of schools (though only some by economists)

(Aitkin and Longford, 1986; Goldhaber and Brewer, 1997; Ladd and Walsh, 2002; Rubin et al., 2004;

Reardon and Raudenbush, 2009). Again similar empirical strategies are used, though non-economists

tend to use random effect models whereas economists favour fixed effect models. Although school

effectiveness is found to impact test scores, there is a consistent finding that schools, like colleges,

have less impact on test scores than teachers with most estimates in the range 0.05-0.20 (Nye et al.,

2004; Konstantopoulos, 2005; Deutsch, 2012; Deming, 2014).8 In one of the most credible studies,

Deutsch (2012) takes advantage of a school choice lottery to estimate a school effect size, adjusted

for measurement error, of 0.12. School effect sizes seem similar to college effect sizes. Thomas et al.

(1997), for example, find the standard deviation in total GCSE performance between schools is 0.10

when pooled across all subjects and is higher in individual subjects ranging from 0.13 in English to

0.28 in History. This closely mirrors my results in terms terms of the size of school (college) effects,

the variation across subjects (courses) and the fact there is less variation in effectiveness once subjects

(courses) are pooled together. Therefore the impact of colleges on exam results appears similar to

the impact of schools on GCSE results. This literature also finds school resources have only a weak

relationship with test scores, leaving much variation in school effectiveness unexplained (Hanushek,

2006; Burgess, 2015).

Third, a small number of studies have attempted to measure university effects on degree out-

comes (Bratti, 2002), student satisfaction Cheng and Marsh (2010), standardised test scores (Klein

et al., 2005) and earnings (Miller III, 2009; Cunha and Miller, 2014). In the attempt to account

for selection bias, “selection on observables” methods have been used exclusively. Results suggest

large unconditional differences in outcomes across universities with observable student covariates

8School effect sizes differ depending on the age of the students – they are highest in Kindergarden, fall as studentsbecome older until bottoming out around GCSE age and rising again in the 6th form (e.g. Goldstein and Sammons(1997) and Fitz-Gibbon (1991)).

6

accounting for a substantial portion, but not all of these differences (Miller III, 2009; Cunha and

Miller, 2014). Observable university characteristics explained only a small proportion of variation in

university value-added (Bratti, 2002).

Beyond “value-added”, this paper is related to the research done by economists on the effect on

earnings from attending a higher “quality” university, where “quality” is usually defined in terms of

mean entry grade, expenditure per student, student/staff ratio and/or ranking in popular league

tables (Dale and Krueger, 1999; Black and Smith, 2004, 2006). Conceptually measuring the return

to institution quality is quite different to my analysis focusing on institution effectiveness. Whereas I

attempt to estimate quality directly, this literature takes quality as given and attempts to estimate the

labour market return to a higher quality. Nevertheless, the university quality literature is interesting

to consider because it has found interesting ways to tackle the non-random selection of students into

universities (better students sort into higher quality colleges). Studies tend to aggregate universities

into a small number of quality groups, thereby reducing the dimensionality of the selection problem.

This facilitates the use of selection on observables based on OLS (James et al., 1989; Black et al.,

2005), selection on observables based on matching (Black and Smith, 2004; Chevalier, 2014) and

methods to account for selection on unobservables including regression discontinuity (Saavedra, 2009;

Hoekstra, 2009), instrumental variables (Long, 2008) and applicant group fixed effects (Dale and

Krueger, 1999, 2014; Broecke, 2012).9 However, no study in this literature has had the opportunity

to exploit random assignment, as I am able to do.

The rest of the paper is organised as follows: Section 2 briefly explains the institutional back-

ground. Section 3 lays out a theoretical model of Oxford admissions that defines college effects.

Section 4 explains the problem of selection bias and outlines econometric models that account for

“selection on observables” and “selection on observables and unobservables” respectively. Section 5

describes the data. Section 6 presents the results. Section 7 considers whether college characteristics

9I considered, but ultimately rejected, using these methods to account for selection on unobservables. For instance,matching could be applied to Oxford colleges with only minimal complications, such as in Davison (2012), but woulddo nothing to help account for unobservables. Instrumental variables requires finding over 30 valid instruments, onefor each college, which is a formidable challenge. Applicant group fixed effects, work better in a university contextthan a college context because they face a multicollinearity problem when students apply to only one college (seediscussion in Miller III (2009)). In addition, applicant group fixed effects make the strong assumption that studentsapply to colleges in a rational way. I did estimate regressions with applicant group fixed effects but the results wereunconvincing and are not reported.

7

can explain differences in college effectiveness. Section 8 discusses limitations and section 9 concludes.

Proofs are collected in the appendix.

2 Institutional Background

The college model is one of the oldest forms of academic organisation in existence. It originated 700

years ago in the UK and was long confined to the universities of Oxford, Cambridge, and Durham.

Today however, college systems have spread worldwide. College systems now operate at several other

British universities including Bristol, Kent and Lancaster. In the US, Harvard, Yale and others have

established similar college systems. College systems are also common in Canada, Australia, and New

Zealand and are present in a numerous other countries from Mexico to China (O’Hara, 2016).

Oxford University can be thought of as consisting of two parts – (1) a Central Administration

and (2) the 32 colleges.10 The Central Administration is composed of academic departments, re-

search centres, administrative departments, libraries and museums. The Central Administration (i)

determines the content of the courses within which college teaching takes place, (ii) organises lectures,

seminars and lab work, (iii) provides resources for teaching and learning such as libraries, laborator-

ies, museums and computing facilities, (iv) provides administrative services and centrally managed

student services such as counselling and careers and (v) sets and marks exams, and awards degrees.

The colleges are self-governing, financially independent and are related to the Central Administra-

tion in a federal system not unlike the federal relationship between of the 51 states of America and

the US Federal Government. The colleges (i) select and admit undergraduate students, (ii) provide

accommodation, meals, common rooms, libraries, sports and social facilities, and pastoral care for

their students and (iii) are responsible for tutorial teaching for undergraduates. Thus Oxford colleges

play a significant role in university life, making Oxford an ideal place to study college effects.

10There are also five Permanent Private Halls at Oxford admitting undergraduates. They tend to be smaller thancolleges, and offer fewer subjects but are otherwise similar. From now on I include them when I refer to “colleges”.

8

3 Theoretical Model

In this section I develop a theoretical model of college admissions. The model serves two main

purposes. First, it allows me to formally define the “effect” of attending an Oxford college. A failure

to clearly define the causal effect of interest has been a criticism of much of the school effect literature

(Rubin et al., 2004; Reardon and Raudenbush, 2009). Second, the model motivates the empirical

strategies I employ to identify college effects in section 4.

3.1 Defining College Effects

There are a total of N applicants to Oxford indexed i = 1, 2, ..., N and J colleges indexed j =

1, 2, . . . , J . For each student i there exist J potential exam results Y 1i , Y

2i , . . . .Y

Ji , where Y ji denotes

the exam result at some specified time (such as end of year 1) that would be realised by individual i

if he or she attended college j. Let each potential exam result depend on pre-admission ability Ai, a

1 x K row vector. Ai permits multiple sources of ability which may be observable or unobservable.

It should be interpreted broadly to include not only cognitive ability but also motivation. Potential

exam results also depend on college effects cij , which are allowed to vary across students, and a

possibly heteroskedastic random shock eij , uncorrelated with ability and representing measurement

error in exam results such as illness on the day of the exam and subjective marking of exams. The

potential exam result obtained by an individual i who attends college j is:

Y ji = Y ji (Ai, cij , eij). (1)

For student i the causal effect of attending college j as opposed to college k is the difference in

potential outcomes Y ji − Y ki . The main focus of this paper is on estimating the average causal effect

of college j relative to a reference college k for the subpopulation of n ≤ N students who actual enrol

at Oxford (denoted by the set E). This average causal effect of college j relative to college k is:

βj = cj − ck =1

n

∑i∈E

cij −1

n

∑i∈E

cik. (2)

Focusing on the subpopulation of students who attend Oxford, rather than the full population

of applicants, makes sense because many applicants (perhaps due to weak prior achievement at

9

school) may have only a low chance of attending Oxford. The definition college effects relies on two

assumptions.

Assumption 1. “Manipulability”: Y ji exists for all i and j

Assumption 1 is the assumption of manipulable college assignment (Rosenbaum and Rubin, 1983;

Reardon and Raudenbush, 2009). It says each student has at least one potential outcome per college.

Intuitively to talk about the effect of college j one needs to be able to imagine student i attending

college j, without changing the student’s prior characteristics Ai. “Manipulability” would be violated,

for instance, if a college only accepted women implying the potential outcome of a male student at that

college may not exist. This assumption is relatively unproblematic at Oxford (certainly compared

to schools or universities). Oxford colleges are not generally segregated by student characteristics11

so it is not difficult to imagine Oxford applicants attending different colleges. Randomness in the

admissions process also makes it possible that all applicants have at least some chance, however

small, of being offered a place at an Oxford college.

Assumption 2. “No interference between units” : Y ji is unique for all i and j

Assumption 2 says each student possesses a maximum of one potential exam result in each college,

regardless of the colleges attended by other students (Reardon and Raudenbush, 2009). The “no

interference between units” assumption of Cox (1958) is one part of the “Stable Unit Treatment Value

Assumption” (or SUTVA; Rubin, 1978). Strictly speaking, this means that a given student’s exam

result in a particular college does not depend on who his college peers are (or even how many of them

there are). Evidence of peer effects in education make this assumption questionable (e.g. Feld and

Zölitz, 2015). Without it, however, we must treat each student as having as JN potential outcomes,

one for each possible permutation of students across colleges. Thus adopting the no interference

assumption makes the problem of causal inference tractable (at the cost of some plausibility). The

consequences of violations of this assumption on the estimates of college effects are unclear, since

without it the causal effects of interest are not well-defined.

11St Hilda’s, the last all women’s college started accepted men in 2008. An exception is colleges that accept onlymature students such as Harris Manchester.

10

3.2 College Admissions

3.2.1 Applications and Applicant Ability

Responsibility for admissions is devolved at the college level, then again at the course level. To save

notation, let all applicants apply for the same course. College j is allocated (receives the application

profiles of) Dj direct applicants and Oj open applicants to consider for admission.

The direct applicants received by college j are the students who expressed a preference for college j

on their application forms - they applied directly to college j. In total there areD1+D2+. . .+DJ = D

direct applicants to Oxford. Let the ability of direct applicants to each college be normally distributed

with the mean ability of direct applicants allowed to differ between colleges but with the variance

constrained to be the same for all colleges. In particular, let the ability of direct applicants to college

j be distributed ADj ∼ N(µDj , 1) where ADj is the ability of a direct applicant to college j and µDj is

the mean ability of direct applicants to college j.

Colleges also receive open applicants. In total there are O1 + O2 + . . . OJ = O open applicants

to Oxford and their ability follows standard normal distribution: AO ∼ N(0, 1). Oxford admissions

procedures require that all open applicants are pooled together by the Undergraduate Admissions

Office. Open applicants are then randomly drawn out, one at a time and are allocated to the college

with the lowest direct applicant to place ratio. This random assignment to colleges, it the key to my

selection on unobservables identification procedure. I present evidence in section 5.5.1 that supports

random assignment. Since each college receives a random sample (of size Oj) of open applicants, the

ability of open applicants sent to college j, denoted AOj , is also distributed N(0, 1).

3.2.2 Application Profiles

Admissions at Oxford colleges are conducted by faculty, who are also researchers and teachers, in the

subject a student applies for (referred to as “admissions tutors”). Applicant ability Ai and college

effects cij are not perfectly observable to admissions tutors. Instead colleges observe an applicant’s

application profile (“UCAS form”) which includes both “hard characteristics” such as GCSE results, A-

level results and the results of Oxford-specific admission tests and “soft” characteristics such as school

11

reference letters and evidence of enthusiasm in the personal statement.12 The application profile does

not include whether an applicant was a direct applicant or an open applicant. Application profiles

can be thought of as a noisy signal of the ability of each applicant. Denote the characteristics of

applicant i seen by admission tutors as a 1 x K row vector xi = Ai−ri where ri is a 1 x K row vector.

Each of the K elements in xi provides a signal about a component of ability Ai. For example, maths

GCSE result provides a signal of maths ability. Assume that each element of xi is an unbiased signal

for its equivalent element in Ai such that E(Ai|xi) = Ai. Also assume xi and cij are independent,

that is, application profile xi provides admissions tutors with no information about college effects cij

(This assumption is relaxed in some of the empirical work). Let X denote the support of x and let

Xj denote the support of the application profiles for students allocated to college j. Let ηj(x) be the

number of students allocated to college j with application profile x.

3.2.3 Enrolment Probabilities and Expected Exam Results

Let αj(x) denote the probability that student with application profile x, upon being offered admission

at college j, eventually enrols. Let Yj(x) denote the expected exam result of an applicant with

application profile x who enrols at college j. This allows acceptance or rejection of an offer from

college j to provide extra information about the ability (and expected exam result) of an applicant.

Colleges need to condition on acceptance when making admissions decisions in order to make a

correct inference about the student’s ability because of an “acceptance curse”: the student might

accept college j’s admission because she is of low ability and is rejected by other universities (either

UK or foreign).

3.2.4 The College Admissions Problem

Define an admission protocol for college j as a probability pj : Xj → [0, 1] such that an applicant

allocated to college j with application profile x is offered admission at college j with probability

pj(x). Each college has a capacity constraint, Kj (the maximum number of students college j can

12Information on ethnicity and parental social class is also collected on the UCAS form but this information is notavailable to admissions tutors when they decide on admissions

12

admit). College j thus chooses the set of pj(x) ∈ [0, 1] to maximise their objective function:

maxpj(x)

{∑x∈Xj

pj(x)αj(x) ηj(x)Yj(x)

}(3)

subject to their capacity constraint:∑x∈Xj

pj(x)αj(x) ηj(x) ≤ Kj . (4)

This is almost identical to the university admissions decision problem studied by Bhattacharya

et al. (2014) (see also Fu (2014)). The college objective is to maximise total expected exam results

among the admitted applicants. It implicitly assumes “Fair Admissions” (Bhattacharya et al., 2014),

in the sense that it gives equal weight to the exam results of all applicants, regardless of pre-admission

characteristics. This assumption is plausible at Oxford because Oxford emphasises that applicants

are admitted strictly based on academic potential. Extra-curricular activities, such as sport and

charity work are given no weight unless they are related to academic potential. “Fair Admissions”

is consistent with the “Common Framework” which guides undergraduate admissions at Oxford:

“Admissions procedures in all subjects and in all colleges should [. . . ] ensure applicants are selected

for admission on the basis that they are well qualified and have the most potential to excel in their

chosen course of study” (Lankester et al., 2005).

The solution to college j’s admissions problem takes the form described below in Proposition 1,

which holds under Condition 1: admitting everyone with an expected exam result Yj(x) ≥ 0 will

exceed capacity in expectation (Bhattacharya et al., 2014).

Condition 1. αj(x) > 0 for any x ∈ Xj and for some δ > 0 we have∑x∈Xj

αj(x) ηj(x) 1{Yj(x) ≥ 0} ≥ Kj + δ.

Proposition 1. Under Condition 1 the solution the college j’s admissions problem is:

pOPTj =

{1 if Yj(x) ≥ zj0 if Yj(x) < zj

where

zj = min{r :

∑x∈Xj αj(x) ηj(x) 1 {Yj(x)≥ r}≤Kj

}13

Proof in Appendix.

The model shows that college j uses a cut-off rule (admission threshold). The result is intuitive.

Colleges first rank applicants by their expected exam results (conditional on acceptance). Colleges

then admit applicants whose expected exam results are the largest, followed by those for whom it is

the next largest and so on till all places are filled. An admissions policy for the ranked groups {pj(x)}

takes the form {1, . . . , 1, 0, . . . , 0}. Since ability is continuously distributed and x is an unbiased signal,

x is also continuously distributed. Hence there are no point masses in the distribution of Yj(x) and

there is no need for account for ties.

As noted by Bhattacharya et al. (2014), the probability of a student enrolling having received an

offer from college j affects the admission rule only through its impact on the cut-off; the intuition is

that individuals who do not accept an offer of admission do not take up any capacity and this is taken

into account in the admission process. Also note that the assumptions imply, perhaps unrealistically,

no role for risk in admissions decisions.

The Fair Admissions assumption implies student characteristics influence the admission process

is through their effect on expected exam results. The same cut-off zj is used for open and direct

applicants - there is no discrimination against open/direct applicants (or any demographic group).

Discrimination would occur if colleges had a higher cut-off for open applicants than direct applicants

as this would imply that a direct applicant with the same expected exam result as a open applicant

is more likely to be admitted. Equal cut-offs for open and direct applicants are plausible because,

as noted above, colleges are not provided with any information about whether an applicant applied

directly or was an open applicant.

The solution is illustrated in Figure 1 for the case where applicant ability is fully observed by

admissions tutors: xi = Ai (ri = 0 for all i).13

13This model is a highly stylised model of admissions. For simplicity, it ignores a number of features of the admissionsprocess. Oxford admissions actually involve multiple stages. In the first stage colleges choose which applicants to“short-list” and “deselect” and which applicants to “reserve”. Deselected applicants are rejected. Short-listed andreserved applicants are given interviews at the college they were allocated. Shortlisted but unreserved applicants maybe reallocated to another college for interview. After first interviews colleges make some admissions decisions aboutwhich applicants to accept. However, a small number of applicants are given second interviews. Second interviewsprovide applicants not selected by their first college the chance to be accepted by another college (known as “pooling”).It should also be noted that application procedures vary slightly between courses. Capturing all these points wouldinvolve a more complex dynamic game played between colleges. Nevertheless, my empirical work relies only on the

14

Figure 1: Applicant Ability and College Admissions Decisions

AjD ∼ N(µj

D,1)

AjO ∼ N(0,1)

pjD

pjO

−4 −2 0 2 4zjµjD

AbilityFigure 1 shows how colleges would make admissions decisions if ability was fully observable (i.e. Ai = si).Direct applicant ability to college j is distributed Aj

D ∼ N(µj

D,1). The graph is drawn such that µj

D = 0.5.

Open applicant ability to college j is distributed AjO ∼ N(0,1). zj is the cut−off (admissions threshold). All

students with ability above the cut−off (the shaded area) are admitted. The distribution of ability forsuccessful open applicants to college j follows a truncated normal distribution and similarly for successfuldirect applicants. A proportion pj

D of direct applicants and a proportion pj

O of open applicants are accepted.

With this admissions model in mind, the goal is to estimate the college effects cij . I consider three

different empirical models. First, as a simple baseline, I consider differences in mean exam results

between colleges in the spirit of the Norrington table. Second, I use a “selection on observables”

strategy that attempts to estimate college effects by conditioning on almost all the information

available to admissions tutors in the student’s application profile. Third, I take advantage of the

random assignment of open applicants and estimate the thresholds zj for each college. I then use

these threshold estimates together with the assumptions of the theoretical model to obtain estimates

of college effects. The next section explains these strategies in detail.

result that colleges use a cut-off rule and that the cut-off is equal across all applicants. This result would continueto hold if, for example, (i) no new information about applicant ability was revealed at interview, (ii) colleges couldcorrectly predict the admissions decisions of other colleges and (iii) the reallocation of rejected applicants was knownin advance by the colleges.

15

4 Econometric Models

The econometric models in this section must acknowledge some objects in the theory model are un-

observable. First, exam results for applicants who do not attend Oxford are not observed. Second,

even for the applicants who enrol at Oxford, at most one potential exam result per student is ob-

servable (the potential exam result from the college they actually attend). This is the “fundamental

problem of causal inference” (Holland, 1986). With a slight abuse of notation I denote observed exam

results of student i at college j as Yij for i = 1, ..., n. Third, not all the information in an applicant’s

application profile is observable. Decompose the information in application profiles into two parts:

x = x1+x2 where x1 and x2 are 1 x G and 1 x K - G row vectors with (with K > G and remembering

x is 1 x K). “Hard” information x1 is assumed observable to admissions tutors and researchers. “Soft”

information x2 is assumed observable to admissions tutors but not researchers.

The aim is to identify college effects given the available data. All three empirical strategies take

the potential exam results function (1) specified in section 3 and assume observed exam results take

the linear form:

Yij = λ0 + λ1Ai + cij + eij (5)

where λ0 and λ1 are K x 1 column vectors that map ability onto potential exam results and all

elements of λ1 are strictly positive. I can now decompose Ai into x1i, x2i and ri and rewrite (5) as:

Yij = λ0 + λ11x1i + λ12x2i + cij + λ1ri + eij (6)

where λ11 is a G x 1 column vector of the first G elements of λ1 and λ12 is a K - G x 1 row vector

of the last K - G elements of λ1. Student ability unobserved even by admissions tutors is captured

by ri.

4.1 Model 1 – Norrington Table

The first empirical strategy is to estimate college effects using a student-level fixed effects regression

with no control variables for observable or unobservable ability. That is, Model 1 estimates for

16

enrolled students:

Yij = λ0 +

J−1∑j=1

βjCj + vij ∀ i = 1, ..., n (7)

where vij =∑J−1j=1 (βij − βj)Cj + λ1Ai + eij , Cj is a dummy variable denoting enrolment at college

j, βij is a college fixed effect coefficient which may differ across i and βj = 1n

∑ni=1 βij is the average

over students of the college fixed effects. College J is the reference college. Model 1 can be estimated

by regressing exam results on a set of college dummy variables. The fixed effect coefficients βj are

the objects of interest, they give mean differences in exam results relative to the baseline college.

Model 1 is thus similar in spirit to the Norrington table.14

The most important problem with Model 1 (and the Norrington table) is selection bias. Selection

bias prevents us from interpreting the fixed effect coefficient estimates as causal effects. Randomised

experiments are the gold standard for estimating causal effects and imagining a hypothetical random-

ised experiment helps to conceptualise the selection bias problem. Consider a two stage admissions

process. In stage 1 it is decided which students will attend Oxford. In stage 2 admitted students are

randomly assigned to colleges. In this ideal scenario, college assignment is independent of student

ability among the population of enrolled students, so the simple mean difference in observed exam

results gives an unbiased estimate of differences in college effects for students attending Oxford.

Unfortunately for researchers selection into colleges is non-random in ways that are correlated

with exam results. Students and admission tutors deliberately and systematically select who enrols.

At the application stage, students choose where to apply to. At the admissions stage, admission

tutors take decisions to accept some students and not others. There could also be selection at the at

the enrolment stage (in practice, very few students reject offers from Oxford colleges). The selection

bias problem makes it difficult to attribute student exam results to the effect of the college attended

separately from the effect of preexisting student ability.

Formally, since we have assumed λ11 6= 0 and λ12 6= 0, selection bias occurs if:

14Model 1 does differ from the Norrington table is some ways. For instance, the Norrington table does not take intoaccount of differences across courses (getting a First in E&M may be easier or more difficult than getting a First inLaw). As I explain in section 5 below, I standardise exam results by course and year which mitigates this problem.

17

E

J−1∑j=1

(βij − βj)Cj + λ1Ai + eij |cij

6= 0.

Model 1 embodies two types of non-random selection into colleges. First, selection on the het-

erogeneous college effect βij . This occurs if individuals differ in their potential exam results, holding

ability Ai constant, and if they choose a college (or colleges chooses them) in part on that basis.15

Selection on heterogeneous college effects captures the intuition that students and colleges are looking

for a good “match”. The economics of the problem suggest students will tend to apply to colleges that

are relatively good at boosting their exam results - a form of selection bias that bares similarities to

Roy’s model of occupational choice (Roy, 1951). Similarly colleges will tend to make offers to students

who tend to benefit more than average from the college’s teaching. Students enrolled at college j

may thus have higher expected exam results from attending college j than the average student. This

biases college fixed effect coefficients and it would not be appropriate to interpret such estimates of

as causal effects for the average student enrolled at Oxford (though college effect estimates biased in

this way may still be of interest).

Second, selection on ability Ai. Determinants of exam results may be correlated with college

enrolment even if college effects are constant across students (βij = βj for all i). This occurs if

individuals choose colleges or colleges choose students in ways correlated with prior ability. Rational

applicants will choose to apply to the college that maximises their expected utility. Expected utility

is likely to depend on a number of factors including the perceived probability of receiving an offer

from each college, risk aversion, the value of their outside option if they did not attend Oxford

and preferences over college characteristics (including college effectiveness and other characteristics

contributing towards consumption benefits). Observable and unobservable ability are likely to impact

the college a student applies to. Furthermore college admissions decisions are based on student ability.

Positive selection seems likely, though not inevitable, with students of higher ability tending to go

to more effective colleges. In the presence of such selection, estimates of the college fixed effect

coefficients will be biased in favour of colleges with higher ability students.

15This assumes students and tutors have an idea of their own student/college-specific coefficient.

18

Selection bias causes three problems. First, as discussed, college effectiveness estimates are biased.

Second, the importance of variation in college effectiveness in determining exam results could be

exaggerated. The total effect of colleges on student exam results could be overstated because some of

the omitted ability will be included in the portion of the variance in student exam results explained by

college effects.16 Third, bias would lead to errors in supplementary analyses that aim to identify the

characteristics effective colleges. Selection bias implies Model 1 is best used as a basis for comparison

with other models that control for observables and unobservables.

4.2 Model 2 – Selection on Observables

The second empirical strategy is to estimate college effects using a conditional OLS regression. Model

2 estimates for enrolled students:

Yij = λ0 + λ11x1i+

J−1∑j=1

βjCj + vij ∀ i = 1, ..., n (8)

where now vij =∑J−1j=1 (βij− βj)Cj +λ12x2i+λri+eij . The difference between Model 1 and Model 2

is that now observable parts of application profiles x1i are included in the regression. The objects of

interest are the college fixed effect coefficients: βj . In an ideal scenario, we could interpret estimated

coefficients as estimates of the average causal effect relative to the reference college for students

attending Oxford. However, such a causal interpretation requires three further assumptions. I start

with two that are relatively unproblematic.

Assumption 3. “Interval scale metric”. The metric of Yij is interval scaled.

Assumption 3 says that the units of the exam result distribution are on an interval scale (Ballou,

2009; Reardon and Raudenbush, 2009). Interval scales are numeric scales in which we know not only

the order, but also the exact differences between the values. Here the assumption says equal sized

gains at all points on the exam result scale are valued equally. A college that produces two students

with scores of 65 is considered equally as effective as a college producing one with a 50 and another

16The effect of the bias on variation in college quality would depend on the direction of the bias. The text herepresumes the likely scenario with positive selection bias – i.e., where more effective colleges are assigned students withhigher expected exam results.

19

with 80. In comparing mean values of exam results, I implicitly treat exam results as interval-scaled

(the mean has no meaning in a non-interval-scaled metric). If exam results are not interval scaled

then the college effect results will depend on arbitrary scaling decisions.17 However, it is unclear

how to determine whether exam results are interval scaled because there is often no clear reference

metric for cognitive skill (Reardon and Raudenbush, 2009). At a practical level, the importance of

this assumption comes down to the sensitivity of college effects estimates and college rankings to

different transformation of exam results. Prior evidence on this point is reassuring, Papay (2011)

finds test scaling affects teacher rankings only minimally with correlations between teacher effects

using raw and scaled scores exceeding 0.98.18 I proceed as if exam results are interval scaled and in

section 6 test the robustness of my results to various monotonic transformations of the exam results

distribution.

Assumption 4. “Common Support or Functional form”. Either (i) there is adequate observed data in

each college to estimate the distribution of potential exam results for students of all types (“Common

Support”) or (ii) the functional form of Model 2 correctly specifies potential exam results even for

types of students who are not present in a given college (“Functional Form”).

Either “Common Support” or “Functional Form” must hold for college effects to be identified.

The common support assumption is violated if not all colleges contain students with any given set

of characteristics. For instance, if not all colleges have students at all ability levels (or not sufficient

numbers at all levels to provide precise estimates of mean exam results at each ability level), then

the common support assumption will fail. In this case we have identification via functional form -

the model extrapolates from regions with data into regions without data by relying on the estimated

parameters of the specified functional form. If the functional form is also wrong, then regression

estimators will be sensitive to differences in the ability distributions for different colleges. However,

if the distribution of ability are similar across colleges the precise functional form used will not

matter much for estimation (Imbens, 2004). The common support assumption has been questioned17This assumption could be relaxed by adopting a non-parametric approach (and comparing, for example, quantiles

rather than means) but this would require a very large sample size for accurate estimation.18If two colleges have similar students initially, but one produces students with better exam results, it will have a

higher measured college effect regardless of the scale chosen. Similarly, if they produce the same exam results, but onebegan with weaker students, the ranking of the colleges will not depend on the scale.

20

for schools because student covariates differ significantly across schools. However, the distribution of

ability is likely to be much more similar across Oxford colleges, partly because of student reallocation

across colleges during the admission process.

We now come to the most significant problem in estimating college effects: how to deal with

selection bias. I make the following two-part “selection on observables” assumption, which allows

consistent estimation of college effects:

Assumption 5. “Selection on Observables” (i) E[∑J−1

j=1 (βij − βj)Cj | Cj , x1i]

= 0 ∀ i = 1, ..., n

(ii) E [λ12x2i + λri + eij | Cj , x1i] = 0 ∀ i = 1, ..., n

The selection on observables assumption follows work by Barnow et al. (1981) in a regression

setting who observed that unbiasedness is attainable only when the variables driving selection are

known, quantified and included in x1.19 Together parts (i) and (ii) imply that potential exam results

are independent of college assignment, given x1.

Part (i) requires the heterogeneous part of college effects to be mean independent of college

enrolment conditional on x1i and Cj . This assumption is similar to, but slightly weaker than, college

effects being the same for every student. It implies there is no interaction of college effects with

student characteristics in x1i. As noted above, if individuals differ in their college effects, and they

know this, they ought to act on it, even conditional on ability. Thus this assumption relies on

students and tutors being unaware of college effects.20 In the empirical work, I test this assumption

by allowing the college effect coefficients to vary with some elements of x1i.

Part (ii) says the observable control variables x1i are sufficiently rich that the remaining variation

in college enrolment that serves to identify college effects is uncorrelated with the error term in

equation (8). This requires two things. First, the observable control variables in x1i must capture,

either directly or as proxies, all the factors that affect both the college enrolment and exam results.

Second, there must exist variables not included in the model that vary college enrolment in ways

unrelated to the unobserved component of exam results (i.e. instrumental variables must exist, even

19Non-parametric versions of this assumption are variously known as “conditional independence assumption” Lechner(2001) and “unconfoundedness” Rosenbaum and Rubin (1983). These are also closely related to “strongly ignorableassignment”Rosenbaum and Rubin (1983).

20If college effects were obvious to everyone then there would be no need for this thesis!

21

though we do not observe them, as they produce the conditional variation in college enrolment used

implicitly in the estimation). Intuitively, the aim is to compare two otherwise identical students but

who went to different colleges for a reason completely unrelated to their exam results. Practically, I

would like to measure and condition on any characteristic whose influence on exam results might be

confounded with that of college enrolment due to non-random sorting into different colleges.

I am aware that the selection on observables assumption is somewhat heroic. Unobservable ability

could cause it to be violated. For instance, students with very high unobservable ability x2i (including

excellent school references and personal statements) may be close to certain of receiving an offer from

whichever college they apply to and thus may tend to apply to colleges with larger college effects.

Alternatively more “academically motivated” students may be both more likely to apply to colleges

that improve exam results than college that provide large consumption benefits. If students do select

into colleges based on unobservable ability correlated with exam results conditional on observed

characteristics then selection bias results.

Nevertheless, the selection on observables assumption can be justified in a number of ways. First,

the extensive dataset allows me to condition on almost all information available to college admission

tutors when they are selecting students as well as some information not seen by admissions tutors.

Furthermore, there is evidence that the information available to admissions tutors but unavailable

to researchers, the personal statement and school reference, are relative unimportant in admission

decisions. In the personal statement, students describe the ambitions, skills and experience that

make them suitable for the course (e.g. previous work experience, books students have read and

essay competitions they have entered). However, Oxford admissions are strictly academic so this

only impacts admissions decisions if it is linked to academic potential. The absence of the school

reference is also perhaps of limited significance because, as noted by Bhattacharya et al. (2014),

school references tend to be somewhat generic and within-school ranks are typically unavailable

to admission tutors. This is supported by survey evidence. Bhattacharya et al. (2014) conduct an

anonymised online survey of PPE admissions tutors in Oxford asking much weight they attach during

admissions to covariates with "1" representing no weight and "5" denoting maximum weight. The

results, based on 52 responses, found that the personal statement and school reference were given

22

the lowest weights.21

Second, two students with the same values for observed characteristics may go to different colleges

without invalidating the selection on observables assumption if the difference in their colleges is driven

by differences in unobserved characteristics that are themselves unrelated to exam results. There are

plenty of potential sources of exogenous variation in college allocations conditional on observables.

For instance, students might care about factors other than the ability of colleges to boost exam

results. Observation indicates that many applicants explicitly choose among colleges, at least at

the margin, for reasons unlikely to be strongly related to exam results. Application decisions may

reflect preferences over college location, architecture, accommodation, facilities and size. These

preferences may not be strongly linked to ability to perform well in exams. Indeed selection based on

preferences over college characteristics is actively encouraged by the University - the Oxford website

recommends students choose colleges based on these non-academic considerations. Alternatively

applicants might be incapable of discerning the size of college effects. While this would not normally

be a comforting thought, it aids the selection on observables assumption. Evidence from university

admissions supports this point. Scott-Clayton (2012) reviews the literature on university admissions

and concludes applicants and parents often know very little about the likely costs and benefits of

university. For instance, small behavioural economics tricks such as whether or not a scholarship has

a formal name and a tiny change in the cost of sending standardised test scores to universities have

been shown to have non-trivial effects on university applications inconsistent with rational choice

(Avery and Hoxby, 2004; Pallais, 2013). The school choice literature also provides evidence that

students and parents do not select schools according to expectations about future test scores - the

typical voucher program does nothing to improve test scores (Epple et al., 2015). Such exogenous

variation is perhaps even more likely in the context of Oxford colleges because Oxford deemphasises

the importance of college choice, stressing all colleges are similar academically and that the primary

factor when choosing a college college choice should be consumption benefits not exam results.

A couple of final points about Model 2 should be noted. First, since I have multiple cohorts

of students, I pool students across cohorts for each college. Evaluating colleges over multiple years21A-levels appeared to be the most important criterion, followed by the admissions tests and interview scores and

then GCSE performance. The choice of subjects at A-level was given a medium weights.

23

reduces the selection bias problem (Koedel and Betts, 2011), increases students per college thus

reducing average standard errors (McCaffrey et al., 2009) and increases the predictive value of past

college effects over future college effects (Goldhaber and Hansen, 2013). In pooling across cohorts,

I assume that college effects are fixed over time and thus place equal weight on exam results in all

years.22

Second, I allow for heteroskedastic measurement error in exam results by estimating heteroske-

dasticity robust standard errors. Exam results measure latent achievement with error because of (i)

the limited number of questions on exams, (ii) the imperfect information provided by each question,

(iii) maximum and minimum marks, (iv) subjective marking of exams and (v) individual issues such

as exam anxiety or on-the-day illness (Boyd et al., 2013). Numerous studies find test score meas-

urement error is larger at the extremes of the distribution (Koedel et al., 2012). The intuition is

exams are well-designed to assess student learning for “targeted” students (near the centre of the

distribution), but not for students whose level of knowledge is not well-aligned with the content

of the exam (in the tails of the distribution). Ignoring heteroskedastic measurement error in the

dependent variable would lead to biased inference. In addition, ignoring measurement error in the

control variables would bias college effect estimates. However, I control for multiple prior test scores

(A-levels, GCSEs, IB, multiple admissions tests and interview scores) which has been shown to help

mitigate the problem (Lockwood and McCaffrey, 2014).

Third, I treat college effects as fixed effects rather than random effects. Whilst random effects

models are more efficient than fixed effects models, economists have conventionally avoided random

effect approaches (Clarke et al., 2010). This is because their use comes at the cost of an important

additional assumption - that college effectiveness is uncorrelated with the student characteristics

that predict exam results. This “random effects assumption” would fail, for example, if more effective

colleges attracted high ability students measured by prior test scores. Random effect estimators

would be inconsistent for fixed college sizes as the number of colleges grows.23 By contrast, fixed

22As the number of cohorts grows, “drift” in college performance may put downward pressure on the predictivepower of older college effect estimates. Thus if predicting future college effects is the main aim (relevant for prospectiveapplicants to Oxford) then it may be best to down-weight older data (Chetty et al., 2013a). However, my main aim isto gauge the importance of college effectiveness and thus do not account for drift.

23The bias (technically, the inconsistency) disappears as the number of students per college increases - because therandom effect estimates converge to fixed effect estimates. However, the bias still can be important in finite samples.

24

effect estimators will still be consistent for fixed college sizes as the number of colleges grows. Guarino

et al. (2015) find that under non-random assignment, random effect estimates can suffer from severe

bias and underestimate the magnitudes of college effects. They conclude fixed effect estimators should

be preferred in this situation and I follow their advice and specify college effects as fixed effects. In

section 6, I perform Hausman tests (robust to heteroskedasticity) and the results broadly support

this choice.

Fourth, I do not employ shrinkage to my college effect estimates. Estimates can be noisy when

there are only a small number of students per college. This means colleges with very few students

could be more likely to end up in the extremes of the distribution (Kane and Staiger, 2002). Shrinkage

is often used as a way to make imprecise estimates more reliable by shrinking them toward the

average estimated college effect in the sample (a Bayesian prior). As the degree of shrinkage depends

on the number of students per college, estimates for colleges with fewer students are more affected,

potentially helping with the misclassification of these colleges. The cost of shrinkage is that the

weight on the prior introduces a bias in estimates of college effects. Shrinkage can be applied to

both random and fixed effects models (so shrinkage is not a reason to favour random effect models

as is sometimes suggested). Despite the promise of shrinkage, two studies use simulations to show

shrinkage does not itself substantially boost performance (Guarino et al., 2015; Herrmann et al.,

2013). Fixed effect models without shrinkage tend to perform well in simulations and should be the

preferred estimator when there is a possibility of non-random assignment.

Even though I avoid having to make the random effects assumption, there is still a danger that

the selection on observables assumption is violated. As a result I now move on to Model 3 which can

more effectively deal with unobservables.

4.3 Model 3 – Selection on Observables and Unobservables

In this subsection I use a novel procedure to estimate college effects and account for both selection

on observables and unobservables. To do this, I take the theory model of section 3 as a starting

point and assume the ability Ai is a scalar (with multiple sources of ability, Ai can be interpreted

as a composite scalar index, i.e. a weighted average). When ability Ai is a scalar, I can estimate the

25

admission thresholds zj for each college. Admission thresholds can be consistently estimated because

open applicants are randomly allocated to colleges. I then use these threshold estimates and the

linear function form assumption (5) to obtain estimates for Ai and λ1. Colleges with high admissions

thresholds tend to have high ability entrants. This allows me to obtain college effect estimates. I

now explain this procedure in more detail.

First, remember in the theory model of section 3, the ability of open applicants to Oxford was

distributed N(0, 1). The key to identification is that open applicants are randomly allocated, by

the Undergraduate Admissions Office, to colleges. Intuitively, the random allocated means that all

colleges receive open applicants with equal ability on average. If a college accepts a large proportion

of open applicants, this suggests that their cut-off zj is low and their entrants have relatively low

ability. On the other hand, if a college accepts a small proportion of open applicants then we expect

their cut-off to be high and their entrants to be of relatively high ability. Formally, the ability of open

applicants allocated to college j is also distributed N(0,1). This means we can consistently estimate

the true cut-off zj at college j using the estimator:

zj = Φ−1(1− pOj

)(9)

where is Φ is the standard normal cdf and pOj is the proportion of open applicants allocated to

college j who are offered a place at college j (pOj is the area in the upper tail of the standard normal

distribution). When pOj is large, zj is small and vice versa. In an infinite sample we could determine

the cut-off value zj exactly. However colleges are assigned a finite number of open applicants so we

estimate zj using zj . As a simple example, consider the case where a college accepted 5% of the

open applicants they were allocated by the Undergraduate Admissions Office. Hence pOj = 0.05 and

the admissions threshold is estimated to be zj = 1.645. Since college j uses the same admissions

threshold for both open and direct applicants, we expect applicants with ability Ai ≥ 1.645 to be

accepted and applicants with ability Ai < 1.645 to be rejected.

Second, note again the ability of open applicants sent to college j is distributed N(0, 1), the ability

direct applicant’s to college j is distributed N(µDj , 1) and each college makes offers to students with

expected exam results above their cut-off. Together these three statements imply the distribution of

26

ability for successful open applicants to college j follows a truncated normal distribution and similarly

for successful direct applicants. The truncations have the same cut-off point zj but the mean of the

truncated normal distributions may differ. This is shown in Figure 1.

Now consider an equation analogous to (9) but this time for direct applicants: zDj = Φ−1(1− pDj )

where pDj is the proportion of direct applicants, assigned to college j, who are offered a place at

college j. I refer to zDj as the standardised cut-off for the ability of direct applicants zDj .

Together (i) the true cut-off zj , (ii) the standardised cut-off for the ability of direct applicants zDj

and (iii) the assumption that the standard deviation of ability for direct applicants is equal to the

standard deviation of the ability of open applicants: σD = σO = 1, give the mean ability of direct

applicants to college j µDj through the equation:

zDj =zj − µDjσD

⇐⇒ µDj = zj − σDzDj = zj − zDj

Since zj and zDj are unobservable, I use the estimator:

µDj = zj − zDj (10)

Using the standard result for the mean of a truncated normal distribution gives an estimator for the

average ability of open and direct applicants given offers by college j:

E(AOj |AOj > zj) =φ(zj)

1− Φ(zj); E(ADj |ADj > zj) = µDj +

φ(zDj )

1− Φ(zDj )(11)

where φ is the standard normal pdf and φ(.)1−Φ(.) is the hazard function for the normal distribution.

Equation (11) gives estimates of average student ability for students enrolled at each college (which

is the average of the upper tail in the normal distributions in Figure 1). Next, use the linear function

form assumption for exam results given in equation (5) to estimate the parameters λ0 and λ1. By

definition, average realised exam results at college j for enrolled open applicants and enrolled direct

applicants are given by:

Y Oj =1

O∗j

∑iεEO

j

(λ0 + λ1A+ cij + eij) ; Y Dj =1

D∗j

∑iεED

j

(λ0 + λ1A+ cij + eij)

where EOj is the set of open applicants who were allocated to college j and who enrolled at college

j, EDj is the set of direct applicants to college j and who enrolled at college j, O∗j is the number of

27

open applicants who were allocated to college j and who enrolled at college j, D∗j is the number of

direct applicants to college j and who enrolled at college j, Y Oj is the average realised exam result of

open applicants enrolled at college j and Y Dj is the average realised exam results of direct applicants

enrolled at college j. Now assume college effects are constant across students so cij = cj for all i.

Taking differences causes the college effect cj and the constant term λ0 to drop out:

Y Oj − Y Dj = λ1

1

O∗j

∑iεEO

j

Ai −1

D∗j

∑iεED

j

Ai

+1

O∗j

∑iεEO

j

eij −1

D∗j

∑iεED

j

eij .

E(AOj |AOj > zj) − E(ADj |ADj > zj) can be used as an estimator for 1O∗

j

∑iεEj

Ai − 1D∗

j

∑iεEj

Ai for

each college j. Thus we can estimate λ1 using an OLS regression:

Y Oj − Y Dj = λ1

[E(AOj |AOj > zj)− E(ADj |ADj > zj)

]+

1

O∗j

∑iεEj

eij −1

D∗j

∑iεEj

eij (12)

with J observations, one for each college. This gives OLS estimates λ1. Note there is no constant in

this regression because λ0 has been differenced away. Unfortunately, heteroskedastic measurement

error in the explanatory variable will cause the OLS estimate of λ1 will be biased – the estimates

of mean ability of enrolled students contain estimation error and this estimation error differs across

observations (it is likely to be larger for colleges with fewer open applicants as this means that the

cut-off is less accurately estimated). Whilst methods exist to correct for heteroskedastic measurement

error in simple cases (Sullivan, 2001), correcting λ1 estimates is more complex and, as far as I am

aware, there is no appropriate method to correct for this.

Once we have λ1, we can back-out cOj and cDj which are estimates of college effects (inclusive of

the constant term λ0) for open applicants and direct applicants:

cOj = Y Oj − λ1E(AOj |AOj > zj) ; cDj = Y Dj − λ1E(ADj |ADj > zj)

Since we have assumed college effects are constant across students, cOj and cDj are also estimates

of the true college effects cj . A single college effect estimate can be obtained by taking a weighted

average of cOj and cDj , where the weights correspond to the number of students who took Prelims

28

exams:

cj =O∗j

O∗j +D∗jcDj +

D∗jO∗j +D∗j

cDj . (13)

Finally, to make the results of Model 3 directly comparable to those from Model 1 and Model 2, I

present college effects relative to those of the best performing college, college J :

βj = cj − cJ . (14)

Implementing Model 3 in practice requires a number of decisions to be taken with regard to the

data. First, I decide to pool across years as done in Model 1 and Model 2. This increases preci-

sion by increasing the number of applicants (particularly open applicants) at each college. Pooling

applications across years is not ideal because it does not reflect how admissions are carried out in

practice, however open applicants will still be randomly allocated to college and if the distribution

of applicant ability is the same each year then cut-offs will be approximately the same across years.

Second, I only compare the subset of colleges with at least 50 open applicants (again to increase

precision). Third, whereas for Model 1 and Model 2, all students with Prelims scores are included

in the analysis, for Model 3, applicants not selected by the first college they were allocated to (these

students were “Rejected by College 1”) are not used in the analysis because their expected ability is

unknown. This means that Model 3 nests Model 1 as a special case where λ1 = 0 and where Model

1 is estimated on a reduced sample only containing applicants selected by the first college they were

allocated to.

5 Data

5.1 Why use Four Datasets?

I use four different datasets due to a trade-off between sample size and the availability of key covari-

ates. The largest dataset consists of anonymised data on all Oxford applicants in the years 2009-2013.

Information on these students was combined from two different sources. Firstly application records

obtained from the Student Data Management and Analysis (SDMA) team at Oxford University.

29

Table 1: Information Available in each DatasetPPE E&M Law All Subjects

Personal Characteristics Y Y Y YContextual Information Y Y Y YPrevious School Type Y Y Y YGCSEs, A-levels and IB Y Y Y YBreakdown of A-levels by Subject N Y Y NAdmissions Test Scores Y Y Y NInterview Scores N N Y NSchool Reference N N N NPersonal Statement N N N NIndividual Paper Marks Y Y Y N

Second, for enrolled students, the application records were then linked to student records (also held

by the SDMA) through unique student identifiers. Exam results are contained in student records.

I refer to this large dataset as the “All Subjects” dataset because it covers all courses taught at

Oxford. Its obvious advantage is the large number of students. However focusing exclusively on

this large dataset is limiting for a number of reasons. First, given Model 2 relies on a selection on

observables assumption, it is important to condition on all relevant covariates used in the admissions

process. Time, resource and data availability constraints prevented the SDMA from supplying inter-

view scores, admissions test scores and specific A-level subjects taken for all students taking every

Oxford course. For courses where this information is missing, the selection on observables assumption

is much less credible. Second, observable ability controls included on the RHS may have a different

impact on exam results depending on the courses taken, e.g. the effect of an A-level in economics is

probably different if a student studies E&M rather than Law at Oxford. Third, college effects may

differ across courses, given that the quality of teaching may vary within colleges. Fourth, admissions

procedures are carried out at a course (department) level so the theoretical model in section 3, implies

open applicants are only randomly to colleges within subjects.

For these reasons I also analyse three other datasets containing information on PPE, E&M and

Law students respectively. I choose these courses because very detailed admissions data is available

for each of them and because they all receive large numbers of applications. The information available

in these datasets is summarised in Table 1.

30

5.2 Choice of Outcome Variable

Preliminary Examinations (“Prelims”) are the exams taken by students at the end of their first year

at Oxford. In PPE, E&M and Law students each take three first year papers, all marked out of 100.

Each script is marked blindly (so the marking tutors do not know which college the student comes

from). The main outcome variable I use is a student’s average Prelims score standardised within

cohort (and course for the All Subjects dataset). For instance, to construct my outcome variable for

PPE, I first take the average score across the three first year papers and then I then standardise the

result so the mean for each cohort is 0 and the standard deviation for each cohort is 1.

Standardising exam results by cohort is important because the distribution of exam scores var-

ies from year to year (partly due to variation in exam difficulty) even within the same course. I

also standardise by course for the All Subjects dataset because there is significant variation between

subjects in Prelims averages and this variation is mostly unrelated to college effectiveness.24 Stand-

ardising Prelims averages across subjects avoids penalising colleges that teach courses with lower

Prelims averages.25 Using Prelims average is preferable to estimating separate models for each Pre-

lims paper taken for two reasons. First, it increases precision. Second college effectiveness is very

likely to “spill over” across papers.

Research has demonstrated that better university exam performance is closely related to other

desirable outcomes which supports the exam based measurement of college effectiveness (Smith et al.,

2000; Walker and Zhu, 2013; Feng and Graetz, 2015; Naylor et al., 2015). One minor problem is that

interpreting Prelims scores is complicated by the that fact a small number of students retake papers.

Students only retake papers if they fail first time around. In this case the data I have corresponds the

highest mark they obtained which may be the first or second attempt. It would have been preferable

if I had the Prelims scores from first attempts. However retakes are rare so this should not be a

significant problem.

An obvious alternative outcome variable is Final Examination (“Finals”) results such as average

24The variation may reflect differences between subjects in the nature of the subject matter (arguably, naturalscience exams are conducive to more extreme patterns of results) and in conventions within subjects of what is ofsufficient merit to be awarded a given mark.

25I don’t standardise marks for each individual paper because students and colleges may optimally concentrate theirteaching efforts on the Prelims papers that have a higher variance of marks.

31

score across Finals papers. However, Prelims results are preferred for a number of reasons. First,

attrition is greater with Finals (because more students drop out over time) and this implies more

missing data which can bias college effect estimates. Second, in Finals not every student takes the

same exams because of different option choices. This is problematic because there are differences in

score distributions across different options. Third, using Finals results involves excluding students

still in their first or second years at Oxford, substantially reducing the power of the analysis.26

However, when interpreting the results one should keep in mind that Prelims are less important to

students than Finals (they are “lower stakes” exams) and Prelims may over or underestimate Finals

college effects (underestimate because they give less time for any college effect to become evident and

because college effects may be cumulative. Overestimate because teaching is more college-focused in

first year than later years).

For these reasons I focus on standardised average Prelims scores in the main analysis but also

briefly consider the consequences of using individual first year paper scores and average Finals score

as outcome variables.

5.3 Choice of Control Variables

The control variables included in the analysis are summarised on the Table 2.

Most of the controls will be familiar to a UK audience. Less familiar may be contextual in-

formation27, which is provided to admissions tutors in the form of “flags”, identifying disadvantaged

students. Admissions tutors are advised to use the contextual information to suggest extra candid-

ates to interview. The International Baccalaureate (IB) is an alternative to A-levels where students

complete assessments in six subjects. Each student gets a mark out of 45. The Thinking Skills

Assessment (TSA) is the admissions test for PPE and E&M applicants. It includes a 90-minute

multiple-choice test, marked by the Admissions Testing Service and the marks are made available

26Using final degree class as in the Norrington table, has the additional problem in that it is discrete and thusdiscards lots of useful information concerning student achievement. This is particularly a problem at Oxford whereover 50% of students obtain a 2:1.

27It is sometimes argued that contextual information (and some personal characteristics such as gender and race),should not be controlled for. This is because controlling for contextual information sets lower expectations for somedemographics. However, not taking these differences into account may penalise colleges that serve these students forreasons that may be at least partly out of their control.

32

Table 2: Description of Control VariablesPersonal CharacteristicsGender Dummy variable indicating whether the student is male or female

Ethnicity / Overseas status Dummy variables indicating: “UK White”; “UK Black”; “UK Asian”;“UK Other ethnic group”; “UK Information refused”; “EU” and;“Non-EU”

Contextual InformationPre-16 School Flag Performance of applicant’s school at GCSE is below national average

Post-16 School Flag Performance of applicant’s school at A-level is below national average

Care Flag Applicant has been in-care for more than three months

Polar Flag Applicant’s postcode is in POLAR quintiles 1 and 2 - indicating lowestrate of young people’s participation in Higher Education

Acorn Flag Applicant’s postcode is in Acorn groups 4 or 5 meaning residents aretypically categorised as ‘financially stretched’ or living in ‘urbanadversity’

Prior Educational QualificationsPrevious school type Dummy variables for State, Independent and other school type

GCSEs Dummy variables for proportion of A*s obtained at GCSE (if morethan 5 GCSEs). Categories are: “Band 1: 100%”; “Band 2: 75-99%”;“Band 3: 50-74%”; “Band 4: < 50%” and; “Less than 5 GCSEs”

A-levels Dummy variables for A-level bands. The categories are: “Did not takeA-levels”, “Applied to start prior to 2010”, “Applied to start in 2010 orlater and no A*”, “1 A*”, “2 A*”, “3 A*” and “4 or more A*”

A-Level subjects Dummy variables indicating whether students had taken A-levels incertain subjects. Subjects for E&M are Economics, Maths and FurtherMaths. Subjects for Law are History and Law

A-Level subject grades Dummy variables indicating the grade achieved in included subjects

IB Dummy variables for IB bands. “Band 1: 45 (full marks)”; “Band 2:{43, 44}”; “Band 3: {41, 42}”; “Band 4: ≤ 40” and; “Did not take IB”

Admissions Tests and InterviewsTSA Variables for TSA critical thinking score and TSA problem solving score

LNAT Variables for LNAT multiple choice score and LNAT essay score

Interview Score An interview score is given to each candidate out of 10.33

to colleges. The Law National Admissions Test (LNAT) is the admissions test for Law applicants.

The LNAT includes a multiple choice section (machine marked out of 42) and an essay section (in-

dividually marked by colleges). Interviews are usually face-to-face with admissions tutors and most

candidates have have 2 interviews. Law students are given an interview score out of 10.

A quick note should also be made about using A-level grades, which is complicated by two

factors. First, a new A* grade was introduced in 2010. I create a separate A-level dummy variable

for students who applied before the A* grade was introduced. Second, most applicants are only

halfway through their A-levels when they apply to Oxford. In this case admissions tutors observe

predicted grades which are not available in the data. This should not be too problematic because

rational admissions tutors will make correct inferences on average about the actual A-levels grades

an applicant will achieve. Actual A-levels achieved are also probably a better measure of ability than

predicted grades.

5.4 Sample Selection

Sample selection involves choosing both a sample of applicants (only relevant for estimating cut-

offs in Model 3) and a sample of enrolled students (relevant for all three models). Fortunately, the

datasets contain only a very small amount of missing data. The missing data comes in two forms.

First, missing values of control variables for individuals who otherwise provide relatively complete

data. For example, a small number of students (12 in PPE, 39 in Law and 0 in E&M) are missing

admissions test scores perhaps because they were ill on the day of the test if or there were no available

test centres in their home countries (the vast majority are international students with many from

outside the EU). Imputing values for these missing covariates is possible. However, the advantages

of multiple imputation are minimal at best when missing data is less than 5% of the sample (Manly

and Wells, 2015). Multiple imputation also makes interpreting results more difficult (R2 can’t be

reported for example). I thus drop these observations (listwise deletion), which is standard practice in

the value-added literature. This choice should be taken into account when interpreting the resulting

college effect estimates.

Second, and more significantly, some students who matriculated at Oxford have missing Prelims

34

Table 3: Sample Selection: PPEApplicant Sample (2009-2014)a 9867Exclusions

Not Enrolled at Oxford -8404Not in Cohorts 2009-14b -7Withdrew from Oxford -51Exclude Extreme Outliersc -2No Admissions Test Scoresd -12

Final Sample 1391

aApplicant sample excludes 53 studentswho have student records but not applicationrecord. This is likely to be because they ap-plied pre-2009, before the dataset begins.

bThese students were offered deferred entry.c2 students had Economics marks recorded

as 0 or 1. The next lowest mark is 30. It isunclear whether these are typographical errorsor true marks.

d11 of the 12 students with missing ad-missions test scores were international studentswith 10 from non-EU countries.

Table 4: Sample Selection: All SubjectsApplicant Sample (2009-2013)a 75033Exclusions

Not Enrolled at Oxford -61153Not in Cohorts 2009-2013b -76No Prelims Averagec -376St Stephen’s College -1

Final Sample 14427

aExcludes all Medicine and PhysiologicalScience applicants as they are not given “marks”in Prelims. Also excludes Classics I and ClassicsII in the 2013 Ucas Cycle, Biomedical Sciencein 2011 and 2012 and Japanese students in 2009and 2010 as in each case their Prelims scores areall missing.

bThese students were offered deferred entry.c210 of these students have officially with-

drawn from Oxford and 8 are suspended.Numbers per college range from 31 (HarrisManchester) to 5 (Exeter and Hertford).

Table 5: Sample Selection: E&MApplicant Sample (2009-2014) 6874Exclusions

Not Enrolled at OxfordRejected Before Interview -4615Rejected After Interview -1638Declined Offer -24Withdrew during Process -32Failed to meet Offer Grades -30Withdrew After Offer -1

Not in Cohorts 2009-14a -2Withdrew from Oxfordb -15Exclude Extreme Outliersc -1

Final Sample 516

aThese students were offered deferred entry.b4 from Pembroke. No more than 1 at any

other college.cUnusually low TSA score.

Table 6: Sample Selection: LawApplicant Sample (2007-2013) 8148Exclusions

Not Enrolled at OxfordRejected Before Interview -4094Rejected After Interview -2440Declined Offer -59Withdrew during Process -60Failed to meet Offer Grades -136Withdrew After Offer -1

Not in Cohorts 2007-13a -10Skipped Prelimsb -31Withdrew before Prelimsc -49No LNAT/interview scoresd -39

Final Sample 1229

aThese students were offered deferred entry.bMay have come to Oxford with a BA from

overseas and been allowed to transfer automat-ically to year 2 without having to sit Prelims.

c16 from Harris Manchester. Less than 3from most other colleges.

d24 of the 39 students with missing ad-missions test scores were international studentswith 22 from non-EU countries.

35

scores (51 for PPE, 49 in Law and 15 in E&M). The main reasons are (i) students dropping out

of Oxford during their first year and (ii) students taking a year out intending to return and repeat

their first year. I again use listwise deletion. This is not ideal because it rewards “cream skimming”

(encouraging weaker students not to take exams and perhaps dropout). Bias will result if having

missing Prelims scores is an indicator that the student was likely to under-perform relative to their

expected result given their pre-Oxford characteristics. Imputing missing prelims scores would also

not fully correct for bias. However, missing Prelims scores are rare and seem evenly spread across

colleges I do not expect biases to be large.2829

The sample selection criteria are summarised in Tables 3-6.

5.5 Descriptive Statistics

Tables 7 and 8 present application, offer and enrolment statistics for each college. The first two

columns show that most applicants to Oxford (e.g. over 80% in PPE) are direct applicants. There

is large variation in the numbers of direct applicants received by each college. For example, whereas

Balliol received 985 direct applications for PPE, St Hilda’s received only 69. The colleges with

relatively few direct applicants are allocated large numbers of open applicants (Balliol received 0

open applicants in PPE whereas St Hilda’s received 246). The tables show that almost all colleges

make offers to a higher proportion of direct applicants than they to do open applicants, suggesting

that the direct applicants are on average of higher ability. Consequently, over 90% of students who

take exams at Oxford are direct applicants rather than open applicants.

Tables 9-12 present descriptive statistics for applicants and exam takers for each dataset. Columns

1-3 present mean pre-Oxford characteristics of applicants. Columns 1-3 show that open applicants

are more likely than direct applicants to be international students (both from the EU or from outside

the EU). Open applicants also tend to perform less well in GCSEs, A-levels and admissions tests.

28An exception is a disproportionately large number of students dropout of Harris Manchester which may be relatedto the fact Harris Manchester is a college for “mature students”.

29If cream skimming is taking place, we might expect to see a positive correlation between college effectivenessestimates and the share of a college’s students that are missing exam results. However, the correlation between theselection on observables estimates and the share of dropouts is −0.86 for PPE, −0.40 for E&M and 0.09 for Law. Ifanything, the opposite is the case - less effective colleges tend to have larger shares of dropouts.

36

Tab

le7:

App

lication,

Offe

ran

dEnrolmentStatistics:PPE

andE&M

PPE

E&M

App

lican

ts%

offers

EnrolledwithPrelim

sApp

lican

ts%

offers

EnrolledwithPrelim

sDirect

Ope

nDirect

Ope

nDirect

Ope

nReject

College1

Direct

Ope

nDirect

Ope

nDirect

Ope

nReject

College1

BALL

985

09%

-70

00

321

05%

-14

02

BLACKF

30

33%

-0

03

BNC

528

011%

-53

00

537

87%

0%32

02

CCC

137

110

18%

7%21

57

CH-C

H368

1715%

0%44

010

245

188%

0%17

01

EXETER

257

1415%

0%34

02

218

14%

0%9

03

H-M

AN

127

017%

-15

020

540

11%

-4

09

HERT

264

1716%

6%39

17

408

150

9%3%

343

7JE

SUS

164

5015%

2%21

014

146

138

12%

4%15

33

KEBLE

230

4714%

6%32

313

217

188

12%

3%22

411

LIN

C312

3116%

0%49

06

LMH

141

136

20%

4%27

216

126

129

9%4%

103

5MAGD

498

012%

-52

00

MANS

132

127

14%

2%16

123

MERT

314

014%

-37

06

213

7811%

1%22

11

NEW

452

014%

-60

01

154

418%

2%12

12

ORIE

L271

4017%

8%42

36

PEMB

164

8915%

8%24

713

525

519%

4%41

22

QUEENS

158

9417%

4%24

313

7333

3%3%

21

4REGENT

200

10%

-2

014

S-ANNE

160

111

14%

9%20

715

128

168

11%

4%12

62

S-BEN

80

0%-

00

19S-CATS

241

4111%

5%26

117

191

14%

0%6

07

S-HIL

69246

7%4%

37

3152

169

2%7%

18

11S-HUGH

71132

14%

8%9

914

134

213

14%

3%17

510

S-JO

HN

297

913%

22%

342

3134

164%

6%5

13

S-PET

161

157

16%

6%21

1018

206

239

13%

3%26

74

SEH

139

8413%

8%17

611

181

265

9%5%

1412

9SO

MER

109

220

17%

5%15

930

TRIN

251

313%

0%24

04

236

86%

0%14

03

UNIV

431

3014%

3%54

010

WADH

327

016%

-47

02

8954

13%

2%10

11

WORC

265

512%

20%

291

4312

05%

-16

01

Total

8054

1810

14%

6%961

77352

4900

1968

8%4%

355

58103

Tab

le7show

sap

plication,

offer

andenrolm

entstatistics

forPPE

andE&M.The

first

twocolumngive

thenu

mbe

rof

applications

received

byeach

colle

ge.The

thirdan

dfourth

columns

give

thepe

rcentage

ofoff

ersmad

eto

open

anddirect

applican

ts.Colum

ns1-4areba

sedon

theap

plican

tsample.

Colum

ns5-7give

thenu

mbe

rof

enrolle

dstud

ents

withPrelim

sresults.

Colum

n7,

“RejectCollege1”

deno

testhenu

mbe

rof

stud

ents

ateach

colle

gewho

wereno

tmad

ean

offer

bythecolle

gethey

wereoriginally

allocatedto.

37

Tab

le8:

App

lication,

Offe

ran

dEnrolmentStatistics:La

wan

dAllSu

bjects

Law

AllSu

bjects

App

lican

ts%

offers

EnrolledwithPrelim

sApp

lican

ts%

offers

EnrolledwithPrelim

sDirect

Ope

nDirect

Ope

nDirect

Ope

nReject

College1

Direct

Ope

nDirect

Ope

nDirect

Ope

nReject

College1

BALL

256

012%

-26

06

3300

6115%

8%421

462

BLACKF

70

57%

-0

03

BNC

509

211%

0%41

03

3761

4612%

11%

419

235

CCC

109

7923%

10%

218

91009

350

24%

9%219

2764

CH-C

H290

3016%

7%36

214

2671

218

18%

11%

399

16158

EXETER

249

4117%

15%

335

82420

156

15%

12%

331

1178

GREYF

30

33%

-0

04

H-M

AN

281

513%

0%12

09

593

015%

-49

067

HERT

197

5919%

5%35

21

2602

321

19%

5%446

1382

JESU

S223

7217%

10%

285

52190

339

18%

5%353

1587

KEBLE

238

6921%

7%37

37

2832

390

17%

4%446

12115

LIN

C309

4413%

5%33

17

1904

181

19%

8%326

1147

LMH

164

4319%

12%

295

41779

686

21%

7%329

34174

MAGD

349

014%

-43

05

3176

9016%

10%

465

442

MANS

130

4212%

10%

124

14998

473

17%

8%149

24153

MERT

179

2218%

0%23

07

2103

139

18%

7%333

838

NEW

227

3917%

13%

344

82601

173

20%

7%478

952

ORIE

L194

101

21%

10%

319

31738

309

17%

10%

276

2390

PEMB

173

3516%

6%23

111

1913

463

17%

12%

295

43139

QUEENS

133

4820%

17%

165

41511

471

20%

11%

268

42130

REGENT

130

0%-

00

13109

021%

-17

0129

S-ANNE

130

7616%

9%17

520

1655

846

21%

10%

304

64186

S-BEN

430

19%

-7

063

S-CATS

307

3612%

8%33

318

2389

572

18%

10%

380

42221

S-HIL

87200

11%

6%8

825

720

1508

18%

11%

113

125

295

S-HUGH

74125

9%19%

519

151031

1422

22%

11%

206

122

235

S-JO

HN

196

4824%

10%

414

32781

195

17%

10%

426

1876

S-PET

107

8115%

5%13

319

1298

855

18%

9%208

56191

SEH

97159

22%

11%

1316

161464

1071

21%

11%

279

97171

SOMER

62125

19%

14%

1013

15871

1153

28%

12%

212

106

206

TRIN

195

316%

0%22

05

2186

3617%

0%336

038

UNIV

382

1014%

10%

381

92534

105

19%

5%406

473

WADH

229

9016%

7%29

517

2778

285

18%

7%455

16102

WORC

371

113%

0%48

04

4106

3413%

12%

477

444

Total

6463

1685

16%

10%

790

131

308

63073

12948

18%

10%

9828

952

3646

Tab

le8show

sap

plication,

offer

andenrolm

entstatistics

forLaw

andAllSu

bjects.

The

first

twocolumngive

thenu

mbe

rof

applications

received

byeach

colle

ge.The

thirdan

dfourth

columns

give

thepe

rcentage

ofoff

ersmad

eto

open

anddirect

applican

ts.Colum

ns1-4areba

sedon

theap

plican

tsample.

Colum

ns5-7give

thenu

mbe

rof

enrolle

dstud

ents

withPrelim

sresults.

Colum

n7,

“RejectCollege1”

deno

testhenu

mbe

rof

stud

ents

ateach

colle

gewho

wereno

tmad

ean

offer

bythecolle

gethey

wereoriginally

allocatedto.

38

Table 9: Mean Applicant and Exam Taker Characteristics: PPE

Applicants Exam TakersDirect Open All Direct Open All

Personal CharacteristicsFemale 0.38 0.41 0.38 0.33 0.32 0.33UK White 0.38 0.16 0.34 0.63 0.27 0.61UK Black 0.02 0.02 0.02 0.02 0.01 0.02UK Asian 0.08 0.04 0.07 0.09 0.03 0.08UK Other Ethnicity 0.01 0.00 0.01 0.01 0.01 0.01UK Information Refused 0.02 0.01 0.02 0.02 0.02 0.02EU 0.22 0.34 0.24 0.10 0.29 0.11Non EU 0.24 0.42 0.27 0.12 0.37 0.13Contextual FactorsPolar Flag 0.06 0.04 0.05 0.08 0.08 0.08Acorn Flag 0.05 0.04 0.05 0.05 0.05 0.05Pre-16 School Flag 0.04 0.03 0.04 0.05 0.05 0.05Post-16 School Flag 0.09 0.07 0.08 0.10 0.08 0.10Care Flag 0.00 0.00 0.00 0.00 0.00 0.00Overall Flag 0.03 0.03 0.03 0.03 0.02 0.03Previous School TypeState 0.31 0.18 0.29 0.43 0.26 0.42Independent 0.27 0.09 0.23 0.37 0.12 0.35Other School Type 0.42 0.74 0.48 0.20 0.62 0.22Took GCSEs 0.56 0.27 0.51 0.78 0.35 0.76GCSE Band 4 (lowest) 0.17 0.16 0.17 0.07 0.06 0.07GCSE Band 3 0.14 0.06 0.13 0.15 0.13 0.15GCSE Band 2 0.16 0.04 0.13 0.31 0.11 0.30GCSE Band 1 (highest) 0.09 0.01 0.07 0.26 0.05 0.25Took A-levels 0.49 0.34 0.46 0.60 0.40 0.58Took IB 0.08 0.08 0.08 0.06 0.09 0.07Admissions TestsTSA Critical 64.51 60.70 63.85 73.66 72.22 73.56TSA Problem 58.57 55.69 58.07 68.15 68.76 68.19OutcomesPrelims Average 61.82 61.83 61.82Std Prelims Average -0.00 0.00 -0.00

Observations 8055 1812 9867 1298 93 1391Table displays mean characteristics of PPE students. Columns 1-3 give the meancharacteristics of applicants to Oxford for PPE between 2009 and 2014. Column 4-6give the mean characteristics of students who took Prelims at Oxford in PPE.

39

Table 10: Mean Applicant and Exam Taker Characteristics: E&M


Personal CharacteristicsFemale 0.37 0.38 0.37 0.30 0.26 0.29UK White 0.31 0.11 0.25 0.59 0.33 0.56UK Black 0.02 0.01 0.02 0.02 0.02 0.02UK Asian 0.14 0.07 0.12 0.17 0.09 0.16UK Other Ethnicity 0.01 0.00 0.01 0.02 0.00 0.02UK Information Refused 0.01 0.00 0.01 0.02 0.00 0.02EU 0.17 0.28 0.20 0.07 0.20 0.09Non EU 0.31 0.52 0.37 0.11 0.36 0.14Contextual FactorsPolar Flag 0.05 0.04 0.05 0.07 0.05 0.07Acorn Flag 0.04 0.03 0.04 0.04 0.05 0.04Pre-16 School Flag 0.04 0.02 0.03 0.03 0.05 0.03Post-16 School Flag 0.07 0.06 0.07 0.09 0.08 0.09Care Flag 0.00 0.00 0.00 0.00 0.00 0.00Overall Flag 0.02 0.02 0.02 0.02 0.03 0.02Previous School TypeState 0.29 0.16 0.25 0.39 0.21 0.37Independent 0.32 0.12 0.27 0.44 0.27 0.42Other School Type 0.38 0.72 0.48 0.16 0.52 0.21Took GCSEs 0.57 0.26 0.48 0.83 0.44 0.78GCSE Band 4 (lowest) 0.16 0.13 0.15 0.05 0.06 0.05GCSE Band 3 0.17 0.07 0.14 0.19 0.17 0.18GCSE Band 2 0.17 0.04 0.14 0.33 0.15 0.31GCSE Band 1 (highest) 0.07 0.01 0.06 0.26 0.06 0.24Took A-levels 0.62 0.37 0.55 0.78 0.44 0.74Took IB 0.07 0.07 0.07 0.04 0.08 0.05Economics 0.52 0.29 0.45 0.68 0.32 0.64Maths 0.61 0.35 0.53 0.77 0.44 0.73Further Maths 0.19 0.09 0.16 0.29 0.11 0.27A* in Economics 0.16 0.06 0.13 0.30 0.11 0.27A* in Maths 0.25 0.12 0.22 0.44 0.24 0.42A* in Further Maths 0.05 0.02 0.04 0.13 0.06 0.12Admissions TestsTSA Critical 60.48 56.96 59.52 71.46 70.24 71.30TSA Problem 58.15 55.71 57.49 68.34 68.66 68.38OutcomesPrelims Average 63.08 64.99 63.33Std Prelims Average -0.04 0.25 -0.00

Observations 4904 1970 6874 450 66 516Table displays mean characteristics for Economics and Management students.Columns 1-3 give the mean characteristics of applicants to Oxford forEconomics and Management between 2009 and 2014. Column 4-6 give the meancharacteristics of students who took Prelims in Economics and Management.

40

Table 11: Mean Applicant and Exam Taker Characteristics: Law


Personal CharacteristicsFemale 0.56 0.54 0.55 0.55 0.57 0.55UK White 0.51 0.31 0.47 0.66 0.41 0.63UK Black 0.04 0.03 0.04 0.03 0.03 0.03UK Asian 0.10 0.09 0.10 0.09 0.09 0.09UK Other Ethnicity 0.02 0.01 0.02 0.02 0.03 0.02UK Information Refused 0.27 0.46 0.31 0.16 0.39 0.19EU 0.05 0.08 0.06 0.04 0.06 0.04Non EU 0.24 0.46 0.29 0.14 0.36 0.17Contextual FactorsPolar Flag 0.08 0.08 0.08 0.06 0.10 0.07Acorn Flag 0.06 0.07 0.06 0.04 0.06 0.05Previous School TypeState 0.49 0.36 0.47 0.54 0.43 0.53Independent 0.21 0.10 0.19 0.29 0.14 0.27Other School Type 0.30 0.54 0.34 0.17 0.43 0.20School Exam ResultsTook GCSEs 0.69 0.47 0.64 0.81 0.58 0.78GCSE Band 4 (lowest) 0.31 0.31 0.31 0.10 0.18 0.11GCSE Band 3 0.18 0.09 0.16 0.24 0.12 0.22GCSE Band 2 0.14 0.05 0.12 0.30 0.19 0.28GCSE Band 1 (highest) 0.06 0.02 0.05 0.18 0.09 0.17Took A-levels 0.65 0.48 0.62 0.78 0.53 0.74Took IB 0.04 0.04 0.04 0.04 0.04 0.04History 0.38 0.22 0.35 0.49 0.30 0.47History A 0.24 0.11 0.22 0.35 0.22 0.33History A* 0.06 0.03 0.06 0.12 0.07 0.12Law 0.12 0.14 0.13 0.08 0.14 0.09Law A 0.08 0.09 0.08 0.07 0.11 0.07Law A* 0.02 0.02 0.02 0.01 0.02 0.01Admissions TestsLNAT Multiple Choice 19.64 18.70 19.45 22.76 22.96 22.79LNAT Essay 58.80 56.08 58.24 64.62 64.32 64.58Interview Score 8.04 8.02 8.04OutcomesPrelims Average 65.08 64.67 65.02Std Prelims Average 0.02 -0.13 0.00

Observations 6463 1685 8148 1069 160 1229Table displays mean characteristics for Law students. Columns 1-3 give the meancharacteristics of applicants to Oxford for Law between 2007 and 2013. Column 4-6give the mean characteristics of students who took Prelims at Oxford in Law.

41

Table 12: Mean Applicant and Exam Taker Characteristics: All Subjects


Personal CharacteristicsFemale 0.49 0.46 0.48 0.47 0.43 0.46UK White 0.60 0.31 0.55 0.74 0.48 0.72UK Black 0.02 0.02 0.02 0.02 0.01 0.01UK Asian 0.08 0.06 0.07 0.07 0.06 0.07UK Other Ethnicity 0.01 0.01 0.01 0.01 0.01 0.01UK Information Refused 0.02 0.01 0.02 0.02 0.01 0.02EU 0.09 0.19 0.11 0.05 0.16 0.06Non EU 0.16 0.38 0.20 0.09 0.27 0.10Contextual FactorsPolar Flag 0.09 0.08 0.08 0.08 0.10 0.09Acorn Flag 0.06 0.07 0.06 0.05 0.07 0.06Previous School TypeState 0.45 0.33 0.43 0.47 0.42 0.47Independent 0.33 0.13 0.29 0.41 0.17 0.39Other School Type 0.22 0.55 0.28 0.12 0.41 0.14School Exam ResultsTook GCSEs 0.73 0.41 0.68 0.86 0.57 0.84GCSE Band 4 (lowest) 0.26 0.26 0.26 0.12 0.19 0.13GCSE Band 3 0.21 0.11 0.19 0.21 0.18 0.21GCSE Band 2 0.19 0.06 0.17 0.32 0.14 0.30GCSE Band 1 (highest) 0.09 0.02 0.08 0.21 0.06 0.20Took A-levels 0.66 0.43 0.62 0.79 0.56 0.77Took IB 0.05 0.06 0.05 0.04 0.05 0.04OutcomesPrelims Average 64.76 64.68 64.75Std Prelims Average 0.01 -0.07 -0.00

Observations 63081 12952 76033 13306 1121 14427Columns 1-3 give the mean characteristics of applicants to Oxford.Column 4-6 give the mean characteristics of students who took Prelims at Oxford.

Table 13: Tests for Differences in Mean and Variance of Applicant Ability across CollegesPPE E&M Law

TSA Critical TSA Problem TSA Critical TSA Problem LNAT

Variance F-statistic 1.888 1.247 1.120 1.533 1.912Prob > F 0.002 0.156 0.313 0.050 0.001

Mean F-statistic 0.968 0.978 0.962 0.975 0.983Prob > F 0.000 0.000 0.000 0.000 0.000

The robvar command in Stata is used to report Brown’s robust test statistic for the equality of variances of admissionstest scores at different colleges. The mvtest command in Stata is used to test for differences in mean admissions testscores across applicants to different colleges.

42

Admissions test scores (TSA and LNAT) and GCSE results provide particularly strong evidence

that direct applicants are on average higher ability than open applicants. Columns 4-6 present

corresponding descriptive statistics for the final sample of students who take Prelims exams.

5.5.1 Testing Assumptions for Selection on Observables and Unobservables

Before moving on to the results, I test two of the key assumptions of Model 3. First, I test the

assumption that the variance of the ability of direct applicants is the same across colleges (and the

same as the variance of ability of open applicants). Since ability is unobservable, I use admissions

test scores as a proxy for ability. Table 13, at the bottom of page 42, reports Brown’s robust test

statistic for the equality of variances which I calculate using the robvar command in Stata (Brown

and Forsythe, 1974). There is relatively strong evidence that the standard deviation of applicant

ability differs across colleges – 2 of the 5 p-values are less than 0.01 and a further p-value is less

than 0.10. This provides some evidence against Model 3 though the importance of the failure of

this assumption is ultimately an empirical question – it is possible that these differences in standard

deviation are not practically important (the large sample size makes it possible for practically small

differences in the variance of ability to be statistically significant). Table 13 also reports the results of

a test for differences in mean admissions test scores across colleges and open applicants, implemented

using the mvtest command in Stata. The results strongly reject the hypothesis that mean admissions

test scores are the same across colleges and open applicants. This justifies the modelling choice to

allow mean applicant ability to differ across colleges.

Second, I test whether open applicants really are randomly allocated to colleges in the admissions

process using a balancing test (randomisation test), analogous to those typically carried out using

pre-treatment outcomes in a randomised trial. I implement it by taking a candidate confounder

(Admissions test scores, gender etc.) and regressing it on a vector of college dummies for the sample

of open applicants. Zero coefficients on the college dummies support the assumption that open

applicants are randomly allocated to colleges. The balancing test is a simple F-test on the college

dummies. The results reported in Table 14 support the randomisation assumption within courses.

Of the 41 p-values in the first 3 columns only 1 is less than 0.05 and only 5 are less than 0.10. A

43

Table 14: P-values from Balance TestsPPE E&M Law All Subjects

(all courses)All Subjects(by course)

Gender 0.284 0.467 0.091* 0.000** 0.854White 0.244 0.172 0.492 0.000** 0.074Asian 0.917 0.181 0.696 0.355 0.536Black 0.990 0.339 0.989 0.944 0.848EU 0.709 0.548 0.169 0.076 0.619Non-EU 0.454 0.408 0.587 0.000** 0.550Overseas 0.062 0.787 0.612 0.000** 0.831State 0.007** 0.587 0.422 0.000** 0.101Independent 0.519 0.437 0.543 0.288 0.230Took GCSEs 0.102 0.721 0.624 0.000** 0.443Took A-levels 0.357 0.598 0.095 0.003** 0.291Took IB 0.435 0.374 0.264 0.532 0.618TSA Problem 0.356 0.084 - - -TSA Critical 0.169 0.762 - - -LNAT - - 0.794 - -No. Open Applicants 1812 1970 1685 12952 12952

Sample contains all open applicants. Columns 1-4 display the p-values from regressions of candidate confounders on afull set of college dummies. Column 5 displays the p-values from regressions of candidate confounders on a full set ofcollege dummies and course dummies. Significance at the 1 and 5 percent level is denoted by **, and *, respectively.

small number of significant F-statistics does not make randomisation implausible as there are many

candidate confounders. In expectation, p-values should be smaller than 0.05 in approximately 2 of

the 41 tests if the tests were independent (though these tests are not independent). Based on these

tests, and the way the open applicants are allocated to colleges, I believe that the allocation of open

applicants to colleges was random within subjects.

However, the results of column 4 are very different with the null hypothesis convincing rejected on

multiple occasions. This is because column 4 pools open applicants across courses and colleges teach a

different range of courses. Thus the random assignment assumption does not hold for the All subjects

dataset and the results of Model 3 should be interpreted with caution for this dataset. Column 5,

which adds course dummy variables as controls, again supports the view that open applicants are

randomly assigned to colleges conditional on the course they applied for.

44

6 Results

6.1 Results for Norrington Table Plus and Selection on Observables

Tables 15-18 show regression results for Models 1 and 2 for PPE, E&M, Law and All Subjects

respectively. The college effect estimates are displayed and coefficients on control variables are

suppressed. Column 1 is the naïve Model 1 with no control variables. Model 2 in the second column

adds all the observable control variables. Our main interest is in the estimates of the coefficients on

college dummy variables.

The coefficients in column 1 can be interpreted as the average differences in (standardised) Prelims

results at various Oxford colleges, relative to students at the college with the highest mean Prelims

scores (the college with the highest mean Prelims scores is St John’s for PPE, Harris Manchester

for E&M, Magdalen for Law and St John’s again when all subjects are combined). For instance, in

Table 15 for PPE, the coefficient of −0.11 in the first row on University College (“UNIV”) can be

interpreted as saying that, on average, students at University College score 0.11 standard deviations

lower on PPE Prelims than students at St John’s.30 These differences in average Prelims scores

amongst students who matriculate at different Oxford colleges are statistically significant. At the

bottom of each table, for each model, I report the results of F-tests under the null hypothesis that

college effects are equal at all colleges. The results for column 1 show, very convincingly, that average

Prelims scores differ across colleges.31 This is my first result.

Result 1. There are statistically significant differences in unconditional Prelims results across

colleges.

Given Model 1 makes no adjustments for observable or unobservable differences in student char-

30For Oxford based readers more familiar with raw exam marks, this translates into University College PPE studentsscoring approximately 0.11×5.7 ≈ 0.627 raw marks lower in Prelims than PPE students at St John’s (given the standarddeviation in Prelims Average for PPE is 5.7).

31Some readers may ask why I am conducting statistically tests (and also why all standard errors are not equalto zero) when I am analysing the full population of students. For instance, Berk (2004) argues “If the data are apopulation, there is no sampling, no uncertainty because of sampling, and no need for statistical inference. Indeed,statistical inference makes no sense.” However, Abadie et al. (2014) show that uncertainty about causal effects ratherthan sampling justifies the use of standard errors in this context. Even if we observe the entire finite population - sowe can estimate the value of regression coefficients in the population with no uncertainty – causal effects are uncertainbecause for each student, at most one of their potential outcomes is observed.

45

Table 15: Regressions: PPE(1) (2)

Prelims Average Prelims Averageβ SE β SE

UNIV −0.11 (0.20) 0.08 (0.19)ORIEL −0.25 (0.19) −0.08 (0.18)HERT −0.25 (0.21) −0.15 (0.19)BALL −0.20 (0.20) −0.20 (0.18)BNC −0.42 (0.21) −0.23 (0.19)REGENT −0.54∗ (0.25) −0.23 (0.23)EXETER −0.47∗ (0.21) −0.27 (0.20)JESUS −0.54∗ (0.25) −0.27 (0.22)PEMB −0.60∗∗ (0.20) −0.27 (0.20)SEH −0.63∗∗ (0.22) −0.29 (0.21)S-HIL −0.66∗∗ (0.21) −0.31 (0.21)MERT −0.36 (0.23) −0.31 (0.22)NEW −0.43∗ (0.20) −0.31 (0.19)S-PET −0.61∗∗ (0.20) −0.32 (0.19)SOMER −0.58∗∗ (0.19) −0.32 (0.18)MANS −0.81∗∗ (0.19) −0.39∗ (0.19)LMH −0.73∗∗ (0.21) −0.39 (0.20)MAGD −0.48∗ (0.20) −0.40∗ (0.20)CCC −0.65∗∗ (0.21) −0.41∗ (0.20)LINC −0.62∗∗ (0.19) −0.42∗ (0.18)CH-CH −0.59∗∗ (0.19) −0.44∗ (0.19)KEBLE −0.63∗∗ (0.20) −0.45∗ (0.19)S-CATS −0.63∗∗ (0.22) −0.45∗ (0.20)TRIN −0.56∗ (0.24) −0.48∗ (0.22)WORC −0.63∗∗ (0.23) −0.50∗ (0.23)WADH −0.68∗∗ (0.21) −0.53∗∗ (0.19)H-MAN −0.76∗∗ (0.27) −0.54∗ (0.26)S-ANNE −0.75∗∗ (0.22) −0.56∗∗ (0.20)S-BEN −0.93∗∗ (0.24) −0.57∗ (0.23)S-HUGH −1.00∗∗ (0.26) −0.58∗ (0.26)QUEENS −0.77∗∗ (0.21) −0.58∗∗ (0.20)BLACKF −2.07∗ (0.98) −1.89∗ (0.88)

Controls No YesProb > F 0.000 0.024SD 0.176 0.114Hausman 0.224R-squared 0.053 0.212N 1391 1391The baseline college is St John’s. Dependent variable isstandardised by year. Standard errors are heteroskedasticityrobust. Prob > F gives the p-value from an F-test of the nullhypothesis that all colleges are equally effective. Hausmangives the p-value for a robust Hausman test. SD gives thestandard deviation of college effectiveness using the methodof Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.01

Table 16: Regressions: E&M(1) (2)


S-HIL −0.59∗ (0.25) −0.08 (0.24)LMH −0.68∗ (0.29) −0.13 (0.27)S-CATS −0.67 (0.36) −0.13 (0.33)NEW −0.68∗∗ (0.26) −0.19 (0.29)CH-CH −0.72∗∗ (0.27) −0.23 (0.29)SEH −0.82∗∗ (0.22) −0.25 (0.23)HERT −0.84∗∗ (0.23) −0.27 (0.25)S-PET −0.96∗∗ (0.24) −0.36 (0.24)S-HUGH −0.62∗ (0.27) −0.37 (0.24)QUEENS −0.98∗ (0.39) −0.37 (0.27)JESUS −0.91∗∗ (0.26) −0.36 (0.26)EXETER −0.99∗∗ (0.38) −0.37 (0.37)PEMB −0.91∗∗ (0.23) −0.40 (0.23)KEBLE −1.06∗∗ (0.24) −0.51∗ (0.25)WORC −0.86∗∗ (0.26) −0.51 (0.26)BNC −0.96∗∗ (0.25) −0.52∗ (0.24)S-JOHN −0.88∗ (0.39) −0.56 (0.35)S-ANNE −1.21∗∗ (0.28) −0.69∗∗ (0.24)TRIN −1.11∗∗ (0.25) −0.73∗ (0.32)WADH −1.20∗∗ (0.33) −0.82∗∗ (0.29)MERT −1.13∗∗ (0.25) −0.81∗∗ (0.26)BALL −1.59∗∗ (0.26) −1.16∗∗ (0.30)

Controls No YesProb > F 0.000 0.011SD 0.145 0.146Hausman 0.189R-squared 0.063 0.352N 516 516The baseline college is Harris Manchester. Dependentvariable is standardised by year. Standard errors areheteroskedasticity robust. Prob > F gives the p-valuefrom an F-test of the null hypothesis that all collegesare equally effective. Hausman gives the p-value for arobust Hausman test. SD gives the standard deviation ofcollege effectiveness using the method of Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.01

46

Table 17: Regressions: Law(1) (2)


WORC −0.04 (0.20) −0.03 (0.20)LMH −0.25 (0.20) −0.14 (0.19)HERT −0.26 (0.19) −0.16 (0.18)S-CATS −0.35 (0.21) −0.19 (0.20)UNIV −0.38 (0.20) −0.22 (0.20)MANS −0.30 (0.22) −0.22 (0.20)BNC −0.18 (0.20) −0.23 (0.20)S-ANNE −0.38 (0.23) −0.28 (0.22)TRIN −0.30 (0.23) −0.29 (0.22)SEH −0.61∗∗ (0.21) −0.33 (0.20)H-MAN −0.27 (0.26) −0.34 (0.26)MERT −0.29 (0.24) −0.36 (0.23)LINC −0.40∗ (0.20) −0.42∗ (0.20)PEMB −0.60∗∗ (0.21) −0.42∗ (0.19)CCC −0.65∗∗ (0.22) −0.44∗ (0.21)NEW −0.56∗∗ (0.20) −0.48∗ (0.19)CH-CH −0.63∗∗ (0.21) −0.49∗ (0.21)S-PET −0.65∗∗ (0.20) −0.51∗ (0.21)S-HUGH −0.55∗ (0.22) −0.53∗∗ (0.20)BALL −0.58∗∗ (0.22) −0.54∗ (0.21)JESUS −0.74∗∗ (0.19) −0.56∗∗ (0.19)WADH −0.68∗∗ (0.20) −0.57∗∗ (0.19)S-HIL −0.59∗∗ (0.20) −0.58∗∗ (0.21)S-JOHN −0.63∗∗ (0.19) −0.58∗∗ (0.19)KEBLE −0.77∗∗ (0.23) −0.59∗∗ (0.22)QUEENS −0.60∗ (0.28) −0.59∗ (0.29)EXETER −0.72∗∗ (0.23) −0.61∗∗ (0.22)REGENT −0.81∗ (0.35) −0.68 (0.35)ORIEL −0.79∗∗ (0.24) −0.74∗∗ (0.22)SOMER −0.86∗∗ (0.24) −0.77∗∗ (0.23)GREYF −1.58∗∗ (0.48) −1.16∗ (0.58)

Controls No YesProb > F 0.000 0.004SD 0.180 0.141Hausman 0.003R-squared 0.057 0.166N 1229 1229The baseline college is Magdalen. Dependent variable isstandardised by year. Standard errors are heteroskedasticityrobust. Prob > F gives the p-value from an F-test of the nullhypothesis that all colleges are equally effective. Hausmangives the p-value for a robust Hausman test. SD gives thestandard deviation of college effectiveness using the methodof Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.01

Table 18: Regressions: All Subjects(1) (2)


MAGD −0.01 (0.07) −0.04 (0.06)BNC −0.05 (0.06) −0.06 (0.06)NEW −0.06 (0.06) −0.07 (0.06)MERT −0.05 (0.07) −0.09 (0.07)UNIV −0.06 (0.06) −0.10 (0.06)WORC −0.13∗ (0.06) −0.16∗∗ (0.06)S-CATS −0.25∗∗ (0.06) −0.19∗∗ (0.06)KEBLE −0.22∗∗ (0.06) −0.19∗∗ (0.06)PEMB −0.27∗∗ (0.07) −0.19∗∗ (0.06)BALL −0.18∗∗ (0.06) −0.20∗∗ (0.06)S-ANNE −0.24∗∗ (0.06) −0.22∗∗ (0.06)LINC −0.19∗∗ (0.07) −0.22∗∗ (0.06)HERT −0.25∗∗ (0.06) −0.22∗∗ (0.06)S-HIL −0.28∗∗ (0.06) −0.22∗∗ (0.06)SEH −0.29∗∗ (0.06) −0.23∗∗ (0.06)S-HUGH −0.29∗∗ (0.06) −0.24∗∗ (0.06)JESUS −0.24∗∗ (0.07) −0.24∗∗ (0.06)LMH −0.29∗∗ (0.06) −0.25∗∗ (0.06)MANS −0.26∗∗ (0.07) −0.25∗∗ (0.07)CH-CH −0.31∗∗ (0.06) −0.27∗∗ (0.06)TRIN −0.21∗∗ (0.07) −0.29∗∗ (0.06)S-PET −0.37∗∗ (0.07) −0.29∗∗ (0.06)WADH −0.29∗∗ (0.06) −0.30∗∗ (0.06)S-BEN −0.37∗∗ (0.12) −0.30∗ (0.12)REGENT −0.47∗∗ (0.09) −0.32∗∗ (0.09)ORIEL −0.33∗∗ (0.07) −0.33∗∗ (0.07)CCC −0.33∗∗ (0.07) −0.33∗∗ (0.07)SOMER −0.39∗∗ (0.06) −0.34∗∗ (0.06)H-MAN −0.31∗∗ (0.12) −0.35∗∗ (0.11)EXETER −0.36∗∗ (0.07) −0.35∗∗ (0.07)QUEENS −0.43∗∗ (0.07) −0.37∗∗ (0.06)BLACKF −2.13 (1.26) −2.28 (1.22)

Controls No YesProb > F 0.000 0.000SD 0.112 0.089Hausman 1.000R-squared 0.015 0.126N 14426 14426The baseline college is St John’s. Dependent variable isstandardised by year. Standard errors are heteroskedasticityrobust. Prob > F gives the p-value from an F-test of the nullhypothesis that all colleges are equally effective. Hausmangives the p-value for a robust Hausman test. SD gives thestandard deviation of college effectiveness using the methodof Nye et al.(2004).∗ p < 0.05, ∗∗ p < 0.0147

Figure 2: College Ranking by Course: Norrington Table Plus vs Selection on Observables

BALL

BLACKF

BNC

CCC

CH−CH

EXETER

H−MAN

HERT

JESUS

KEBLE

LINC

LMHMAGD

MANS

MERTNEW

ORIEL

PEMB

QUEENS

REGENT

S−ANNES−BEN

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

SOMER

TRIN

UNIV

WADHWORC

Correlation: 0.95

010

20

30

40

Mod

el 2

Ran

k

0 10 20 30 40

Model 1 Rank

PPE

BALL

BNC

CH−CH

EXETER

H−MAN

HERT

JESUS

KEBLE

LMH

MERT

NEW

PEMB

QUEENS

S−ANNE

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

TRIN

WADH

WORC

Correlation: 0.88

05

10

15

20

25

Mod

el 2

Ran

k

0 5 10 15 20 25

Model 1 Rank

E&M

BALL

BNC

CCC

CH−CH

EXETER

GREYF

H−MAN

HERT

JESUS

KEBLE

LINC

LMH

MAGD

MANS

MERT

NEW

ORIEL

PEMB

QUEENS

S−ANNE

S−BEN

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

SOMER

TRIN

UNIV

WADH

WORC

Correlation: 0.95

010

20

30

Mo

de

l 2

Ran

k

0 10 20 30

Model 1 Rank

Law

BALL

BLACKF

BNC

CCC

CH−CH

EXETERH−MAN

HERT

JESUS

KEBLE

LINC

LMH

MAGD

MANS

MERTNEW

ORIEL

PEMB

QUEENS

REGENT

S−ANNE

S−BEN

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

SOMER

TRIN

UNIV

WADH

WORC

Correlation: 0.99

010

20

30

40

Mo

del 2 R

ank

0 10 20 30 40

Model 1 Rank

All Subjects

acteristics across colleges, the ranking of colleges provided in column 1 is biased in favour of colleges

that receive intakes of high ability students (relative to other colleges) because part of the effect of

ability is attributed to impact of the college.

As we move from column 1 to column 2 the coefficients on college dummies tend shrink – when

controls are added, gaps in average Prelims scores between colleges decline. For instance, in PPE in

31 out of 32 colleges, the coefficients in column 2 are smaller in magnitude than in column 1 (the

exception is Balliol College where the coefficient remains at −0.20). The coefficients in column 1

sum to −19.9 while the coefficients in column 2 sum to −13.0. Thus differences between coefficients

decline by approximately 35%. This trend could be explained by students selecting into colleges

48

according to ability; students with high observable ability are more likely to attend effective colleges.

In particular, St John’s has PPE students with higher observable ability than PPE students at any

other college except Balliol. Therefore, controlling for observable ability reduces the disparity in

Prelims scores across colleges, bringing the estimates closer to the true causal effects of attending

particular colleges. However, even in Model 2, differences between colleges are statistically significant

at the 5% level in each dataset – the second result.

Result 2. Controlling for selection on observables reduces differences in Prelims results across

colleges but they remain statistically significant.

The finding that colleges are statistically significant determinants of Prelims results, does not

provide information about the practical significance of colleges. College effectiveness is practically

significant if some colleges are substantially more effective than others. One way to measure this is

to look at the standard deviation of college effectiveness (the college “effect size”), which indicates

how much adjusted Prelims results differ across colleges. I report the standard deviation of college

effects using a method proposed by Nye et al. (2004), though unlike Nye et al., I also adjust for

estimation error, an important addition.32 Nye et al. recommend calculating two regressions. One

is a regression of Prelims results on only student characteristics yielding a multiple correlation R21.

The second regression is Prelims results on the same student characteristics but it also includes a

set of college dummy variables, yielding a multiple correlation R22. The difference between the two

regressions in variance accounted for (the change in R2 value or ∆R2 = R22 − R2

1) represents the

proportion of variance in (residualised) Prelims results accounted for by college effects. If we regard

the ∆R2 as the variance accounted for by college effectiveness, then the square root of ∆R2, namely

∆R, can be interpreted as the standard deviation of college effectiveness. However, Nye et al.’s

method gives an estimator of the standard deviation of college effects that is biased upwards due to

estimation error. The problem is that R2 (weakly) increases whenever the college dummy variables

are added to the second regression, even if their true coefficients are zero. Given there are over 3032Various other methods can be used estimate the standard deviation of college effectiveness (Aaronson et al., 2007;

Koedel, 2009; Guarino et al., 2015). Guarino et al.’s method gives almost identical estimates as Nye et al.’s method,but neither adjust for estimation error. Aaronson et al. (2007) and Koedel (2009) do account for estimation error butestimates are very sensitive to the choice of baseline college.

49

colleges, this bias may be large. The change I make is to use adjusted R2 in place of the simple R2

used by Nye et al.. I report the results at the bottom of each column.

A couple of points about the results are worth noting. First, accounting for estimation error using

adjusted R2 is important. It dramatically reduces the standard deviation estimates, particularly for

E&M which has a smaller number of students per college than the other datasets.33 Other studies

have also found accounting for estimation error can be important (Aaronson et al., 2007).

Second, as we move from column 1 to column 2, the standard deviation of college effects falls

slightly. For instance, in PPE the standard deviation of college effects falls from 0.18 in column 1

to 0.11 in column 2. Thus the variation associated with colleges drops as more controls are added,

again reflecting sorting into colleges by ability.

Third, in column 2 the standard deviation of college effects across courses ranges from 0.11 in PPE

to 0.15 in E&M. Differences across courses would be expected if there are differences across courses

in the sensitivity of exam results to teaching. The standard deviation of college effectiveness in the

All Subjects dataset is 0.09 which is lower than in PPE, E&M or Law. This could be because colleges

effectiveness is imperfectly correlated across courses so the true variation in college effectiveness is

underestimated. Alternatively, exam results in E&M, PPE and Law may be more sensitive to college

teaching than other courses.

Fourth, by most standards, these college effects are moderate in size and are large enough to have

policy significance. For example, for PPE, a standard deviation in college effectiveness of 0.11 says

that a one standard deviation increase in college effectiveness should increase Prelims scores by 0.11

standard deviations. If college effects are normally distributed, these findings would suggest that

the difference in Prelims average between having a 25th percentile college (a not so effective college)

and a 75th percentile college (an effective college) is 0.15 of a standard deviation in Prelims.34 This

33I test whether using adjusted R2 is successful in removing estimation error. To do this, I create "placebo colleges”and then randomly assign Oxford students to these colleges and repeat the analysis. That is I create dummy variablesfor each placebo college and use them instead of dummy variables for real colleges. Since students are randomlyassigned a placebo college, the true standard deviation of placebo college effectiveness should be zero. Of 100 placebocollege effectiveness standard deviation estimates I produce for each dataset, over 60% of estimates were identicallyzero (the adjusted R2 in the second regression was greater than the adjusted R2 in the first regression). This suggeststhat estimation error is not longer an issue when I use adjusted R2. In contrast, when I use placebo colleges andsimple R2, I obtain average values from 100 replications of 0.17 for E&M, 0.16 for Law, 0.14 for PPE and 0.04 for AllSubjects, implying large estimation error.

34The college effect standard deviation in PPE is 0.1136948. The difference between the 25th and 75th percentiles

50

would move a student at the middle of the exam result distribution to the 56th percentile. The US

Department of Education defines 0.25 as an effect that is “substantially important” (Seftor et al.,

2011) but what determines whether an effect size is large or small is often context dependent (Hill

et al., 2008). A college effect size of 0.11 can be compared to gaps in Prelims results by demographic

groups. For instance, it is smaller than the raw achievement gap between males and females at Oxford

(0.16 standard deviations) and the raw achievement gap between international and home students

(0.21) but larger than the raw achievement gap between independent school and state school students

(0.05). In PPE, a standard deviation improvement in college effectiveness has a ceteris paribus impact

on Prelims results that is larger than an extra 2 A*s at GCSE35 and comparable to an extra 10 marks

on the TSA. As noted in the introduction, these college effect sizes also are comparable to the effect

of teachers and schools. Thus student achievement could be improved if colleges on the lower end

moved up modestly in the distribution of college effectiveness.

Finally, a college effect size of 0.11-0.15 implies only 1-3% of the variance in Prelims results is

associated with variation in college effectiveness. Thus although variation in college performance is

non-negligible, the difference in mean Prelims performance between the best- and worst-performing

colleges is not nearly as large as the difference in performance between the best and worst students

in the typical college. The majority of variation in exam results is within, not between, colleges. I

can now state results 3 and 4.

Result 3. Differences in college effectiveness estimates based on selection on observables are

practically significant.

Result 4. The vast majority of variation in Prelims results is within colleges not between colleges.

At the bottom of column 2, I report the results of a robust Hausman test. I implement it using

the Stata command rhausman and 50,000 replications (see Cameron and Trivedi (2005, pp 718) and

Kaiser et al. (2014) for more details). The robust Hausman test can be used in the presence of

heteroskedasticity in the error term unlike the traditional Hausman test, which makes the auxiliary

of the standard normal distribution is 1.34 standard deviations, so the difference in Prelims average between a 25thand 75th percentile college is (1.34)(0.1136948) ≈ 0.15.

35The effect size of moving from Band 3 (with 7.1 A*s on average), to Band 2 (with 9.2 A*s on average) is 0.075.

51

assumption that the random effects estimator is asymptotically efficient under the null hypothesis.

The robust Hausman test strongly rejects the null hypothesis that the random effects assumption

holds for the Law dataset but does not reject at the 5% level for E&M, PPE or All Subjects. Thus

the choice of modelling college effects as fixed effects rather than random effects seems important for

Law but less so for PPE, E&M and All Subjects. The Hausman test also has some power to detect

violations of the selection on observables assumption since the Hausman test would be misspecified

if the selection on observables assumption were violated. In this case, the random effects estimator

and the fixed effect estimator typically have different probability limits so the Hausman test may

reject the null because selection on observables is violated. This thus provides some encouragement

with regards to the selection on observables assumption.

The R2 for Model 2 across courses ranges from 17 percent for Law to 35 percent for E&M.

Given that the goodness-of-fit measures typically reported by applied researchers working with cross-

sectional data (e.g. Mincer equations) are only 5 percent, this suggests we can explain a significant

proportion of variation in student Prelims results. Following Oster (2013), this also suggests that

unobservables have more potential to bias Law college effectiveness estimates than E&M college

effectiveness estimates.

The regression estimates can be used to form colleges rankings. Figure 2 shows college effects are

strongly positively correlated across Models 1 and 2. However, the high positive correlations conceal

moderate mean absolute movement in college rankings. More dispersion across the 45-degree line

implies more variation in college rankings. Both tails of the original distributions lie relatively close

to the 45-degree line, but there are big movers elsewhere in the distribution. Even though regression

coefficient changes between Models 1 and 2 are large, ranking changes are modest because student

sort into colleges partly based on observable ability.

Result 5. College rankings change moderately when adjusted for selection on observables.

However, college rankings should acknowledge uncertainty by using the appropriate level of stat-

istical significance. It is tempting to look at the regression results and search for colleges where the

standard p-value is less than 0.05 (in Tables 15-18 these are stared college coefficients) and conclude

52

that these colleges are statistically worse than the baseline college at the 5% level. However, we must

be careful about making comparisons like this if we are devising hypotheses having already observed

the data, so are, in effect, performing multiple hypothesis tests. To gauge statistical significance, we

want to avoid “data snooping” – basing inference on individual p-values without taking the multitude

of tests into account. Data snooping would likely lead us to falsely declare some pairs of colleges as

significantly different (see Afshartous and Wolf (2007) for a detailed discussion of data snooping and

methods to avoid it). To account for multiple comparisons I define a new (lower) critical value for

hypothesis tests using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) and set the

false discovery rate (the proportion of significant results that are actually false positives) to 5%.36

Using Benjamini-Hochberg critical values for the All Subjects dataset, 118 of the 528 pairwise

college effectiveness comparisons were statistically significant. However, none of the pairwise com-

parisons of colleges are statistically significant for PPE, Law or E&M (tables not reported). Thus

in these three courses, the top ranked college is not statistically significantly better than the bottom

ranked college! Sstimation errors are large because colleges are only observed with relatively small

numbers of students, even after pooling over multiple years. The uncertainty undermines the use of

course-specific league tables to rank colleges. Therefore although the results provide strong evidence

that colleges do matter, both statistically and practically, the sample sizes are not large enough to

say with much certainty that college A is better than college B for a given course.

Result 6. Course-specific college rankings have large confidence intervals and cannot distinguish

between the majority of colleges.

Table 19 shows college effects are not strongly correlated across courses. Indeed PPE college

effects are negatively correlated with E&M college effects and are uncorrelated with Law college

effects. This finding is emphasised in Figure 3 which presents scatter plots of Model 2 rankings

across courses. Colleges appear to have strengths in teaching different subjects. As already discussed,36The method works as follows. Put individual p-values in order, from smallest to largest. The smallest p-value has

a rank of i=1, then next smallest has i=2, etc. Compare each individual p-value to its Benjamini-Hochberg criticalvalue, (m−i+1).0.05

(2m), where i is the rank, m is the total number of tests, and 0.05 is the false discovery rate. The largest

p-value that has p < (m−i+1).0.05(2m)

, is significant, and all of the p-values smaller than it are also significant, even theones that aren’t less than their Benjamini-Hochberg critical value.

53

Figure 3: Comparison of Selection on Observables College Ranking across Courses

BALLBNC

CH−CH

EXETER

H−MAN

HERT

JESUS

KEBLE

LMH

MERTNEW

PEMB

QUEENS

S−ANNE

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

TRIN

WADHWORC

01

02

03

0

PP

E M

odel 2 R

ank

0 5 10 15 20 25

E&M Model 2 Rank

Model 2 Rank: E&M vs PPE

BALLBNC

CCC

CH−CH

EXETER

H−MAN

HERT

JESUS

KEBLE

LINC

LMHMAGD

MANS

MERTNEW

ORIEL

PEMB

QUEENS

S−ANNES−BEN

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

SOMER

TRIN

UNIV

WADHWORC

01

02

03

04

0

PP

E M

odel 2 R

ank

0 10 20 30

Law Model 2 Rank

Model 2 Rank: Law vs PPE

BALL

BNC

CH−CH

EXETER

H−MAN

HERT

JESUS

KEBLE

LMH

MERT

NEW

PEMB

QUEENS

S−ANNE

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEHTRIN

WADH

WORC

01

02

03

0

Law

Model 2 R

ank

0 5 10 15 20 25

E&M Model 2 Rank

Model 2 Rank: E&M vs LawTable 19: Correlation in College Effectsacross Courses

Model 1 Model 2PPE vs E&M -0.37 -0.10PPE vs Law -0.00 -0.05PPE vs All Subjects 0.84 0.88E&M vs Law 0.30 0.22E&M vs All Subjects -0.17 -0.24Law vs All Subjects 0.48 0.39

further evidence consistent with this interpretation is the standard deviation of college effectiveness

in the All Subjects dataset is lower than in PPE, E&M or Law.

Result 7. College effectiveness rankings differ between courses.

6.2 Robustness Checks for Norrington Table and Selection on Observables

In this subsection I consider several robustness checks. First, I examine how using different outcome

variables alters college effect estimates. Second, I examine results under monotonic transformations

54

of outcome variables. This helps to assess the interval scale metric assumption. Third, I consider

whether there is any evidence that college effects differ for different types of student.

6.2.1 Alternative Outcome Variables

I regress alternative (standardised) outcome variables on the observable ability controls from Model

2. The results are shown in Tables 20-22. The outcome variables in columns 1-3 are scores in indi-

vidual Prelims papers. These are: Introductory Politics, Introductory Philosophy and Introductory

Economics for PPE; General Management, Financial Management and Introductory Economics for

E&M; Roman Law, Constitutional Law and Criminal Law for Law. Column 4 repeats the Prelims

Average results from earlier. Column 5 also uses Prelims Average as an outcome variable but uses

a restricted sample of students who also have Finals scores which allows easy comparison to column

6 which has Finals Average as the outcome variable. College effects remain jointly statistically sig-

nificant at the 1% level in the majority of cases. The standard deviation of college effectiveness is

larger for Prelims than for Finals for E&M and Law and similar for PPE, though the results are not

precisely estimated due to using smaller sample of Finals students. This may be because students

require more guidance during their first year at Oxford when they are finding their feet but become

more independent in later years which would imply Prelims results were more sensitive to instruction

than Finals results. In addition, more teaching takes place inside colleges in the first year relative to

later years.

College effects are positively correlated under different dependent variables. The correlation

between Prelims Average and Finals Average college rankings are high: 0.50 for PPE, 0.53 in E&M

and 0.78 for Law. Correlations between the three first year paper rankings are similar, between 0.36

and 0.75 in PPE, between 0.29 and 0.53 for E&M and between 0.35 and 0.60 for Law. Correlations

across alternative outcome variables are thus clearly positive which is consistent with there being an

underlying generalisable within-course college effectiveness component embodied in the measures.

Result 8. College effectiveness is positively correlated across Prelims papers and between Prelims

and Finals results.

55

Tab

le20:Alterna

tive

Dep

endent

VariableRegressions:PPE

(1)

(2)

(3)

(4)

(5)

(6)

Philosoph

yPolitics

Econo

mics

Prelim

sAvg

Prelim

sAvg

FinalsAvg

βSE

βSE

βSE

βSE

βSE

βSE

UNIV

0.20

(0.22)

0.23

(0.20)

−0.15

(0.17)

0.07

(0.19)

0.07

(0.27)

−0.14

(0.11)

ORIE

L0.30

(0.21)

0.01

(0.21)

−0.43∗

(0.17)

−0.10

(0.18)

−0.37

(0.25)

−0.19

(0.11)

HERT

0.37

(0.22)

−0.12

(0.23)

−0.50∗∗

(0.18)

−0.15

(0.19)

−0.48

(0.27)

−0.32∗

(0.13)

BALL

0.05

(0.22)

0.23

(0.19)

−0.54∗∗

(0.16)

−0.21

(0.19)

−0.21

(0.25)

−0.01

(0.10)

BNC

0.26

(0.21)

−0.13

(0.22)

−0.56∗∗

(0.18)

−0.24

(0.19)

−0.37

(0.29)

−0.10

(0.11)

REGENT

0.02

(0.26)

0.08

(0.31)

−0.50∗

(0.22)

−0.24

(0.23)

−0.16

(0.29)

−0.10

(0.11)

EXETER

−0.05

(0.22)

−0.01

(0.21)

−0.45∗

(0.21)

−0.27

(0.20)

−0.47

(0.27)

−0.05

(0.11)

JESU

S−0.05

(0.22)

0.12

(0.22)

−0.51∗

(0.21)

−0.29

(0.22)

−0.90∗∗

(0.30)

−0.15

(0.11)

PEMB

−0.14

(0.21)

−0.41

(0.25)

−0.14

(0.17)

−0.29

(0.20)

−0.43

(0.33)

−0.19

(0.10)

SEH

−0.06

(0.24)

−0.08

(0.24)

−0.45∗

(0.21)

−0.30

(0.21)

−0.35

(0.34)

−0.20

(0.12)

S-HIL

−0.16

(0.23)

0.29

(0.22)

−0.62∗∗

(0.18)

−0.32

(0.21)

−0.43

(0.27)

−0.23∗

(0.10)

MERT

−0.00

(0.25)

−0.16

(0.24)

−0.48∗

(0.19)

−0.32

(0.22)

−0.48

(0.30)

−0.10

(0.10)

NEW

−0.16

(0.21)

−0.03

(0.21)

−0.46∗∗

(0.18)

−0.32

(0.19)

−0.44

(0.25)

−0.10

(0.10)

S-PET

−0.30

(0.22)

0.04

(0.22)

−0.38∗

(0.18)

−0.33

(0.19)

−0.55∗

(0.25)

−0.24∗

(0.10)

SOMER

−0.06

(0.21)

−0.27

(0.22)

−0.40∗

(0.16)

−0.34

(0.18)

−0.51∗

(0.23)

−0.04

(0.09)

MANS

−0.14

(0.22)

0.02

(0.23)

−0.57∗∗

(0.18)

−0.38∗

(0.19)

−0.46

(0.26)

−0.23

(0.12)

LMH

0.00

(0.22)

−0.32

(0.22)

−0.55∗∗

(0.19)

−0.41∗

(0.20)

−0.61∗

(0.31)

−0.28∗

(0.14)

MAGD

−0.13

(0.23)

−0.09

(0.23)

−0.55∗∗

(0.18)

−0.41∗

(0.20)

−0.69∗

(0.28)

−0.20∗

(0.10)

CCC

−0.23

(0.24)

−0.06

(0.21)

−0.55∗∗

(0.18)

−0.42∗

(0.20)

−0.41

(0.32)

−0.26

(0.16)

LINC

−0.08

(0.21)

−0.25

(0.21)

−0.57∗∗

(0.17)

−0.43∗

(0.18)

−0.73∗∗

(0.24)

−0.08

(0.11)

CH-C

H−0.37

(0.22)

−0.11

(0.20)

−0.45∗

(0.18)

−0.45∗

(0.19)

−0.72∗∗

(0.25)

−0.20∗

(0.10)

KEBLE

−0.17

(0.22)

−0.32

(0.22)

−0.51∗∗

(0.18)

−0.46∗

(0.19)

−0.75∗∗

(0.28)

−0.22∗

(0.10)

S-CATS

−0.37

(0.24)

−0.23

(0.22)

−0.40∗

(0.18)

−0.46∗

(0.20)

−0.54

(0.27)

−0.13

(0.11)

TRIN

−0.05

(0.26)

−0.25

(0.24)

−0.70∗∗

(0.23)

−0.49∗

(0.22)

−1.22∗∗

(0.31)

−0.34∗∗

(0.12)

WORC

−0.12

(0.27)

−0.20

(0.26)

−0.72∗∗

(0.18)

−0.51∗

(0.23)

−0.53∗

(0.26)

−0.33∗∗

(0.11)

WADH

−0.28

(0.22)

−0.02

(0.20)

−0.77∗∗

(0.19)

−0.55∗∗

(0.19)

−1.06∗∗

(0.25)

−0.27∗∗

(0.10)

H-M

AN

−0.28

(0.29)

−0.12

(0.25)

−0.71∗∗

(0.22)

−0.55∗

(0.26)

−0.70

(0.36)

−0.09

(0.11)

S-ANNE

−0.25

(0.23)

−0.02

(0.22)

−0.81∗∗

(0.19)

−0.57∗∗

(0.20)

−0.95∗∗

(0.31)

−0.30∗∗

(0.11)

S-BEN

−0.33

(0.26)

−0.17

(0.30)

−0.66∗∗

(0.22)

−0.57∗

(0.23)

−0.77∗

(0.31)

−0.19

(0.11)

S-HUGH

−0.26

(0.26)

−0.26

(0.27)

−0.69∗∗

(0.24)

−0.60∗

(0.27)

−0.93∗

(0.37)

−0.42∗∗

(0.13)

QUEENS

−0.13

(0.21)

−0.24

(0.21)

−0.84∗∗

(0.19)

−0.60∗∗

(0.20)

−0.75∗∗

(0.27)

−0.42∗∗

(0.15)

BLA

CKF

−1.77∗

(0.69)

−0.48

(0.83)

−1.77∗

(0.77)

−1.89∗

(0.87)

−1.16∗∗

(0.36)

−0.17

(0.10)

Con

trols

Yes

Yes

Yes

Yes

Yes

Yes

Prob>

F0.001

0.015

0.003

0.023

0.000

0.003

SD0.146

0.091

0.128

0.114

0.186

0.199

R-squ

ared

0.162

0.116

0.185

0.211

0.260

0.267

N1391

1391

1391

1391

660

660

The

baselin

ecolle

geis

StJo

hn’s

college.Alldepe

ndentvariab

lesarestan

dardised.Colum

ns1-4usethesampleof

enrolle

dPPE

stud

ents.Colum

ns5-6use

aredu

cted

sampleof

PPE

stud

ents

withFinalsresults.

Collegesareorderedba

sedon

thecoeffi

cients

incolumn4.

Stan

dard

errors

areheteroskedasticity

robu

st.Prob>

Fgivesthep-valuefrom

anF-testof

thenu

llhy

pothesis

that

allcolle

gesareequa

llyeff

ective.SD

givesthestan

dard

deviationof

colle

geeff

ectiveness

usingthemetho

dof

Nye

etal.(2004).

∗p<

0.05,∗∗p<

0.01

56

Tab

le21:Alterna

tive

Dep

endent

VariableRegressions:E&M

(1)

(2)

(3)

(4)

(5)

(6)

General

Man

agem

ent

Finan

cial

Man

agem

ent

Econo

mics

Prelim

sAvg

Prelim

sAvg

FinalsAvg

βSE

βSE

βSE

βSE

βSE

βSE

S-HIL

−0.11

(0.33)

−0.20

(0.27)

−0.05

(0.29)

−0.08

(0.24)

0.89

(0.55)

1.12∗

(0.44)

LMH

−0.19

(0.33)

−0.40

(0.32)

0.19

(0.29)

−0.13

(0.27)

−0.24

(0.51)

0.72

(0.61)

S-CATS

0.15

(0.40)

−0.67∗

(0.32)

0.14

(0.41)

−0.13

(0.33)

−0.00

(0.79)

0.68

(0.92)

NEW

−0.22

(0.33)

−0.56

(0.31)

0.08

(0.32)

−0.19

(0.29)

−0.19

(0.63)

0.47

(0.53)

CH-C

H0.00

(0.36)

−0.68∗

(0.28)

−0.01

(0.33)

−0.23

(0.29)

0.07

(0.60)

0.72

(0.37)

SEH

−0.06

(0.33)

−0.61∗

(0.25)

−0.04

(0.28)

−0.25

(0.23)

−0.08

(0.59)

0.21

(0.49)

HERT

−0.01

(0.29)

−0.50

(0.27)

−0.09

(0.29)

−0.27

(0.25)

−0.79

(0.59)

0.42

(0.44)

S-PET

−0.40

(0.31)

−0.68∗

(0.27)

−0.03

(0.28)

−0.36

(0.24)

−0.32

(0.53)

0.48

(0.42)

S-HUGH

−0.49

(0.31)

−0.43

(0.27)

−0.14

(0.28)

−0.37

(0.24)

−0.22

(0.52)

0.44

(0.44)

QUEENS

−0.06

(0.41)

−0.31

(0.41)

−0.31

(0.30)

−0.37

(0.27)

−0.74

(0.48)

0.42

(0.52)

JESU

S−0.26

(0.34)

−0.64∗

(0.29)

−0.12

(0.31)

−0.36

(0.26)

−0.73

(0.60)

0.64

(0.53)

EXETER

−0.03

(0.34)

−0.53

(0.39)

−0.29

(0.39)

−0.37

(0.37)

0.53

(0.76)

0.88∗

(0.43)

PEMB

−0.40

(0.29)

−0.66∗

(0.26)

−0.11

(0.27)

−0.40

(0.23)

−0.08

(0.50)

0.40

(0.47)

KEBLE

−0.38

(0.29)

−0.58∗

(0.25)

−0.32

(0.29)

−0.51∗

(0.25)

−0.99∗

(0.47)

0.13

(0.31)

WORC

−0.43

(0.31)

−0.54

(0.30)

−0.41

(0.31)

−0.51

(0.26)

−0.95

(0.55)

0.41

(0.34)

BNC

−0.53

(0.33)

−0.60∗

(0.27)

−0.20

(0.28)

−0.52∗

(0.24)

−0.66

(0.47)

0.62

(0.34)

S-JO

HN

−0.71∗

(0.35)

−0.68

(0.42)

−0.18

(0.39)

−0.56

(0.35)

−1.11∗

(0.55)

0.31

(0.54)

S-ANNE

−0.22

(0.34)

−1.00∗∗

(0.30)

−0.41

(0.29)

−0.69∗∗

(0.24)

−0.64

(0.51)

0.20

(0.30)

TRIN

−0.22

(0.28)

−1.17∗∗

(0.35)

−0.41

(0.36)

−0.73∗

(0.32)

−0.76

(0.49)

0.83∗

(0.38)

WADH

−0.34

(0.40)

−1.01∗∗

(0.29)

−0.51

(0.36)

−0.82∗∗

(0.29)

−1.29

(0.78)

0.34

(0.49)

MERT

−0.49

(0.34)

−0.75∗∗

(0.27)

−0.74∗

(0.32)

−0.81∗∗

(0.26)

−1.23∗∗

(0.45)

0.17

(0.30)

BALL

−0.41

(0.34)

−0.98∗∗

(0.32)

−1.26∗∗

(0.34)

−1.16∗∗

(0.30)

−2.06∗∗

(0.57)

0.32

(0.51)

Con

trols

Yes

Yes

Yes

Yes

Yes

Yes

Prob>

F0.465

0.093

0.002

0.011

0.000

0.240

SD0.000

0.121

0.193

0.146

0.422

0.000

R-squ

ared

0.190

0.314

0.311

0.352

0.571

0.364

N516

516

516

516

161

161

The

baselin

ecolle

geis

HarrisMan

chestercolle

ge.Alldepe

ndentvariab

lesarestan

dardised.Colum

ns1-4usethesampleof

enrolle

dEcono

micsan

dMan

agem

ent

stud

ents.Colum

ns5-6usearedu

cted

sampleof

Econo

micsan

dMan

agem

entstud

ents

withFinalsresults.

Collegesareorderedba

sedon

thecoeffi

cients

incolumn4.

Stan

dard

errors

areheteroskedasticity

robu

st.Prob>


anF-testof

thenu

llhy

pothesis

that

allcolle

gesareequa

llyeff

ective.SD

gives

thestan

dard

deviationof

college

effectiveness

usingthemetho

dof

Nye

etal.(2004).

∗p<

0.05,∗∗p<

0.01

57

Tab

le22:Alterna

tive

Dep

endent

VariableRegressions:La

w(1)

(2)

(3)

(4)

(5)

(6)

Rom

anCon

stitutiona

lCriminal

Prelim

sAvg

Prelim

sAvg

FinalsAvg

βSE

βSE

βSE

βSE

βSE

βSE

WORC

0.03

(0.23)

−0.08

(0.18)

−0.06

(0.17)

−0.03

(0.20)

−0.14

(0.21)

−0.19

(0.21)

LMH

−0.16

(0.22)

0.05

(0.19)

−0.21

(0.19)

−0.14

(0.19)

−0.44∗

(0.21)

−0.19

(0.19)

HERT

−0.31

(0.21)

0.08

(0.20)

−0.15

(0.17)

−0.16

(0.18)

−0.38∗

(0.19)

−0.12

(0.18)

S-CATS

−0.15

(0.22)

−0.20

(0.18)

−0.08

(0.18)

−0.19

(0.20)

−0.40

(0.23)

−0.32

(0.17)

UNIV

−0.22

(0.23)

−0.02

(0.18)

−0.24

(0.19)

−0.22

(0.20)

−0.52∗

(0.24)

−0.47∗

(0.21)

MANS

−0.44

(0.23)

0.16

(0.19)

−0.20

(0.22)

−0.22

(0.20)

−0.40

(0.23)

−0.27

(0.16)

BNC

−0.07

(0.22)

−0.18

(0.18)

−0.25

(0.17)

−0.23

(0.20)

−0.52∗

(0.22)

−0.23

(0.17)

S-ANNE

−0.02

(0.22)

−0.23

(0.21)

−0.38

(0.22)

−0.28

(0.22)

−0.40

(0.22)

−0.38

(0.20)

TRIN

−0.38

(0.27)

−0.04

(0.20)

−0.18

(0.20)

−0.29

(0.22)

−0.49∗

(0.25)

−0.22

(0.19)

SEH

−0.26

(0.22)

−0.17

(0.19)

−0.34

(0.20)

−0.33

(0.20)

−0.66∗∗

(0.24)

−0.43∗

(0.20)

H-M

AN

−0.16

(0.31)

−0.13

(0.25)

−0.46∗

(0.22)

−0.34

(0.26)

−0.59

(0.34)

−1.06

(0.68)

MERT

−0.13

(0.24)

−0.47∗

(0.22)

−0.22

(0.24)

−0.36

(0.23)

−0.73∗∗

(0.25)

−0.35

(0.20)

LINC

−0.30

(0.24)

−0.35

(0.19)

−0.30

(0.18)

−0.42∗

(0.20)

−0.71∗∗

(0.21)

−0.32

(0.17)

PEMB

−0.51∗

(0.22)

−0.26

(0.19)

−0.19

(0.17)

−0.42∗

(0.19)

−0.58∗∗

(0.21)

−0.32

(0.19)

CCC

−0.42

(0.22)

−0.17

(0.19)

−0.45∗

(0.21)

−0.44∗

(0.21)

−0.47∗

(0.24)

−0.31

(0.23)

NEW

−0.25

(0.24)

−0.40∗

(0.18)

−0.47∗∗

(0.17)

−0.48∗

(0.19)

−0.65∗∗

(0.20)

−0.37∗

(0.18)

CH-C

H−0.31

(0.22)

−0.44∗

(0.20)

−0.36

(0.20)

−0.49∗

(0.21)

−0.73∗∗

(0.21)

−0.33

(0.17)

S-PET

−0.27

(0.24)

−0.36∗

(0.18)

−0.49∗

(0.21)

−0.51∗

(0.21)

−0.74∗∗

(0.23)

−0.53∗

(0.23)

S-HUGH

−0.24

(0.22)

−0.40∗

(0.19)

−0.54∗∗

(0.20)

−0.53∗∗

(0.20)

−0.67∗∗

(0.21)

−0.42

(0.22)

BALL

−0.43

(0.22)

−0.26

(0.20)

−0.55∗∗

(0.21)

−0.54∗

(0.21)

−0.81∗∗

(0.25)

−0.71∗∗

(0.23)

JESU

S−0.22

(0.23)

−0.14

(0.17)

−0.86∗∗

(0.19)

−0.56∗∗

(0.19)

−0.84∗∗

(0.21)

−0.81∗∗

(0.18)

WADH

−0.45∗

(0.22)

−0.25

(0.16)

−0.55∗∗

(0.17)

−0.57∗∗

(0.19)

−0.84∗∗

(0.21)

−0.29

(0.19)

S-HIL

−0.36

(0.24)

−0.36

(0.19)

−0.57∗∗

(0.18)

−0.58∗∗

(0.21)

−0.80∗∗

(0.21)

−0.51∗

(0.21)

S-JO

HN

−0.26

(0.21)

−0.45∗

(0.21)

−0.63∗∗

(0.18)

−0.58∗∗

(0.19)

−0.75∗∗

(0.20)

−0.46∗

(0.19)

KEBLE

−0.57∗

(0.25)

−0.39∗

(0.19)

−0.44∗

(0.21)

−0.59∗∗

(0.22)

−0.77∗∗

(0.25)

−0.45

(0.27)

QUEENS

−0.46

(0.30)

−0.16

(0.28)

−0.72∗∗

(0.26)

−0.59∗

(0.29)

−0.66∗

(0.32)

−0.35

(0.22)

EXETER

−0.38

(0.25)

−0.31

(0.22)

−0.70∗∗

(0.22)

−0.61∗∗

(0.22)

−0.84∗∗

(0.25)

−0.69∗∗

(0.22)

REGENT

−0.58

(0.39)

−0.40

(0.32)

−0.59

(0.31)

−0.68

(0.35)

−1.30∗∗

(0.35)

−0.88∗

(0.36)

ORIE

L−0.54∗

(0.24)

−0.50∗

(0.20)

−0.62∗∗

(0.19)

−0.74∗∗

(0.22)

−1.22∗∗

(0.25)

−0.84∗∗

(0.24)

SOMER

−0.43

(0.23)

−0.41

(0.22)

−0.91∗∗

(0.24)

−0.77∗∗

(0.23)

−0.93∗∗

(0.24)

−0.69∗∗

(0.21)

GREYF

−0.75

(0.70)

−0.44

(0.45)

−1.36∗∗

(0.42)

−1.16∗

(0.58)

−1.27∗

(0.62)

−1.23∗∗

(0.43)

Con

trols

Yes

Yes

Yes

Yes

Yes

Yes

Prob>

F0.343

0.098

0.000

0.004

0.000

0.004

SD0.056

0.072

0.179

0.141

0.195

0.132

R-squ

ared

0.127

0.104

0.148

0.166

0.185

0.188

N1229

1229

1229

1229

854

854

The

baselin

ecolle

geis

Mag

dalencolle

ge.Allde

pend

entvariab

lesarestan

dardised

.Colum

ns1-4usethesampleof

enrolle

dLaw

stud

ents.Colum

ns5-6use

aredu

cted

sampleof

Law

stud

ents

withFinalsresults.

Collegesareorde

redba

sedon

thecoeffi

cients

incolumn4.

Stan

dard

errors

arehe

terosked

asticity

robu

st.Prob>


anF-testof

thenu

llhy

pothesis

that

allcolle

gesareequa

llyeff

ective.SD

givesthestan

dard

deviationof

colle

geeff

ectivene

ssusingthemetho

dof

Nye

etal.(20

04).

∗p<

0.05,∗∗p<

0.01

58

6.2.2 Interval Scale Metric Assumption

If Prelims Average is not an interval scale metric then there is a danger college rankings are not

invariant to the scale of Prelims Average. The importance of this assumption is an empirical issue.

To test it, I compare the results of Model 2 using different (monotonic transformations) of Prelims

Average. I consider (i) standardised Prelims Average (as in the main analysis), (ii) squaring Prelims

Average and then standardising and (iii) taking logarithms of Prelims Average and then standard-

ising. I reexamine the coefficients and college rankings in each case. The rankings of colleges are

unchanged in the majority of cases (no college moves by more than 3 places) and changes in the size

of the coefficient changes tend to be small. For PPE, the correlation between college effects estimates

across the three models is between 0.992 and 0.998. The results for other courses are similar. There-

fore, whilst the monotonic transformations chosen are clearly only a tiny proportion of all possible

rescalings, the fact the rankings are only change slightly does provide comfort.

6.2.3 Heterogeneity in College Effectiveness across Students of Different Types

To test for heterogeneity in college effects by gender, I include in the regression interaction terms

between gender and college attended. An F-test can then determine if these interaction terms are

significant which would indicate evidence that college effects differed by gender. I repeat the process

for overseas status, cohort, previous school type, GCSE results and A-level results. Table 23 displays

the p-values for the F-tests. For PPE, E&M and Law the F-test cannot reject the hypothesis that

college effects are invariant to ability (in terms of prior GCSE and A-level results). Thus there is little

evidence to support a “mismatch” hypothesis that college quality and ability interact in substantively

important ways. Students of all abilities benefit from attending higher quality colleges. However, for

both Law and PPE there is strong evidence that college effects change over time. The All Subjects

model is less well specified than the other models. The evidence from F-tests suggests that college

effects could be heterogeneous across gender, overseas status, cohort, previous school type and GCSE

bands.

59

Table 23: P-values from Tests for Heterogeneity in College Effects across StudentsPPE E&M Law All Subjects

Gender 0.24 0.82 0.02* 0.06Overseas Status 0.00** 0.15 0.09 0.00**Cohort 0.00** 0.68 0.00** 0.05*Previous School type 0.02* 0.06 0.50 0.01*GCSE Band 0.33 0.23 0.99 0.01*A-level Band 0.65 0.72 0.98 0.22

This table shows the results of tests of the null hypothesis that college effects are invariant to student characteristics.Each cell gives the p-value from an F-test on the coefficients of interaction terms between student characteristics andcollege dummies. Significance at the 1 and 5 percent level is denoted by **, and *, respectively.

6.3 Results for Selection on Observables and Unobservables

The estimation of college effects for PPE, Law and E&M using Model 3 turned out to be problem-

atic because OLS regression estimates of the scale parameter λ1 from equation (5) were negative.

Given the theory model constraints λ1 > 0 this is troubling and prevents us from obtaining point

estimates of college effects for these courses. Mechanically this is because (i) colleges tend to select

lower proportions of open applicants than direct applicants, implying that direct applicants are of

higher ability on average relative to open applicants and (ii) at most colleges, open applicants slightly

outperform direct applicants in Prelims. There are a number of possibilities as to why λ1 estimates

are negative. First, is estimation error. There are only a few open applicants at each college so

our estimates of λ1 are not precise. Second, the “Fair Admissions” assumption may not hold. A

negative estimate of λ1 could be generated if colleges were biased against open applicants relative to

direct applicants (open applicants face a higher cut-off). Direct discrimination seems unlikely given

admissions tutors are unaware whether an applicant applied directly or made an open application.

However, discrimination could occur indirectly, for instance if admissions tutors were biased against

international applicants relative to UK applicants and open applicants are disproportionately inter-

national students. In each of PPE, E&M and Law, international students do score more highly on

average than UK students in Prelims but an analysis of marginal students is needed to determine

the validity of the “Fair Admissions” assumption (Bhattacharya et al., 2014). Third, assumptions

made about the distribution of ability (normality and equal variance) may not hold. Evidence that

this is due to estimation error is that the All Subjects dataset, which draws on many more students,

60

estimates a positive λ1.

As a result of this problem I do two things. First, in Table 24, I present college effect estimates for

PPE, E&M and Law for different values of λ1. Cut-off estimates are provided in the first column. For

PPE they range from 1.34 at St Anne’s to 2.15 at Mansfield (remembering that the average ability

of open applicants in the whole population is zero) reflecting that some colleges accept a much larger

proportion of open applicants than others. The second and third columns give the average ability of

enrolled students at each college. In each case direct applicants are estimated to be of higher ability

on average. Columns 4 and 5 give the number of enrolled students at each college. It is notable how

few open applicants attend each college – the most for PPE is 10 at St Peter’s – and this means

that college effect estimates are imprecise. The remaining columns of Table 24, present college effect

estimates for Models 1, 2 and 3 based on the reduced sample of students who were offered a place at

the first college they were allocated to. This makes college effectiveness estimates directly comparable

across models. Colleges are ordered by their Model 1 college effectiveness estimate. Since the estimate

of the scale parameter λ1 < 0 for these courses, the Model 3 estimates are reported for different values

of λ1 (λ1 = 0.5, λ1 = 1, λ1 = 2 and λ1 = 5). Model 1, or equivalently the Model 3 for λ1 ≈ 0, is a

baseline value where the entire difference in Prelims scores is attributed to colleges. As λ1 increases,

Prelims results become more sensitive to ability. This improves the college effect estimates for colleges

with low estimated cut-offs and low estimated enrolled student ability relative to colleges with high

estimated cut-offs and enrolled students of high average ability. In the limit as λ1 → ∞ college are

ranked based solely on the estimated cut-offs. The results for PPE, E&M and Law seem plausible

if, for example, λ1 = 0.5. In this case differences in college effectiveness are similar to differences

that result from controlling for observables in Model 2, with correlations in college effectiveness

estimates of 0.71 for PPE, 0.93 for E&M and 0.79 for Law. However, college effectiveness estimates

for some colleges are quite sensitive to the value of λ1 and the estimate of the cut-off zj . Sensitivity

to the cut-off estimate creates uncertainty about true college effectiveness because the cut-offs are

imprecisely estimated due to the low number of open applicants at each college. For example, the

cut-off estimate for Mansfield (MANS) for PPE of 2.15 seems unrealistically high given estimates

for other colleges range from 1.34 to 1.79. Overall the results in Table 24, suggest similar results to

61

Tab

le24:Selectionon

Observables

andUno

bservables

Results

forvariou

sλ1:PPE,E

&M

andLa

wCutoff

Ability

No.

Enrolled

College

effects β j=

c j−c J

z jOpe

nDirect

Ope

nDirect

Mod

el1

Mod

el2

Mod

el3

λ1≈

0-

λ1=

0.5

λ1=

1λ1=

2λ1=

5

PP

ES-HIL

1.70

2.11

2.14

73

0.00

0.00

0.00

0.00

0.00

0.00

SOMER

1.60

2.03

2.13

915

-0.16

-0.19

-0.14

-0.13

-0.10

-0.02

PEMB

1.41

1.87

1.93

724

-0.32

-0.32

-0.22

-0.12

0.09

0.69

SEH

1.38

1.84

1.88

617

-0.35

-0.19

-0.22

-0.10

0.15

0.89

S-PET

1.52

1.96

2.05

1021

-0.42

-0.43

-0.37

-0.32

-0.23

0.07

CCC

1.46

1.90

2.00

521

-0.42

-0.40

-0.35

-0.28

-0.15

0.27

LMH

1.79

2.19

2.35

227

-0.45

-0.42

-0.56

-0.67

-0.89

-1.53

MANS

2.15

2.51

2.66

116

-0.54

-0.30

-0.80

-1.07

-1.60

-3.19

QUEENS

1.72

2.13

2.26

324

-0.56

-0.59

-0.62

-0.68

-0.80

-1.17

S-ANNE

1.34

1.80

1.85

720

-0.70

-0.71

-0.56

-0.42

-0.13

0.72

S-HUGH

1.43

1.88

1.94

99

-0.75

-0.53

-0.65

-0.55

-0.34

0.28

E&

MS-HIL

1.51

1.95

1.88

81

0.00

0.00

0.00

0.00

0.00

0.00

SEH

1.62

2.04

2.09

1214

-0.29

-0.08

-0.35

-0.41

-0.53

-0.90

S-HUGH

1.84

2.23

2.35

517

-0.33

-0.22

-0.52

-0.71

-1.10

-2.24

HERT

1.83

2.23

2.30

334

-0.46

-0.19

-0.63

-0.81

-1.16

-2.21

JESU

S1.80

2.19

2.28

315

-0.48

-0.22

-0.64

-0.80

-1.13

-2.12

KEBLE

1.93

2.31

2.43

422

-0.58

-0.36

-0.81

-1.04

-1.51

-2.91

PEMB

1.76

2.16

2.22

241

-0.59

-0.36

-0.73

-0.87

-1.15

-1.99

S-PET

1.83

2.23

2.33

726

-0.65

-0.32

-0.84

-1.02

-1.39

-2.48

LMH

1.77

2.17

2.23

310

-0.68

-0.30

-0.81

-0.95

-1.22

-2.03

MERT

2.23

2.58

2.72

122

-0.78

-0.73

-1.17

-1.55

-2.32

-4.63

WADH

2.09

2.45

2.59

110

-0.90

-0.73

-1.22

-1.54

-2.17

-4.08

S-ANNE

1.73

2.14

2.21

612

-0.91

-0.72

-1.03

-1.15

-1.40

-2.14

Law

S-ANNE

1.33

1.79

1.86

517

0.00

0.00

0.00

0.00

0.00

0.00

HERT

1.64

2.06

2.19

235

-0.04

0.05

-0.21

-0.38

-0.72

-1.75

SEH

1.24

1.72

1.81

1613

-0.16

0.08

-0.12

-0.08

0.00

0.23

S-HIL

1.55

1.99

2.04

88

-0.22

-0.31

-0.30

-0.39

-0.56

-1.08

WADH

1.5

1.94

2.03

529

-0.29

-0.22

-0.38

-0.46

-0.63

-1.15

S-PET

1.65

2.07

2.17

313

-0.31

-0.24

-0.47

-0.62

-0.93

-1.85

CCC

1.27

1.75

1.86

821

-0.42

-0.26

-0.41

-0.40

-0.39

-0.34

JESU

S1.3

1.77

1.83

528

-0.49

-0.33

-0.49

-0.48

-0.46

-0.40

S-HUGH

0.87

1.42

1.34

195

-0.50

-0.49

-0.29

-0.07

0.37

1.68

ORIE

L1.29

1.76

1.85

931

-0.55

-0.54

-0.55

-0.54

-0.54

-0.51

KEBLE

1.46

1.9

2.02

337

-0.62

-0.42

-0.71

-0.79

-0.96

-1.47

SOMER

1.06

1.58

1.62

1310

-0.77

-0.70

-0.65

-0.52

-0.27

0.47

The

estimated

valueofλ1foreach

ofPPE,E&M

andLaw

isnegative.Colum

ns2-3give

theestimated

ability

ofenrolle

dstud

ents.Colum

ns4-5give

thenu

mbe

rof

enrolle

dstud

ents.Colum

ns6-11

give

colle

geeff

ectestimates

relative

tothecolle

gewiththehigh

estaveragePrelim

sresultsfrom

Mod

el1.

Estim

ates

areba

sedon

arestricted

samplethat

does

notinclud

estud

ents

who

wereno

toff

ered

aplaceat

thefirst

colle

gethey

wereallocatedto.

Collegesareon

lyinclud

edifthey

have

atleast50

open

applican

ts.

62

Table 25: Selection on Observables and Unobservables Results: All Subjects, English, Maths andHistory

Cutoff Ability No. Enrolled College effects βjzj Open Direct Open Direct Model 1 Model 2 Model 3

All SubjectsMAGD 1.28 1.75 1.81 4 465 0.00 0.00 0.00S-JOHN 1.27 1.74 1.81 18 426 -0.03 0.02 -0.02PEMB 1.17 1.66 1.71 43 295 -0.31 -0.18 -0.19CH-CH 1.25 1.73 1.79 16 399 -0.26 -0.17 -0.23SEH 1.23 1.71 1.79 97 279 -0.30 -0.17 -0.25EXETER 1.17 1.66 1.69 11 331 -0.26 -0.14 -0.26S-CATS 1.27 1.75 1.82 42 380 -0.40 -0.33 -0.26S-HIL 1.23 1.71 1.78 125 113 -0.34 -0.23 -0.27S-HUGH 1.22 1.71 1.80 122 206 -0.33 -0.21 -0.28MERT 1.46 1.91 2.00 8 333 -0.09 -0.08 -0.30BALL 1.39 1.85 1.91 4 421 -0.20 -0.18 -0.30ORIEL 1.26 1.74 1.80 23 276 -0.33 -0.27 -0.31S-ANNE 1.29 1.76 1.86 64 304 -0.29 -0.20 -0.33SOMER 1.18 1.67 1.80 106 212 -0.41 -0.30 -0.35NEW 1.48 1.92 2.04 9 478 -0.12 -0.08 -0.36LINC 1.42 1.87 1.97 11 326 -0.24 -0.22 -0.41QUEENS 1.24 1.72 1.79 42 268 -0.46 -0.35 -0.42S-PET 1.37 1.83 1.91 56 208 -0.37 -0.21 -0.46MANS 1.43 1.88 1.97 24 149 -0.30 -0.20 -0.46CCC 1.37 1.83 1.96 27 219 -0.37 -0.31 -0.51UNIV 1.67 2.08 2.22 4 406 -0.11 -0.11 -0.55LMH 1.47 1.91 2.03 34 329 -0.32 -0.22 -0.55WADH 1.47 1.92 2.02 16 455 -0.34 -0.28 -0.56JESUS 1.64 2.06 2.19 15 353 -0.27 -0.23 -0.67HERT 1.65 2.06 2.19 13 446 -0.28 -0.20 -0.70KEBLE 1.74 2.14 2.28 12 446 -0.26 -0.19 -0.77EnglishS-HUGH 1.24 1.72 1.85 12 17 0.00 0.00 0.00LMH 1.12 1.63 1.72 5 42 -0.28 -0.48 -0.41SOMER 1.07 1.58 1.73 13 35 -0.64 -0.71 -0.62S-HIL 1.09 1.60 1.67 15 16 -0.71 -0.69 -0.67MathsS-HIL 1.41 1.86 1.92 5 6 0.00 0.00 0.00S-PET 1.10 1.60 1.59 7 6 -0.63 -0.57 -0.03SEH 1.38 1.84 1.86 5 4 -0.17 -0.27 -0.08S-HUGH 1.40 1.85 1.95 3 20 -0.25 -0.13 -0.35QUEENS 1.12 1.63 1.72 11 17 -0.98 -0.89 -0.56MANS 1.75 2.15 2.23 2 7 -0.81 -0.93 -1.48LMH 2.08 2.44 2.62 1 21 -0.39 -0.31 -1.85HistoryS-HIL 1.11 1.61 1.67 5 8 0.00 0.00 0.00S-HUGH 0.99 1.52 1.64 7 27 -0.27 0.03 -0.20S-ANNE 1.21 1.69 1.76 5 12 -0.14 0.07 -0.32MANS 1.27 1.74 1.82 4 10 -0.04 0.05 -0.32SOMER 1.14 1.64 1.85 11 30 -0.13 0.10 -0.41

Columns 2-3 give the estimated ability of enrolled students. Columns 4-5 give the number of enrolled students. Allcollege effect estimates based on a restricted sample that does not include students who were not offered a place atthe first college they were allocated to. College effect estimates are given relative to the college with the largest Model3 college effect estimate. Colleges are only included if they have at least 50 open applicants. Estimated value ofλ1: λ1 = 1.09 for All Subjects; λ1 = 0.24 for English; λ1 = 2.04 for Maths; λ1 = 1.92 for History.

63

Figure 4: Comparison of College Rankings across Models: All Subjects

BALL

CCC

CH−CH

EXETER

HERT

JESUS

KEBLE

LINC

LMH

MAGD

MANS

MERTNEW

ORIEL

PEMB

QUEENS

S−ANNE

S−CATS

S−HIL

S−HUGH

S−JOHN

S−PET

SEH

SOMER

UNIV

WADH

05

10

15

20

25

Mo

de

l 2

Ra

nk

0 5 10 15 20 25

Model 1 RankBased on restricted sample of colleges

Model 1 vs Model 2

BALL

CCC

CH−CH

EXETER

HERTJESUS

KEBLE

LINC

LMH

MAGD

MANS

MERT

NEW

ORIEL

PEMB

QUEENS

S−ANNE

S−CATS

S−HILS−HUGH

S−JOHN

S−PET

SEH

SOMER

UNIV

WADH

05

10

15

20

25

Mo

de

l 3

Ra

nk

0 5 10 15 20 25

Model 2 RankBased on restricted sample of colleges

Model 2 vs Model 3

selection on observables estimates for some parameterisations. With more students per college, this

method could produce useful effectiveness estimates but currently there is considerable uncertainty

surrounding college effectiveness estimates.

Second, in Table 25, I present All Subjects selection on observables and unobservables results.

Again the first five columns present the cut-off, average ability of enrolled students and the number

of enrolled students. The final three columns give college effect estimates for Models 1, 2 and 3 again

based on the reduced sample of students and colleges. Colleges are ranked by their Model 3 college

effectiveness estimate. Comparing Model 1, 2 and 3 college effect estimates suggests that taking

into account unobservable ability may actually slightly increase variation in adjusted Prelims results

between colleges. Taking the results at face value also suggests that in many cases, unobservable

ability is not well correlated with observable ability (differences in Prelims results between colleges

usually fall when moving from Model 1 to Model 2 but often rise when moving from Model 2 to Model

3). However, I cannot rule out that these results are mainly due to estimation error. Indeed, for

All Subjects, the estimate of the ability scale parameter is λ1 = 1.09, which seems implausibly high

because it leads to a strong negative correlation between the estimated cut-off zj and the Model 3

college effect estimates βj (the 6 lowest places in the table are occupied by colleges that have cut-offs

in the top 7). Figure 4 illustrates how college effectiveness rankings change across Models. Although

college rankings change very little when we compare Model 1 and Model 2 results, Model 3 results

64

are quite different. The correlation in college effectiveness estimates is 0.94 for Model 1 vs Model

2 but falls to 0.44 for Model 2 vs Model 3. As shown in section 5.5.1, random assignment of open

applicants is less convincing when pooled across subjects so I also give college effect estimates for

English, Maths and History in the lower 3 panels in Table 25. These estimates largely tell the same

story – the Model 3 estimates are at times quite different to the Model 1 and Model 2 estimates

indicating unobservable ability is very important or that these estimates are very imprecise.

7 Characteristics of Effective Colleges

What characteristics are associated with effective colleges? To answer this question I implement a

two-step procedure because Moulton (1986) that estimating the impact of college characteristics in

one step is problematic for the precision of estimated effects. I would like to estimate γ from the

college level equation:

βj = Zjγ + uj ∀ j = 1, 2, . . . , J (15)

where βj is the true college effect for college j composed of a vector of college characteristics Zj and

a homoskedastic random error term uj with E(uj) = 0 and V ar(uj) = σ2. However, true college

effectiveness βj is not observable. Rather we observe college effect coefficient estimates βj from the

first stage models. Using first stage estimated regression coefficients implies an additional error in

the second stage regression because of estimation error:

βj = βj + εj ∀ j = 1, 2, . . . , J. (16)

Thus the second stage regression becomes:

βj = Zjγ + uj + εj ∀ j = 1, 2, . . . , J (17)

where V ar(εj) = w2j . Since the dependent variable in the second stage βj is itself estimated, the

second stage regression residual can be thought of as having two components. One, uj , is the random

shock that would have obtained even if the college effects were directly observed and could well be

homoscedastic. The second component εj is the estimation error from the first stage regression. Even

65

if uj is homoscedastic, this εj will be heteroskedastic because estimation error differs across colleges.

Therefore the regression errors in (17) will be heteroskedastic and OLS will produce inconsistent

standard error estimates.37

I follow Hanushek et al. (1996) and assume w2j is proportional to the sampling variance of βj and

use a specialised form of feasible generalised least squares (FGLS) (Hanushek et al., 1996).38 First,

I estimate equation (17) using OLS and calculate the squared residuals for j = 1, ..., J − 1, where

college J is the baseline college in the first stage. Next, I regress the squared residuals on the squared

standard errors from the college effect estimates. Finally, I use the inverse of the predicted square

of the residuals from this auxiliary regression as the weight the FGLS estimation of (17). Estimates

from this regression will be asymptotically efficient.

The small number of colleges has two implications for inference (Donald and Lang, 2007). First,

the assumption that college effects are normally distributed is crucial for hypothesis testing because

we cannot rely on large sample sizes to provide an asymptotically normal distribution of the parameter

estimates. Second, there is a practical limit on the number of variables that can be included in Z.

For the college characteristics in Zj I use endowment39, the number of students on the course

and the college average admissions test scores. These latter two variables are included, as in Bratti

(2002), to proxy for peer effects. A positive coefficient on the number of students on the course

would suggest that students benefit from being surrounded by lots of other students studying the

same course within the same college. A positive coefficient on college average admissions test scores

would suggest that students benefit from being surrounded by high ability students studying the

same course within the same college. I also consider specifications with a dummy variable for being

a former All Women’s college (LMH, St Anne’s, St Hugh’s and Somerville), dummy variables based

on location and dummy variables for “Old” colleges (foundation pre-1500) and “Young” colleges

37OLS standard errors will be inconsistent with a fixed number of students per college as the number of collegestends to infinity. However, when the number of students per college is large, the second component (the estimationerror) is small and the first component (the random shock) is assumed homoscedastic so OLS produces consistentstandard errors (Donald and Lang, 2007).

38A common approach is to use weighted least squares with weights 1wj

in the second stage regression. However, likeOLS, this is inefficient and may produce inconsistent estimates of parameter uncertainty because it implicitly assumesthat the entire residual uj + εj , and not just the second component εj is heteroskedastic (Hanushek, 1974).

39Specifically 2011 endowment, approximately the midpoint of the time period. I collected this information fromcollege Financial Reports publicly available on the Oxford website.

66

Table 26: Second Stage Regression Results: Impact of Endowment

PPE EM Law All Subjects

Model 1 Model 2 Model 1 Model 2 Model 1 Model 2 Model 1 Model 2

Endowment 0.048∗∗ 0.028 -0.022 -0.024 0.003 0.006 0.027∗ 0.018∗(0.015) (0.015) (0.018) (0.023) (0.018) (0.018) (0.010) (0.008)

Endowment Sq -0.002∗∗ -0.001 0.001 0.001 -0.000 -0.000 -0.001∗∗ -0.001∗∗(0.000) (0.001) (0.001) (0.001) (0.000) (0.000) (0.000) (0.000)

Prob > F 0.01 0.08 0.33 0.59 0.19 0.07 0.01 0.01R-squared 0.25 0.13 0.06 0.04 0.01 0.01 0.24 0.15N 32 32 22 22 31 31 32 32The dependent variable in columns 1, 3, 5 and 7 are the Model 1 college effectiveness estimates. The dependentvariable in columns 2, 4 6 and 8 are the Model 2 college effectiveness estimates. Standard errors are FGLSstandard errors calculated as in Hanushek, Rivkin and Taylor (1996). Prob > F gives the p-vaue for the F-testof the null hypothesis that the coefficients on endowment and endowment squared are equal to zero.The units for endowment are £10m∗ p < 0.05, ∗∗ p < 0.01

(foundation post-1850). However, including lots of dummy variables severely reduces the degrees of

freedom available and I do not report these results.

Tables 26 and 27 presents FGLS estimates of the determinants of the college effects. Regression

results are reported using college effects estimates from Model 1 and Model 2 (no standard errors are

available for Model 3).

The impact of endowment on college effectiveness is best evaluated through regressions with no

other controls, as in Table 26. Table 26 provides evidence that endowment is related to both raw

Prelims scores (Model 1) and college effectiveness adjusted for observables (Model 2). F-tests of the

null hypothesis that endowment has no impact on Model 2 college effectiveness can be rejected for

PPE and Law (at the 10% level) and for All Subjects (at the 1% level), though not for E&M where

the estimated effect is negative and insignificant. The estimated relationship between endowment

and college effectiveness is increasing and concave for PPE, Law and All Subjects. Richer colleges on

average tend to be more effective. For example, the top 6 most effective colleges have endowments

in the top 9.40 For PPE and All Subjects, endowment is more closely related to raw Prelims scores

40This point is made periodically in the media. Eg. Times Higher Education: “Oxford inequalities exposed” 2ndMay 2003 and Cherwell: “Rich colleges enjoy more academic success” 29th October 2010.

67

Tab

le27:Second

StageRegressionResults:Evidenceof

PeerEffe

cts

PPE

EM

Law

AllSu

bjects

Mod

el1

Mod

el2

Mod

el1

Mod

el2

Mod

el1

Mod

el2

Mod

el1

Mod

el2

End

owment

0.017

0.020

-0.033

-0.031

0.005

0.002

0.024∗

0.015

(0.016)

(0.016)

(0.021)

(0.028)

(0.020)

(0.015)

(0.011)

(0.008)

End

owmentSq

-0.001

-0.001

0.001

0.001

-0.000

-0.000

-0.001∗

-0.000

(0.000)

(0.001)

(0.001)

(0.001)

(0.001)

(0.000)

(0.000)

(0.000)

No.

PPE

stud

ents

0.011∗∗

0.008∗∗

(0.003)

(0.002)

Avg

TSA

CriticalP

PE

0.013

-0.011

(0.019)

(0.020)

Avg

TSA

Problem

PPE

0.010

-0.013

(0.013)

(0.013)

No.

EM

stud

ents

0.003

0.004

(0.005)

(0.006)

Avg

TSA

CriticalE

M-0.003

-0.019

(0.051)

(0.051)

Avg

TSA

Problem

EM

0.046

0.057

(0.059)

(0.064)

No.

Law

stud

ents

0.003

0.006

(0.006)

(0.004)

Avg

LNAT

0.124

0.107∗

(0.064)

(0.042)

Total

Stud

ents

peryear

0.001

0.001∗

(0.001)

(0.001)

End

owment:

Prob>

F0.08

0.00

0.16

0.49

0.28

0.11

0.09

0.21

Abilitype

ereff

ects:Prob>

F0.57

0.49

0.53

0.63

0.06

0.02

Allvariab

les:

Prob>

F0.00

0.00

0.28

0.85

0.22

0.03

0.02

0.01

R-squ

ared

0.54

0.40

0.16

0.13

0.19

0.31

0.25

0.21

N32

3221

2130

3032

32The

depe

ndentvariab

lein

columns

1,3,

5an

d7aretheMod

el1colle

geeff

ectiveness

estimates.The

depe

ndentvariab

lein

columns

2,4,

6an

d8aretheMod

el2colle

geeff

ectiveness

estimates.Stan

dard

errors

areFGLSstan

dard

errors

calculated

asin

Han

ushek,

Rivkinan

dTay

lor(1996).End

owment:

Prob>

Fgivesthep-vaue

fortheF-testof

thenu

llhy

pothesis

that

thecoeffi

cients

onendo

wmentan

dendo

wmentsqua

redareequa

lto

zero.Abilitype

ereff

ects:Prob>

Fgivesthep-vaue

fortheF-testof

thenu

llhy

pothesis

that

the

coeffi

cients

onaveragead

mission

stest

scores

areequa

lto

zero.

∗p<

0.05,∗∗p<

0.01

68

from Model 1 than college effectiveness adjusted for observables in Model 2 (as shown by smaller

coefficient estimates and smaller R2 estimates), which suggests high ability students sort into richer

colleges. For PPE, an increase in endowment from £15million to £25million is related to a 0.046

standard deviation increase in raw Prelims scores and an improvement of 0.027 standard deviations

in Prelims scores after accounting for observables. The effect is also slightly underestimated because I

must exclude the baseline college from the analysis since its college effect has no associated standard

error and this is St John’s for PPE and All Subjects which has both the largest endowment and

high college effectiveness. These results are consistent with richer colleges attracting higher ability

students and teaching them more effectively than other colleges. More effective colleges may also

receive more and larger donations from alumni.

Table 27 includes endowment and peer effect proxies as explnatory variables. Evidence on peer

effects, holding endowment fixed, is mixed. Focusing on the regressions with Model 2 college effects

as the dependent variable, average admissions test scores have a positive and significant effect for

Law. This suggests students can learn from other high ability students within the same college or

perhaps benefit from competition with them. However, average admissions test score coefficients are

insignificant for E&M and PPE and are even negative in some cases, which would suggest students

benefit from more from lower ability peers. Thus there is little evidence of ability peer effects.

There is however evidence of peer effects operating though the number of students at each college

per course. The coefficients for the Model 2 college effect regressions are positive and statistically

significant at the 1% level for PPE and are positive but insignificant for E&M and Law (they are also

positive and insignificant for All Subjects). Colleges that take large numbers of students in a given

course, perform well in that course both before and after accounting for observables. Colleges taking

one extra student per year over 5 years in PPE, ceteris paribus, are associated with an improvement

of 0.055 standard deviations in raw Prelims scores and an improvement of 0.040 standard deviations

in Prelims scores after accounting for observables – a large effect for such a small intervention. A

similar size effect is measured for Law – colleges taking one extra student per year over 7 years in Law

are associated with an improvement of 0.042 standard deviations in Prelims scores after accounting

for observables – however it is not statistically significant. One interpretation is students benefit from

69

interacting with college peers within their subjects. Alternatively colleges may accept more students

in courses that they are stronger in or close down poorly performing courses.

Overall, these college characteristics explain only a fraction of differences in college effectiveness.

This does not mean college effectiveness cannot ever be explained – the estimates are very imprecise

and further work may benefit from using data on a wider range of college characteristics that I was

unable to obtain. I discuss this further in the conclusion.

8 Discussion and Limitations

Even though my college effectiveness estimates are an improvement over the Norrington table, their

interpretation should include various caveats and cautions.

First, my first stage college effectiveness estimates are more directly relevant to students than

to college administrators. To understand why, following Raudenbush and Willms (1995), imagine

decomposing my college effect estimates into two parts: (i) college context (the resources available to a

college) and (ii) college practice (the efficiency with which those resources are used). Context includes

college endowment, location and peer interactions. Practice includes teaching style, organisational

structure and college leadership. My college effectiveness estimates include both context and practice

and are known as “Type A effects” (Raudenbush andWillms, 1995), appropriate for students who wish

to ascertain their expected exam results at different colleges conditional on their own characteristics,

but are unconcerned about whether exam results come from college context or college practice. In

contrast, “Type B effects” include only the effect of college practice and not college context. Type B

effects are appropriate for college administrators interested in college accountability and instructional

practice because they measure the efficiency with which colleges to exploit the resources available to

them. Removing college “context” ensures colleges are not held accountable for factors mostly outside

their control.41 Strictly, this means that my college effectiveness estimates are, at best, type A effects,

of interest to students selecting colleges, not type B effects, of interest to administrators analysing

instructional practice. Certainly I hope the estimates have the potential to stimulate useful discussion

41Type A effects are often known as value-added whereas Type B effects are known as contextual value-added.

70

about how to improve practice within colleges and my second stage estimates also contribute to this.

However, the first stage estimates should not be taken as direct evidence of instructional practice.

Second, and relatedly, my college effect estimates are inclusive of any student-effort/input ad-

justments (Bratti, 2002; Todd and Wolpin, 2003). In an optimising behavioural model, changing

a student’s college may change their effort level. Thus colleges exert a twofold impact on students’

exam performance, first directly through college characteristics and second through students’ optimal

effort input. This second effect may be positive or negative. Good teaching may motivate students to

put more effort into studying (college teaching and student inputs are complements). Alternatively,

students may work harder to make up for ineffective teaching (college teaching and student inputs

are substitutes). Thus student behaviour could potentially mute or exacerbate differences in college

effectiveness. This is not necessarily a limitation, as the total college effect is precisely the desired

effect for answering most policy questions (Todd and Wolpin, 2003).

Third, my college effectiveness estimates are relative by construction. The colleges are only

compared to other Oxford colleges – they also do not assess the value of going to an Oxford college

as compared to going to a different university or no university at all.

Fourth, my college effect estimates are backward looking – they measure how effective colleges

were in the past. Potential students are interested in future, rather than past, effectiveness and this

implies larger uncertainty around college effectiveness estimates. I have not examined the stability

of college effects in detail but specification tests did suggest some evidence college effects do change

over time, perhaps due to tutor turnover. The less stable college effects are, the more noise in the

signal of college quality they provide to prospective students.

Fifth, my college effect estimates concentrate on exam results which are only one of many elements

that contribute to college quality (for a discussion of production with multiple outputs see Chizmar

and Zak (1983)). Colleges aim to produce a wide range of private benefits for students from increased

cognitive skills and improved labour market outcomes to an improved ability to make informed life

decisions about marriage, health, and parenting, and even perhaps increased happiness. Colleges also

produce an array of social benefits. Positive externalities from colleges operate through proximity to

knowledgable people (Acemoglu and Angrist, 2001; Moretti, 2004), reduced crime, propensity to vote

71

and support for free speech (Dee, 2004). Finally, colleges aim to instil ethical values in their students.

Yet exams focus on measuring students’ cognitive skills and neglect other dimensions of college quality.

Exams may also mismeasure student cognitive skills because they reward college practices that may

not be considered desirable. These include “cream skimming” – encouraging weaker students not

to take exams and perhaps dropout; and “teaching to the test” – focusing teaching on test-taking

strategies and a narrow range of topics likely to be examined. Since exam results do not fully capture

everything students and society care about, my college effect estimates should not serve as the sole

criteria used by students or administrators to make decisions but should be seen as a starting point

to be complemented by other sources of information. A full appraisal of college effectiveness requires

a broad set of outcomes that proxy for the various dimensions of college effectiveness. I leave this for

further research.

Finally, adjusting exam results for ability makes it impossible to statistically distinguish between

the majority of colleges. Thus ordinal rankings of colleges should have large confidence intervals

around the point estimates. One alternative is to determine a benchmark Prelims / Finals average

and classify colleges relative to that benchmark (e.g. significantly below, within normal statistical

variance of, or statistically above).

9 Conclusion and Future Work

The Oxford college system gives students the benefits of belonging both to a large, internationally

renowned institution and to a small, interdisciplinary academic community. The benefits include

more personal tuition and more support than most other universities can give. The college system

naturally raises questions about whether students at different colleges benefit equally.

I find that although most of the variation in exam results is explained by differences in student

ability, colleges also play an important role, comparable to the role played by schools in boosting

student GCSE results. Across models and courses, there is evidence college effectiveness impacts

exam results. However, college effectiveness differs across courses which suggests focusing on the

effectiveness of colleges as a whole may be too simplistic – it is better to focus on courses with

72

colleges. OLS selection on observables results suggest a one standard deviation improvement in

college effectiveness corresponds to an increase of 0.11 standard deviations in Prelims for PPE, 0.15

for E&M and 0.14 for Law. Selection on unobservables results broadly support this but are imprecisely

estimated.

The finding that effectiveness differs across colleges is encouraging: it implies we can identify the

relevant factors and then improve educational outcomes. I find evidence that high endowment and

large numbers of students studying a given course are associated with more effective colleges. Overall

however, college effectiveness is not easy to explain with available college characteristics.

I hope this study will spark further research on college effectiveness. Future work could build on

this study in a number of ways. First, it would be interesting to examine the external validity of the

findings in this paper by studying colleges at different universities. Cambridge would be an obvious

example, but also universities where colleges exist but play less of a role in teaching than they do

at Oxford. Second, more accurate college effectiveness estimates may be achievable if the data used

here were complemented with measures of the quality of students’ personal statements and school

references. Interview scores for a wider range of courses would also be beneficial. Third, future work

could examine the effect of colleges on a broader set outcome variables such as post-graduate earnings

or student satisfaction ratings. Different outcomes may lead to different college effectiveness estimates

as different outcomes capture different dimensions of college quality. These multiple measures of

college effectiveness could then be combined (with an appropriate set of weights) to produce a better

overall measure of college effectiveness. Fourth, future work could use better information on college

characteristics which may help to better explain differences in college effectiveness. In particular,

using data on tutorial group sizes, tutor qualifications and hours of tuition may be interesting. Much

of this data is available from “Oxford Colleges On-line Reports for Tutorials” (OxCORT) which is a

web application for the collection and processing of tutorial reports for undergraduate teaching. I

would have used this data for my second stage estimation but gaining access to it proved difficult

as each college owns its their own data and must be asked for it individually. Finally, further work

could study tutor value-added at Oxford. While teacher value-added has been extensively studied

at a primary school level and to a lesser extent a secondary school level, there have been only a few

73

studies of tutor value-added studies. Two findings in this study potentially imply tutors may play an

important role in the educational production function: (i) college effectiveness varies across courses

and (ii) aggregate college characteristics leave much variation in college effectiveness unexplained.

Again OxCORT data would make such a study feasible.

References

Aaronson, D., L. Barrow, and W. Sander (2007). Teachers and student achievement in the chicago public high schools.

Journal of Labor Economics 25 (1), 95–135.

Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2014). Finite population causal standard errors. Technical

report, National Bureau of Economic Research.

Acemoglu, D. and J. Angrist (2001). How large are human-capital externalities? evidence from compulsory-schooling

laws. In NBER Macroeconomics Annual 2000, Volume 15, pp. 9–74. MIT Press.

Afshartous, D. and M. Wolf (2007). Avoiding ‘data snooping’in multilevel and mixed effects models. Journal of the

Royal Statistical Society: Series A (Statistics in Society) 170 (4), 1035–1059.

Aitkin, M. and N. Longford (1986). Statistical modelling issues in school effectiveness studies. Journal of the Royal

Statistical Society. Series A (General), 1–43.

Avery, C. and C. M. Hoxby (2004). Do and should financial aid packages affect students’ college choices? In College

choices: The economics of where to go, when to go, and how to pay for it, pp. 239–302. University of Chicago Press.

Ballou, D. (2009). Test scaling and value-added measurement. Education 4 (4), 351–383.

Barnow, B., G. Cain, and A. Goldberger (1981). Selection on observables. Evaluation Studies Review Annual 5 (1),

43–59.

Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to

multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

Berk, R. A. (2004). Regression analysis: A constructive critique, Volume 11. Sage.

Bhattacharya, D., S. Kanaya, and M. Stevens (2014). Are university admissions academically fair? Available at SSRN

2082976 .

Black, D., J. Smith, and K. Daniel (2005). College quality and wages in the united states. German Economic

Review 6 (3), 415–443.

Black, D. A. and J. A. Smith (2004). How robust is the evidence on the effects of college quality? evidence from

matching. Journal of Econometrics 121 (1), 99–124.

Black, D. A. and J. A. Smith (2006). Estimating the returns to college quality with multiple proxies for quality. Journal

of Labor Economics 24 (3), 701–728.

74

Boyd, D., H. Lankford, S. Loeb, and J. Wyckoff (2013). Measuring test measurement error a general approach. Journal

of Educational and Behavioral Statistics 38 (6), 629–663.

Braga, M., M. Paccagnella, and M. Pellizzari (2014). Evaluating students’ evaluations of professors. Economics of

Education Review 41, 71–88.

Bratti, M. (2002). Does the choice of university matter?: a study of the differences across uk universities in life sciences

students’ degree performance. Economics of Education Review 21 (5), 431–443.

Broecke, S. (2012). University selectivity and earnings: Evidence from uk data on applications and admissions to

university. Economics of Education Review 31 (3), 96–107.

Brown, M. B. and A. B. Forsythe (1974). Robust tests for the equality of variances. Journal of the American Statistical

Association 69 (346), 364–367.

Burgess, S. (2015). Human capital and education: The state of the art in the economics of education.

Cameron, A. C. and P. K. Trivedi (2005). Microeconometrics: methods and applications. Cambridge university press.

Carrell, S. E. and J. E. West (2008). Does professor quality matter? evidence from random assignment of students to

professors. Technical report, National Bureau of Economic Research.

Cheng, J. H. and H. W. Marsh (2010). National student survey: are differences between universities and courses

reliable and meaningful? Oxford Review of Education 36 (6), 693–712.

Chetty, R., J. N. Friedman, and J. E. Rockoff (2013a). Measuring the impacts of teachers i: Evaluating bias in teacher

value-added estimates. Technical report, National Bureau of Economic Research.

Chetty, R., J. N. Friedman, and J. E. Rockoff (2013b). Measuring the impacts of teachers ii: Teacher value-added and

student outcomes in adulthood. Technical report, National Bureau of Economic Research.

Chevalier, A. (2014). Does higher education quality matter in the uk. Research in Labor Economics 40, 257–292.

Chevalier, A. and X. Jia (2015). Subject-specific league tables and students’ application decisions. The Manchester

School .

Chizmar, J. F. and T. A. Zak (1983). Modeling multiple outputs in educational production functions. American

Economic Review 73 (2), 18–22.

Clarke, P., C. Crawford, F. Steele, and A. F. Vignoles (2010). The choice between fixed and random effects models:

some considerations for educational research.

Cox, D. R. (1958). Planning of experiments.

Cunha, J. M. and T. Miller (2014). Measuring value-added in higher education: Possibilities and limitations in the

use of administrative data. Economics of Education Review 42, 64–77.

Dale, S. B. and A. B. Krueger (1999). Estimating the payoff to attending a more selective college: An application of

selection on observables and unobservables. Technical report, National Bureau of Economic Research.

Dale, S. B. and A. B. Krueger (2014). Estimating the effects of college characteristics over the career using adminis-

trative earnings data. Journal of Human Resources 49 (2), 323–358.

75

Davison, K. K. (2012). Propensity score methods as alternatives to value-added modeling for the estimation of teacher

contributions to student achievement.

Dee, T. S. (2004). Are there civic returns to education? Journal of Public Economics 88 (9), 1697–1720.

Deming, D. J. (2014). Using school choice lotteries to test measures of school effectiveness. Technical report, National

Bureau of Economic Research.

Deutsch, J. (2012). Using school lotteries to evaluate the value-added model. Unpublished working paper .

Donald, S. G. and K. Lang (2007). Inference with difference-in-differences and other panel data. Review of Economics

and Statistics 89 (2), 221–233.

Epple, D., R. E. Romano, and M. Urquiola (2015). School vouchers: a survey of the economics literature. Technical

report, National Bureau of Economic Research.

Feld, J. and U. Zölitz (2015). Understanding peer effects: on the nature, estimation and channels of peer effects.

Feng, A. and G. Graetz (2015). A question of degree: the effects of degree class on labor market outcomes. Technical

report, IZA Discussion Papers.

Fitz-Gibbon, C. T. (1991). Multilevel modelling in an indicator system. Schools, classrooms and pupils: international

studies of schooling from multilevel perspective, 67–83.

Fu, C. (2014). Equilibrium tuition, applications, admissions, and enrollment in the college market. Journal of Political

Economy 122 (2), 225–281.

Goldhaber, D. and M. Hansen (2013). Is it just a bad class? assessing the long-term stability of estimated teacher

performance. Economica 80 (319), 589–612.

Goldhaber, D. D. and D. J. Brewer (1997). Why don’t schools and teachers seem to matter? assessing the impact of

unobservables on educational productivity. Journal of Human Resources, 505–523.

Goldstein, H. and P. Sammons (1997). The influence of secondary and junior schools on sixteen year examination

performance: A cross-classified multilevel analysis. School Effectiveness and School Improvement 8 (2), 219–230.

Goldstein, H. and D. J. Spiegelhalter (1996). League tables and their limitations: statistical issues in comparisons of

institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 385–443.

Guarino, C. M., M. Maxfield, M. D. Reckase, P. N. Thompson, and J. M. Wooldridge (2015). An evaluation of

empirical bayes’s estimation of value-added teacher performance measures. Journal of Educational and Behavioral

Statistics 40 (2), 190–222.

Hanushek, E. (1971). Teacher characteristics and gains in student achievement: Estimation using micro data. American

Economic Review 61 (2), 280–288.

Hanushek, E. A. (1974). Efficient estimators for regressing regression coefficients. American Statistician 28 (2), 66–67.

Hanushek, E. A. (2006). School resources. Handbook of the Economics of Education 2, 865–908.

Hanushek, E. A. and S. G. Rivkin (2010). Generalizations about using value-added measures of teacher quality.

American Economic Review 100 (2), 267–271.

76

Hanushek, E. A., S. G. Rivkin, and L. L. Taylor (1996). Aggregation and the estimated effects of school resources.

Technical report, National Bureau of Economic Research.

Herrmann, M., E. Walsh, E. Isenberg, A. Resch, et al. (2013). Shrinkage of value-added estimates and characteristics

of students with hard-to-predict achievement levels. Washington, DC: Mathematica Policy Research.

Hill, C. J., H. S. Bloom, A. R. Black, and M. W. Lipsey (2008). Empirical benchmarks for interpreting effect sizes in

research. Child Development Perspectives 2 (3), 172–177.

Hoekstra, M. (2009). The effect of attending the flagship state university on earnings: A discontinuity-based approach.

Review of Economics and Statistics 91 (4), 717–724.

Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association 81 (396),

945–960.

Illanes, G., C. Sapelli, et al. (2012). Class size and teacher effects in higher education. Technical report.

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of

Economics and Statistics 86 (1), 4–29.

James, E., N. Alsalam, J. C. Conaty, and D.-L. To (1989). College quality and future earnings: where should you send

your child to college? American Economic Review 79 (2), 247–252.

Kaiser, B. et al. (2014). Rhausman: Stata module to perform robust hausman specification test. Statistical Software

Components.

Kane, T. J. and D. O. Staiger (2002). The promise and pitfalls of using imprecise school accountability measures.

Journal of Economic Perspectives 16 (4), 91–114.

Klein, S. P., G. Kuh, M. Chun, L. Hamilton, and R. Shavelson (2005). An approach to measuring cognitive outcomes

across higher education institutions. Research in Higher Education 46 (3), 251–276.

Koedel, C. (2009). An empirical analysis of teacher spillover effects in secondary school. Economics of Education

Review 28 (6), 682–692.

Koedel, C. and J. R. Betts (2011). Does student sorting invalidate value-added models of teacher effectiveness? an

extended analysis of the rothstein critique. Education 6 (1), 18–42.

Koedel, C., R. Leatherman, and E. Parsons (2012). Test measurement error and inference from value-added models.

BE Journal of Economic Analysis & Policy 12 (1).

Koedel, C., K. Mihaly, and J. Rockoff (2015). Value-added modeling: A review. Economics of Education Review .

Konstantopoulos, S. (2005). Trends of school effects on student achievement: Evidence from nls: 72, hsb: 82, and nels:

92.

Ladd, H. F. (2008). Teacher effects: What do we know. Teacher quality: Broadening and deepening the debate, 3–26.

Ladd, H. F. and R. P. Walsh (2002). Implementing value-added measures of school effectiveness: getting the incentives

right. Economics of Education Review 21 (1), 1–17.

Lankester, T. et al. (2005). Undergraduate admissions:policy and procedures. Technical report, WORKING PARTY

ON SELECTION AND ADMISSIONS.

77

Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional

independence assumption. Springer.

Lockwood, J. and D. F. McCaffrey (2014). Correcting for test score measurement error in ancova models for estimating

treatment effects. Journal of Educational and Behavioral Statistics 39 (1), 22–52.

Long, M. C. (2008). College quality and early adult outcomes. Economics of Education Review 27 (5), 588–602.

Lucas, J. (1980). Norrington blues.

Manly, C. A. and R. S. Wells (2015). Reporting the use of multiple imputation for missing data in higher education

research. Research in Higher Education 56 (4), 397–409.

McCaffrey, D. F., T. R. Sass, J. Lockwood, and K. Mihaly (2009). The intertemporal variability of teacher effect

estimates. Education 4 (4), 572–606.

Miller III, D. W. (2009). ESSAYS ON HIGHER EDUCTION POLICY. Ph. D. thesis, Stanford University.

Moretti, E. (2004). Estimating the social return to higher education: evidence from longitudinal and repeated cross-

sectional data. Journal of Econometrics 121 (1), 175–212.

Moulton, B. R. (1986). Random group effects and the precision of regression estimates. Journal of Econometrics 32 (3),

385–397.

Naylor, R., J. Smith, and S. Telhaj (2015). Graduate returns, degree class premia and higher education expansion in

the uk. Oxford Economic Papers, gpv070.

Nye, B., S. Konstantopoulos, and L. V. Hedges (2004). How large are teacher effects? Educational Evaluation and

Policy Analysis 26 (3), 237–257.

O’Hara, R. (2016). The collegiate way. http://collegiateway.org/. Accessed: 2016-04-15.

Oster, E. (2013). Unobservable selection and coefficient stability: Theory and validation. Technical report, National

Bureau of Economic Research.

Pallais, A. (2013). Small differences that matter: Mistakes in applying to college. Technical report, National Bureau

of Economic Research.

Papay, J. P. (2011). Different tests, different answers the stability of teacher value-added estimates across outcome

measures. American Educational Research Journal 48 (1), 163–193.

Raudenbush, S. W. and J. Willms (1995). The estimation of school effects. Journal of Educational and Behavioral

Statistics 20 (4), 307–335.

Reardon, S. F. and S. W. Raudenbush (2009). Assumptions of value-added models for estimating school effects.

Education 4 (4), 492–519.

Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal

effects. Biometrika 70 (1), 41–55.

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables.

Education 4 (4), 537–571.

Roy, A. D. (1951). Some thoughts on the distribution of earnings. Oxford economic papers 3 (2), 135–146.

78

Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 34–58.

Rubin, D. B., E. A. Stuart, and E. L. Zanutto (2004). A potential outcomes view of value-added assessment in

education. Journal of Educational and Behavioral Statistics, 103–116.

Saavedra, J. E. (2009). The learning and early labor market effects of college quality: A regression discontinuity

analysis. Investigaciones del ICFES .

Scott-Clayton, J. (2012). Information constraints and financial aid policy. Technical report, National Bureau of

Economic Research.

Seftor, N., J. Constantine, S. Cody, M. Ponza, J. Knab, J. Deke, and S. Monahan (2011). What works clearinghouse:

Procedures and standards handbook 2011 (ncee 2011-xxxx). Washington, DC: National Center for Education

Evaluation and Regional Assistance, Institute of Education Sciences, US Department of Education.

Smith, J., A. McKnight, and R. Naylor (2000). Graduate employability: policy and performance in higher education

in the uk. Economic Journal 110 (464), 382–411.

Sullivan, D. G. (2001). A note on the estimation of linear regression models with heteroskedastic measurement errors.

Thomas, S., P. Sammons, P. Mortimore, and R. Smees (1997). Stability and consistency in secondary schools’ effects

on students’ gcse outcomes over three years. School effectiveness and school improvement 8 (2), 169–197.

Todd, P. E. and K. I. Wolpin (2003). On the specification and estimation of the production function for cognitive

achievement. Economic Journal 113 (485), F3–F33.

Waldinger, F. (2010). Quality matters: The expulsion of professors and the consequences for phd student outcomes in

nazi germany. Journal of Political Economy 118 (4), 787–831.

Walker, I. and Y. Zhu (2013). The impact of university degrees on the lifecycle of earnings: some further analysis.

A Proof of Proposition 1

This proof is adapted from Bhattacharya et al. (2014). Consider any feasible admissions policy for

college j pj satisfying the capacity constraint. Since the optimal admissions policy for college j pOPTj

satisfies the capacity constraint with equality (see the definitions of zj) and pj is feasible we must

have:∑x∈Xj

pOPTj (x)αj(x) ηj(x) = Kj ≥∑x∈Xj

pj(x)αj(x) ηj(x) ⇒∑x∈Xj

[pOPTj (x)− pj(x)]αj(x) ηj(x) ≥ 0.

(18)

79

Let W (pj) =∑x∈Xj pj(x)αj(x) ηj(x)Yj(x). Now college welfare resulting from pj differs from:

W (pOPTj )−W (pj) =∑x∈Xj

[pOPTj (x)− pj(x)]αj(x) ηj(x)Yj(x)

=∑x∈Xj

[pOPTj (x)− pj(x)]αj(x) ηj(x) [Yj(x)− zj ] + zj∑x∈Xj

[pOPTj (x)− pj(x)]αj(x) ηj(x)

≥∑x∈Xj

[pOPTj (x)− pj(x)]αj(x) ηj(x)Yj(x)

=∑

Yj(x)≥zj

[pOPTj (x)− pj(x)]α(x)ηj(x)[Yj(x)− zj ]

+∑

Yj(x)<zj

[pOPTj (x)− pj(x)]αj(x) ηj(x) [Yj(x)− zj ]

=∑

Yj(x)≥zj

[1− pj(x)]αj(x) ηj(x) [Yj(x)− zj ] +∑

Yj(x)<zj

pj(x)αj(x) ηj(x) [zj − Yj(x)] ≥ 0

(19)

where the first inequality holds by (18) and that by condition 1, zj > 0. Therefore we have

W (pOPTj ) ≥W (pj) for any feasible pj and the solution given in Proposition 1 is optimal.

To show uniqueness, argue by contradiction. Consider any feasible rule pj which differs from

pOPTj for some admissions profiles x in a non-empty set: X(pj) :={x ∈ Xj | pOPTj (x) 6= pj(x)

}and

let W (pOPTj ) = W (pj). Therefore the last equality on the RHS of (19) holds with equality so pj

must take the form:

pj =

1 if Yj(x) ≥ zj

0 if Yj(x) < zj

However, this implies pj(x) = pOPTj (x) for all x. This contradicts that assumption X(pj) is

non-empty. Therefore W (pOPTj ) > W (pj) for any feasible pj that differs from pOPTj , leading to the

desired uniqueness property of pOPTj .

80

david thesis final 1 sided

Documents