university of groningen latent instrumental variables ... · my family and friends have always been...

University of Groningen

Latent instrumental variablesEbbes, P.

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

Document VersionPublisher's PDF, also known as Version of record

Publication date:2004

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity. s.n.

CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

Download date: 08-12-2020

https://www.rug.nl/research/portal/en/publications/latent-instrumental-variables(2bb49ff4-cfff-4dcb-bc51-544f0ca9f823).html

https://www.rug.nl/research/portal/en/publications/latent-instrumental-variables(2bb49ff4-cfff-4dcb-bc51-544f0ca9f823).html

Latent Instrumental Variables– A New Approach to Solve for Endogeneity –

Peter Ebbes

Published by: Labyrinth Publications

P.O. Box 334

2984 AX Ridderkerk

The Netherlands

Tel.: +31 180 463 962

Printed by: Offsetdrukkerij Ridderprint B.V., Ridderkerk

ISBN 90-5335-029-2

c© 2004, Peter Ebbes

All rights reserved. No part of this publication may be reproduced, stored in a

retrieval system of any nature, or transmitted in any form or by any means,

electronic, mechanical, now known or hereafter invented, including photocopying or

recording, without prior written permission of the publisher.

Promotores: Prof. dr. M. WedelProf. dr. U. BockenholtProf. dr. drs. A. G. M. Steerneman

Beoordelingscommissie: Prof. dr. P. S. H. LeeflangProf. dr. P. J. LenkProf. dr. P. H. Franses

ISBN 90-5335-029-2

Acknowledgements

This dissertation introduces, develops, and illustrates the latent instrumental

variables (LIV) approach, which is a new method to correct for potential endo-

geneity bias in commonly used linear models. It has several advantages over

traditional methods, such as ordinary least squares and instrumental variables

techniques: it estimates the regression parameters unbiasedly regardless of the

presence of regressor-error dependencies, it allows to test for such dependen-

cies in a straightforward manner, and it does not require the availability of

observed instrumental variables at hand. This dissertation is the result of four-

and-a-half years of work and could not have been conducted without the help

and support of many people.

First and foremost, I am deeply grateful to my advisors, Michel Wedel, Ulf

Bockenholt, and Ton Steerneman, for their continuous guidance and support

and for sharing their vast knowledge and experience. In particular, I would

like to thank Michel Wedel for introducing me to the (international) marketing

community, for showing me what it takes to pursue a career in academics, and

for the considerable effort and time he invested in me. As for Ulf Bockenholt,

I am grateful to him for the inspiring and constructive discussions, after which

I always had the feeling of being (back) on a great avenue with a lot of further

work ahead and many interesting new research questions. Ton Steerneman’s

excellent technical skills and his ability to teach me about the ‘abstract world

of statistics’ were essential for the well-being of this project and for enhancing

my understanding of statistics. There was always enough time to talk about

the other aspects of life as well. I am looking forward to collaborating with

them on many future research projects.

Next, I would like to express my appreciation to the members of my Ph.D.

committee for carefully reading my manuscript. This committee consists of

Peter Leeflang (University of Groningen), Philip Hans Franses (Erasmus Uni-

versity Rotterdam), and Peter Lenk (University of Michigan). Their comments,

questions, and suggestions were very constructive.

i

This research has benefitted from the comments and suggestions made by the

editors and reviewers of papers that were based on parts of this thesis. In ad-

dition, this research gained from suggestions made by seminar participants at

the University of Groningen, Tilburg University, Erasmus University Rotter-

dam, the Durham Business School, McGill University, and the University of

Michigan. In particular, I would like to thank Tom Wansbeek for his effort in

studying earlier versions of our manuscripts and for providing us with valuable

suggestions to improve and position our research. I thank Paul Bekker for his

time in the early phases of my dissertation work, helping me study Jan van der

Ploeg’s thesis.

The department of Marketing and the SOM graduate school at the University

of Groningen provided me with an inspiring research and social environment,

which is one of the most important aspects of completing a dissertation project.

I would like to thank my colleagues for the warm contacts, the keen interest,

and the many pleasant “off-hours” social happenings. Being part of the board

of the GAIOO for two years gave me the opportunity to be actively involved in

Ph.D. student matters, and to organize sufficient social events to stimulate the

interdisciplinary character of doctoral research at the University of Groningen.

Many special thanks go to my paranimfen, Frits Wijbenga and Bart van de Aa,

for their help in organizing my defense and for all the pleasant social and in-

tellectual moments that we shared, in whatever form or combination.

I have special memories of my stay at the Ross School of Business at the Uni-

versity of Michigan. I acknowledge Michel Wedel’s efforts and the financial

support of both the Netherlands Organization for Scientific Research and The

Prince Bernhard Cultural Foundation, which made it possible for me to spend

a considerable amount of time at this top-tier school. In addition, Dirk Pieter

van Donk from the bureau of the SOM graduate school was of great help. My

research and my personal development have benefitted greatly from this stay.

I enjoyed being part of the community and working with faculty members and

doctoral students of the University of Michigan. My special thanks go to Fred

ii

Feinberg for involving me in a challenging project on product line develop-

ment, for helping me with job market issues, and for the pleasant and joyful

talks (among other things) about ‘the Dutch’ and ‘the Americans’. I am grate-

ful to Peter Lenk, who collaborated with me on the Bayesian chapter (chapter

7) of my thesis and whose particular sense of humor made the meetings always

enjoyable. Jie Zhang’s support is greatly acknowledged and appreciated and

I look forward to (finally) starting with our self-selection project on purchase

decisions in on- and offline stores.

My family and friends have always been very supportive, which is essential

to me, and for which I cannot thank them enough. In particular, and most

importantly, the unconditional support of my parents and sister throughout all

these years, their interest in my study and work, and their encouragement in

pursuing my (international) ambitions, are invaluable to me.

Ann Arbor, MI, October 2004.

iii

Contents

1 Introduction 1

2 Instrumental variables: a survey 7

2.1 Introduction and bias in OLS . . . . . . . . . . . . . . . . . . 7

2.1.1 Relevant omitted explanatory variables . . . . . . . . 8

2.1.2 Measurement error . . . . . . . . . . . . . . . . . . . 10

2.1.3 Self-selection . . . . . . . . . . . . . . . . . . . . . . 12

2.1.4 The simultaneous equation model . . . . . . . . . . . 13

2.1.5 Lagged dependent variables . . . . . . . . . . . . . . 14

2.1.6 Bias in OLS when E(ε|X) 6= 0 . . . . . . . . . . . . 15

2.2 The IV approach . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Considerations when using Instrumental Variables . . 17

2.2.2 IV based solutions to the weak instrument problem . . 24

2.3 Alternative approaches to solve for regressor-error dependencies 27

2.4 Conclusions and positioning of research . . . . . . . . . . . . 31

3 The LIV model 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 The LIV model . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Identifiability and information . . . . . . . . . . . . . . . . . 39

3.3.1 Identifiability . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2 Information matrix . . . . . . . . . . . . . . . . . . . 44

3.4 A test-statistic to test for regressor-error dependencies . . . . . 47

3.5 Monte Carlo experiments . . . . . . . . . . . . . . . . . . . . 48

v

3.5.1 Design of the simulation study: data generation . . . . 49

3.5.2 Results for the simple LIV model (m= 2) . . . . . . . 51

3.5.3 Sensitivity analysis: usingm= 3 andm= 4 . . . . . 55

3.5.4 Results forπ , λ, σ 2ν , andσεν . . . . . . . . . . . . . . 58

3.6 An illustrative example: a simple measurement error model . . 60

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Appendix 3A Basic theorems on identifiability of mixtures . . . . 67

Appendix 3B 1st and 2nd order derivatives log-likelihood . . . . . 67

4 LIV implementation issues 77

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Additional regressors and identifiability . . . . . . . . . . . . 79

4.3 Investigating observed instrumental variables . . . . . . . . . 85

4.3.1 Testing for weak instruments . . . . . . . . . . . . . . 85

4.3.2 Testing for endogenous instruments . . . . . . . . . . 86

4.4 A simulation study . . . . . . . . . . . . . . . . . . . . . . . 87

4.4.1 Results for the regression parameters . . . . . . . . . 87

4.4.2 Results testH0 : observed IV is exogenous . . . . . . 90

4.4.3 Results testH0 : observed IV has no effect onx . . . . 91

4.4.4 Concluding remarks simulation study . . . . . . . . . 93

4.5 LIV model diagnostics . . . . . . . . . . . . . . . . . . . . . 94

4.5.1 Selection of the number of categories of the discrete

instrument . . . . . . . . . . . . . . . . . . . . . . . 94

4.5.2 Residuals, outliers, and influential observations . . . . 96

4.6 The Hausman-LIV test revised . . . . . . . . . . . . . . . . . 102

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Appendix 4A 1st and 2nd order derivatives log-likelihood of the

general LIV model . . . . . . . . . . . . . . . . . . . . . . . 108

Appendix 4B Simulation results for the exogenous regressor . . . 109

5 Estimating the return to education using LIV 111

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2 Sources of bias in the OLS estimate of the return to education . 113

vi

5.2.1 Ability bias . . . . . . . . . . . . . . . . . . . . . . . 113

5.2.2 Measurement error bias . . . . . . . . . . . . . . . . . 114

5.2.3 Heterogeneity bias . . . . . . . . . . . . . . . . . . . 116

5.2.4 Optimizing behavior bias . . . . . . . . . . . . . . . . 118

5.3 IV estimation of the returns to education . . . . . . . . . . . . 118

5.3.1 Institutional features of the schooling system . . . . . 119

5.3.2 Family background . . . . . . . . . . . . . . . . . . . 120

5.3.3 Alternative, non-IV approaches . . . . . . . . . . . . 120

5.4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . 122

5.4.1 Data description . . . . . . . . . . . . . . . . . . . . 122

5.4.2 LIV results for schooling . . . . . . . . . . . . . . . . 124

5.4.3 Relative biases and comparison with classical IV . . . 130

5.4.4 Wrap-up . . . . . . . . . . . . . . . . . . . . . . . . 134

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Appendix 5A Descriptive statistics datasets used . . . . . . . . . 138

5A.1 NLSY data . . . . . . . . . . . . . . . . . . . . . . . 139

5A.2 Brabant data . . . . . . . . . . . . . . . . . . . . . . 139

5A.3 PSID data . . . . . . . . . . . . . . . . . . . . . . . . 139

Appendix 5B Results optimal LIV model for the three datasets . . 141

6 Regressor and random-effects dependencies 145

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Biases caused by level-1 (Xα)– and level-2 (Xη)– dependencies 149

6.3 The case of level-2 (Xα) dependencies only . . . . . . . . . . 151

6.3.1 Testing forXα–dependencies . . . . . . . . . . . . . 151

6.3.2 Mundlak’s approach forXα–dependencies . . . . . . 152

6.3.3 The Hausman-Taylor estimator underXα–dependencies 153

6.4 Limitations in the presence of level-1 (Xη)– dependencies . . 155

6.5 Testing and solving forXη–dependencies . . . . . . . . . . . 158

6.5.1 External Instruments . . . . . . . . . . . . . . . . . . 158

6.5.2 Internal instruments: Lewbel’s approach . . . . . . . . 160

6.6 Discussion and future research . . . . . . . . . . . . . . . . . 161

Appendix 6A Classical instrumental variables (IV) estimation . . 166

vii

Appendix 6B Estimation for the hierarchical linear model . . . . 166

Appendix 6C Lewbel’s instruments in a simple multilevel model . 168

7 A Nonparametic Bayesian LIV approach 171

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.2 A simple multilevel model with a general latent instrument . . 174

7.2.1 The Dirichlet process prior forθi j . . . . . . . . . . . 176

7.2.2 MCMC estimation . . . . . . . . . . . . . . . . . . . 179

7.3 Endogenous subject-level covariates and random coefficients . 184

7.3.1 Estimating the hierarchical model with general latent

instruments . . . . . . . . . . . . . . . . . . . . . . . 186

7.4 A simulation study . . . . . . . . . . . . . . . . . . . . . . . 191

7.4.1 Simulation results for the simple multilevel model . . 192

7.4.2 Simulation results for the hierarchical model . . . . . 198

7.5 Discussion nonparametric Bayesian LIV approach . . . . . . . 202

Appendix 7A The Dirichlet process . . . . . . . . . . . . . . . . 206

Appendix 7B Full conditionals: the simple multilevel model with

general LIV . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Appendix 7C Full conditionals: the hierarchical model with gen-

eral LIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Appendix 7D Iteration plots . . . . . . . . . . . . . . . . . . . . 217

8 Discussion 219

8.1 Summary and conclusions . . . . . . . . . . . . . . . . . . . 219

8.2 Limitations and future research . . . . . . . . . . . . . . . . . 224

8.2.1 Methodological (technical) issues . . . . . . . . . . . 225

8.2.2 Substantive issues . . . . . . . . . . . . . . . . . . . 231

Bibliography 237

Author index 249

Subject index 252

Samenvatting (summary in Dutch) 255

viii

Chapter 1

Introduction

In this thesis we propose a new method to estimate regression coefficients

in linear regression models where regressor-error correlations are likely to be

present. This method, the Latent Instrumental Variables (LIV) method utilizes

a discrete latent variable model that accounts for dependencies between regres-

sors and the error term. As a result, observed exogenous instrumental variables

are not required. In the following chapters we introduce and illustrate the LIV

method on both simulated data and empirical applications. We show that the

LIV method has desirable properties over existing methods, such as ordinary

regression and instrumental variables methods, when regressor-error depen-

dencies are present. Each chapter is more or less self-contained and based on

articles. In the following we present the scope and outline of the thesis.

The starting point of this research is the simple linear regression model given

by

yi = β0+ β1xi + εi , (1.1)

whereyi is the dependent variable,xi the explanatory variable (regressor), and

εi is the error term or disturbance with mean zero and varianceσ 2, all inde-

pendent. The regression parametersβ0 andβ1 are the objects of inference.

We focus on a situation where the regressor is random and possibly correlated

1

2 Chapter 1 Introduction

with the disturbance1, in which case it is not ‘exogenous’ but ‘endogenous’.

Regressor-error correlations may be the result of several causes and arise in a

wide variety of models, e.g. when relevant explanatory variables are omitted,

when the dependent variable influences the explanatory variable (simultane-

ity), when the sampling process is non-random (self-selection), or when the

explanatory variable is measured with error.

The standard inferential methods are invalid if regressor-error dependencies

exist. For instance, the ordinary least squares estimator for the regression pa-

rameters(β0, β1) suffers from inconsistency, in which case the true effect of the

explanatory variable on the dependent variable is systematically over- or un-

derestimated, leading to false conclusions and erroneous decision making. The

instrumental variables (IV) methods were developed to overcome these prob-

lems and have a long history in econometrics (Bowden and Turkington, 1984,

Greene, 2000, or Judge et al., 1985). Instrumentsz are variables that mimic

the endogenous regressorx as well as possible, but are uncorrelated2 with the

error termε. Once ‘valid’ instruments are available, the regression param-

eters can be consistently estimated via, for instance, two-stage least squares

techniques. However, finding exogenous instruments is hard work, and empir-

ical researchers are often confronted with weak instruments. An instrument is

‘weak’ when it only weakly correlates with the endogenous regressors. If in-

struments are weak and/or not exogenous, the standard instrumental variables

estimation and inferential procedures are inaccurate and produce “bad results”,

that are potentially worse than simply ignoring the endogeneity problem and

relying on biased ordinary least squares. Hence, small biases in ordinary least

squares estimates can become large biases when invalid instruments are used

(Stock, Wright, and Yogo, 2002, or Hahn and Hausman, 2003)3. Besides the

problems of potential weak and/or endogenous instruments, these variables

may simply not be available to a researcher, whereas collecting them is time

1At least in the weak sense that plim∑

i xi εi 6= 0, or that E(xi εi ) 6= 0 implying E(εi |xi ) 6=0, e.g. White (2001) or Ferguson (1996).

2The instrument is said to be ‘exogenous’.3This was already observed by Sargan in the 1950s, see e.g. Arellano (2002).

3

consuming and expensive. The main purpose of this research is to develop

a new method (the latent instrumental variables (LIV) method) that does not

require observed instrumental variables at hand. As such, the difficult task of

finding instruments and the inferential issues in presence of bad quality instru-

ments are circumvented. In fact, the ‘optimal’ LIV instruments are estimated

as a “by-product” from the available data.

The above discussion on the problems surrounding instrumental variables es-

timation is considered in greater detail inchapter 2. The literature review pre-

sented in this chapter covers most of the recent studies on weak instruments

and contains several references to empirical research (labor economics, mar-

keting, industrial economics) that aims at solving regressor-error dependen-

cies. Furthermore, we point out a few alternative approaches to instrumental

variables estimation that may be useful in solving regressor-error dependen-

cies. This overview of the literature is a selection of issues that motivates the

development of the latent instrumental variables (LIV) method. We conclude

chapter 2by highlighting the relevance and contribution of this research.

In chapter 3 we introduce the latent instrumental variable (LIV) model. It

solves regressor-error correlations in linear models by postulating that the in-

strumental variable is discrete and latent. As a byproduct, the method allows

for testing for endogeneity without requiring access to observable instruments.

Our simulation results show that the LIV method yields consistent estimates

for the model parameters without having observable instrumental variables at

hand. These results are superior to OLS estimates which are biased when the

regressors are not exogenous. The proposed test statistic to test for exogene-

ity is shown to have a reasonable power throughout a wide range of settings.

Furthermore, we prove identifiability of all model parameters. We apply the

LIV method to an empirical measurement error application where a labora-

tory dummy instrumental variable is available. We show that the predicted

LIV dummy instrument is identical to this observed laboratory instrument.

Hence, the LIV estimate for the regression parameter,without using the ob-

served instrument, is identical to the classical IV estimate thatdoesrequire the


existence of an observed instrument. We conclude that our ‘instrument-free’

approach can be successfully used to estimate regression parameters in pres-

ence of regressor-error correlations, and to test for this dependency without the

necessity of first finding valid instruments.

The method proposed inchapter 3 is extended inchapter 4 to more gen-

eral settings. We extend the model to a situation where several exogenous

regressors are available. Furthermore, we allow for the possibility that ob-

served instrumental variables are available. Using similar techniques as for the

more simple LIV model, we prove that all model parameters can be identified.

Importantly, from this proof it follows that the general LIV model is still iden-

tified, even when possible observed instruments have no or very small effects

on the endogenous regressor. In such a case, the classical IV model is uniden-

tified or weakly identified, respectively. This identifiability result suggests a

straightforward approach to examine instrument weakness, that is based on

existing testing principles. Furthermore, using a similar reasoning, it suggests

a straightforward test of instrument exogeneity (validity). To the best of our

knowledge, such tests to independently investigate instrument exogeneity and

weakness for each instrument have not appeared in the literature before. We

illustrate both tests by the means of a simulation example and show that the

proposed tests have a reasonable power under a variety of settings. Besides, we

propose several diagnostics to complete an LIV analysis. We propose several

statistics to choose among the number of categories of the discrete LIV instru-

ment. Furthermore, we examine the robustness of the LIV estimates towards

misspecification of the likelihood equation and suggest how to examine resid-

uals. We adapt standard methods from regression models to detect outliers and

influential observations.

The proposed LIV model, tests, and diagnostics are applied inchapter 5. We

examine the effect of education on income, where the variable ‘education’ is

potentially endogenous due to omitted ‘ability’ or other causes. We review

part of the schooling literature and discuss the problems associated with clas-

sical instrumental variables estimation. As will become clear, the classical IV

5

method has produced a less than satisfactory solution in estimating the return

to education. Importantly, researchers who use different sets of instruments

arrive at different conclusions in terms of size and magnitude of the bias found

in the OLS estimate for the return to eduction. We examine three empirical

datasets. In all three applications, we find an upward bias in the OLS esti-

mates of approximately 7%. Our conclusions agree closely with recent results

obtained in studies with twins that find an upward bias in OLS of about 10%

(Card, 1999). Diagnostic evaluations demonstrate that the LIV method pro-

vides a satisfactory fit of the data. We also find that for each of the three

datasets the classical IV estimates for the return to education point to biases

in OLS that are not consistent in terms of size and magnitude. The proposed

diagnostics and tests to examine the validity of available observed instruments

indicate that in two of the three datasets the used instruments are potentially

weak and/or endogenous. Our conclusion is that LIV estimates are preferable

to the classical OLS and IV estimates in understanding the effects of education

on income.

In chapter 6 we consider endogeneity problems in multilevel models, i.e.

when data has an hierarchical structure. As before, the explanatory variables

are assumed to be independent of the random components at various levels.

However, in many applications this is an unrealistic assumption. When the

same cross-section units are observed over time, for instance, or when data on

siblings or twins is available, multilevel models may in fact be used to solve

regressor-error correlations at a lower level. In this chapter we show that much

care is required in relying on these methods in actual applications. We re-

view methods that can be used to test for different types of random effects –

regressor dependencies. Secondly, we present results from Monte Carlo stud-

ies designed to investigate the performance of these methods, and, finally, we

discuss estimation methods that can be used when some, but not all of the

random effects – regressor independence assumptions are violated. Because

current methods are limited in various ways, we will also present a list of open

problems and suggest solutions for some of them. As we will show, the issue

of regressor random – effects independence has received some attention in the


econometrics literature, but this important work has had little impact on cur-

rent research practices in the social and behavioral sciences.

In chapter 7we take parts of the results ofchapter 6a step further and develop

sophisticated nonparametric Bayesian methods (Dey, Muller and Sinha, 1998)

to solve regressor-error dependencies in multilevel models at various stages of

the model. This method solves some of the problems addressed inchapter

6 and is a generalization of the standard LIV model in the sense that we do

not impose restrictions (discreteness) on the distribution of the instruments. In

fact, we let the data determine the best distribution. This is an important ad-

vantage as it does not require an a priori specification of the ‘right’ number

of categories of the unobserved discrete instrument. Because we take fully

advantage of Bayesian estimation methods, the proposed model can readily

be adapted and extended to more general and more complex model structures.

Furthermore, insight in small sample properties of the estimation results is

more easily obtained and inference does not rely on asymptotic results. This

chapter is still work-in-progress and the results are preliminary, yet promising.

We illustrate the potential usefulness of this approach to regressor-error depen-

dencies and suggest steps for further research.

In chapter 8 we present a discussion of the proposed LIV method and the

results found. Furthermore, we present future research directions.

Chapter 2

Instrumental variables: asurvey

2.1 Introduction and bias in OLS

The standard linear regression modely = Xβ + ε is an important tool in

(applied) statistical science to model the effect of a set of explanatory vari-

ables on a dependent variable. Herey = (y1, ..., yn)′ denotes then × 1 vec-

tor of observations on the dependent variable,X ∈ Rn×k denotes then × k

matrix of observations on the explanatory variables (regressors),β is the un-

known k × 1 vector of regression parameters andε = (ε1, ..., εn)′ is an un-

observed stochastic disturbance. Because of identifiability it is assumed that

rank X = k < n. Although the standard linear regression model is frequently

used in cross-sectional applications, in many situations data has an hierarchical

structure (see also chapter 6). For instance, when it is investigated how work-

place characteristics affect a worker’s productivity, both workers and firms are

units in the analysis. Similarly, hierarchical data arise in the context of panel

data, when multiple observations are available on the ‘objects’ under study.

This type of data is modeled through multilevel models, panel data models,

or hierarchical linear models, which generalize the standard linear regression

model (Judge et al., 1985, Wooldridge, 2002, Snijders and Bosker, 1999, Bryk

and Raudenbush, 1992, or Greene, 2000).

7

8 Chapter 2 Instrumental variables: a survey

An important assumption in these models is the independence of the explana-

tory variables (X) and the random error components (ε). In this case the re-

gressors are said to be ‘exogenous’ and are assumed to be determined outside

the model. Failure of this assumption may lead to biased or inconsistent es-

timates for the parameters of interest and therefore to wrong conclusions and

erroneous decision-making.

Unfortunately, in many situations the assumption of regressors and error in-

dependence is not satisfied. In this case the regressors are often said to be

‘endogenous’. Endogeneity can arise from a number of different sources: (1)

relevant omitted variables, (2) measurement error in the regressors, (3) the

problem of self-selection, (4) simultaneity, and (5) serially correlated errors

in the presence of a lagged dependent variable in the set of regressors. Ruud

(2000) shows that the possibilities (2)-(5) can be viewed as a special case of

(1). A similar argument is put forward by Wooldridge (2002), who notes that

the distinction among the possible causes of regressor-error correlation is not

always clear. Card (1999), for instance, argues that measurement error in the

education variable1, on the one hand, results in a downward bias of the effect

of education on income, whereas omitted ability bias may, on the other hand,

results in a positive bias in OLS. Similarly, Nevo (2000) states that price en-

dogeneity can be generated by a price-setting firm taking unobserved product

attributes into account, or can be a result of the mechanics of consumer’s opti-

mization problem. These causes may enforce or offset each other to an extent

that depends on the empirical context. In the following subsections we briefly

illustrate the previously mentioned causes and provide some references to em-

pirical studies in labor economics, marketing, and industrial economics that

are confronted with these problems.

2.1.1 Relevant omitted explanatory variables

Card (1999, 2001) and Uusitalo (1999), among others, consider the estimation

of the causal effect of education on earnings, where ability is a typical omit-

1‘Education’ is usually measured by ‘years of schooling’, see also chapter 5.

2.1 Introduction and bias in OLS 9

ted variable. Individuals with a higher ability are potentially more successful

on the labor market by earning higher wages, whereas these individuals may

acquire more education. As such, unobserved ability affects both education

and earnings, causing a dependency between the regressor ‘education’ and the

model error term (see also chapter 5).

Marketing modelers are often faced with omitted variables. Wansbeek and

Wedel (1999) put forward that the exogeneity assumption of regressors, in-

cluding price, is a shortcoming of standard market response models2. Shugan

(2004) observes an increasing focus of reviewers on endogeneity. The lack

of exogeneity of regressors due to the omission of ‘key’ aspects in marketing

models is gaining more interest in marketing research studies. As store man-

agers set the marketing mix variables (e.g. price or advertising variables), their

decision is based on (local) market information or product characteristics un-

known to the researcher. This unobserved information may affect consumer

behavior, which induces a correlation between the error term and the regres-

sors, usually price, in a typical marketing model. Examples of unobserved lo-

cal market information are competition, word-of-mouth-effects, taste changes,

local market shares, or coupon availability. For recent omitted variables stud-

ies in marketing, see e.g. Villas-Boas and Winer (1999), Chintagunta (2001),

Nevo (2001), Petrin and Train (2002), or Vilcassim and Chintagunta (1995).

An omitted variable model is given by (Judge et al., 1985).

E (yi |xi , wi ) = x′iβ + w′i γ, (2.1)

where thewi ’s are the latent or unobserved variables. Conditioning on the

observablexi but omittingwi , gives

E (yi |xi ) = x′iβ + E (w′i |xi )γ, (2.2)

2Models that relate sales to marketing mix variabels.


which is unequal tox′iβ whenever: (i) E(w′i |xi ) 6= 0 (i.e. when omitted and

included regressors are not orthogonal) and (ii)γ 6= 0 (i.e. when the omit-

ted regressors are relevant). The resulting bias in the OLS estimator forβ

is equal to E(βOLSn − β) = 5γ of which magnitude and size depends on

5 = (X′X)−1X′W andγ . As can be seen, all estimated coefficients inβ

are affected by the omission of relevant explanatory variables.

2.1.2 Measurement error

Measurement error in regressors arises when the variables specified in the re-

gression model are not similar to the observed measure. This may arise, for

instance, due to method- or instrument-error, the absence of a ‘physical’ mea-

sure for the true construct, like IQ, ability, perceptions, ‘total price’ versus

‘money price’, or incorrectly aggregated and combined measures from differ-

ent data-sources, like GDP, price inflation or productivity of employees. When

the regressors used do not conform to the variables included in regression mod-

els, it is unlikely that they are independent of the random components.

As stated before, a good measure of education that corresponds to the qualities

that employers are willing to pay for, needs to be available when estimating the

effect of education on income. It is common practice to use ‘years of school-

ing completed’ as a measure for ‘total education’. Apart from errors due to

recall or recording errors in ‘years of schooling completed’, it can be ques-

tioned whether this measure fully represents education levels, because individ-

uals may, for instance, educate themselves with evening courses or on-the-job

training. Besides, as most studies on labor economics rely on household in-

terview data, all of the variables are subject to some error (Griliches, 1977).

Even if the errors are small, their effect may be magnified if more variables are

added in an attempt to control for e.g. omitted ability bias3 (Card, 1999, 2001,

or Griliches, 1977).

Nevo (2000) and Sudhir (2001) argue that the measure for price used in es-

3See also subsection 5.2.2.


timating aggregate logit demand in inferring competition may be measured

with error. The price variable used in these studies is often ‘list price’ or an

aggregated price measure, whereas the model specification assumes that all

consumers face the same product characteristics. However, if consumers face

different prices in different stores, regions, or weeks, depending on the data,

the price measure used exhibits measurement error. Instead, it would be ideal

to estimate the model with transaction prices (cf. Sudhir, 2001). Bagozzi, Yi

and Nassen (1999) explore measurement error in marketing research data. For

instance, questionnaire items or rating scales that are used to measure percep-

tions, beliefs, attitudes, judgements, or other theoretical constructs are likely to

reflect measurement error because of the absence of physical measures corre-

sponding to these variables. Besides, marketing research data may be subject

to method errors like halo effects4, interviewer effects, or social desirability

distortions. Their findings suggest that measurement error in marketing data

may be large and needs to be corrected for in empirical applications to improve

decision making and inferences.

We illustrate the problem of measurement error in the simple bivariate case.

Consider

yi = β0+ β1χi + εi ,

whereχi is the ‘true’ unobserved construct. Instead,xi is observed andxi =χi + νi , with E (εi ) = E (νi ) = 0, E(ε2

i ) = σ 2ε > 0, E(ν2

i ) = σ 2ν > 0, and

E (εi νi ) = E (χi εi ) = E (χi νi ) = 0. These two equations can be combined,

giving yi = β0 + β1xi + εi , with εi = εi − βνi . The OLS estimator forβ1 is

biased towards zero as E(εi xi ) = −β1σ2ν 6= 0, which implies that E(εi |xi ) 6=

0. For more details, see e.g. Plat (1988), Wansbeek and Meijer (2000), or

Carroll, Ruppert and Stefanski (1995).

4A problem that arises in data collection when there is carry-over from one judgement toanother (source: www.marketingpower.com).


2.1.3 Self-selection

The problem of self-selection arises when individuals tend to select themselves

in a certain state, like union vs. non-union member (Vella and Verbeek, 1998),

or treated vs. not treated (Angrist, Imbens and Rubin, 1996), on the basis of

economic, or other, usually unknown, arguments. For instance, Angrist (1990)

considers the effect of Vietnam veteran status on civilian income to investigate

whether these veterans ought to be compensated by the US government for

their possible loss of personal income caused by serving the army. However,

civilian earnings are not easily compared by Vietnam veteran status simply be-

cause certain individuals with fewer civilian opportunities are more likely to

enlist than others, and such individuals would have earned less income regard-

less of serving the army.

Hamilton and Nickerson (2003) give an overview of endogenous decision-

making in strategic management, where managers often make strategic orga-

nizational choices between several competing strategies not ‘randomly’ but

based on expectations and experience. Similarly, data collected on the internet

may suffer from self-selection. Certain individuals are more likely to be on

the internet and are therefore more likely to fill-in the web-survey, to click on

the web pages or to purchase products online. If these unobserved individual

characteristics also influence web behavior, preferences or perceptions5, then

part of the effect of these latent characteristics is falsely attributed to internet

usage. One could argue that these individuals would have reacted differently

regardless of their frequency of being on the internet. These issues are impor-

tant, for instance, when investigating purchase quantities decisions in online

stores versus brick-and-mortar (“offline”) stores, whether or not to buy a cer-

tain product category, or whether or not to buy a certain brand, given category

and shopping environment.

To illustrate, a simple self-selection model is given by

5Or our phenomenon under study.


yi = x′i (β + δ)+ εi if i ∈ I,= x′iβ + εi if i ∈ II ,

where I and II denote a certain state (e.g. treated vs. non treated, or web-user

versus non web-user). More compactly,

yi = x′iβ + di x′i δ + εi ,

with di = 1 if i ∈ I and di = 0 otherwise. From this representation it

can be seen thatdi is a dummy regressor and standard estimation fails when

E (εi |di ) 6= 0. This assumption is possibly violated for the examples given in

the previous paragraph. For more details on self-selection problems, see e.g.

Vella (1998), Wooldridge (2002), or Bowden and Turkington (1984).

2.1.4 The simultaneous equation model

Ordinary (or hierarchical) regression analysis will not be appropriate when

the right-hand variables are simultaneously determined along with the depen-

dent variables. However, it is often hard to rule out such feedback loops. Ex-

amples are an economic agent making choices regarding education or labor

market participation (Card, 1999, 2001) or the price setting behavior of firms

while interacting with competition. Several studies consider simultaneity in

prices and demand for markets with differentiated products, given a structure

for competition. The price-setting behavior of firms due to e.g. unobserved

product characteristics like coupon availability, national advertising, shelf-

space (al)location, and other retail environment characteristics, or competitor’s

(re)actions causes endogeneity. Berry’s (1994) work in dealing with price en-

dogeneity in aggregated models while using instrumental variables has been

widely applied and adapted. For instance, Nevo (2001) estimates a structural

demand-supply model for the ready-to-eat cereal industry; Besanko, Gupta and

Jain (1998) consider a scanner-data application for the two categories yoghurt

and catsup; Berry, Levinsohn and Pakes (1995) and Sudhir (2001) develop a

market equilibrium model with competitive pricing for an automobile market

to investigate automobile pricing and competition. Using a more simple model

for demand, Gasmi, Laffont and Vuong (1992) model collusive behavior on


price and advertising in a soft-drink market.

A simple supply and demand model for a product or good is given by

ydt = (xd

t )′βd + γ d pt + εd

t

yst = (xs

t )′βs+ γ s pt + εs

t ,

where variables inxdt are factors that affect the demand or behavior of con-

sumers, whereas the variablesxst only influence the behavior of producers. The

price pt is determined such thatydt = ys

t = yt . When the demand equation

ydt = (xd

t )′β+γ d pt+εd

t is estimated, it cannot be assumed that E(εdt |pt) = 0,

because price is simultaneously determined with the demanded quantity, i.e.

unobserved positive shocks in demand or competitor (re)actions shift the de-

mand curve upward, implying a higher equilibrium price (ceteris paribus) (van

der Ploeg, 1997, or Asher, 1983). In this case, OLS can not be used to estimate

the parameters of the demand equation. For more technical details, see e.g.

Judge et al. (1985) or Davidson and MacKinnon (1993).

2.1.5 Lagged dependent variables

The presence of lagged dependent variables in the set of regressors violates the

exogeneity assumption when serial correlation is present. It is well known that

OLS estimation should not be used see, for instance, White (2001). Consider

yt = x′tβ1+ yt−1β2+ εt

εt = φεt−1+ vt , (2.3)

where e.g. yt are the sales at timet , xt are promotional activities at timet ,

andyt−1 is included to represent lagged effects of promotional activities held

in the past. Suppose that thevt are i.i.d., |φ| < 1, |β2| < 1, E(vt) = 0,

andvt is independent ofyt andxt , and assume that all second order moments

exist. Nowεt yt−1 = φεt−1yt−1 + vt yt−1, so that E(εt yt−1) = φE (εt−1yt−1).


Furthermore, E(εt yt) = x′tβ1E (εt) + β2E (yt−1εt) + var(εt). By using the

stationarity ofεt it follows that

E (yt−1εt) =φ

1− φβ2

var(εt),

and E(εt |yt−1) 6= 0, unlessφ = 0. Davidson and MacKinnon (1993) (p.

681) make a stronger statement and argue that the OLS estimator is biased in

all models when lagged dependent variables are present (yet consistent when

φ = 0).

In certain situations explanatory variables may ‘act’ as lagged dependent vari-

ables, which can easily be overseen. This is illustrated by Gonul, Kim and Shi

(2000), who examine the effect of sending out catalogues on the probability

to buy products from that catalogue. The mailing variable and other customer

shaped promotional activities, are often functions of passed sales, which im-

plicitly introduces problems of the nature described above.

2.1.6 Bias in OLS when E(ε|X) 6= 0

From the preceding subsections it can be concluded that regressor-error de-

pendencies may exist for many different applications. It follows immediately

that the OLS estimator, given byβOLSn = β + (X′X)−1X′ε, where E(ε) = 0,

is biased when E(ε|X) 6= 0, and it loses its attractiveness as an estimator.

Similarly, in absence of heteroscedasticity and autocorrelation, the usual –for

degrees of freedom corrected– estimator for the error variance that is based on

the OLS residuals, is unbiased when E(ε|X) = 0, see e.g. Verbeek (2000)

(p. 19). Otherwise it can be expected that the true value isunderestimated,

since, on average, conditioning reduces the variance of the variable subject to

the conditioning (cf. Greene, 2000) (p. 81).

Unfortunately, the bias in the OLS estimates does not reduce when the sam-

ple size gets larger. More specifically, the OLS estimates are inconsistent, and

plim(βOLSn ) 6= β and plim(σ 2

n,OLS) < σ 2, but this inconsistency can be reduced,


at least in large samples, by using instrumental variables (White, 2001, or Fer-

guson, 1996). The instrumental variables (IV) approach is discussed next.

2.2 The IV approach

The instrumental variable (IV) method assumes that a set of variablesZ, called

instrumental variables, is available. These instruments should be uncorrelated

with the error termε, i.e. E(ε|Z) = 0, and explain part of the variability in

the endogenous regressors. This implies that the instrumentsZ cannot have

a direct effect ony (the instrumentsZ are ‘exogenous’). The standard IV re-

gression model is obtained by augmenting the standard linear regression model

with a model for the endogenous regressors and the instruments, namely

y = Xβ + εX = Z5+ V (2.4)

wherey, X, andβ are defined as before,Z is ann× q matrix containing the

instrumental variables, andV is ann × k matrix containing the error terms.

The matrix5 represents the effect of the instruments on the endogenous re-

gressors. The exogenous variables inX are assumed to appear inZ as well

and should not be omitted (Wooldridge, 2002). It is assumed for identifiability

thatq ≥ k and rankZ = q < n. The correlation betweenX andε, i.e. the

degree of endogeneity, arises because of nonzero covariances betweenε and

V . The errors are assumed to have mean zero. It can be seen from (2.4) that

the endogenous regressors are ‘split’ into an exogenous part and an endoge-

nous part. This IV model is a special case of a simultaneous equation model

(SEM), which is well-known in econometrics. The most common estimators

for β are the 2SLS estimator (or a method of moments estimator) and the lim-

ited information maximum likelihood (LIML) estimator, which is in fact the

maximum likelihood estimator of (2.4). 2SLS is most frequently used because

of its availability in many standard computing packages.

Once instruments are available, the IV estimator is given by

2.2 The IV approach 17

β IVn = (X′PZ X)−1X′PZ y, (2.5)

wherePZ = Z(Z′Z)−1Z′, and is consistent and approximately normally dis-

tributed for largen when (i) plim(1/n)Z′ε = 0, and (ii) both plim(1/n)Z′Z

and plim(1/n)Z′X exist and have full column rank. Unbiasedness of the IV

estimator is discussed in the next subsection. One often relies on large-sample

analysis in examining this estimator because its expected value does not exist

when the number of instruments equals the number of explanatory variables

(cf. Wooldridge, 2002, p.101). Standard inferential procedures can be em-

ployed to learn about the model parameters or to test hypotheses (Bowden and

Turkington, 1984, or White, 2001). The maximum likelihood (LIML) estima-

tor can be computed with a little more effort and, provided that the instruments

are not too weak, the asymptotic properties of the 2SLS and LIML estimator

are the same (Davidson and MacKinnon, 1993, van der Ploeg, 1997, Kleiber-

gen and Zivot, 2003).

2.2.1 Considerations when using Instrumental Variables

The problem in empirical applications is how or where to find ‘valid’ instru-

ments. In general, there are no clear guidelines, and instruments may not be

easy to obtain. Besides, it can be very expensive to obtain additional data. As

such, instruments are often chosen by ad hoc arguments or even by availability,

resulting in potential invalid instruments. The condition E(ε|Z) = 0 requires

that there is no direct association between the instruments and the dependent

variable, which is debatable in many empirical situations.

Wooldridge (2002, p.88), for instance, discusses the (in)validity of the draft

lottery number instrument used in Angrist (1990) to estimate the effect of Viet-

nam veteran status on personal income. Although the draft lottery number ap-

pears to be random, individuals who are more likely to get drafted may chose

to obtain more education to increase the chance of obtaining a draft postpone-

ment or employers may be more willing to invest in educating and training

individuals who are unlikely to be drafted. Bound, Jaeger and Baker (1995)


question the exogeneity of the quarter of birth instruments used by Angrist and

Krueger (1991) who estimate the effect of schooling on income. They present

evidence that a weak correlation between quarter of birth and wages, indepen-

dently of the effect of quarter of birth on education, exists that is sufficiently

strong to have an effect on the IV results. Card (1999, 2001) provides more

extensive summaries of debates on the validity of family background variables,

like parental education, and institutional features of the schooling system vari-

ables, like the presence of a nearby college, as instruments for the endogenous

regressor schooling, see also chapter 5. In estimating demand, lagged prices

or promotional variables are often used as instruments in marketing response

models, but these are not valid, for instance, when reference prices exist6 and

are historically formed (cf. Bronnenberg and Mahajan, 2001). Yang, Chen and

Allenby (2003) note that lagged prices may not be appropriate due to reasons

as forward buying and stockpiling. Besides, treating lagged variables as ‘ex-

ogenous’ is a potential source of endogeneity itself (see also Arellano, 2002

(p.455)). Nevo (2001) used price data from other markets as instrumental vari-

ables for price, but notes that these instruments are invalid when common (na-

tional) demand shocks occur, or when advertising or promotion activities are

coordinated across markets. This is more likely when the same manufacturer

or retailer is active in several markets. Although cost drivers may be potential

instrumental variables for price, Nevo (2000) (p. 546) concludes that these are

rarely observed, while proxies for cost usually do not exhibit sufficient varia-

tion.

Exogeneity of instruments is only one of the two criteria for an instrument

to be valid, in addition, available instruments may be weak in the sense that

they are poorly correlated with the endogenous regressors. Stock, Wright and

Yogo (2002) state: “Empirical researchers often confront weak instruments.

Finding exogenous instruments is hard work, and the features that make an IV

6The reference price is the ‘expected price’ of a product. Several studies have found asym-metric effects when the perceived price differs from the reference price. The effect of thereference price on demand depends (among other things) on the convenience during the buyingprocess, on the familiarity of the brand, and on the type of store the product is bought (Leeflang,1994) (in Dutch).


plausible exogenous [...] can also work to make the instrument weak”. Unfor-

tunately, statistical properties of IV estimators and inferential procedures based

on these, turn out to be sensitive to the choice and validity of the instruments,

even for large sample sizes. Consequently, researchers who study the same

substantive question but use different instruments may end up with another

conclusion. In the following we will review some recent results on the prob-

lem of weak instruments that appeared in the econometric literature. Most of

the following discussion is developed for the linear (non-hierarchical) regres-

sion model for cross sectional studies, see Stock, Wright and Yogo (2002), and

Hahn and Hausman (2003) for more details7.

Weak instruments

Recent results in the econometric literature has shown that the presence of

weak instruments does not only reduce the precision of the estimates, but may

also lead to biased and inconsistent estimates that are potentially larger than

OLS. Furthermore, standard asymptotic approximations break down (Staiger

and Stock, 1997, Bound, Jaeger and Baker, 1995, Hahn and Hausman, 2002 or

Kleibergen and Zivot, 2003). As a consequence, standard hypothesis tests and

confidence intervals are unreliable. Weak instruments may arise when the in-

struments do not have a high degree of explanatory power for the endogenous

regressors or when the number of instruments is large (cf. Hahn and Hausman,

2002, 2003). In the following we discuss three potential pitfalls with IV esti-

mation in the presence of weak instruments: (1) the finite sample bias of 2SLS,

(2) situations where the instruments are potentially correlated withε, i.e. they

are not exogenous, and (3) the poor asymptotic approximation to the sampling

distribution of IV estimators.

In finite samples the IV estimator (2SLS) is biased in the same direction as

OLS. This fact is often unnoted in empirical studies. Even when E(ε|Z) = 0,

7In a survey article on Sargan’s work on instrumental variables estimation, Arellano (2002)observes that “Many of the themes [on instrumental variables estimation] that appeared [...] inthe econometrics literature of the 1980s and 1990s were presented in a surprisingly mature wayin Sargan’s 1958 and 1959 articles”.


β IVn = β+(X′PZ X)−1X′PZε is, in general, biased as E(X′PZ X)−1X′PZε 6= 0.

This bias arises because coefficients5 in (2.4) are not observed. If we had ob-

servedZ5, an OLS regression ofy on Z5would be unbiased. Instead, an esti-

mate of5 has to be obtained from a regression ofX on Z. Hahn and Hausman

(2002, 2003), Buse (1992), Stock, Wright and Yogo (2002), or Bound, Jaeger

and Baker (1995) show that this finite sample bias8 is a function of (among

other things) the number of instruments, which suggests that augmenting the

set of instruments increases the bias in the estimator. However, as Buse (1992)

shows, the bias will only be proportionally larger when the number of instru-

ments grows faster than the rate of explained variance of the endogenous re-

gressors. As a consequence, adding important or strong instruments does not

necessarily increase the bias, however, adding less important instruments, or

having weak instruments, will undoubtedly lead to more biased results. Bound,

Jaeger and Baker (1995) and Hahn and Hausman (2003) show that the bias is

inversely related to theF-statistic (the Fisher Statistic) of the regression of the

endogenous explanatory variable on the instruments. These results suggest that

the (partial)R2 andF-statistic of the first stage regression (i.e. the regression

of X on Z in (2.4)) are useful as rough guides to the quality of IV estimates

and should routinely be reported (cf. Bound, Jaeger and Baker, 1995). The

LIML estimator is known to have no finite moments and has thicker tails. As

such, it is generally less sensitive to the addition of superfluous instruments

(cf. Kleibergen and Zivot, 2003). Nevertheless, when the IVs are weak, even

LIML may not solve the problem (cf. Hahn and Hausman, 2003).

A second problem associated with weak instruments is the inconsistency of

the IV estimator relative to OLS when the instrument is potentially correlated

with ε, i.e. it is endogenous itself. Bound, Jaeger and Baker (1995) show that

the relative inconsistency of IV to OLS is equal to (for simplicity it is assumed

thatk = q = 1)

8They find that for one endogenous regressor, the expectation does not exist when only oneinstrument is available. See also Wooldridge (2002) who states that the number of momentsthat exists is one less than the number of overidentifying restrictions.


plim β IVn − β

plim βOLSn − β

= ρz,ε/ρx,ε

ρx,z

,

whereρx,z indicates the correlation betweenx and z, the other terms being

defined similarly. When the instrument is weak,ρx,z→ 0, implying that even

a small correlation between thez andε can produce a large relative inconsis-

tency in the IV estimator, making the inconsistency of IV potentiallylarger

than in OLS9.

Thirdly, if the instruments are weak, then, even in large samples, classical

(first-order) asymptotic approximations are poor. This is illustrated by (among

others) Nelson and Startz (1990). As they note, conventional wisdom suggests

that when the instruments are weak, the classical asymptotic variance matrix

will be large and the asymptotic distribution ofβ is dispersed. However, it

is also shown that the asymptotic distribution is a very poor approximation

to the exact finite density function (which is bimodal, fat tailed and concen-

trated closer to the probability limit of least squares than the true value). If

the asymptotic variance ofβ IVn decreases, i.e. when the instruments are gener-

ally stronger, the classical approximation becomes better. As a consequence,

with weak instruments inferential procedures based on classical asymptotic

results are unreliable. Although finite sample methods could be used in these

situations, their use in practice is limited due to restrictive assumptions, com-

putationally intractable distributions, or the absence of a clear framework for

testing or constructing confidence intervals. The weak instruments problem is

not only relevant for “small samples” and it cannot be ignored in large sam-

ples. This is illustrated by Bound, Jaeger and Baker (1995), who show that for

the Angrist and Krueger (1991) study it is possible to obtain similar results if

artificial random (dummy) instrumental variables are used, despite the sample

size of 329500 observations.

9See also Hahn and Hausman (2003), section V.


Examining instrument validity

As (asymptotic) properties of IV estimators are sensitive to the choice of valid

instruments, regardless of the sample size, measures for ‘weakness’ are desir-

able. Recently, the outcome of several studies suggests to reportF-statistics

andR2 measures of the first stage regression routinely. Stock, Wright and Yogo

(2003), for instance, suggest that the first-stageF-statistics must be larger

than 10 for 2SLS inference to be reliable. Furthermore, Bowden and Turk-

ington (1984) argue that one should find instruments that maximize all of the

canonical correlations withX. Staiger and Stock (1997) develop a data-based

measure for the relative bias, where large values should alert the researcher

to potential problems of correlations between the instruments and the random

components. Bowden and Turkington (1984) and Verbeek (2000) (among oth-

ers) present a test for instrument admissibility whenq > k (overidentified).

If the test rejects, there is sample evidence against the joint validity of the in-

struments, although it is not possible to determine which one is incorrect. The

method in Bowden and Turkington can be used to examine whether an addi-

tional set of instruments is admissible, but this test does not address potential

weakness of the instruments. In fact, Hahn and Hausman (2003) argue that

this test rejects too often when weak instruments are present, which is a major

drawback since it is often used to test economic theory embodied in the model.

Hahn and Hausman (2002) have recently developed a test for the validity of

instrumental variables, which jointly addresses exogeneity and strength. It

is based on the general Hausman specification test approach (Hausman, 1978)

and adopts the second order asymptotic approximations of Bekker (1994). The

idea is to compare forward and backward 2SLS estimators, which are shown to

be equivalent under the null hypothesis that conventional asymptotics is valid.

The test statistic is fairly simple to compute and is shown to have at distribu-

tion under the null hypothesis. Rejection of the null hypothesis might indicate

a failure of the orthogonality assumption of the instruments or that the instru-

ments could be weak. Hahn and Hausman (2002) suggest a two step approach

based on this test to decide whether 2SLS, LIML, or none should be used.


In chapter 4 we propose another method that can be used to investigate the

validity of observed instruments, which is based on the LIV model and exist-

ing test principles. Contrary to the Hahn and Hausman test, it can be used to

separately investigate either instrument weakness or instrument endogeneity,

or both. Furthermore, if the instruments are found to be invalid, the estimates

for the regression parameters can still be used because the LIV results do not

rely on the quality, nor require access to observed instruments. Our simulation

evidence suggests that this approach does not yield size problems in presence

of weak instruments, as opposed to the classical test of overidentifying restric-

tions.

Choosing the (number of) instruments

The finite sample bias in IV estimators is a function of the number of instru-

ments, which suggests that one should not include too many, although identifi-

cation requires that at least as many instruments as endogenous regressors are

included (q ≥ k). Furthermore, increasing the number of instrumental vari-

ables results in a loss of degrees of freedom and the first stage regression (X

on Z) suffers from overfitting. Sargan (1958) concludes that “if the first few

instrumental variables are well chosen, there is usually no improvement, and

even a deterioration, in the confidence regions as the number of instrumental

variables is increased beyond three of four”. Besides, similar to the results

presented above, he also notes that “estimates [may] have large biases if the

number of instrumental variables becomes too large” (p.400). As opposed to

these finite sample results, large sample theory, however, shows that an IV es-

timator with one more instrument is at least as efficient, which suggests that

we can add as many instruments as we please without doing worse (see e.g.

Davidson and MacKinnon, 1993 (p.220-p.224)).

Bowden and Turkington (1984) suggest to perform a principal components

analysis onZ′Z and to choose the firstp principal components as instruments.

This approach, however, does not address the correlation ofX andZ, i.e. the

strength of the instruments. Donald and Newey (2001) developed a mean-


squared error criterion that can be minimized to choose a set of instrumen-

tal variables. They find that this method of choosing instruments generally

yields an improvement in performance. In the leading cases, LIML outper-

forms 2SLS, although they find that 2SLS performs better in situations of little

endogeneity. For the weak instruments case, there is a clear tendency to use

fewer instruments.

Testing regressor-disturbance problems

Given the potential pitfalls when using IV results and the problem of finding

instruments at all, one would like to test for potential regressor error correla-

tion a priori. Unfortunately, it is not possible to examineX′ε directly, asε is

unobserved and OLS estimation yieldsX′ε = 0 by definition10. In order to test

for endogeneity valid instrumental variables are required. A test based on the

general test procedure of Hausman (1978) can then be used. This test is based

on comparing the difference betweenβOLSn andβ IV

n , and Hausman proposed a

test-statistic that has approximately aχ2 distribution under the null hypothesis.

A drawback of this test procedure is that external instruments have to be avail-

able in order to computeβ IVn . As a consequence, the researcher may conclude

that the obtained instruments were not needed after all. Furthermore, this test

is potentially sensitive to weak instruments (see e.g. Staiger and Stock, 1997,

or Bowden and Turkington, 1984, for more details). In fact, the Hausman test

may incorrectly fail to reject the use of the OLS estimator because of the bias

(cf. Hahn and Hausman, 2003). In chapter 3 we propose an instrument-free

test, that solves this circular problem. We show that this test has a reasonable

power over a wide variety of settings.

2.2.2 IV based solutions to the weak instrument problem

Hahn and Hausman (2003), and Stock, Wright and Yogo (2002) surveyed most

of the econometric literature on solutions to the weak instruments problem in

empirical applications. In the following we present a brief summary, since

10An exception is testing forX′α = 0 in random intercept models, whereα = (α1, ..., αn)′

are the unit-specific random intercepts, as a test statistic is readily available (chapter 6).


most of the technicalities and the amount of results that have appeared in the

literature, are beyond the scope of this thesis.

As mentioned before, first-order asymptotic approximations are poor in the

presence of weak instruments. Several studies have presented improved asymp-

totic approximations to finite-sample distributions in this situation. Staiger and

Stock (1997) developed an alternative asymptotic framework that models the

coefficients of the first stage regression as locally zero, i.e. weakly correlated

without assuming normality. In this framework they showed that if the in-

struments are weak, the 2SLS and LIML estimators have nonstandard asymp-

totic distributions and are not consistent, where the bias is less problematic for

LIML than for 2SLS, particularly in small samples. Furthermore, results on

properties of various inferential procedures (liket test, coverage rates of con-

fidence intervals and tests of overidentifying restrictions) are obtained. Bekker

(1994) developed an asymptotic approximation for models with normal errors

in which both the number of instruments and the sample size increases. Simu-

lation evidence shows that these asymptotics provide good approximations for

moderate and large values of the number of instruments, and that LIML is to

be preferred over the standard IV estimator. However, Bekker’s results apply

only to normal cases and do not capture the nonnormality observed in the exact

finite sample density, see also Staiger and Stock (1997).

Besides work on finding better alternatives to the first-order asymptotics, sev-

eral fully robust hypothesis tests and methods are developed to construct con-

fidence sets forβ that have approximately the correct size and coverage rates

under weak instruments. One such robust test to investigateβ = β0 is the

Anderson-Rubin statistic (Anderson and Rubin, 1949), which is not affected

by the degree of underidentification. However, it may lack power because of

a loss of degrees of freedom when the number of instruments is larger. The

K statistic (Kleibergen, 2002) has similar asymptotic properties with a mini-

mal number of degrees of freedom. Bekker and Kleibergen (2003) investigate

the finite-sample distribution under normality. Other tests have been proposed

as well, see e.g. Staiger and Stock (1997). Stock, Wright and Yogo (2002)


present results of power comparisons for several tests under different condi-

tions. Given the duality of hypothesis tests and the construction of confidence

sets, the robust tests can be used to obtain confidence intervals. When the in-

struments are weak, these sets can have infinite volume, indicating that there

simply is limited information to use in order to make inferences aboutβ (cf.

Stock, Wright and Yogo, 2002).

The previous methods to carry out tests or construct confidence intervals do

not readily provide point estimates forβ. In addition, they may be difficult to

compute. Several alternatives to 2SLS are proposed, that ought to be more ro-

bust and reliable if the instruments are weak. Second-order unbiased estimates,

such as LIML or Nagar estimators, are often suggested as robust alternatives.

These estimators, however, do not have finite sample moments which may

present a problem in empirical situations (Hahn and Hausman, 2003). Other al-

ternatives are Jackknife Instrumental Variables (Angrist, Imbens and Krueger,

1999), Fuller-k Estimator (Fuller, 1977), or bias-adjusted 2SLS (Donald and

Newey, 2001). Stock, Wright and Yogo (2002) find that these partially robust

estimators provide relatively reliable alternatives to 2SLS in applications with

weak instruments. However, Hahn and Hausman (2003) recommend, based on

Monte Carlo evidence, extreme caution using “no moment” estimators (LIML

or Nagar). Considering mean-squared error and IQR measures, they conclude

that 2SLS, jackknife 2SLS, and Fuller-based estimators perform best, and state

that “instrument pessimism seems overstated for 2SLS, which may be why

2SLS often performs better than expected in terms of MSE in the weak instru-

ment situation”. The specification test suggested by Hahn and Hausman (2002)

may be used to decide among the alternatives. Both Stock, Wright and Yogo

(2002) and Hahn and Hausman (2003) stress that most of the analysis in the

weak instruments literature is conditional on instrument exogeneity. Failure of

the exogeneity restriction, in particular in combination with weak instruments,

leads to additional complications and situations in which OLS may do better

than the above suggested remedies against weak instruments (see also section

4.4).


2.3 Alternative approaches to solve for regressor-errordependencies

In some applications the nature of the data generating process or the suspected

cause of endogeneity itself suggests suitable instruments or even a different es-

timation approach. Wooldridge (2002) suggests three other solutions to solve

omitted variable problems, including the proxy-variable OLS method (p.63)

and using indicators of the unobervables (p.105-p.107) that require IV esti-

mation11, where the latter method also applies to measurement error models.

Furthermore, observing the same cross-sectional units over time, and applying

fixed effects estimation could also eliminate endogeneity due to omitted vari-

ables, if the endogeneity arises from time-invariant sources, see also chapter

6, or the example given by Verbeek (2000) (p.312). Card (1999) presents an

overview of studies using sibling and twin data to estimate the return to ed-

ucation and argues that omitted ability is eliminated when computing within-

family estimators. Stern (2004) uses data composed of multiple job offers to

postdoctoral students and a fixed-effects approach to estimate the relation be-

tween wages and the scientific orientation of organizations. His results suggest

a negative relation between science and income, that is biased upward when

unobserved quality of researchers is not controlled for. For measurement er-

ror models, autoregressive models, and simultaneous equation models the data

generating process may suggest suitable instruments. It is beyond the scope of

this thesis to review all the literature on these topics. For measurement error

models we refer to e.g. Wansbeek and Meijer (2000), Carroll, Ruppert and

Stefanski (1995), or Bowden and Turkington (1984) for extensive overviews.

These models can be estimated using IV techniques, for instance by using other

(potentially) mismeasured variables (White, 2001). Another method is based

on Wald (1940), that assumes that the observations can be divided into groups.

11The indicator IV solution is different from the classical IV solution discussed previously.The indicator IV solution assumes the existence of a possible mismeasured proxy for the miss-ing variablew, that needs to be instrumented, whereas the classical IV solution leaves the omit-ted variablew in the error term and all elements ofx correlated withw need to be instrumented.See also Petrin and Train’s (2002) control function approach and the discussion in Chintagunta,Dube and Goh (2004) (p.6).


This classification should be independent of the error terms and should dis-

criminate between high and low values of the unobservable true construct (see

also Madansky, 1959, or chapter 3). Similarly, higher order lags may serve as

instruments for a model that has lagged dependent variables as regressors in

the presence of serial correlation. In simultaneous equation models exogenous

variables that are not included in the equation of interest can often serve as

instruments and are readily available (see e.g. Greene, 2000).

In the following we briefly consider three other interesting methods to solve for

endogeneity that have recently appeared in the literature: (1) Lewbel’s method,

(2) methods that model demand, cost and competition, and (3) spatial econo-

metrics.

Lewbel’s method.Lewbel (1997) showed that for measurement error models

instruments can be constructed from available data by exploiting higher order

moments. Hence, observed exogenous instrumental variables are not required.

Erickson and Whited (2002) extend this method and propose a two-step gen-

eralized method of moments estimator for a multiple mismeasured regressor

errors-in-variables model. Consistent estimation requires, among other things,

that measurement and equation errors are independent and have moments of

every order, but no assumptions have to be made about distributional forms.

Hence, information contained in third- and higher order moments of the data

are fully exploited to identify the regression parameters.

This interesting approach is developed for measurement error applications, but

may be applicable to more general regressor-error dependency models as well.

In appendix 6C we show for a simple linear multilevel model how these ideas

can potentially be extended to more ‘general’ endogeneity applications. De-

pending on the empirical situation, it may provide an easy way to construct

instruments from the available data and, hence, deserves more attention.

Methods that model demand, cost and competition.Several studies have

attempted to solve for ‘price endogeneity’ in markets with differentiated prod-


ucts. Price is endogenously determined by the interaction of demand and sup-

ply. The idea is to solve this form of endogeneity by jointly modeling demand

and supply equations, using a profit maximization model. Berry (1994) and

Berry, Levinsohn and Pakes (1995) develop a market equilibrium model, based

on a logit demand function, that is adapted to make it suitable for traditional

instrumental variables estimation. This method is both applicable to aggregate

or dissaggregate data, or a combination of both. The resulting system is ob-

tained by aggregating a discrete choice model of individual consumer behavior,

which is combined with a cost function. These two models are embedded in a

system of price setting firms in differentiated markets. Joint estimation leads

to potentially more efficient estimates, than to focus on the demand side only

with instruments for price. Furthermore, the system provides detailed infor-

mation on cost structures and the nature of competition. Using equilibrium

models, however, imposes more demand on data and incorrect specification of

the firm’s behavior could lead to biased estimates12. This approach has widely

been applied and adapted, with differences in e.g. data-aggregation, type of

heterogeneity, or method of estimation, see, for instance, Besanko, Gupta and

Jain (1998), Besanko, Dube and Gupta (2000), Nevo (2001), or Sudhir (2001).

Most studies employ an instrumental-variables based simultaneous equations

estimation procedure, which is a generalized method of moments estimator.

However, as opposed to homogenous goods models, in differentiated markets

most of the exogenous variables in the model are product characteristics affect-

ing both cost and demand. Traditional exclusion restrictions therefore cannot

be used to form instruments. Sudhir (2001) (section 4.2) and Nevo (2001) (sec-

tion 4.3), for instance, discuss this in more detail and report having instruments

of potential poor quality, see also the discussion in Berry (2003) (section 1).

Recently, Draganska and Jain (2004) proposed a new maximum likelihood

based method for simultaneous estimation of supply-demand. Their proposed

algorithm uses individual level-data for a heterogenous demand model, to-

12See e.g. discussion in Yang, Chen and Allenby (2003) (section 2.3) or Dube and Chinta-gunta (2003).


gether with a supply equation derived from profit maximization behavior of

firms, assuming a Bertrand-Nash equilibrium. The resulting likelihood equa-

tion cannot be maximized straightforwardly because the equilibrium model is

highly nonlinear. Their estimation procedure is based on simulating prices

and choice probabilities to solve for the market equilibrium. The obtained

smoothed empirical distribution can be used for maximization. On the other

hand, Yang, Chen and Allenby (2003) (with discussions) proposed a Bayesian

approach to estimate a simultaneous heterogenous demand and supply model.

The method incorporates consumer heterogeneity and allows for a wide va-

riety of supply model specifications. The advantage of their approach is that

it can handle non-linear model structures and allows for exact small sample

inference. See also Chintagunta, Dube and Goh (2004) for a recent overview

and discussion.

Spatial econometrics.Recently, two studies in marketing appeared, that solve

endogeneity of marketing mix variables using spatial dependencies in observed

market data, where no or limited time variation is present. These dependen-

cies are caused by the fact that economic agents are spatially organized, or have

similar store profiles. Bronnenberg and Mahajan (2001) identify correlations

between marketing mix variables and the error term by imposing a measurable

spatial structure on the random terms in the model. This spatial map is a con-

sequence of unobserved actions of retailers that are faced by trade territories

consisting of multiple neighboring markets. Bronnenberg and Mahajan con-

struct a spatial map by making use of geographic proximities. By accounting

for this space in an econometric model, it is possible to correct and test for the

effect of unobserved retailer’s behavior. Their results for Mexican food items

suggest that unobserved components of the dependent variables are related to

the marketing mix variables. Van Dijk et al. (2004) consider the estimation of

shelf-space elasticities based on endogenous shelf space data. Estimation of

shelf-space elasticities is hampered due to minimal (time) variation in shelf-

space measures. The authors build on the work of Bronnenberg and Maha-

jan and propose to model the correlation between shelf space and the random

terms by using a spatial structure based on similarities in store-, consumer-,

2.4 Conclusions and positioning of research 31

and competitor characteristics. Their results for frequently bought daily care

products provide face valid shelf-space elasticities estimates that outperform

a model with a spatial structure based on geographic proximities in terms of

predictive validity. Since retailers generally decide about shelf space based on

store, customers and competitor characteristics, it is expected that the similar-

ity of two geographically similar stores in this case is lower than the similarity

of two stores with similar profiles in distinct regions.

2.4 Conclusions and positioning of research

It is clear from this review that traditional instrumental variable methods, that

rely on economic theory or intuition to find additional observable instruments,

suffer from at least two problems: (i) in many situations no such variables

are available, and (ii) once available, performance of the inferential proce-

dures critically rely on the quality of these variables. In particular the latter

has recently been the topic of several studies in econometrics. Although many

important contributions to the weak instrument problem have been made, the

problem of having potential endogenous instruments has not yet been solved

(see e.g. concluding remarks of Hahn and Hausman, 2003, or Stock, Wright

and Yogo, 2002). For most empirical researchers the question where to find

suitable instruments is still open and usually there is not much choice when

selecting instrumental variables. Without having valid instrumental variables

at hand, classical instrumental variables estimation techniques cannot be re-

lied on. Furthermore, there is a bit of a dilemma: theory suggests that the

best choice of instruments are variables that are highly correlated with the en-

dogenous regressors. However, the more highly correlated they are, the less

defensible is the claim that these variables themselves are uncorrelated with

the disturbances (cf. Greene, 2000, p. 375).

The latent instrumental variables (LIV) method proposed in the next chapter at-

tempts to solve this circular problem. Similar to the classical IV model in (2.4),

we assume that the endogenous regressor can be separated into an exogenous

part and an endogenous part. However, we propose to model the exogenous


part as an unobserved discrete variable, which is a nuisance parameter. We

prove that the model parameters are identified through the likelihood. Hence,

observed instrumental variables are not required to estimate the regression pa-

rameters. In econometrics, instruments frequently take the form of categor-

ical variables and, in addition, continuous instruments are often transformed

into dummy variables (van de Ploeg, 1997, Bowden and Turkington, 1984, or

Verbeek, 2000). We show that the parameters in the LIV model can be iden-

tified and estimated through maximum likelihood methods. As a by-product,

‘optimal’ LIV instruments are estimated from the data and regressor-error de-

pendencies can be tested for straightforwardlywithout needing observed in-

struments at hand. Furthermore, the proposed likelihood framework allows for

straightforward extensions to different applications.

The LIV approach has some similar features as two methods developed in

the measurement error literature. Wald (1940) and Madansky (1959) assume

that data is divided into two groups according to certain (statistical) criteria.

Then a straight line can be fitted because it is determined by two points. If

the grouping criteria are satisfied, the fitted line can be shown to be a consis-

tent estimate of the true line. Randomly assigning the observations into two

groups, for instance, or simply assigning the observations with highx val-

ues to one group and with lowx values to the other group, does not provide

valid groupings. Ideally, group construction should be based on some knowl-

edge of the pattern of the underlying variation (cf. Bowden and Turkington,

1984). The LIV model does not require the existence of an a priori group-

ing of the data but estimates such a grouping simultaneously with the other

parameters using mixture modeling techniques. Secondly, Lewbel’s idea to

construct instruments from the available data and, hence, solving the circular

problem of needing observed instrumental variables at hand, is similar to the

motivation of the LIV model. Lewbel (1997), and Erickson and Whited (2002)

propose method-of-moments based estimators and show that, under certain

higher-order moment conditions, instrumental variables can be obtained from

the available data. Hence, instruments are constructed based on ‘statistical’

moment conditions and the resulting variables will generally not correspond to

2.4 Conclusions and positioning of research 33

an economic theory or interpretation. Although these methods are developed

for measurement error applications, we believe that they are more generally

applicable, like the LIV model, although this requires further research (see ap-

pendix 6C, subsection 6.5.2, or subsection 8.2.2).

On the other hand, we propose a likelihood-based approach, which constitutes

a very general framework that can be easily adapted to more general situa-

tions, for instance to the Bayesian setting in chapter 7. Furthermore, the pre-

dicted LIV instruments can be used to investigate the nature of the endogene-

ity more thoroughly, since these instruments are estimated from the available

data rather than being constructed based on higher-order moment assumptions

that may or may not be valid. The likelihood-approach has desirable opti-

mality properties and can be expected to be more efficient than method-of-

moments estimation. We agree with Yang, Chen and Allenby (2003) who state

that “likelihood-based inference offers a distinct advantage over a method-of-

moments approach because it makes precise statements about the probability

of the observed data. In a likelihood-based analysis, the researcher is con-

fronted with the correspondence between the model and the data, and cannot

fit a model that is not supported by the data”. Besides, the LIV model belongs

to the class of mixture models, that are often employed to estimate probabil-

ity density functions, and mixture models can be seen as a flexible and robust

approach to approximate them. Kim, Menzefricke and Feinberg (2004), for

instance, provide evidence that mixtures of normals are a simple and effective

way of density estimation, in particular in a Bayesian framework. Tittering-

ton, Smith and Makov (1985) find that finite mixtures (of normals) have often

been used in robustness studies to investigate non-normal conditions in ‘nor-

mal’ inference, or to provide a procedure to reduce the influence of outlying

observations. Hence, it is expected that some of these aspects translate to the

LIV model, and we will show that the LIV results are relative insensitive to

different choices of the shape of the distribution of the data and, hence, to the

(non)existence of higher-order moments.

In the next chapter we introduce the simple LIV model which is further devel-


oped in subsequent chapters.

Chapter 3

The LIV model

3.1 Introduction

In applying the classical linear regression model, the assumption that E(ε|x) =0, with x ∈ Rk×1, may not hold. As a consequence, the OLS estimator is

biased. A more or less standard approach in econometrics to obtain unbi-

ased estimates is to find instrumental variablesz (see for example Bowden and

Turkington, 1984). Instruments mimic the troublesome regressors but are un-

correlated with the error term (i.e. E(ε|z) = 0, z ∈ Rg×1). Onceg ≥ k, the

regression coefficients can be estimated by 2SLS or LIML (see the standard

textbooks, e.g. Greene, 2000).

In empirical work it is not always obvious whether it is necessary to search for

instruments and if so where to find them. Thus, one would like to test a priori

whether E(ε|x) = 0 holds. However, as OLS always yieldsX′e = 0, it is

fruitless to use the OLS estimates for that purpose. One way to test for exo-

geneity is through the use of instruments. Once instruments have been found,

a Hausman test can be applied to determine post-hoc whether they were ac-

tually needed (see e.g. Bowden and Turkington, 1984). This method has as

drawbacks that instruments need to be available and that once they are avail-

able, they may be weak and/or correlate with the error (i.e., E(ε|z) = 0 may

not hold). Several authors have examined problems associated with weak in-

struments (Bound, Jaeger, and Baker, 1995, Staiger and Stock, 1997, or van

35

36 Chapter 3 The LIV model

der Ploeg, 1997), and results on asymptotics, efficiency bounds and tests for

instrument validity for IV models are available (Staiger and Stock, 1997, Hahn

and Hausman, 2002, Hahn, 2002).

We propose an “instrument-free” approach to estimate regression parameters

in a situation with potential regressor-error correlations. As this method does

not rely on observable instruments, issues such as availability, validity, or

weakness of the instruments can be circumvented. The proposed method,

which we call the ‘latent instrumental variable (LIV) method’, utilizes a la-

tent variable model to account for dependencies between the regressors and

the error. The method introduces an (unobserved) discrete binary variable to

decomposex into a systematic part that is uncorrelated withε and one that is

possibly correlated withε.

Although the idea oflatent instruments is new, discrete instruments have been

used before. Frequently, observable instruments are categorical instruments.

For example, in the measurement error literature, grouping methods have been

used to construct instruments based on the method of Wald (cf. Madansky,

1959). Van der Ploeg (1997) uses instruments generated by an a priori group-

ing of the data to apply the group-asymptotics developed by Bekker (1994).

Lewbel (1997) proposed a method to construct internal instruments for regres-

sors with measurement errors by taking simple functions of the model data. As

such, in his approach no additional external instruments are needed to iden-

tify and estimate the regression parameters. Under nonnormality, the use of

third order moments of the data identifies the model parameters and 2SLS or a

GMM approach can be shown to yield consistent estimators (Erickson, 2001,

or Wansbeek and Meijer, 2000). In our LIV model the parameters are not

identified by the first two moments, but can be identified by the likelihood.

However, as Hennig (2000) observes, identifiability of mixture models (the

LIV model belongs to this class of models) is not straightforward and we have

to be careful in claiming identifiability.

This chapter is organized as follows. In section 3.2 we introduce the LIV

3.2 The LIV model 37

model and in section 3.3 we prove identifiability and present results on the

information matrix. Furthermore, we suggest a method which is based on a

Hausman-test (Hausman, 1978) to test directly for regressor-error dependen-

cies, without needing observed instrumental variables. This instrument-free

test can be used to assess a priori the presence of regressor-error correlations

(section 3.4). The model estimators and test-statistic are evaluated on the basis

of a simulation study (section 3.5). We demonstrate that this latent instrumen-

tal variable method yields approximately unbiased1 results for the regression

parameters over a wide range of regressor-error correlations and several dis-

tributions of the instruments. Furthermore, the test-statistic has a reasonable

power across these settings. In section 3.6 we empirically illustrate the LIV

model for a measurement error application where an observed ‘natural’ (lab-

oratory) discrete instrument is available. We show that in this case the LIV

estimate and the IV estimate, computed with this natural instrument, coincide.

Section 3.7 concludes this chapter.

3.2 The LIV model

The structural form of the assumed LIV model is given by

yi = β0+ β1xi + εi ,

xi = π ′zi + νi ,(3.1)

with i = 1, ...,n andπ an(m× 1)-vector of category means. Here we assume

a single unobserved categorical discrete instrumentz. In subsection 3.3.1 we

show that in this case the categorical instrument should be at least of dimension

two, which is in accordance with van der Ploeg’s (1997) result for the standard

IV model with discrete instruments. It is assumed thatz is independent of the

error terms(ε, ν), that are specified to follow a joint normal distributionF

with mean zero and variance-covariance matrix

6 =[σ 2ε σενσεν σ 2

ν

]. (3.2)

1Since we have no formal proof of the unbiasedness or consistency of the LIV estimator(see also chapter 8 for a discussion), we sometimes use ‘approximately’ unbiased or consistentto denote a result from Monte Carlo studies.


If we had observed the instruments,zwould separate the sample intom groups,

with known category–membership for each observation. We assume, however,

that the category indicators are unknown a priori and have a multinomial dis-

tribution with parameters(n, λ), wheren = 1 andλ = (λ1, λ2, ..., λm)′, with

λ j > 0 and∑

j λ j = 1. Conditionally on categoryj = 1,2, ...,m, the reduced

form distribution corresponding to (3.1) has a mean

µ j =(β0+ β1π j

π j

)(3.3)

and variance-covariance matrix

� =[β2

1σ2ν + 2β1σεν + σ 2

ε β1σ2ν + σεν

β1σ2ν + σεν σ 2

ν

]. (3.4)

ThesimpleLIV model assumes the existence of a dummy instrument, i.e.m=2. The assumption of a single dummy instrument prevents overfitting, and adds

tractability to our specification. Furthermore, it is known from the IV literature

that the number of IVs should not be too large (Bowden and Turkington, 1984,

Buse, 1992, or Bound, Jaeger, and Baker, 1995). Moreover, we will show

below that the simple LIV model is robust against misspecification of the true

number of categories of the instrument and performs well for several types of

distributions forx. If m = 2, then conditionally on categoryj = 1,2, the

reduced form distribution corresponding to (3.1) is

L(yi , xi |zi = ej ) = N2

(µ j ,�

)(3.5)

with mean (3.3) and variance (3.4), and wheree1 = (1,0)′ ande2 = (0,1)′.

If f j denotes the normal bivariate probability density function conditionally

given zi = ej , then the unconditional (marginal) probability density function

for (yi , xi ) can be computed as

f (yi , xi ) = λ f1(yi , xi )+ (1− λ) f2(yi , xi ). (3.6)

3.3 Identifiability and information 39

Thus, f (yi , xi ) is a mixture of bivariate homoscedastic normal distributions

and it has expectation2

µy,x =(β0+ β1

(λπ1+ (1− λ)π2

)λπ1+ (1− λ)π2

)

and variance-covariance matrix3

�y,x = �+ λ(1− λ)(π1− π2)2(β1,1)

′(β1,1) (3.7)

The parameters to be estimated areβ0, β1, π1, π2, 6, andλ. The parameters

are identified in our model. We will demonstrate this in subsection 3.3.1.

For estimation of the parameters, assume that a sample ofn i.i.d. observa-

tions (yi , xi ) is available, but we donot require the availability of observed

instrumentszi . The method of maximum likelihood can be used to estimate

the model parameters. The likelihood function is obtained as the product of

(3.6) across the observations. The resulting (log) likelihood equations, how-

ever, are nonlinear and do not allow a closed-form expression. Therefore we

use quasi-Newton numerical optimization routines (the BFGS-method) for the

maximization of the likelihood function that are provided with the GAUSS

package (Aptech, 2000). In the following section we discuss some statistical

properties of the LIV model.

3.3 Identifiability and information

In this section we discuss some statistical properties of the LIV model. Firstly,

we proof the identifiability of all the parameters of the LIV model. Identifia-

bility is proved for a general number of categories and normal regressor-error

distributions. Furthermore, we discuss the estimation of the information ma-

trix.

2Use E(U ) = E [E (U |V)].3Use: Var(U ) = E [Var(U |V)] + Var(E [U |V ]).


3.3.1 Identifiability

The parametersπ andσ 2ν are identified using first and second order moments,

but the parametersβ0, β1, σ2ε , andσεν are not identified. However, these

parameters become identified by considering finite mixtures. Let

F = {F(x, θ), θ ∈ 2, x ∈ Rd}

be the class ofd-dimensional distribution functions from which mixtures are

to be formed. Here2 will be a Borel measurable set inRq andθ is formed

from the elements of the mean (3.3) and variance (3.4). For the simple LIV

model above,d = 2, q = 5, andF(., θ) is a bivariate normal c.d.f..

The class of finite mixturesH generated byF is defined as

H =H(x) : H(x) =

m∑

j=1

ψ j F(x, θ j ), ψ j > 0,

m∑

j=1

ψ j = 1, F(x, θ j ) ∈ F, ∀ j,m= 1,2, ...; x ∈ Rd

. (3.8)

So,H is the convex hull ofF . For the sake of simplicity, we will use some

abbreviations:F(x, θ j )will be written asF j (x) or justF j and the (correspond-

ing) mixture asH = ∑mj=1ψ j F j . We use definition 3.1 for the identifiability

of mixtures inH. Here we are interested in pure mixtures (ψ j > 0) of order

m= 1,2, ....

Definition 3.1 SupposeH andH ′ are any two members ofH, given by

H =m∑

j=1

ψ j F j , H ′ =m′∑

j=1

ψ ′j F′j . (3.9)

H is identifiablewhen H ≡ H ′ if and only if m = m′, and the order of

summation can be chosen such thatψ j = ψ ′j , F j = F ′j , j = 1, ...,m.


Stated differently,H ∈ H is identifiable, if there is a unique solution (up to a

permutation of subscripts) of the identity definingH in (3.9). Several theorems

on the identifiability of finite mixtures are available and linear independence

of the members ofF is the key to answering the question (cf. Titterington,

Smith, and Makov, 1985). Core papers on this issue are Teicher (1963) and

Yakowitz and Spragins (1968), from which we present the most important re-

sults in appendix 3A.

In several studies it is proved that certain familiesF of d-dimensional c.d.f.’s,

for instance Gaussian c.d.f.’s, generate identifiable finite mixtures. As we show

below, these results do not carry over directly to the LIV model and we have

to prove identifiability in two steps. Similarly, related work by Hennig (2000)

on the identifiability of mixtures of regressions cannot be extended straight-

forwardly because of structural differences between his and our framework,

and model assumptions (in the context of Hennig at leastσεν 6= 0 is required

whereasσεν = 0 is also of interest in our model).

Let

Fβ,6 ={F |F is a bivariate normal c.d.f. onR2 of the pair(yi , xi )

with mean and variance(µ(β, π),�(β,6)) , π ∈ R} , (3.10)

whereµ(β, π) = (β0 + β1π, π)′ and�(β,6) = � as in (3.4), be the class

of general LIV models, whereβ and6 are known, andHβ,6 the set of all

pure finite mixtures of orderm of the classFβ,6. We will consider general

m > 1. For the simple LIV model we usedm = 2. We apply standard results

of identifiability of finite mixtures to establish identifiability of the classHβ,6

in terms of theπ ’s and the mixture probabilities. However, the identifiability

of the parametersβ and6 does not follow immediately. In fact, we are not

seeking for the identifiability ofHβ,6 but for the identifiability of the larger

class


G =⋃

β,6

Hβ,6. (3.11)

In the following, we first prove the identifiability of the classHβ,6. Subse-

quently, we use the identifiability ofHβ,6 to prove identifiability ofG, which

is the class of LIV models. Identifiability ofG is equivalent with identifiability

of the parameters in the general LIV model.

Proof of identifiability of Hβ,6. Let F j,X be the marginal distribution func-

tion of F j for X. More specifically,F j,X(x) = limy→∞ F j (y, x). From (3.1)

and (3.3) it can be seen thatX has meanπ j and varianceω22. Here, allF j,X,

j = 1,2, ..., are normal distribution functions with different location parame-

ters but with the same variances.

Since we assume for the moment thatβ and6 are known, identifiability

of the classHβ,6 is established if there is a unique solution ofF(y, x) =∑mj=1 a j F j (y, x) in terms ofa j andπ j for m = 1,2, .... But now we only

have to look at the marginal distribution ofX, since this distribution contains

all the relevant parameters. The c.d.f.F is a finite mixture ofN(π, σ 2ν ), dis-

tribution functions withπ ∈ R andσ 2ν is fixed for the moment. According

to proposition 1 of Teicher (1963), or proposition 2 of Yakowitz and Spragins

(1968) (see appendix 3A),F is identifiable. It follows that there is a unique

solution in terms ofm, a j , andπ j , for j = 1,2, ...,m, for anyF ∈ Hβ,6.

In the preceding we assumedβ and6 known, which will not be the case in

general. But the previous result can be used to proof identifiability of the larger

classG = ⋃β,6Hβ,6, which is the union across all possible values of(β,6).

If a distribution from this class has a unique solution in terms of its unknown

parameters, than the parameters of the general LIV model are identified (in-

cluding the relative class sizes).

Proof of identifiability of G. In the following we prove thatG is also identi-

fied, i.e. we prove the following theorem.


Theorem 3.1 Assume thatm≥ 2.Hβ,6 is identified for allβ and6 positive

semi-definite⇐⇒ G is identified.

Proof (⇒) Let F,G ∈ G such thatF ≡ G, where

F =m∑

i=1

ai Fi ∈ Hβ,6

G =k∑

j=1

b j G j ∈ Hδ,9,

anda1, ...,am andb1, ...,bk are the positive mixing proportions, and the dis-

tributions F1, ..., Fm ∈ Fβ,6 andG1, ...,Gk ∈ Fδ,9 are different in terms of

their means and variances, i.e.

Fi is the c.d.f. ofN2

(µ(β, πi ),�(β,6)

)

Gi is the c.d.f. ofN2

(µ(δ, γi ),�(δ,9)

). (3.12)

We need to show thatF ≡ G impliesm = k, ai = bi and Fi = Gi modulo

permutation (definition 3.1). By definition 3.1,G is identified if F ≡ G im-

plies thatm = k, ai = bi , andFi = Gi eventually after relabeling (and vice

versa) fori = 1, ..., k.

F ≡ G implies that

m∑

i=1

ai Fi =k∑

j=1

b j G j . (3.13)

Both F andG have unique representations (up to a permutation of indices) in

terms ofm, ai , andπi , respectively,k, b j , andγ j becauseHβ,6 andHδ,9 are

assumed to be identified (⇒). Hence, we know that, givenβ and6 (δ and9),

there is no other representation in terms ofm, ai , andπi (k, b j , andγ j ) that

yields F (G). In addition, in (3.13) we have two finite mixtures of bivariate

normal distribution functions. According to proposition 2 of Yakowitz and


Spragins (1968) such mixtures are identifiable, hence we must havem = k,

ai = bi , andFi = Gi (eventually after relabeling). ButFi = Gi implies that

µ(β, πi ) = µ(δ, γi ) and�(β,6) = �(δ,9). Thus,

β0+ β1πi = δ0+ δ1γi (3.14)

πi = γi (3.15)

β21σ

2ν + σ 2

ε + 2β1σεν = δ21ψ

2ν + ψ2

ε + 2δ1ψεν (3.16)

β1σ2ν + σεν = δ1ψ

2ν + ψεν (3.17)

σ 2ν = ψ2

ν (3.18)

for i = 1, ...,m. Since theFi = Gi are different fori = 1, ...,m, we have

πi 6= π j andγi 6= γ j for all i 6= j . Using this andm ≥ 2, it follows from

(3.14) and (3.15) thatβ0 = δ0 andβ1 = δ1. Subsequently, from (3.16) - (3.18),

we haveσ 2ε = ψ2

ε , σ 2ν = ψ2

ν , andσεν = ψεν . So,F ∈ G has an unique repre-

sentation andG is thus identified.

The reverse of the proof (⇐) follows immediately (i.e. a subset of an identified

set must be identified as well).

To conclude, from theorem 3.1 it follows that ifm≥ 2 and all the group means

π j , j = 1, ...,m, are different, the parameters of the LIV model in (3.1) with

normally distributed errors are identifiable, including the mixture probabilities.

3.3.2 Information matrix

The Fisher information matrix is a quantity for the ‘information’ in a sample.

Besides, the (asymptotic) variance of unbiased estimates is based on this ma-

trix (Cramer-Rao lower bound), and plays a role in determining the asymptotic

distribution of the maximum likelihood estimator. The information matrix is

defined as


I(θ) = −E

[∂2 lnL(θ)∂θ∂θ ′

]= E

[(∂ lnL(θ)∂θ

)(∂ lnL(θ)∂θ ′

)], (3.19)

provided that this expression is well-defined. From the mixture literature (e.g.

Titterington, Smith, and Makov, 1985) it is known that the information loss is

larger when the mixing weights are unbalanced or when the component den-

sities are not well separated. In such cases, larger samples or adding a small

portion of fully categorized data may result in large improvements (Tittering-

ton, Smith, and Makov, 1985, or Redner and Walker, 1984). Although in some

situations the calculation of the information matrix can be simplified by using

lower-dimensional numerical (quadrature) integration, or by using existing ta-

bles that report on quantities which can be used to approximate the information

matrix quite accurately, in general the derivatives of the log-likelihood will be

complicated nonlinear functions of the data whose expected values will be un-

known.

Using the law of large numbers, the information matrix can be estimated by

evaluating the actual (not expected) second-order derivatives of the log- like-

lihood (i.e. the Hessian at the maximum likelihood estimate). A second es-

timate can be obtained using the first order derivatives (which are necessary

to solve the likelihood equation). This estimator of the information matrix is

known as the outer product of gradients (OPG) estimator (e.g. Greene, 2000,

or Davidson and MacKinnon, 1993). In the following we examine the first-

and second-order derivatives of the log-likelihood function of the LIV model

in (3.1) more closely.

Let

L(θ |y, x) =n∏

i=1

f (yi , xi |θ), (3.20)

where the parameterθ = (θ1, θ2, θ3, θ4)′ is defined as follows:θ1 = (β0, β1)

′,

θ2 = (π1, ..., πm)′, θ3 = (σ 2

ε , σεν, σ2ν )′, andθ4 = (λ1, ..., λm)

′, with∑m

j=1 λ j =1, and


f (yi , xi |θ) =m∑

j=1

λ j f j (yi , xi |θ1, θ2 j , θ3). (3.21)

In this notation,θ1 contains the (regression) parameters that do not differ with

j , θ2 contains the parameters that are dependent ofj , except the class sizesλ j ,

that are contained inθ4. The vectorθ3 consists of the elements of6. We also

write fi | j instead off j (yi , xi |θ1, θ2 j , θ3) and fi instead off (yi , xi |θ).

The general structure of the first order derivatives of the log-likelihood of the

LIV model (gradient) with respect toθ1, θ2, andθ3 is

∂

∂θh

logL(θ) =n∑

i=1

1

fi

m∑

j=1

λ j

∂ fi | j∂θh

, (3.22)

whereθh is an element ofθ1, θ2, andθ3. For elements ofθ4 (i.e. the group

sizes) we have forl = 1, ...,m− 1,

∂

∂θ4l

logL(θ) =n∑

i=1

fi |l − fi |mfi

, (3.23)

sinceλm = 1−∑m−1j λ j . In appendix 3B we present more detailed results

when the errors have a joint normal distribution.

The general structure of the mixed partial derivatives of the log-likelihood of

the LIV model (Hessian) with respect to the elements ofθ1, θ2, andθ3, is given

by

∂2

∂θh∂θk

logL(θ) = ∂

∂θh

n∑

i=1

1

fi

m∑

j=1

λ j

∂ fi | j∂θk

(3.24)

=n∑

i=1

−

(m∑

l=1

λl

fi

∂ fi |l∂θh

)

m∑

j=1

λ j

fi

∂ fi | j∂θk

+

m∑

j=1

λ j

fi

∂2 fi | j∂θh∂θk

,

3.4 A test-statistic to test for regressor-error dependencies 47

whereθh andθk are elements ofθ1, θ2, or θ3. The second order derivatives with

respect to the group sizesθ4 have the following structure:

∂2

∂θh∂θ4l

logL(θ) = ∂

∂θh

{∂

∂θ4l

logL(θ)}

=n∑

i=1

fi∂∂θh

{fi |l − fi |m

}− { fi |l − fi |m}

∂∂θh

fi

( fi )2

, (3.25)

where the results from the gradients can be used. In appendix 3B we present

more detailed results on the derivation of the second-order partial derivatives

for the normal case.

The gradient vector and Hessian matrix can be programmed along with the

log-likelihood function and numerical optimization techniques can be used to

find the maximum likelihood estimates of the parameters of the LIV model.

We found that using these analytical expressions drastically increase the speed

of convergence of the numerical optimization routine. Furthermore, the final

results were found to be more stable than using numerical approximations of

the gradient and Hessian. This holds in particular for the Hausman test in the

next section.

3.4 A test-statistic to test for regressor-error dependen-cies

We propose to apply a Hausman test directly to test for exogeneity of the re-

gressor (see Greene, 2000) based on the parameter estimatesβLIV . The null

hypothesis is that both OLS and LIV estimates are consistent. The alternative

hypothesis states that only LIV is consistent. The Hausman-LIV statistic is

defined as

H L I V = (βLIV − βOLS)′6−1

HLIV (βLIV − βOLS), (3.26)

where6HLIV is the estimated asymptotic covariance of the difference ofβLIV −βOLS. Hausman shows that this difference can be computed by subtracting the


estimated asymptotic OLS covariance matrix from the estimated asymptotic

LIV covariance matrix. The latter can be obtained as the outer product of the

gradients, from the Hessian matrix, or in case of potential misspecification,

by White’s misspecification consistent covariance matrix (White, 1982). We

found that the estimated asymptotic covariance matrix based on the analytical

first- and second-order derivatives (see subsection 3.3.2) is more stable and

gives more accurate results than a numerical approximation of the gradient

or Hessian. Under the null hypothesis, the statistic follows asymptotically a

χ2(1)-distribution.

This Hausman-LIV test we propose has a great practical advantage over clas-

sical IV methods. In the classical case, one would first need to find good

observable instruments, after which a test to investigate whether or not the in-

struments were needed can be performed. If the test does not reject the null

hypothesis, the instrumental variables are simply discarded since the OLS es-

timator is used in that case. Besides, weak and/or endogenous instruments

will bias the test leading to false conclusions. The LIV variant of the test cir-

cumvents this circular problem and observed instrumental variables are not

required to perform the test.

3.5 Monte Carlo experiments

This section presents the results of a Monte Carlo experiment to demonstrate

that the proposed simple LIV model and Hausman-LIV test are well suited to

identify and resolve regressor–error dependencies. Even when the true num-

ber of the categories of the instrument is larger than two and for various dis-

tributions of the endogenous regressor, the LIV estimates are approximately

consistent and the power4 of the test appears to be satisfactory.

We present the results as follows. First we discuss the results concerning the

main parameters of interest:β1 andσ 2ε . We start with the results for fitting

4The power of a test is the probability of getting a positive result for a given test whichshould produce a positive result (cf. Weisstein, 2004c).

3.5 Monte Carlo experiments 49

the simple LIV model, i.e. assuming a dummy instrument (m = 2). We

show that in all cases the parameters of interest can be recovered well, in con-

trast with OLS. Furthermore, we discuss the results of the Hausman-LIV test.

Subsequently, we present a sensitivity analysis for the simple LIV model by

assuming that the unobserved discrete instrument has three or four categories.

As will become clear, this has no significant impact on the results forβ1 andσ 2ε

in most cases. The same result holds for the power of the Hausman-LIV test

for these specifications. This is an important result, since it illustrates that the

impact of the exact choice of the number of categories on the final outcomes is

less important. Finally, in subsection 3.5.4 we present the results for theπ ’s,

theλ’s, σ 2ν , andσεν .

3.5.1 Design of the simulation study: data generation

In the simulation study the data were generated as follows. The error terms

(ε, ν) are drawn from a bivariate normal distribution with unit variances. The

endogenous regressorx was constructed by varying the correlation betweenx

andε and the true number of instrumentsm. We considered three specifica-

tions form: 2, 4, and 8, using equal group sizes 1/m. This results in a bimodal

distribution withm = 2 (bim2), and two unimodal distributions withm = 4

(unim4) andm = 8 (unim8) for the endogenousx. Furthermore we consider

two other distributions for the instruments both withm = 8 support points,

resulting in a bimodal distribution (bim8) and a skewed distribution (skew8).

The values form, σεν , andπ1, ..., πm are chosen such that the mean ofx is zero,

its variance is 2.5 and the correlation betweenx andε is 0,0.1, ...,0.5. Since

in all simulations the endogenous regressor has mean zero, the OLS estimate

of the constant is unbiased and it can be used as an estimate forβ0. Hence, we

omit further details onβ0 in the following. The Hausman-LIV test statistic for

the regression coefficient has aχ2(1)-distribution under the null hypothesis of

no regressor-error dependency. Data were generated for 1000 observations and

250 Monte Carlo replications. We use quasi-Newton numerical optimization

routines (the BFGS-method) to maximize the likelihood function that are pro-

vided with the GAUSS package (Aptech, 2000). For computing the gradient

and Hessian we use the analytical expressions discussed in subsection 3.3.2


Figure 3.1: Bias plotsβ1 for the simple LIV model, where 1: OLS, 2: bimodalm= 2(bim2), 3: bimodalm = 8 (bim8), 4: skewedm = 8 (skew8), 5: unimodalm = 4(unim4), and 6: unimodalm= 8 (unim8).


and derived in appendix 3B.

3.5.2 Results for the simple LIV model (m= 2)

Results forβ1 and σ 2ε

Figure 3.1 shows the bias plots forβ1 estimated for the six different corre-

lations betweenx andε by, respectively, OLS and the LIV method for data

generated withm = 2,4,8. Two observations are in order. First, increas-

ing the degree of endogeneity decreases the amount of uncertainty in the LIV

estimator. This result is expected as the proposed method is designed for situ-

ations with endogeneity. In the case of a perfectly exogenous regressor, OLS

provides the ‘best linear unbiased’ estimator and outperforms LIV, but OLS

performs worse as the correlation betweenx andε increases. Secondly, when

there are four or eight instruments in the unimodal distribution (unim4 and

unim8), some efficiency is lost with the LIV approach since the model is mis-

specified under these conditions. Furthermore, the distribution ofx tends more

to a normal distribution with mean 0 and variance 2.5 when the true number of

categories is larger. From theorem 3.1 we know that a normal distribution for

the unobserved instrument results in an unidentified model. Furthermore, less

well separated mixture components may lead to lower efficiency, as became

clear from results on information in mixture models (see subsection 3.3.2).

However, when the true instrument has an obvious grouped structure, as in the

case of the two bimodal (bim2 and bim8) and the skewed (skew8) distribution

of the instrument, it is well approximated by the assumed discrete instrument.

In these cases, the LIV model represents the true instruments quite accurately,

resulting in more efficient estimates. In all cases, the LIV estimator appears to

be consistent.

In figure 3.2 we present the results forσ 2ε . It can be seen that more or less

similar conclusions hold as forβ1. The OLS estimator forσ 2ε is the ‘best’ esti-

mator forρxε = 0. Forρxε = 0.1 andρxε = 0.2 the bias in the OLS estimator

is not large, but becomes more substantial for larger regressor–error correla-

tions in which case the OLS estimator exhibits a significant downward bias.


Figure 3.2: Bias plotsσ 2ε for the simple LIV model, where 1: OLS, 2: bimodalm= 2

(bim2), 3: bimodalm = 8 (bim8), 4: skewedm = 8 (skew8), 5: unimodalm = 4(unim4), and 6: unimodalm= 8 (unim8).


For the skewed and the two bimodal distributions, the LIV results are very

precise even for a situation with no endogeneity. For the two symmetric uni-

modal cases (unim4 and unim8) the distribution across the simulations of the

estimates forσ 2ε has a small positive skew. As the correlation betweenx andε

increases, the variance of the estimate forσ 2ε across the simulations is slightly

higher for all cases, unlike the results forβ1. This can be expected since a

larger correlation betweenx andε implies thaty and E(y|x) are further apart,

which will adversely affect the precision of an estimate forσ 2ε . Nevertheless,

the LIV estimates forσ 2ε appear to be approximately consistent.

Results for the Hausman test for exogeneity

Table 3.1 shows the results for the Hausman-LIV test. The degrees of en-

dogeneity are presented row-wise, each entry represents the fractions of re-

jections of the null hypothesis. Increasing the correlation betweenx and ε

increases the number of times the null hypothesis is rejected, as is to be ex-

pected. Comparing the two bimodal distributions, the test performs slightly

better for bim2 in which case the number of instruments is correctly specified,

although the results are very close5. Comparing the two unimodal distributions

(unim4 and unim8) the test tends to reject the null-hypothesis somewhat too

often when the true correlation is zero for the case withm = 8. As before,

this is caused by efficiency loss due to misspecification and the approximation

to a normal distribution. When the true instrument has a skewed distribution,

the power of the Hausman-LIV test forρxε > 0 is higher than for any other

distribution that we investigated. But, in this case the test is also too liberal for

zero regressor-error correlations because the model is misspecified. The power

of the test is highest for the bimodal distributions and the skewed distribution,

and the lowest for the unimodal distributions. If the instrument has a bimodal

or a skewed distribution, the two groups imposed on the endogenousx by the

simple LIV model are a more adequate representation, allowing for precise

LIV estimates.

5We did not report the standard deviations. These can be computed easily as√

f (1− f )/ l ,where f are the reported fractions andl is the total number of simulation runs.


Table 3.1: Power of the Hausman-LIV test using the simple LIV model for variousdegrees of endogeneity, for sizes (respectively)α = 0.50, 0.05, and 0.01.

Distribution

α ρx,ε bim2 bim8 skew8 unim4 unim8

0.50 0 0.53 0.51 0.54 0.47 0.540.1 0.92 0.86 0.96 0.71 0.590.2 1.00 1.00 1.00 0.95 0.860.3 1.00 1.00 1.00 0.99 0.970.4 1.00 1.00 1.00 1.00 1.000.5 1.00 1.00 1.00 1.00 1.00

0.05 0 0.04 0.06 0.08 0.06 0.070.1 0.44 0.40 0.56 0.20 0.130.2 0.95 0.97 0.99 0.52 0.420.3 1.00 1.00 1.00 0.85 0.760.4 1.00 1.00 1.00 1.00 0.970.5 1.00 1.00 1.00 1.00 1.00

0.01 0 0.02 0.01 0.01 0.02 0.020.1 0.22 0.22 0.37 0.08 0.060.2 0.87 0.82 0.96 0.27 0.240.3 1.00 1.00 1.00 0.68 0.610.4 1.00 1.00 1.00 0.97 0.940.5 1.00 1.00 1.00 1.00 1.00

The results of our simulation studies for the Hausman-LIV test suggest that the

test has a reasonable power across a wide range of regressor-error correlations

and for different kinds of distributions of the instruments. Furthermore, the

proposed model test and estimation work well even if the number of instru-

ments is misspecified. We find the test to be fairly robust under misspecifica-

tion with a small bias towards rejecting the null hypothesis somewhat too often.

These results are obtained without requiring observed instrumental variables.


3.5.3 Sensitivity analysis: usingm= 3 and m= 4

The simple LIV model assumes a dummy instrument. We saw that, regardless

of true number of categories of the instrument and the shape of the distribution

of the endogenousx, the main parameters are estimated approximately consis-

tent. Here we illustrate for the skewed (skew8), bimodal (bim8), and unimodal

(unim8) distributions from the previous subsections the effect of estimating the

LIV model with m = 3 andm = 4 categories. The results demonstrate that

the conclusions for the main parameters (β1 andσ 2ε ) and the Hausman-LIV test

are fairly robust for different choices ofm, alleviating the burden of choosing

a value form.

The results forβ1 for six different regressor-error correlations are shown in

figure 3.3. In each figure, the first panel shows the results for the bimodal

distribution when the LIV model is estimated form = 2,3 and 4, the second

panel gives the results for the skewed distribution, and the third panel shows

the results for the unimodel distribution. For all three distributions the true

number of instruments is eight. It is clear that changing the number of cate-

gories of the instrument hardly effects the estimation results forβ1 and the bias

is approximately zero in all cases. The results for the unimodal distribution are

slightly more sensitive to different choices ofm, in particular whenρxε = 0.

As before, a larger value forρxε results in more precise estimates. Forσ 2ε we

also found that the results are fairly robust against different choices form and

we omit further details.

Results for the power of the Hausman test are presented6 in table 3.2. Note

that them = 2 columns are copied from table 3.1. It can be seen that for the

bimodal and skewed cases the power of the Hausman-LIV test is fairly stable

for different choices ofm. For the bimodal case withm = 4 and the skewed

case withm = 2 the power is slightly lower, although differences are small.

For the unimodal distribution, the test has a lower power form= 3 andm= 4,

in particular whenρxε is small. This can be explained from the results forβ1

6For the sake of clarity we omittedα = 0.50 from the table.


Figure 3.3: Bias plotsβ1 LIV model for the 1: skewed, 2: bimodal, and 3: unimodelcases, withm= 8, and estimated withm= 2, 3,4, respectively.


that showed larger amounts of uncertainty in the estimates (figure 3.3) than

for m = 2. In this case, takingm = 2 or increasing the sample size would

contribute to the power of the test.

Table 3.2: Power of the Hausman-LIV test for different choices ofm.

α = 0.05 α = 0.01

ρx,ε m= 2 m= 3 m= 4 m= 2 m= 3 m= 4

bim8 0 0.06 0.06 0.09 0.01 0.02 0.020.1 0.40 0.37 0.38 0.22 0.18 0.200.2 0.97 0.99 0.88 0.82 0.99 0.720.3 1.00 0.99 0.99 1.00 0.99 0.990.4 1.00 1.00 1.00 1.00 1.00 1.000.5 1.00 1.00 0.99 1.00 1.00 0.99

skew8 0 0.08 0.06 0.07 0.01 0.02 0.010.1 0.56 0.63 0.62 0.37 0.42 0.400.2 0.99 1.00 0.99 0.96 0.99 0.960.3 1.00 1.00 1.00 1.00 1.00 1.000.4 1.00 1.00 1.00 1.00 1.00 1.000.5 1.00 1.00 1.00 1.00 1.00 1.00

unim8 0 0.07 0.12 0.24 0.02 0.05 0.180.1 0.13 0.17 0.24 0.06 0.08 0.140.2 0.42 0.43 0.47 0.24 0.25 0.290.3 0.76 0.73 0.68 0.61 0.55 0.550.4 0.97 0.93 0.91 0.94 0.89 0.850.5 1.00 0.99 1.00 1.00 0.99 0.99

When the LIV model is estimated withm = 3 or m = 4 we found degen-

erate solutions. A degenerate solution occurs when two (or more) estimated

category means (π j ) are equal or when one category (or more) contains no

observations (λ j = 0). In such a case, the determinant of the Hessian matrix

computed in the maximum likelihood estimate is zero. Degeneracy for a fit-

ted model can be expected to occur more often when the different components

of the mixture distribution show a large overlap. Results from, among others,


Redner and Walker (1984) illustrate that if the sample is from a mixture of

poorly separated components (using the Mahalanobis distance as a criterion),

the estimation problem becomes more difficult and impractically large sam-

ples may be needed in order to expect moderately precise estimates for the

category means and sizes. For the LIV model7 this does not present a problem

for the main parameters, since in such case a nondegenerate solution with a

lower number of categories can be used. This can be done without harm as the

previous results illustrate that the estimates for the main parameters (β1 and

σ 2ε ) are hardly affected by changing the number of categories.

3.5.4 Results forπ , λ, σ 2ν , and σεν

For the sake of simplicity and overview we postponed discussing the results

for the other parameters (π , λ, σ 2ν , andσεν) until here. In performing a LIV

analysis interest will usually be on the parametersβ1 andσ 2ε , since the linear

regression model is the object of study. However, the other parameters can give

valuable insight. We give the results for the simple LIV model for the bimodal

distribution withm= 2 (bim2), in which case the number of categories is cor-

rectly specified. We also present the results for these parameters form= 2,3,

and 4 for the skewed skewed distribution withm = 8 (skew8), and report for

the other cases only substantially different or noteworthy outcomes8.

In table 3.3 we present the results for the bimodalm = 2 distribution (means

and standard deviations across the 250 monte carlo simulations) for which the

simple LIV model is correctly specified. It can be seen that the group means

and sizes are estimated approximately unbiased. Similar results hold for the

varianceσ 2ν and the covarianceσεν . As for β1, a higher degree of endogeneity

results in lower standard deviations for the estimates.

For the skewed distribution withm= 8, the true number of categories is larger

7The degenerate solutions are left out of the analysis. The largest number of degeneratesolutions was found for the univariate distribution (unim8) when estimated withm = 4 (about35% of the Monte Carlo replications).

8The results can be obtained from the author upon request.


Table 3.3: Results for the bimodalm = 2 case. The true values areπ1 = −π2 =−1.23, λ1 = λ2 = 0.5, σ 2

ν = 1, and the true values forσεν are given in the lastcolumn.

Parameters

ρx,ε π1 π2 λ1 λ2 σ 2ν σεν

0-1.23 1.23 0.50 0.50 1.00 0.01 0

(0.08) (0.08) (0.03) (0.03) (0.08) (0.10)

0.1-1.22 1.23 0.50 0.50 1.00 0.16 0.16

(0.08) (0.07) (0.03) (0.03) (0.07) (0.09)

0.2-1.23 1.22 0.50 0.50 0.99 0.31 0.32

(0.06) (0.06) (0.02) (0.02) (0.07) (0.09)

0.3-1.22 1.23 0.50 0.50 1.00 0.48 0.47

(0.06) (0.06) (0.02) (0.02) (0.06) (0.08)

0.4-1.23 1.22 0.50 0.50 1.01 0.64 0.63

(0.06) (0.05) (0.02) (0.02) (0.06) (0.06)

0.5-1.23 1.23 0.50 0.50 1.00 0.79 0.79

(0.05) (0.05) (0.02) (0.02) (0.05) (0.06)

than the numbers used in estimation (m= 2,3,4), in which case the estimates

for π andλ cannot be ‘consistent’. Table 3.4 and table table 3.5 show the

results form = 2,3,4 for π andλ, and forσεν andσ 2ν , respectively, when

the distribution ofx is skewed. Two comments are in order. First, across the

simulations we found that the observed meansx = (1/n)∑

i xi and variances

s2x =

∑i (xi − x)2 are (almost) equal to the estimated meansµx =

∑j λ j π j

and variancesσ 2ν =

∑j λ j (π j−µx)

2+σ 2ν , regardless of the choice form. This

means that the estimated value forσ 2ν decreases whenm increases because of a

trade-off between ‘within-group’ and ‘between-group’ variance (see table 3.5).

Secondly, for larger correlations betweenx andε, the difference between the

largest and smallest estimated group mean gets smaller. This is the effect of

the covariance on the estimated group means and illustrates, for instance, that

both equations cannot be estimated independently of each other. As a conse-

quence, the within group variance is smaller and the estimated value forσ 2ν is

higher for larger values ofρxε , which holds in particular form = 4, see table


3.5.

Regardless of the choice ofm, the mean and variance ofx are estimated ap-

proximately unbiased byµx andσ 2ν . Furthermore, the parametersπ andλ are

estimated consistently if the number of categories in the LIV model is correctly

specified. But, importantly, the results forβ1 andσ 2ε are approximately unbi-

ased regardless of the choice form. Hence, whetherm = m does not present

a problem in estimating the effect ofx on y in presence of regressor-error cor-

relations.

In this section we presented the results of a simulation study to investigate the

performance of the simple latent instrumental variable (LIV) model and the

power of the Hausman-LIV test. We showed that the regression coefficients

can be estimated approximately unbiased and that the Hausman-LIV test has

a reasonable power, without the requirement of having observed instrumental

variables at hand. Furthermore, the results are fairly robust against different

specifications form. In the next section the simple LIV model is applied to a

measurement error problem. In this application a natural grouping variable ex-

ists which is likely to be a ‘perfect’ instrumental variable since it was obtained

in a laboratory setting. We show that the LIV model gives in this situation

identical results to the classical IV estimator, but without using the observed

dummy instrument. The simulation results and this empirical result illustrate

that the LIV model can be successfully used in applications where perfect in-

strumental variables are not available.

3.6 An illustrative example: a simple measurement er-ror model

The data we use in this section is taken from Madansky (1959). Madansky

considers several methods to fit a straight line when both variables are sub-

ject to error. One method is based on the grouping method of Wald (see also

Bowden and Turkington, 1984) which can be viewed as an IV method. The

advantage of this dataset is that a ‘natural’ discrete instrument comes with the

3.6 An illustrative example: a simple measurement error model 61

Table 3.4: Results for the skewed (m= 8) case.

Parameters

ρx,ε π1 π2 π3 π4 λ1 λ2 λ3 λ4

0 -0.34 3.64 0.92 0.08(0.05) (0.25) (0.01) (0.01)-0.49 1.99 4.64 0.84 0.12 0.04

(0.08) (0.50) (0.42) (0.04) (0.04) (0.01)-0.94 0.51 2.85 5.14 0.54 0.36 0.08 0.03

(0.43) (0.54) (0.59) (0.48) (0.20) (0.18) (0.02) (0.01)0.1 -0.34 3.66 0.92 0.08

(0.04) (0.27) (0.01) (0.01)-0.50 1.91 4.60 0.84 0.12 0.04

(0.07) (0.45) (0.38) (0.04) (0.04) (0.01)-1.00 0.46 2.72 5.01 0.51 0.38 0.08 0.03

(0.51) (0.58) (0.64) (0.57) (0.24) (0.21) (0.03) (0.01)0.2 -0.33 3.68 0.91 0.09

(0.05) (0.25) (0.01) (0.01)-0.48 2.01 4.64 0.84 0.12 0.04

(0.07) (0.44) (0.35) (0.03) (0.03) (0.01)-0.89 0.63 2.92 5.10 0.59 0.31 0.07 0.03

(0.55) (0.60) (0.64) (0.52) (0.22) (0.20) (0.03) (0.01)0.3 -0.34 3.60 0.91 0.09

(0.05) (0.23) (0.01) (0.01)-0.48 2.01 4.58 0.85 0.11 0.04

(0.06) (0.35) (0.31) (0.03) (0.02) (0.01)-0.84 0.56 2.73 4.93 0.59 0.31 0.07 0.03

(0.39) (0.57) (0.57) (0.44) (0.22) (0.20) (0.03) (0.01)0.4 -0.35 3.52 0.91 0.09

(0.04) (0.25) (0.01) (0.01)-0.47 1.97 4.52 0.85 0.11 0.04

(0.05) (0.28) (0.28) (0.02) (0.02) (0.01)-0.63 0.93 2.87 4.88 0.73 0.18 0.06 0.03

(0.24) (0.49) (0.49) (0.32) (0.13) (0.11) (0.02) (0.01)0.5 -0.35 3.40 0.91 0.09

(0.04) (0.23) (0.01) (0.01)-0.47 1.89 4.43 0.85 0.11 0.04

(0.04) (0.18) (0.22) (0.02) (0.01) (0.01)-0.55 1.13 2.83 4.76 0.79 0.12 0.05 0.03

(0.05) (0.28) (0.37) (0.25) (0.03) (0.02) (0.01) (0.01)


Table 3.5: Results for the skewedm= 8 case. The true value forσ 2ν = 1 and the true

values forσεν are the same as in table 3.3.

m= 2 m= 3 m= 4

ρx,ε σ 2ν σεν σ 2

ν σεν σ 2ν σεν

0 1.33 0.01 1.08 0.01 0.85 0.01(0.08) (0.08) (0.08) (0.07) (0.12) (0.07)

0.1 1.34 0.15 1.07 0.16 0.86 0.15(0.08) (0.07) (0.09) (0.07) (0.13) (0.07)

0.2 1.33 0.32 1.08 0.32 0.90 0.32(0.07) (0.07) (0.09) (0.07) (0.12) (0.07)

0.3 1.34 0.47 1.10 0.47 0.93 0.47(0.07) (0.07) (0.07) (0.06) (0.09) (0.06)

0.4 1.34 0.63 1.10 0.63 0.99 0.64(0.07) (0.07) (0.07) (0.06) (0.07) (0.06)

0.5 1.35 0.78 1.10 0.78 1.02 0.78(0.07) (0.07) (0.06) (0.06) (0.06) (0.06)

experiment. This makes a comparison between the LIV method and classical

IV method in an empirical setting of great interest, since the ‘true’ instrument

is known.

The dataset contains a random sample of 50 measures on yield strength of ar-

tillery shells (x) and measures for hardness of the shells (y). Artillery shells are

projectiles for large guns, whose quality depends, among other things, on the

hardness of steel of which they are composed. A low hardness, for instance,

may cause premature explosions in or near the muzzle of the gun, and projec-

tiles are less effective against armor when composed of low quality metal9.

The yield strength was measured by pulling a piece of steel from two sides

for a period of time and converting the new dimension into a measure of yield

strength. The (Brinell) hardness was obtained by making a dent in each shell

with a device from which the hardness could be read from a dial. The shells

9Source: www.civilwarartillery.com and www.winterwar.com

3.6 An illustrative example: a simple measurement error model 63

were manufactured from two different heats of steel, which constitutes the

‘natural’ data-grouping criterion (25 observations in each group). This vari-

able can serve as an instrumental variable as the different manufacturing tem-

peratures do not affect the measuring of yield strength or hardness afterwards

(Madansky, 1959). We do not attempt to discuss any of the technicalities asso-

ciated with measuringy andx, which is beyond our scope. Madansky argues

that measurement errors in yield strength are due to inhomogeneity of steel

and other errors in the process of measuring (human or measuring-instrument

errors). This dataset is in particular interesting because of the presence of a

laboratory instrument that allows for direct comparison of the classical IV es-

timator and the LIV estimator.

Table 3.6: Results for the Madansky measurement error data (n = 50).

Method

OLS IV LIV

β 3.288 3.204 3.204(0.426) (0.440) (0.431)

Hausman test - 0.551 1.403

The results for OLS, IV, and the simple LIV model are given in table 3.6. The

IV estimate with the temperature dummy as natural instrument for the effect

of yield strength on hardness isβIV = 3.204. The simple LIV estimate (with

m = 2) is equal to 3.204, i.e. exactly identical. The latter estimate is ob-

tained without using the observed instrument. More importantly, when the a

posteriori classification found by the LIV model is compared to the ‘natural’

dummy instrument (high heat - low heat), we find that the LIV classification

and the observed classification are identical, i.e. we are able to predict the

observed instrument exactly, see table 3.7. All posterior probabilities for the

two categories of the latent instrument are either zero or one. In both cases, the

Hausman-test does not indicate a significant bias in the OLS estimate, which is

βOLS = 3.288. The similarity between LIV and IV with an observed ‘natural’


dummy instrument illustrates the power of the LIV method: the LIV method

which does not rely on an observed instrument gives results that are similar

to a situation where a perfect instrument is observed. A ‘natural’ instrument,

however, will rarely be available in empirical studies in economics or market-

ing10.

Table 3.7: Predicted instrument versus observed instrument (n = 50).

Observed instrument

Low heat High heat Total

Predicted instrument LIV Group 1 25 0 25Group 2 0 25 25

Total 25 25 50

Although the IV and LIV are similar, the results raise a few questions. Firstly,

both IV and LIV indicate an upward bias in OLS, whereas the classical mea-

surement error model predicts a downward bias. The upward bias that we find,

however, is in both cases not significant. Secondly, precise estimation of the

variance covariance matrix for the LIV estimate is problematic in samples of

this size. LIV is slightly more efficient than IV11 when the Hessian matrix,

evaluated at the maximum likelihood estimate, is used12 to compute the stan-

dard errors. In this example the yield strength has an obvious group structure

and a lack of knowledge of a priori group-membership (i.e. knowing the ob-

served instrument) does not result in a large loss of information. Furthermore,

since the LIV model estimates variances and covariances simultaneously, con-

10In this particular case, another estimator forβ can be constructed, if we can assume thatX = 0 impliesY = 0, so that the intercept is zero (cf. Madansky, 1954). When this is true, aconsistent estimate forβ is given byy/x = 3.434(0.04). Fixing the intercept to zero in the LIVmodel yields 3.432 (0.433), which is similar but much less efficient than Madansky’s simpleestimator.

11Madansky reports a standard error for the grouping method of 0.22, which is much smallerthan our result. For least squares he reports 0.47. We do not know exactly how these estimateswere obtained.

12Greene (2000) states that the available estimators for the asymptotic covariance matrixusually give different results in small samples, but when a sample is small or moderate sizedthe Hessian is preferable.

3.7 Conclusions 65

trary to IV, more efficient results may be obtained. However, when the out-

erproduct of gradients is used to estimate the standard deviation, the classical

IV estimator is more efficient (in this case we find that the estimated standard

error forβLIV is 0.531).

As we saw, the small size of the dataset does not allow for very precise esti-

mation of the relation between yield strength and hardness of artillery shells.

If precise inferences on the relation between yield strength and hardness are

desirable, more datapoints are needed. Nevertheless, this example shows that

even in a small dataset the LIV method yields results similar to IV when an

observed ‘natural’ dummy instrument is available. However, in absence of

such an instrument the IV estimator can not be used, whereas LIV still gives

unbiased results.

3.7 Conclusions

Searching for valid instruments is a long-standing problem in estimating IV

models in economics that account for regressor-error problems. In addition,

the identification of regressor-error correlation has been impossible without

such valid instruments. Our proposed instrument-free approach presents a

practical solution to this circular problem: it can be used to estimate regres-

sion parameters and test for regressor-error correlations without the necessity

of first finding valid instruments.

In this chapter we introduced the LIV model. We proved that the LIV model

is identified through the likelihood, which is a necessary condition for the ex-

istence of a consistent estimator. Furthermore, we discussed estimation of the

information matrix. The Monte Carlo studies show that the model yields un-

biased results for several types of distributions forx, and outperforms OLS

whenever their exists a correlation betweenx andε. The Hausman-LIV test

detects departures from independence of regressor and model error with a rea-

sonable power across a wide range of regressor-error correlations. In the case

of severe model violations, the Hausman-LIV test becomes too strict in reject-


ing the null hypothesis when it is true. As a result, in applications of this test

researchers may search unnecessarily for manifest instruments in a small frac-

tion of cases. However, we feel that this is a small price to pay in view of the

simplicity and ease of implementing the proposed test. Importantly, the test

results and estimates forβ1 are robust against different choices for the num-

ber of categories of the instrument. Finally, we analyzed a measurement error

application with the LIV model. We showed that the LIV model gives similar

results to the classical IV estimator when a ‘natural’ dummy instrument exists.

The results in this chapter are convergent and add credibility to the LIV ap-

proach. The LIV model presents a solution to the circular problem of search-

ing for valid instruments in empirical studies. The model and Hausman-LIV

test are fairly simple and easy to implement, and the results in this chapter il-

lustrate its usefulness across a wide variety of problems. In the next chapter

we consider several extensions of the LIV model. Furthermore, we suggest

diagnostics to examine model fit.

Appendix 3A Basic theorems on identifiability of mixtures 67

Appendix 3A Basic theorems on identifiability of mix-tures

Let F andH be defined as in subsection 3.3.1. Yakowitz and Spragins (1968) provethe following theorem.

Theorem [Yakowitz and Spragins, 1968].A necessary and sufficient condition forthe classH of all finite mixtures of the familyF be identifiable is thatF be a linearlyindependent set over the field of real numbers.

A corollary to this theorem is the following.

Corollary [Yakowitz and Spragins, 1968]. A necessary and sufficient condition forthe classH of all finite mixtures of the familyF be identifiable is that the image ofFunder any vector isomorphism on span(F) be linearly independent in the image space.

This corollary tends to be easier to apply than the theorem itself as it allows us to workin terms of generating functions which are often more convenient to handle than thecorresponding c.d.f.’s. Yakowitz and Spragins (1968) use this corollary to prove thatthe familyF of n-dimensional Gaussian cdf’s generates identifiable finite mixtures(their proposition 2).

For detailed proofs, extensions to other distributions we refer to Yakowitz and Spra-gins (1968) or Teicher (1963). For a recent overview and discussion, see Hennig(2000).

Appendix 3B 1st and2nd order derivatives log-likelihood

Definezi | j = (yi − β0− β1π j , xi −π j )′ and assume that the errors of the LIV model

in (3.1) have a bivariate normal distribution. Then, using (3.3) and (3.4), we have forf j (yi , xi |θ1, θ2 j , θ3) the following:

f j (yi , xi |θ1, θ2 j , θ3) =1

2π√|�| exp

(−1

2z′i | j�

−1zi | j), (3B.1)

whereθ1, θ2, andθ3 are defined in subsection 3.3.2. The quadratic formz′i | j�−1zi | j

can be rewritten as the quotientqi (θ1, θ2 j , θ3)/d(θ3) with

qi (θ1, θ2 j , θ3) =(

yi − µyi | j)2σ 2ν +

(xi − µx

i | j)2 (

β21σ

2ν + σ 2

ε + 2β1σεν

)+

− 2(

yi − µyi | j) (

xi − µxi | j) (β1σ

2ν + σεν

)

d(θ3) = σ 2ε σ

2ν − σ 2

εν, (3B.2)


andµyi | j = β0+ β1π j andµx

i | j = π j . Now (3B.1) can be written as

f j (yi , xi |θ1, θ2 j , θ3) =exp

[−qi (θ1, θ2 j , θ3) / 2d(θ3)]

2π√

d(θ3). (3B.3)

To simplify notation we will write fi | j for f j (yi , xi |θ1, θ2 j , θ3), fi for f (yi , xi |θ),qi | j for qi (θ1, θ2 j , θ3), andd for d(θ3).

First order partial derivatives (Gradient)

First derivatives with respect to elements ofθ1 (fixed parameters)

We have

∂

∂θ1slogL(θ) =

n∑

i=1

1

fi

m∑

j=1

λ j∂ fi | j∂θ1s

, (3B.4)

where

∂ fi | j∂θ1s

= − fi | jq′i | j (1s)

2d, (3B.5)

with q′i | j (1s) = (∂qi | j /∂θ1s), andθ11 = β0 andθ12 = β1. It follows that:

∂qi | j∂β0

= −2σ 2ν (yi − µy

i | j )+ 2(xi − µxi | j )(β1σ

2ν + σεν)

∂qi | j∂β1

= −2π j σ2ν (yi − µy

i | j )+ 2(xi − µxi | j )

2{σ 2ν β1+ σεν} +

−2(xi − µxi | j ){−π j (β1σ

2ν + σεν)+ (yi − µy

i | j )σ2ν }.

First derivatives with respect to elements ofθ2 (group means)

We have forθ2l , l = 1, ...,m,

∂

∂θ2llogL(θ) =

n∑

i=1

λl

fi

∂ fi |l∂θ2l

, (3B.6)

since

∂

∂θ2lfi | j = 0, for l 6= j .


Estimating the model via numerical optimization

When the log-likelihood function obtained from (3.20) is maximized via numericaloptimization techniques, the estimates obtained for the variances and the groups sizesλ j , j = 1, ...,m, do in general not necessarily satisfy the constraintsσ 2

ε ≥ 0, σ 2ν ≥ 0,

σ 2εν ≤ σ 2

ε σ2ν , 0 < λ j < 1, j = 1, ...,m, and

∑j λ j = 1. This can be circumvented

by optimizing the log-likelihood function not for these parameters directly, but for thetransformed parameters (say)a,b, c andλ j , j = 1, ...,m− 1, and

[a b0 c

]= chol(6),

where ‘chol’ denotes the Cholesky decomposition, implying that

σ 2ε = a2

σεν = ab

σ 2ν = b2+ c2,

and

λ j = ln

(λ j

λm

),

for j = 1, ...,m− 1 andλm = 0, implying

λ j =exp(λ j )

1+∑k exp(λk),

for j = 1, ...,m − 1 andλm = 1 −∑k λk. These expressions can be substitutedin (3.21). The derivatives should now be taken with respect toθ3 = (a,b, c) andθ4 = (λ1, ..., λm−1). This does not effect the general expressions of∂

∂θ3logL(θ)

in (3B.8) and (3B.9), but the derivatives in (3B.10) and (3B.11) need to be taken withrespect toa,b andc. The first order derivative of the log-likelihood with respect to theelements ofθ4 given in (3.23) have to be changed forθ4 = (λ1, ..., λm−1) as follows

∂

∂θ4 jlogL(θ) =

n∑

i=1

1

fi

∂

∂θ4 j

m∑

l=1

λl (λ1, ..., λm−1) fi |l

=n∑

i=1

1

fi

m−1∑

l 6= j

[−( fi |l − fi |m)exp(λl + λ j )

(1+∑k exp(λk))2

]+

+( fi | j − fi |m)exp(λ j )(1+

∑m−1k 6= j exp(λk))

(1+∑k exp(λk))2

}. (3B.12)


Second order partial derivatives (Hessian)

In the following we examine the second order derivatives in more detail:

∂2

∂θ21s

∂2

∂θ1s∂θ1t

∂2

∂θ1s∂θ2t

∂2

∂θ1s∂θ3t

∂2

∂θ1s∂θ4t

∂2

∂θ22s

∂2

∂θ2s∂θ2t

∂2

∂θ2s∂θ3t

∂2

∂θ2s∂θ4t

∂2

∂θ23s

∂2

∂θ3s∂θ3t

∂2

∂θ3s∂θ4t

∂2

∂θ24s

∂2

∂θ4s∂θ4t,

(3B.13)

wheres andt indicate the different elements inθ1, θ2, θ3, andθ4. Due to the continuityof the log-likelihood (except for some boundary points whered(θ3) = 0), we haveequality of the mixed partial derivatives (i.e.∂2 f (x, y)/∂x∂y = ∂2 f (x, y)/∂y∂x,see Apostol, 1969). We do not give the expressions for the second order derivativesof the qi | j ’s andd here, which can be derived easily from (3B.2) and the first orderpartial derivatives ofqi | j andd above.

Second order partial derivatives elements ofθ1 (fixed effects)

The second order partial derivative∂2/∂θ1s∂θ1t can be computed using (3.24), where

∂2 fi | j∂θ1s∂θ1t

= ∂

∂θ1s

{∂ fi | j∂θ1t

}

= −{∂ fi | j∂θ1s

}∂qi | j∂θ1t

1

2d− fi | j

2d

∂2qi | j∂θ1sθ1t

,

which gives, on using (3B.5)

∂2 logL(θ)∂θ1s∂θ1t

=n∑

i=1

−

(m∑

l=1

λl fi |l2d fi

∂qi |l∂θ1s

)

m∑

j=1

λ j fi | j2d fi

∂qi | j∂θ1t

+

+

m∑

j=1

λ j fi | j4d2 fi

{∂qi | j∂θ1s

∂qi | j∂θ1t− 2d

∂2qi | j∂θ1sθ1t

} .

(3B.14)

The second order partial derivative with respect to the same elements ofθ1 (i.e. s= t)is almost equal to (3B.14) but can be simplified a bit more, namely


∂2 logL(θ)∂θ2

1s

=n∑

i=1

−

m∑

j=1

(λ j fi | j2d fi

∂qi | j∂θ1s

)

2

+

+

m∑

j=1

λ j fi | j4d2 fi

{(∂qi | j∂θ1s

)2

− 2d∂2qi | j∂θ2

1s

} .

(3B.15)

The second order partial derivative of the log-likelihood with respect to the elementsof θ1 andθ2 is given by

∂2 logL(θ)∂θ1s∂θ2l

=n∑

i=1

−

m∑

j ′=1

λ j ′

fi

∂ fi | j ′∂θ1s

[λl

fi

∂ fi |l∂θ2l

]+

n∑

i=1

λl

fi

∂2 fi |l∂θ1s∂θ2l

, (3B.16)

with

∂2 fi |l∂θ1s∂θ2l

= ∂

∂θ1s

{∂ fi |l∂θ2l

}

= fi |l2d

{1

2d

∂qi |l∂θ1s

∂qi |l∂θ2l− ∂2qi |l∂θ1s∂θ2l

}.

The results follows using (3B.5) and (3B.7) fors= 1,2 andl = 1, ...,m.

For (∂2 logL(θ)/∂θ1s∂θ3t ), s= 1,2 andt = 1, 2, 3, we first need to compute


= ∂

∂θ1s

{fi | j2d2

[qi | j

∂d

∂θ3t− d

(∂qi | j∂θ3t+ ∂d

∂θ3t

)]}

= fi | j2d2

{(3d − qi | j

2d

)∂qi | j∂θ1s

∂d

∂θ3t+ 1

2

∂qi | j∂θ1s

∂qi | j∂θ3t− d

∂2qi | j∂θ1s∂θ3t

},

(3B.17)

from which it follows that



=n∑

i=1

−1

( fi )2

m∑

j ′=1

(−λ j ′ fi | j ′

2d

∂qi | j ′∂θ1s

)×

×

m∑

j=1

λ j fi | j2d2

[qi | j

∂d

∂θ3t− d

(∂qi | j∂θ3t+ ∂d

∂θ3t

)]+

+n∑

i=1

1

fi

m∑

j=1

λ j fi | j2d2

{(3d − qi | j

2d

)∂qi | j∂θ1s

∂d

∂θ3t+

+1

2

∂qi | j∂θ1s

∂qi | j∂θ3t− d


}. (3B.18)

The second order partial derivative of the LIV log-likelihood with respect to the ele-ments ofθ1 andθ4 is obtained using (3.25),


=n∑

i=1

fi∂∂θ1s

(fi |l − fi |m

)− ( fi |l − fi |m)

∂∂θ1s

fi

( fi )2, (3B.19)

where ∂ fi |l∂θ1s

and ∂ fi |m∂θ1s

are given in (3B.5) and∂ fi∂θ1s= ∑m

j=1 λ j∂ fi | j∂θ1s

, for s = 1,2 andl = 1, ...,m− 1.

Second order partial derivatives elements ofθ2 (group means)

The structure of the second order partial derivatives of the log-likelihood with respectto the elements ofθ2 is similar to results forθ1. In fact, several simplifications can bemade because∂ fi | j /∂θ2l and∂qi | j /∂θ2l are both zero forj 6= l , j = 1, ...,m. For∂2 logL(θ)/∂θ2

2l we have

∂2 logL(θ)∂θ2

2l

=n∑

i=1

−1

( fi )2

(λl∂ fi |l∂θ2l

)2

+n∑

i=1

λl

fi

∂2 fi |l∂θ2

2l

, (3B.20)

where∂ fi |l /∂θ2l is given in (3B.7) and

∂2 fi |l∂θ2

2l

= ∂

∂θ2l

{− fi |l

2d

∂qi |l∂θ2l

}

= fi |l2d

{1

2d

(∂qi |l∂θ2l

)2

− ∂2qi |l∂θ2

2l

},

for l = 1, ...,m. The second order partial derivative with respect to different elementsin θ2 can be simplified, since


∂2 fi | j∂θ2k∂θ2l

= ∂

∂θ2k

{− fi |l

2d

∂qi |l∂θ2l

}= 0,

for j 6= k, j = 1, ...,m, from which it follows that

∂2 logL(θ)∂θ2k∂θ2l

=n∑

i=1

−(λk fi |k2d fi

∂qi |k∂θ2k

)(λl fi |l2d fi

∂qi |l∂θ2l

). (3B.21)

The second order partial derivatives with respect toθ2l andθ3t are

∂2 logL(θ)∂θ2l ∂θ3t

=n∑

i=1

−(λl

fi

∂ fi |l∂θ2l

)

m∑

j=1

λ j

fi

∂ fi | j∂θ3t

+

+n∑

i=1

λl

fi

∂2 fi |l∂θ2l ∂θ3t

, (3B.22)

because∂2 fi | j /∂θ2l ∂θ3t = 0, for j 6= l , and equal to

fi |l2d2

{(3d − qi |l

2d

)∂qi |l∂θ2l

∂d

∂θ3t+ 1

2

∂qi |l∂θ2l

∂qi |l∂θ3t− d

∂2qi |l∂θ2l ∂θ3t

},

for j = l , similar to (3B.17). Using (3.25) we have


=n∑

i=1

fi∂∂θ2k

{fi |l − fi |m

}− { fi |l − fi |m}

∂∂θ2k

fi

( fi )2, (3B.23)

where

∂ fi∂θ2k

= ∂

∂θ2k

m∑

j=1

λ j fi | j

= λk∂ fi |k∂θ2k

,

which is given in (3B.7), and

(∂{

fi |l − fi |m}/∂θ2k

) = 0 k 6= l ,m= (

∂ fi |l /∂θ2l)

k = l= − (∂ fi |m/∂θ2m

)k = m.


Second order partial derivatives elements ofθ3 (variances-covariance)

Here we derive the second order partial derivatives of the log-likelihood with respectto different elements inθ3. The second order partial derivatives with respect to thesame elements ofθ3 can be derived in a similar way and are not given here. We have


= ∂

∂θ3s

{∂ fi | j∂θ3t

}= ∂ fi | j∂θ3s

qi | j ∂d∂θ3t− d

(∂qi | j∂θ3t+ ∂d

∂θ3t

)

2d2

+

+ fi | j∂

∂θ3s


(∂qi | j∂θ3t+ ∂d

∂θ3t

)

2d2

, (3B.24)

where the latter factor

∂

∂θ3s


(∂qi | j∂θ3t+ ∂d

∂θ3t

)

2d2

=

= 1

2d2

{(qi | j − d

) ∂2d

∂θ3s∂θ3t+ ∂qi | j∂θ3t

∂d

∂θ3s+ ∂qi | j∂θ3s

∂d

∂θ3t+

− (2qi | j − d)

d

∂d

∂θ3s

∂d

∂θ3t− d


},

and (3B.24) becomes


= fi | j4d4

{(3d2+ q2

i | j − 6dqi | j) ∂d

∂θ3s

∂d

∂θ3t+(3d2− dqi | j

)×

×(∂d

∂θ3s

∂qi | j∂θ3t+ ∂qi | j∂θ3s

∂d

∂θ3t

)+ d2∂qi | j

∂θ3s

∂qi | j∂θ3t+ (3B.25)

+2d2 (qi | j − d) ∂2d

∂θ3s∂θ3t− 2d3 ∂2qi | j

∂θ3s∂θ3t

}.

Then combining (3.24), (3B.9), and (3B.25) gives



=n∑

i=1

−

m∑

j ′=1

λ j ′ fi | j ′2d2 fi

{qi | j ′

∂d

∂θ3s− d

(∂qi | j ′∂θ3s

+ ∂d

∂θ3s

)} ×

×

m∑

j=1

λ j fi | j2d2 fi

{qi | j

∂d

∂θ3t− d

(∂qi | j∂θ3t+ ∂d

∂θ3t

)}+

+

m∑

j=1

λ j fi | j4d4 fi

{(3d2+ q2

i | j − 6dqi | j) ∂d

∂θ3s

∂d

∂θ3t+ (3B.26)

+(3d2− dqi | j

)( ∂d

∂θ3s

∂qi | j∂θ3t+ ∂qi | j∂θ3s

∂d

∂θ3t

)+

+d2∂qi | j∂θ3s

∂qi | j∂θ3t+ 2d2 (qi | j − d

) ∂2d

∂θ3s∂θ3t− 2d3 ∂2qi | j

∂θ3s∂θ3t

})},

for s, t = 1, 2, 3, s 6= t . The results fors = t can be obtained in a similar way andthe result is more or less similar to (3B.26). The second order partial derivatives withrespect toθ3s, s= 1, 2,3, andθ4l , l = 1, ...,m− 1, are equal to


=n∑

i=1

1

2d2 fi

{(fi |l qi |l − fi |mqi |m− d( fi |l − fi |m)

) ∂d

∂θ3s+

−d

(fi |l∂qi |l∂θ3s− fi |m

∂qi |m∂θ3s

)− ( fi |l − fi |m)

fi

m∑

j=1

λ j fi | j×

×[qi | j

∂d

∂θ3s− d

(∂qi | j∂θ3s

+ ∂d

∂θ3s

)])}. (3B.27)

Here we applied formula (3.25).

Second order partial derivatives elements ofθ4 (group-sizes)

The second-order partial derivatives for the group-sizes are given by


=

∑ni=1−( fi |l− fi |m)2

f 2i

if k = l∑n

i=1−( fi |l− fi |m)( fi |k− fi |m)

f 2i

if k 6= l ,(3B.28)

for k, l = 1, ...,m− 1.

Chapter 4

LIV implementation issues

4.1 Introduction

In this chapter we extend the LIV model proposed in the previous chapter and

we present several diagnostics to facilitate an LIV analysis. In section 4.2 we

include several exogenous regressors in the model and allow for the possibility

that observed instrumental variables (IVs) are available. Both are important

generalizations for empirical work. We prove in section 4.2 that the parame-

ters of the extended LIV model are identifiable, using a similar approach as in

subsection 3.3.1.

When a researcher has access to observed instrumental variables, an important

question is whether these instrumental variables are valid. Valid instruments

explain a considerable amount of the variance of the endogenous regressor

and have no direct effect on the dependent variable. Unfortunately, the per-

formance of the classical IV estimator, which has been used extensively in

empirical applications, critically relies on the quality of observed instruments,

see chapter 2. Classical IV models are identified using additional variables in

the form of instruments that are constructed on basis of a priori grounds, such

as economic sense or intuition. The assumptions made, however, are often

questionable, see for instance Card’s (1999, 2001) discussion on the validity

of the instrumental variables used in estimating the return to schooling. Un-

fortunately, examining instrument validity is not straightforward in a classical

77

78 Chapter 4 LIV implementation issues

IV model. In section 4.3 we propose a new approach based on Wald tests

(see e.g. Greene, 2000) to investigate the validity of observed instrumental

variables, and we show by means of a simulation study that the proposed tests

have a reasonable power in identifying weak or endogenous instruments. If the

observed instruments are found to be valid, a classical IV estimator (2SLS or

LIML) can be used, or the observed IVs can be combined with a latent discrete

instrument in the LIV model, yielding potentially more efficient estimates. If

the null hypothesis of having valid observed IVs is rejected, then the classical

IV estimator is known to be biased but the LIV estimator can still be used to

make valid inferences.

Thirdly, in carrying out an LIV analysis several implementational issues need

to be addressed. Here we propose diagnostics to choose the number of cate-

goriesm, to examine the LIV residuals, and to identify outliers or influential

observations. In the previous chapter we showed using synthetic data that the

existence of a latent dummy instrumental variable (m = 2) allows for con-

sistent estimation of the regression parameters and we performed a sensitivity

analysis usingm= 3 orm= 4. It was shown that the main results and conclu-

sions are robust against different choices ofm. In empirical studies one may

wish to choose for one particular value form and we present several diagnos-

tics that can be used for this purpose. Furthermore, the normality assumption

made to compute the maximum likelihood estimates may be invalid. We inves-

tigate in a simulation study the sensitivity of the LIV estimates for misspecifi-

cation of the distribution of the error terms and we find that the results for the

regression parameters are fairly robust against a misspecified likelihood. Fur-

thermore, we propose a way to compute the ‘LIV residuals’ that can be used to

investigate the normality assumption of the disturbances and to examine het-

eroscedasticity. Outliers and influential observations may present a problem in

estimation because of their large influence on the results, and their presence in

large numbers may point out that the used model failed to capture important

aspects of the data. The available regression diagnostics (e.g. Fox, 1991, Bel-

sley, Kuh, and Welsch, 1980, or Cook and Weisberg, 1982) are not applicable

but can be extended in a straightforward manner (see also Wang et al., 1996).

4.2 Additional regressors and identifiability 79

We address these issues in section 4.5.

Finally, in section 4.6 we present another method to test for regressor-error

dependencies (section 3.4). This method is based on Wald’s test-principle and

can be obtained as a by-product of an IV analysis. In section 4.7 we conclude

this chapter.

4.2 Including exogenous regressors in the LIV modeland identifiability

The simple LIV model in (3.1) is extended by including additional exogenous

regressors and instrumental variables as follows

yi = β0+ β1xi + x′2iβ2+ εi ,

xi = π ′zi + z′2i γ2+ νi ,(4.1)

where i = 1, ...,n. The l1 × 1 vectorx2i contains the observations on the

exogenous regressors, andz2i is the l2 × 1 vector of observations on the ex-

ogenous instruments. The regression parameterβ2 is anl1 × 1 vector of un-

knowns and represents the effect ofx2i on yi . Similarly, the unknown vector

γ2 has dimensionl2 × 1, and denotes the effect of the exogenous regressors

z2i on the endogenous variablexi . As before,π is anm × 1 vector of cat-

egory means andzi is the unobserved discrete instrument withm categories,

that have sizesλ j > 0, j = 1, ...,m, where∑m

j=1 λ j = 1. The errors(εi , νi )

are independently and identically distributed according to a bivariate normal

distribution with mean zero and variance-covariance matrix (3.2). The vector

z2i contains the elements ofx2i and, in addition, possible other exogenous re-

gressors that do not have a direct effect onyi . I.e.,z2i = (x′2i , x′3i )′ andl2 ≥ l1.

As will be shown later,z2i cannot contain a constant term. The variables in

x3i can be interpreted as the ‘traditional’ instrumental variables. We define

X2 = (x′21, ..., x′2n)′, X3 = (x′31, ..., x′3n)

′, andZ2 = [X2 X3].

It can be seen that the general LIV model is a mixture of normal distribution

functions. Conditionally on groupj , and the setsx2i andx3i , the reduced form


mean is given by

µyi | j = β0+ β1π j + x′2iβ2+ z′2i γ2β1

µxi | j = π j + z′2i γ2, (4.2)

and the variance is equal to (3.4). The unconditional mean of(yi , xi ) is

µyx =(β0+ β1

∑mj=1 λ jπ j + x′2iβ2+ z′2i γ2β1∑m

j=1 λ jπ j + z′2i γ2

),

and the unconditional variance-covariance matrix is given by1

�yx = �+(β1π

′

π ′

)var(zi )

(β1π

′

π ′

)′,

where var(zi ) = diag(λ) − λλ′, λ = (λ1, ..., λm)′ (e.g. Weisstein, 2004a). In

the following we apply a similar approach as in subsection 3.3.1 to prove iden-

tifiability of all parameters of the LIV model in (4.1). Identifiability is now

conditional on a set of observationsSi = (x2i , x3i ) = z2i , i = 1, ...,n.

Let β = (β0, β1, β′2)′ and define the set

Fβ,γ2,6,Si= {FSi

|FSiis a bivariate normal c.d.f. onR2 of the pair(yi , xi )

with mean and variance(µi (β, π, γ2),�(β,6)

), π ∈ R} , (4.3)

whereµi (β, π, γ2) = (β0+β1π+x′2iβ2+z′2i γ2β1, π+z′2i γ2)′ and�(β,6) =

� as in (3.4). This defines the class of general LIV models with givenβ, γ2

and6. Let us now focus on the mixture distribution obtained from (4.3). It is

defined by the parametersπ = (π1, ..., πm)′, whereπ j ∈ R andπi 6= π j for

i 6= j , andλ = (λ1, ..., λm)′ with λ j > 0,

∑j λ j = 1, wherem = 1,2, ....

According to the mixture distribution, the outcomeπ j occurs with probability

1Use the reduced form of (4.1), the relation var(y, x) = E [var(y, x|z)] + var[E (y, x|z)]and var(a′X) = a′var(X)a.


λ j . Let Hβ,γ2,6,Si ;π,λ be a mixture fromFβ,γ2,6,Siof orderm, defined by the

parametersπ andλ, for i = 1, ...,n. For the complete dataset we define the

class

Hβ,γ2,6,S={

Hβ,γ2,6,S;π,λ|Hβ,γ2,6,S;π,λ =n⊗

i=1

Hβ,γ2,6,Si ;π,λ, π j ∈ R,

πi 6= π j for i 6= j ; λ j > 0,m∑

j=1

λ j = 1, for i, j = 1, ...,m,m= 1,2, ...

,

where⊗

denotes the independent product of distributions (the observations

(yi , xi ) are independently, but not identically distributed). We considerm= 1

or m > 1 depending on whether an observed instrumental variable is avail-

able. Identifiability of the general LIV model is established by proving that

the classGS =⋃β,γ2,6

Hβ,γ2,6,Sis identifiable. We first proof identifiability

of Hβ,γ2,6,S.

Proof of identifiability of Hβ,γ2,6,S. The setHβ,γ2,6,S

is identifiable if and

only if Hβ,γ2,6,S;π,λ ∈ Hβ,γ2,6,Shas a unique representation in terms of the

mixing proportionsλ j and theπ j ’s, and, equivalently, eachHβ,γ2,6,Si ;π,λ, i =1, ...,n, has a unique representation. Hence, we can apply a similar reasoning

to prove identifiability ofHβ,γ2,6,Sas proving the identifiability ofHβ,6 in

subsection 3.3.1.

More specifically, defineyi = yi − x′2iβ2 − z′2i γ2β1 = β0 + β1π j and xi =xi − z′2i γ2 = π j . Givenβ, γ2, 6, andS, the model for(yi , xi ) is similar to

(3.1), and, hence,Hβ,γ2,6,Sihas a unique representation. Therefore, the set

Hβ,γ2,6,Sis identifiable.

Proof of identifiability of GS. In the following we proof theorem 4.1.

Theorem 4.1Hβ,γ2,6,Sis identifiable for allβ, γ2, and6 positive semi-

definite⇐⇒ GS is identifiable.


Proof (⇒) Let FS,GS ∈ GS such thatFS ≡ GS, with

FS =n⊗

i=1

F Si ∈ Hβ,γ2,6,S

GS =n⊗

i=1

GSi ∈ Hδ,ζ2,9,S,

where F Si = ∑mj=1 a j F

Sij and GSi = ∑k

l=1 bl GSil , a1, ...,am and b1, ...,bk

are the mixing proportions, the distributionsF Si1 , ..., F Si

m ∈ Fβ,γ2,6,Siare all

different, andGSi1 , ...,G

Sik ∈ Fδ,ζ2,9,Si

are all different, where

F Sij is the c.d.f. of N2

(µi (β, γ2, π j ),�(β1, 6)

)

GSil is the c.d.f. ofN2

(µi (δ, ζ2, τl ),�(δ1, 9)

).

By definition 3.1, identifiability ofGS follows if it is proven thatFS can be writ-

ten uniquely in terms of the parametersm, β, γ2, 6, π j , anda j , j = 1, ...,m

(modulo permutation).FS ≡ GS implies thatF Si = GSi for i = 1, ...,n, since

FS ≡ GS if and only if the marginals for eachi = 1, ..., n are identical. Both

Hβ,γ2,6,SandHδ,ζ2,9,S

constitute identified sets (by assumption), and, hence,∑mj=1 a j F

Sij and

∑kl=1 bl G

Sil both have unique representations in terms ofπ

andτ , the mixing proportionsa andb, and the number of componentsm and

k. So, givenβ, γ2, 6 (δ, ζ2, 9) andSi , there are no two sets of parametersπ , a,

andm (τ,b,andk) that lead to the same distribution functionF Si (GSi ), which

is just a mixture of bivariate normal distributions. Using this and identifiability

of bivariate normal mixtures (e.g. appendix 3A),FS ≡ GS implies thatm= k,

a j = b j and F Sij = GSi

j , modulo permutation. Subsequently,F Sij = GSi

j im-

plies thatµi (β, γ2, π j ) = µi (δ, ζ2, τ j ) and�(β1, 6) = �(δ1, 9). Combining

for all i = 1, ...,n, and writingγ2 = (γ ′21, γ′22)′ andζ2 = (ζ ′21, ζ

′22)′ we have


(β0+ β1π j )ιn + [X2 X3]

(β2+ γ21β1β1γ22

)=

(δ0+ δ1τ j )ιn + [X2 X3]

(δ2+ ζ21δ1δ1ζ22

)(4.4)

π j ιn + Z2γ2 = τ j ιn + Z2ζ2 (4.5)

β21σ

2ν + σ 2

ε + 2β1σεν = δ21ψ

2ν + ψ2

ε + 2δ1ψεν (4.6)

β1σ2ν + σεν = δ1ψ

2ν + ψεν (4.7)

σ 2ν = ψ2

ν (4.8)

for j = 1, ...,m. We need to prove thatβs = δs, s= 0,1,2,γ2 = ζ2, σ2ε = ψ2

ε ,

σεν = ψεν , andσ 2ν = ψ2

ν .

From (4.5) we have that(π j − τ j )ιn = Z2(ζ2− γ2), for j = 1, ...,m. Suppose

π j − τ j = c 6= 0,∀ j , thencιn = Z2(ζ2− γ2), or

[ιn Z2

] ( cγ2− ζ2

)= 0,

whereZ2 = [X2 X3]. If [ ιn Z2] has full column rank, thenc = 0, and it fol-

lows thatπ j = τ j , andζ2 = γ2.

Using this result with (4.4), we obtain

(β0+ β1π j )ιn + [X2 X3]

(β2+ γ21β1β1γ22

)=

(δ0+ δ1π j )ιn + [X2 X3]

(δ2+ γ21δ1δ1γ22

)

⇐⇒[ιn X2 X3

](δ0− β0)+ π j (δ1− β1)

δ2− β2+ (δ1− β1)γ21(δ1− β1)γ22

= 0.

Again, if [ιn Z2] has full column rank, then


(δ0− β0)+ π j (δ1− β1) = 0 (4.9)

(δ2− β2)+ (δ1− β1)γ21 = 0 (4.10)

(δ1− β1)γ22 = 0, (4.11)

for j = 1, ...,m. By definition allπ j ’s are different. Ifm > 1, then (4.9)

yieldsβ1 = δ1 andβ0 = δ0. Subsequently, from (4.10) it follows thatδ2 = β2.

Regardless of the value ofγ22, (4.11) is satisfied. Ifm = 1, it can be seen

thatγ22 6= 0 is needed to establish identifiability (this situation is identical to

classical IV and it means that instrumentsx3i that explain part of the variance

in xi need to be available ).

Sinceβ1 = δ1, equality of the variances and covariances follows from (4.6) -

(4.8). We conclude that anyFS ∈ GS has a unique representation and henceGS

constitutes an identified set. The reverse of the proof (⇐) follows immediately

(i.e. a subset of an identified set must be identified as well).

From theorem 4.1 it can be concluded that all parameters of the general LIV

model given in (4.1) are identifiable, assuming that the errors have a bivariate

normal distribution. The following remarks are in place:

1. When x2i ⊂ z2i andγ22 6= 0 (i.e. there is a valid set of instruments

x3i available),m may be equal to 1. In this case, the LIML estimate is

identical to the classical LIML estimate. Whenx2i = z2i (i.e. there is no

valid set of instruments available),m> 1 is required for identifiability.

2. The model parameters are also identified when the elements ofγ2 cor-

responding tox2 are zero. This implies that the regressorsx2 are inde-

pendent of the endogenous regressorx, i.e. there is no multicollinearity

betweenx andx2. In practice this is unlikely to be the case and in order

to avoid using biased estimates by imposing false restrictions it is not

advisable to restrictγ2 (partly) to zero (see also Wooldridge (2002), p.

91).

4.3 Investigating observed instrumental variables 85

3. From the above proof it follows that [ιn X2 X3] should have full column

rank. This implies thatn has to be larger thanm.

In appendix 4A we discuss how the first and second order derivatives presented

in subsection 3.3.2 can be extended to the more general model. The general

LIV model is estimated by maximizing the log-likelihood obtained from (4.1).

Next we argue that this model can be used to investigate the validness of avail-

able observed instrumental variables.

4.3 Investigating observed instrumental variables

In chapter 2 we discussed classical instrumental variables estimation and the

problems associated with using potential weak and endogenous instruments. In

a classical framework these assumptions cannot be tested for straightforwardly.

For instance, when an instrument is weak it does not explain any (or at most

a small amount) of the variance of the endogenous regressor. In this case,

classical asymptotic theory gives inferior approximations to the finite sample

distribution and standard test procedures are not applicable (see also chapter

2). If an observed instrument is not truly exogenous, the IV estimator (2SLS

or LIML) is biased (see e.g. Bound, Jaeger, and Baker, 1995). In this case,

the instrument has a direct effect on the dependent variable, in addition to the

usual indirect effect through the endogenous regressor. Here we propose a

simple approach based on the Wald test principle to investigate whether weak

instruments are present (subsection 4.3.1). Furthermore, we propose a test to

test for the presence of endogenous instruments (subsection 4.3.2).

4.3.1 Testing for weak instruments

In section 4.2 we proved that the general LIV model is identifiable whenever

m > 1, and, hence, also when the observed instrumental variables have no

effect on the endogenous regressor (i.e. the observed instrumental variables

are weak). A Wald-test can be used to test whetherγ22 = 0 or not. In gen-

eral, the Wald test-statistic to test for the validity ofr linear restrictions, i.e.

H0 : Rθ = q, is given byW = (Rθ − q)′[R var(θ)R′]−1(Rθ − q), and has


approximately aχ2-distribution withr degrees of freedom underH0 (Greene,

2000). In our case,θ are the elements fromγ22, R= Il2−l1, q = 0, andvar(θ)

can be estimated using the methods in subsection 3.3.2.

From the simulation studies presented in the following it will become clear

that this test should be used with a conservative significance levelα in order to

be effective. Furthermore, it is advisable to accompany this test with more

traditional methods like theR2 and theF-statistic of the regression of the

endogenous regressor on the set of instrumental variables (see also subsection

4.5.2 and chapter 2).

4.3.2 Testing for endogenous instruments

An endogenous instrument is correlated with the error term of the regression

equation, in which case it has a direct effect on the dependent variable. I.e.,

the error term also contains an effect (say)e of the observed instruments, in

addition to the usual unobserved effects. The ‘total’ error is, in this case, given

by u = e(x3)+ ε. We propose the following procedure to investigate possible

instrument endogeneity.

Instruments that are suspected to be correlated with the error term should be

included in the main regression equation. Subsequently, a Wald-test can be

used to investigate whether or not they have a non-zero effect on the dependent

variable. To be more specific, suppose thatl2 ≤ l2−l1 instrumental variables in

(4.1) are possibly endogenous. Including these variables in the main regression

equation yields

yi = β0+ β1xi + x′2iβ2+ x′3iβ3+ εi

xi = π ′zi + z′2i γ2+ νi ,(4.12)

where all variables are defined as before, andx3i is a l2 × 1 vector contain-

ing the elements ofx3i that are possibly endogenous. The null hypothesis for

instrument exogeneity, given byH0 : β3 = 0, can be tested for by using the

Wald approach discussed in the previous subsection.

4.4 A simulation study 87

4.4 A simulation study

In this section we present the results of a simulation study to demonstrate that

the general LIV model in (4.1) can be used successfully to estimate the regres-

sion parameters and the error variance in presence of an endogenous regressor.

Furthermore, we show that the proposed Wald-tests have a reasonable power

to detect invalid instruments.

Data was generated using model (4.12) withβ0 = 1, β1 = 2, andβ2 = 0.25.

Furthermore, we assume thatx2 also had an moderate effect onx by taking

γ2x2= −0.25. Throughout the simulations, we took a bimodal distribution

with two categories for the latent instrument. To investigate the performance

of the proposed tests we assumed that one potential weak and/or endogenous

instrument is available. Its effectγ2z2on the endogenous regressor is specified

as 0, 0.1, 0.2, 0.3, 0.4, and 0.5, whereγ2z2= 0 represents that the instrument

has no effect onx, and 0.5 indicates that it has a relatively strong effect. The

value for σ 2ν is chosen such that var(x) = 3 in all cases. The correlation

betweenx3i and the total error termui = β3x3i + εi is controlled viaβ3, and

its values are chosen such that the correlation coefficient is equal to 0 (i.e. the

instrument is exogenous), 0.05, 0.10, 0.15, and 0.20. The total variance of

the errorui is fixed to 1 by adjusting the value forσ 2ε . The covarianceσεν is

adjusted such that the correlation between the endogenous regressorxi andui

is equal to 0.5 in all cases. Hence, across the 5× 6 = 30 settings, the bias

in OLS will, on average, be the same, and provides a benchmark with which

the results can compared. We tookn = 1000 and a total of 250 simulated

datasets. In the following we first present the OLS, IV, and LIV results for the

regression coefficientsβ1 andβ2. Subsequently, we discuss the power of the

proposed Wald tests to investigate instrument validity.

4.4.1 Results for the regression parameters

In table 4.1 we present the means and standard deviations of the biases in the

estimated values forβ1 by LIV, OLS, and 2SLS. We estimated the LIV model

by including the observed instrumental variable in the equation fory and the


Table 4.1: Means and standard deviations of the bias in the estimates forβ1.

γ2z2

ρzu 0 0.1 0.2 0.3 0.4 0.5

0 -0.003 -0.003 -0.002 0.000 -0.001 0.002(0.033) (0.034) (0.033) (0.032) (0.029) (0.027)

0.05 -0.003 0.003 0.000 0.000 -0.001 0.000(0.031) (0.033) (0.034) (0.034) (0.030) (0.031)

LIV 0.1 -0.001 0.000 0.002 -0.004 0.002 0.001(0.033) (0.035) (0.030) (0.030) (0.030) (0.027)

0.15 -0.001 -0.004 -0.001 -0.003 0.001 0.003(0.030) (0.033) (0.033) (0.029) (0.032) (0.030)

0.2 0.000 0.001 -0.002 -0.001 -0.001 -0.005(0.031) (0.031) (0.033) (0.030) (0.029) (0.028)

0 0.295 0.295 0.294 0.294 0.296 0.294(0.014) (0.015) (0.015) (0.014) (0.014) (0.014)

0.05 0.295 0.296 0.296 0.294 0.294 0.295(0.015) (0.015) (0.014) (0.014) (0.015) (0.015)

OLS 0.1 0.295 0.293 0.295 0.294 0.295 0.294(0.015) (0.016) (0.015) (0.014) (0.014) (0.014)

0.15 0.294 0.293 0.293 0.294 0.295 0.296(0.014) (0.015) (0.015) (0.013) (0.015) (0.015)

0.2 0.294 0.294 0.295 0.293 0.295 0.294(0.015) (0.015) (0.014) (0.014) (0.015) (0.015)

0 0.242 -0.007 -0.027 -0.011 -0.003 -0.004(0.833) (0.455) (0.182) (0.121) (0.082) (0.070)

0.05 0.058 0.498 0.253 0.158 0.131 0.100(2.444) (0.479) (0.162) (0.097) (0.076) (0.060)

IV 0.1 -0.216 0.966 0.520 0.337 0.245 0.199(6.241) (0.610) (0.152) (0.099) (0.066) (0.053)

0.15 -0.800 1.474 0.814 0.503 0.374 0.305(8.279) (0.907) (0.237) (0.105) (0.070) (0.056)

0.2 1.821 1.848 1.058 0.687 0.500 0.398(10.794) (1.616) (0.294) (0.130) (0.080) (0.058)


equation forx. The 2SLS results are obtained using only the observed instru-

ment.

It can be seen that, in all cases, the average bias of the LIV estimate forβ1

is approximately zero, whereas the bias in OLS is approximately 0.29 (which

is a 15% upward bias). Hence, the general LIV model gives approximately

unbiased results in presence of regressor-error dependencies, regardless of the

quality (and the availability) of the observed instrument. The results for LIV

are slightly more efficient when the observed instrument is stronger (i.e. when

γ2z2is larger) and when the observed instrument has a larger direct effect ony

(i.e. whenρzu is larger). We note that the LIV results forβ1 will be biased, if

the observed instrumentx3 is wrongfully omitted from the main equation.

From the 2SLS results it can be seen that when the instrument is exogenous

(ρzu = 0) and not too weak (e.g.γ2z2> 0.2), the 2SLS estimate forβ1 is

approximately unbiased, where the standard deviations are smaller when the

instrument used is stronger. In all other cases, the 2SLS method yields biased

results and the bias is larger when the correlation betweenz andu is higher,

and when the instrument used is more weak2. As can be seen, in many cases

the results for 2SLS aremorebiased than OLS, an observation that was also

made by Bound, Jaeger, and Baker (1995).

The biases found in the estimates forβ2 show a similar pattern as the results for

β1, and these can be found in appendix 4B. The simulation results presented in

this subsection illustrate the problems associated with classical IV estimation

in presence of weak or endogenous instruments, and indicate that its results

cannot be relied upon without a proper investigation of the validity of the in-

struments used. In all cases, however, the LIV model gives approximately

unbiased results. In the next subsections we present the results of investigating

the validity of the available observed instrumental variable using the Wald test

2In fact, the first two columns for the 2SLS results, that correspond to a situation with aweak instrument, are the median bias and the IQR across the simulation results, since the meanand standard deviation in these situations gave extreme results because of the presence of manylarge outliers.


approach proposed in section 4.3.

4.4.2 Results testH0 : observed IV is exogenous

Table 4.2: Power of test for instrument exogeneity (H0 : ρzu = 0).

γ2z2

ρzu 0 0.1 0.2 0.3 0.4 0.5

0 0.44 0.49 0.48 0.52 0.50 0.560.05 0.76 0.87 0.81 0.81 0.82 0.76

α = 0.5 0.1 1.00 1.00 1.00 1.00 0.99 0.990.15 1.00 1.00 1.00 1.00 1.00 1.000.2 1.00 1.00 1.00 1.00 1.00 1.00

0 0.05 0.04 0.03 0.06 0.06 0.070.05 0.32 0.35 0.34 0.27 0.34 0.30

α = 0.05 0.1 0.84 0.88 0.90 0.86 0.82 0.840.15 0.99 1.00 1.00 0.99 1.00 0.990.2 1.00 1.00 1.00 1.00 1.00 1.00

0 0.02 0.01 0.01 0.02 0.02 0.010.05 0.12 0.17 0.18 0.13 0.15 0.14

α = 0.01 0.1 0.71 0.65 0.73 0.69 0.59 0.630.15 0.98 0.99 0.99 0.97 0.98 0.960.2 1.00 1.00 1.00 1.00 1.00 1.00

Table 4.2 gives the fractions of rejections of the null hypothesis of instrument

exogeneity, as computed by the proposed Wald test3. It can be seen that, under

H0 : ρzu = 0, the size of the test is fairly close to the true sizesα = 0.50,0.05,

and 0.01. Forα = 0.5 and 0.05 the test underestimates the true size slightly

when the instrument is weak (γ2z2< 0.3) and has the tendency to be too con-

servative when the instrument is stronger (γ2z2≥ 0.3). If the null hypothesis is

not rejected, the LIV model can be re-estimated with the observed instrumen-

tal variable excluded from the main equation, but this is not necessary. The

probability to detect an endogenous instrument increases whenρzu gets larger

3Based on the Hessian matrix.


and the power to detectρzu > 0 is not very much affected by the strength of

the instrument, although it is slightly lower for an instrument that has no effect

on x, and for the strongest instrument. When the observed instrument in the

LIV model is relatively stronger compared to the unobserved discrete instru-

ment, the group structure gets more contaminated. This may have a negative

effect on the precision of the estimates and, therefore, may reduce the power

of the test. Nevertheless, in any case the proposed test has a reasonable power

to detect instrument endogeneity.

The bias in the 2SLS estimates that arises from using endogenous instruments,

illustrates the importance of examining instrument exogeneity. However, the

2SLS model is exactly identified in our case, and it is not possible to test for

instrument exogeneity within the classical IV framework. As was shown, it is

straightforward to use the LIV model for this purpose, which is an important

extension of LIV. In the next subsection we present the results for testing for

the presence of an instrument that does not explain any of the variance of the

endogenous regressor (the instrument is weak).

4.4.3 Results testH0 : observed IV has no effect onx

In table 4.3 we present the results for testing4 for the presence of an instrument

that has no effect on the endogenous regressorx. We find for an exogenous

instrument (i.e.ρzu = 0), that has no effect onx (i.e. γ2z2= 0), that the

fractions of rejections of the proposed test are close to the true sizesα = 0.5,

0.05, and 0.01. When the instrument is endogenous (i.e.ρzu > 0), the test

still performs well underH0, but gives results that are sometimes larger and

sometimes smaller than the true sizes of the test. Furthermore, the test has a

large power in detecting departures fromH0 : γ2z2= 0, regardless of whether

the instrument used is exogenous or not.

The advantage of the LIV approach in examining the weakness of observed

instrumental variables is that identifiability of the LIV model does not depend

4Based on the Hessian matrix.


Table 4.3: Power of test zero-effect instrument (H0 : γ2z2 = 0).

γ2z2

ρzu 0 0.1 0.2 0.3 0.4 0.5

0 0.52 0.94 1.00 1.00 1.00 1.000.05 0.52 0.92 1.00 1.00 1.00 1.00

α = 0.5 0.1 0.46 0.94 1.00 1.00 1.00 1.000.15 0.56 0.93 1.00 1.00 1.00 1.000.2 0.52 0.94 1.00 1.00 1.00 1.00

0 0.05 0.60 1.00 1.00 1.00 1.000.05 0.08 0.53 0.99 1.00 1.00 1.00

α = 0.05 0.1 0.04 0.57 1.00 1.00 1.00 1.000.15 0.05 0.62 0.98 1.00 1.00 1.000.2 0.07 0.59 0.99 1.00 1.00 1.00

0 0.01 0.35 0.97 1.00 1.00 1.000.05 0.02 0.33 0.97 1.00 1.00 1.00

α = 0.01 0.1 0.01 0.32 0.97 1.00 1.00 1.000.15 0.00 0.38 0.96 1.00 1.00 1.000.2 0.00 0.34 0.96 1.00 1.00 1.00

on the strength of the instruments. A test to investigate whether the instru-

ments explain part of the variance ofx is not readily available in the classical

IV framework, due to lack of identifiability. Instead, instrument weakness is

usually examined via theR2 of the first stage regression ofx on the set of

instrumental variables. We note, however, that theR2 also reflects the effect

of the exogenous regressorsx2 on x, sincex2 is typically included in the set

of instruments. Hence, the researcher may conclude that the instruments are

strong enough, while, in fact, most of the contribution to theR2 is from the

exogenous regressors and not from the instruments.

It can be seen from the simulation studies that the proposed test has a fairly

strong power in detecting departures fromH0 : γ2z2= 0. This is, however,

not necessarily an advantage in this case, since a rejection of the the null-


hypothesis suggests that the instrument used has a nonzero effect onx, and the

researcher may be tempted to use it in a classical IV regression. However, it

was seen from table 4.1 that the 2SLS results forγ2z2= 0.1 and 0.2 still suffer

from a possible bias and, importantly, large standard deviations. Although

the instrument has a nonzero effect, it is considerable weak and results from

using such an instrument should be interpreted with extreme caution. Hence,

in applying this test, we are more willing to accept a false acceptance ofH0

rather than a false rejection. We therefore recommend to use this test only with

conservative significance levels (i.e.α ≤ 0.01), in particular for large sample

sizes, and in combination with traditional measures of instrument weakness.

4.4.4 Concluding remarks simulation study

The simulation study illustrates the potential usefulness of the LIV model in

estimating the regression parameters of a linear model in presence of an en-

dogenous regressor and several exogenous regressors. Furthermore, we intro-

duce two new tests to examine instrument validity. We presented simulation

results for a situation where the OLS estimates for the regression coefficients

are biased and IV or LIV estimation is desirable.

We showed that the proposed tests have a reasonable power in detecting ‘in-

valid’ instruments for a wide variety of settings. However, the results of the

test to investigate whether the instrument has a zero-effect on the endogenous

regressor suggest to take a more conservative choice forα. Nevertheless, the

simulation study illustrates that the LIV model can be used successfully to ex-

amine the validity of a set of observed instrumental variables. Importantly, the

results clearly point out the problems with classical IV estimation when the

instrument used is endogenous or weak, and in particular when it is both weak

and endogenous. The bias that arises in the 2SLS estimates from using invalid

instruments can be larger than the bias in OLS. Hence, in absence of good

quality instrumental variables, 2SLS may actually do more harm than good,

in which case it is better to simply ignore the endogeneity ofx and use the

biased OLS estimates (see also the remarks of Shugan, 2004). The simulation

study presented here, however, illustrates that the LIV model can be used suc-


cessfully to estimate the regression parameters, because it does not require the

availability of observed instruments.

We also examined the performance of the LIV model and the two tests in ab-

sence of regressor-error dependencies, in which case OLS is the best linear

unbiased estimator. As in the previous chapter, we found that the LIV results

exhibit more variability across the simulation results than for a situation where

x is endogenous. The performance of the tests to examine instrument valid-

ity does not differ substantially from the results reported in tables 4.2 and 4.3,

but the power to detect a non-zero effect of the instrument onx is slightly

lower whenγ2z2= 0.1, than the results reported in table 4.3. Importantly,

we find that the 2SLS estimates for the regression parameters are biased and

have large standard deviations when invalid instruments are used, even though

no regressor-error dependencies are present and OLS yields unbiased results.

Obviously, it is important to test for regressor-error dependencies to decide

whether OLS can be used, but it becomes clear from these results that a test

based on classical IV estimates, while having invalid instruments, likely leads

to false conclusions. In section 4.6 we discuss testing for regressor-error de-

pendencies using the model in (4.1) without relying on observed instruments.

In the following we present several diagnostic tools that can be used to com-

plete a LIV analysis when using empirical data.

4.5 LIV model diagnostics

4.5.1 Selection of the number of categories of the discrete instru-ment

In empirical studies, it has to be decided how many categories of the discrete

instruments are needed, i.e. how largem should be. From classical IV estima-

tion it is know that the number of instruments should not be too large, since the

finite sample bias is a function of the number of instruments (see also chap-

ter 2). Besides, a large set of instruments reduces the degrees of freedom and

the first–stage regression (X on Z) can overfit the data (Buse, 1992, Bound,

Jaeger and Baker, 1995, Bowden and Turkington, 1984). Standard model se-

4.5 LIV model diagnostics 95

lection methods, like AIC, CAIC, or BIC are often found to overestimate the

number of groups. Furthermore, they are ad hoc and their performance may

depend on the usage context (Naik, Shi and Tsai, 2003, Andrews and Currim,

2003, or Biernacki, Celeux and Govaert, 2000). Naik, Shi and Tsai (2003)

argue that information criteria like AIC are designed for selecting regressors,

but not groups. Indirect support is provided by a recent study of Andrews

and Currim (2003) who note that AIC3 performs well in finding groups in fi-

nite mixture regression models. The integrated classification likelihood (ICL)

criterion (Biernacki, Celeux and Govaert, 2000) has also been shown to be

suitable for selecting the number of components in mixture models. Since our

aim is to select the number of categories for the discrete instruments, i.e. the

number of groups representing the endogenous regressor best, and given the

importance of not overestimating the number of components, we prefer the

ICL criterion, which is more conservative than the other statistics.

The ICL criterion is a modification of BIC. Biernacki, Celeux and Govaert

(2000) find that BIC often overestimates the true number of clusters in mix-

ture models. They suggest to choose the model that maximizes the complete

integrated maximum likelihood and show that the resulting ICL criterion is

essentially the BIC statistic penalized by the subtraction of the mean entropy

−2∑

i

∑j zi j log pi j , where pi j are the posterior probabilities that observa-

tion i comes from categoryj andzi j = 1 wheneverpi j = maxj pi j , and zero

otherwise. It follows that if the categories are not well separated, this term

has a large value and BIC is penalized more severely. If the groups found by

the LIV model are not well separated, it resembles a situation in classical IV

where the instruments are weak. Furthermore, overfitting in terms of the num-

ber of groups results in using a too large number of instruments which is not

preferred, since degrees of freedom are lost which reduces efficiency. Further

support for this strategy is provided in chapter 3 where it is shown that the LIV

results are relatively insensitive to an under-specification of the true number of

instruments.

Although the ICL criterion can be used in determining the number of instru-


ments, we emphasize that several choices ofm should be examined. When the

estimated regression coefficients are substantially different for these choices

of m, or when the diagnostic measures point towards solutions that lead to

substantial different conclusions, the final results have to be interpreted with

caution. In this case, it may prove useful to consider different model speci-

fications by changing the set of exogenous regressors. However, in general,

we advocate against over-fitting by using the extra penalization term proposed

by Biernacki, Celeux and Govaert (2000) that adjusts the BIC statistic more

severely when the instruments yield posterior groupings that are fuzzy.

4.5.2 Residuals, outliers, and influential observations

In this section we extend several diagnostics, originally proposed for the clas-

sical regression model (Fox, 1991, Belsley, Kuh and Welsch, 1980, Cook and

Weisberg, 1982) to the LIV case. Outliers and influential observations can be

problematic because they may influence estimation results, and their presence

in large numbers may point out that the used model failed to capture impor-

tant aspects of the data. Analyzing residuals can reveal important information

for assessing model assumptions. Although maximum-likelihood is still ap-

proximately valid in all but small samples, highly non-normal residuals, in

particular, skewed and bimodal ones, are to be distrusted.

Strength of the instruments. In applying classical IV estimation it is recom-

mended to report theR2 or F-statistic from the regression of the endogenous

regressors on the instrumental variables (Bound, Jaeger and Baker, 1995). In

section 4.4 we saw that when the instruments explain only a small part of the

variation of the endogenous regressors, the instruments are weak and using the

IV results in this case is not recommendable. Instruments can be computed as

a byproduct of the LIV results by computing a posteriori category membership

using Bayes theorem. In addition to the tests proposed in section 4.3, these es-

timates can be ‘treated’ as observed instruments and used to compute theR2 as

a diagnostic. ThisR2 can, if observed instruments are available, be compared

to theR2 from classical IV.


Analyzing residuals. In most empirical work it is reasonable to assume bivari-

ate normality of the residuals. If some other distribution generated the data, the

MLE based on joint normality might lose its appeal. The model residuals from

(4.1), however, can be examined to investigate the normality assumption of the

disturbances. Furthermore, they can be used to detect potential outliers and

to examine heteroscedasticity. Defining the LIV residuals, however, is not a

straightforward matter, since the complete model is a bivariate mixture model

(see also Wang et al., 1996). We look at two type of residuals: the conditional

residuals, and IV-type residuals.

Conditional residuals.One way to examine residuals is to look at the condi-

tional distribution ofy given x in the LIV model (for the sake of notational

simplicity we omit other exogenous regressors here). Conditional on category

j and assuming normality, we have5: (y|x, j ) ∼ N(µy|x, j , σ

2y|x, j

), where the

conditional mean ofy|x, j is

E (yi |xi , j ) = (β0−σεν

σ 2ν

π j )+ (β1+σεν

σ 2ν

)xi (4.13)

= β0+ β1xi +σεν

σ 2ν

(xi − π j ), (4.14)

with xi − π j = νi , and var(y|x, j ) = σ 2ε − σ2

εν

σ2ν

.

This conditional distribution yields two valuable insights. Firstly, the resid-

ual ei = yi − yi , with yi = µyi |xi , j, is equal toei = (β0 − β0) + (β1 −

β1)xi − (σεν/σ 2ν )νi + εi , with νi = xi − xi = (π j − π j )+ νi . Because of the

presence ofνi , normality ofεi cannot be examined via this type of residuals.

Secondly, the above conditional mean has an interesting interpretation. If we

consider the OLS modely = α0 + α1x + ε, whereσxε 6= 0, the mean ofx

is µx and the variance isσ 2x , then the probability limit of the OLS estimator

for α0 is α0 − (σxε/σ2x )µx and forα1 equal toα1 + σxε/σ

2x , which resembles

5Use the result:(v,w) ∼ N2(µv, µw, σ2v , σ

2w, ρ), with σvw = ρσvσw. Then, fv(v) =

N(µv, σ2v ), fw(w) = N(µw, σ2

w), and f (v|w) = N(α0 + α1w, σv(1 − ρ2)), with α0 =µv − α1µw, α1 = σvw/σ2

w.


(4.13) in form. This shows that treatingx as given and estimating the mean

in (4.13) with least squares, yields inconsistent results forβ0 andβ1, unless

σxε = 0. From the simulation studies we know that the LIV model yields

consistent results in both cases (similar observations can be made when con-

sidering the unconditional mean (with respect toj ) E (yi |xi )). Since a priori it

is not know to which category/group individuali belongs, we use the a poste-

riori group probabilitiespi j , and compute the residuals asei = yi − yi , where

yi =∑m

j=1 pi j µyi |xi , j.

IV-type residuals.In classical IV estimation with observed instruments, the

residual is estimated asy − Xβ and noty − Xβ, see e.g. Greene (2000)

or Pagan (1984). Using the latter residuals results in an incorrect estimate

of the standard errors. Applying this result to LIV givesei = yi − yi =(β0− β0)+ (β1− β1)xi + εi , whereyi = β0+ β1xi and the estimated values

are the maximum likelihood estimates from LIV. Note that there is no ‘direct’

effect of νi . Furthermore, there is no need in using the a posteriori group

memberships. Unfortunately, we found this type of residual to be misleading

in detecting heteroscedasticity6.

In table 4.4 we examine the results of robustness of LIV against misspecify-

ing the error-distribution (see also Honore and Hu, 2004). Furthermore, we

investigate whether the above proposed residuals can effectively be used to

detect departures of normality. We used three different specifications for the

distribution ofεi : (1) a normal distribution, (2) aχ21 distribution, and (3) at3

distribution7, all normalized to have mean 0 and variance 1. The error term

for the endogenous regressorx was computed asνi = aεi + bui , whereui is

from a normal distribution with mean 0 and variance 1, anda andb are cho-

6The problem in using this residual can be described as follows. If, for instance,x is posi-tively correlated withε, the OLS line will biased upward (too steep). Since LIV estimates onaverage the correct value, the estimated LIV line will be more flat. Consequently, in this case,for the larger values ofx, the differencey− y will be more positive, and for the smaller valuesmore negative, i.e. there is a positive correlation observed betweenx ande.

7The t3 distribution resulted in approximately 5% of the generated datasets in an very ex-treme observation which would cause numerical under(over) flows. These observations werediscarded.


Table 4.4: Effect of misspecifying the regressor-error distribution on biases and stan-dard errors.

ρxε = 0 ρxε = 0.5

β1 σ 2ε σεν β1 σ 2

ε σεν

N OLS 0.001 -0.005 0.345 0.299(0.020) (0.042) (0.013) (0.030)

LIV -0.003 -0.004 0.008 0.002 0.001 0.003(0.044) (0.043) (0.092) (0.025) (0.064) (0.011)

χ2 OLS 0.000 0.007 0.346 0.304(0.020) (0.122) (0.030) (0.059)

LIV 0.000 0.007 0.000 -0.004 0.010 0.006(0.041) (0.122) (0.095) (0.027) (0.123) (0.017)

St. t OLS -0.002 0.033 0.342 0.306(0.018) (0.132) (0.039) (0.083)

LIV 0.001 0.032 -0.002 0.001 0.008 0.006(0.046) (0.132) (0.106) (0.025) (0.177) (0.020)

sen8 such that that variance ofνi is 1 and the correlation betweenx andεi is

0, 0.1, 0.2, 0.3, 0.4 and 0.5. We assumed the presence of two other exogenous

regressors. The error termεi accounts for approximately 25% of the variance

in y. We take a moderate sized sample (n = 1000) and use 250 Monte Carlo

simulations. The results in table 4.4 are the mean biases (i.e. true value minus

estimated value) and standard deviations of the biases across the simulations

for β1, σ2ε , andσεν . It can be seen that the LIV results for the regression pa-

rameter are fairly insensitive9 to the misspecification of the error distribution

for εi . The variance componentsσ 2ε andσεν are also estimated unbiasedly, but

at a cost of lower efficiency. We presented the results for the OLS estimator for

comparison and it can be seen that in particular for high degrees of regressor-

8The Cholesky decomposition can be used for that purpose.9We did not consider the potential multi-modality of the log-likelihood function due to mis-

specification of the error distribution and the effect of different starting values for the numericaloptimization routine here, but note that this may be an issue in case of heavily skewed, ‘fat’, orbimodal errors.


error dependencies, the standard deviations for the OLS estimator are much

larger than in the normal case. In table 4.5 we illustrate the effect of misspeci-

fication of the error-distribution on the Hausman-LIV test on a 5%-level. It can

be seen that underH0 the test is too conservative for the fat-tailedt-distribution

and the skewedχ21 distribution, but the power to detect a nonzeroρx,ε is not

much lower than for the normal distribution. In all cases we investigated the

skeweness, the kurtosis, and QQ plots of the ‘IV-type’ residuals, which clearly

indicated that the disturbances were fat-tailed or skewed. This investigation

can be accompanied by the test suggested in Greene (2000).

Table 4.5: Effect of misspecifying the error distribution on power of the Hausman-LIV test onα = 0.05.

Distribution

ρx,ε N χ2 t

0 0.05 0.09 0.090.1 0.46 0.51 0.430.2 0.99 0.96 0.920.3 1.00 1.00 1.000.4 1.00 1.00 1.000.5 1.00 1.00 1.00

In summary, this simulation study shows that the LIV method is fairly ro-

bust against misspecification of the error-term, at least in large samples. In

such a case, the maximum likelihood LIV estimator may not be fully efficient

anymore and alternative more efficient estimators may exists (see e.g. the dis-

cussion in Honore and Hu, 2004). One possible caveat of the LIV model in

presence of a severely misspecified error distribution, such as for theχ21 dis-

tribution, is that the log-likelihood may be multimodal. In such a case, the

starting values for numerically optimizing the log-likelihood equation become

important, since the LIV model may wrongfully mix on the skewed error dis-

tribution instead of on the latent instrument. Here we do not attempt to solve

that question, and leave it for further research, since lack of normality can be


effectively examined for by using the IV-type residuals to compute kurtosis and

skewness. Furthermore, heteroscedasticity can be detected by using the con-

ditional residuals from (4.13), i.e. examine scatterplots of the residuals versus

explanatory variables and the predicted values. Presence of heteroscedastic-

ity does not affect consistency or unbiasedness of the regression parameters,

but leads to a loss in efficiency. To correct the estimated standard errors for

heteroscedasticity, White’s (1980) method can be used. A more detailed dis-

cussion of potential strategies to remedy heteroscedasticity or non-normality

can be found in Fox (1991). Finally, outliers can be identified by examining

standardized versions of the above residuals, or using the methods presented

in the following.

Analyzing influential observations and outliersSince standard closed form

expressions of outlier diagnostics available for OLS cannot be generalized for

LIV, we propose to approximate the Jacknife LIV estimateθ (i ) by a few nu-

merical optimization steps with the maximum likelihood estimate of the com-

plete sample as starting value (Cook and Weisberg, 1982, Belsley, Kuh and

Welsch, 1980, or Fahrmeir and Tutz, 1994). Once these estimates are avail-

able, we propose to use the following measures to determine the influence of

observationi on:

1. the likelihood, measured by the likelihood distance LD(i ) = 2[LL(θ)−LL(θ(i ))], where LL(θ) denotes the value of the log-likelihood for the

complete sample in pointθ .

2. the estimated parameters, measured by Cook’s distance CD(i ) = (θ −θ (i ))′H(θ)(θ − θ (i )), whereH(θ) is the Hessian evaluated atθ .

3. the estimated covariance matrix, which is measured by COVRATIO1(i ) =det[V(θ(i ))]/det[V(θ)], whereV(θ) denotes the estimated variance co-

variance matrix forθ .

4. the estimated covariance matrix of(ε, ν), given by COVRATIO2(i ) =det[�(i )]/det[�], where� is given in (3.4). Because� is essential in


correcting for endogeneity, this measure may point towards observations

having a large effect on the relation betweenε andν.

Our experience based on simulation studies10 is that the four measures men-

tioned above, together with an examination of the residuals, can be fruitfully

applied to detect outliers and influential observations. We propose examining

the ranking of the largest values of LD(i ) or CD(i ), and of|COVRATIO1(i )−1| and|COVRATIO2(i )−1|, where large jumps between subsequent observa-

tions indicate potential influential or outlying observations.

The next section introduces another test to test for regressor-error disturbances

that may give more stable results that the earlier proposed Hausman-LIV test.

4.6 The Hausman-LIV test revised

The Hausman-LIV test proposed in the previous chapter was shown to have

a reasonable power across a wide range of regressor-error correlations and for

several distributions of the endogenous regressor. The Hausman-LIV test com-

pares the difference between the LIV estimate and the OLS estimate and this

difference is used to construct a test-statistic which has aχ21 distribution under

the null hypothesis. When the difference is substantial enough to be impor-

tant, the probability that the test rejects should be large otherwise the test does

not provide much information. This test uses the complete vector of estimated

regression coefficientsβ = (β0, β1, β′2)′ for OLS and LIV and the correspond-

ing estimated variance-covariance matrices. However, a more simple test that

focusses only on the covariance betweenεi andνi can be constructed. A test

of no regressor-error correlation is equivalent to testingH0 : σεν = 0. This

hypothesis is a linear restriction on the parameter vector and can be tested for

by using the same Wald test11 as in section 4.3. The potential advantage of this

test is that it relies on fewer parameters.

10Not reported here.11In fact, the Hausman test is also based on a Wald statistic, see e.g. Greene (2000).

4.6 The Hausman-LIV test revised 103

We found that the Hausman-LIV statistic may give negative values when other

exogenous regressorsx2 are present, because one or more eigenvalues of the

difference matrix6LIV in (3.26) were negative. In such a case, it is falsely

concluded that LIV is ‘more efficient’ than OLS. We did not find, however,

that the estimated standard deviation ofβ1 is smaller (i.e. more efficient) for

LIV than for OLS, but the estimated covariances betweenβ0, β1, andβ2 may

be larger, and the estimated standard deviations ofβ0 andβ2 may be smaller

using LIV. In these cases, the difference matrix6LIV is indefinite, and the re-

sulting Hausman-LIV statistic is not necessarily larger than or equal to zero12.

It is not uncommon to find a non-positive definite matrix when applying the

Hausman-test within a classical IV framework. In fact, unless the set of regres-

sors and the set of instrumental variables have no variables in common, the

ordinary inverse of the estimated asymptotic covariance matrix in the Haus-

man test will not exist. In other situations, this singularity may be a finite

sample problem (cf. Greene, 2000). In the classical IV framework one nor-

mally uses6aIV = [(X′PZ X)−1 − (X′X)−1]−1/s2 to compute the Hausman

statistic, wheres2 is the common estimator forσ 2ε based either on IV or OLS.

Another approach is to subtract the estimated variance covariance matrices for

IV and OLS, i.e.6bIV = [s2

IV(X′PZ X)−1− s2

OLS(X′X)−1]−1, which is, according

to Dhrymes (2003), a ‘naive’ application of the Hausman test. However, the

Hausman-LIV test uses the latter approach (see section 3.4), which may ex-

plain, given Dhrymes’ conclusion, part of the problems we observe. Besides,

the explicit presence of the exogenous regressorsx2 in the equation forx may

induce a higher estimated covariance between the estimated values forβ1 and

β2. The fact that some of the estimated standard deviations are smaller than the

corresponding OLS estimates is because the LIV model includes a model for

the endogenous regressor which may imply an efficiency advantage for some

of the estimated regression parameters of the exogenous regressors.

In table 4.6 we compare the performance of the two tests. We use the same

skewed, bimodal, and unimodal distribution as in the previous chapter, where

12E.g. Greene (2000) p.46-p.49.


Table 4.6: Power of the Hausman test (H) and Wald test (W) for exogeneity.

α = 0.5 0.05 0.01

ρx,ε H W H W H W

bim8 0 0.45 0.45 0.05 0.05 0.02 0.020.1 0.91 0.92 0.51 0.54 0.31 0.320.2 1.00 1.00 0.98 0.98 0.92 0.920.3 1.00 1.00 1.00 1.00 1.00 1.000.4 1.00 1.00 1.00 1.00 1.00 1.000.5 1.00 1.00 1.00 1.00 1.00 1.00

skew8 0 0.49 0.49 0.04 0.04 0.01 0.010.1 0.98 0.98 0.68 0.69 0.47 0.510.2 1.00 1.00 1.00 1.00 1.00 1.000.3 1.00 1.00 1.00 1.00 1.00 1.000.4 1.00 1.00 1.00 1.00 1.00 1.000.5 1.00 1.00 1.00 1.00 1.00 1.00

unim8 0 0.55 0.55 0.06 0.08 0.03 0.040.1 0.64 0.65 0.18 0.21 0.09 0.110.2 0.93 0.93 0.55 0.57 0.30 0.380.3 1.00 1.00 0.90 0.92 0.78 0.820.4 1.00 1.00 0.99 1.00 0.96 0.990.5 1.00 1.00 1.00 1.00 1.00 1.00

the true number of categories of the instrument is eight. We include two ex-

ogenous regressors, where one has a moderate effect on the endogenous re-

gressor and the other has no effect. The mean ofx is zero, the total variance

of x is equal to 3, and the regressor-error correlations are taken to be 0, 0.1,

0.2, 0.3, 0.4, 0.5. The LIV model is estimated with two categories. We used

n = 1000 and 250 Monte Carlo simulations. In table 4.6 we present the frac-

tion of rejections of the null hypothesis across the simulation runs. It can be

seen that both tests perform about equally well, but the Wald based test has

a slightly larger power to detect endogeneity ofx. Both tests are close to

their true sizes under the null hypothesis and are more conservative for the

unimodal distribution. The same observation was also made in the previous

4.7 Conclusions 105

chapter and can be explained by the efficiency loss due to misspecifying the

true number of categories of the latent instrument and the less well separated

mixture components in the unimodal distribution. The power of both tests is

higher than the results reported in table 3.1 for bim8, skew8, and unim8, be-

cause of the presence of two exogenous regressors to explain the variance in

y. We found that the Hausman-LIV test performs better when the estimated

variance-covariance matrix ofβ is based on the Hessian matrix. Furthermore,

for lower regressor-error correlations, the difference matrix6LIV is more of-

ten non-positive definite. The Hausman-LIV test statistic is not necessarily

negative, and the results suggest that the Hausman-LIV test has nevertheless

a reasonable power and a size close to its true value. However, the proposed

alternative test, based on Wald’s principle, gives results that are at least as good

as the results for the Hausman-LIV test, but does not have the problem of po-

tential ambiguous results. These results suggest to use the Wald test statistic

in empirical applications, but we recommend to compute both and to compare

their results13.

4.7 Conclusions

The simple LIV model proposed in chapter 3 adequately solves for regressor-

error correlations in linear models without requiring observed instrumental

variables. This is a great advantage over existing methods that rely on observed

instrumental variables, in particular in empirical applications where good in-

struments are not available. In this chapter we propose several methods to

extend the simple LIV model.

Firstly, we include several exogenous regressors and observed instrumental

variables in the model. This is an important generalization since in most em-

pirical applications additional regressors are part of the model. We extend the

identifiability proof in subsection 3.3.1 and we prove that all model parameters

13In subsection 8.2.2 we elaborate upon another test (the Lagrange multiplier test) to test forendogeneity without requiring observed instruments, that can potentially be computed straight-forwardly using standard computing packages, because it operates under the restricted model(σεν = 0). This test is asymptotically equivalent to the tests presented here (Greene, 2000).


are identified for the more general LIV model. Furthermore, observed instru-

mental variables can be included in the LIV model, and, importantly, Wald-

based test results are easily obtained to investigate their validness. To the best

of our knowledge, this is not straightforwardly possible in a classical IV frame-

work because of identifiability problems. The proposed Wald-based tests have

a reasonable power in detecting weak and/or endogenous instruments. Our re-

sults suggest to use a conservative significance level for the Wald-test that tests

for a zero effect of the instruments on the endogenous regressor. Instruments

that are found to be ‘valid’ can be used either stand alone, to compute a 2SLS

or LIML estimate, or they can be included in the LIV model and used jointly

with the latent discrete instrument to obtain an LIV-IV estimate. If the ob-

served instruments are ‘invalid’, the classical instrumental variables estimates

have to be distrusted and the LIV estimates are the most appropriate to use.

In section 4.5 we propose several diagnostics that can be used to complete a

LIV analysis. In empirical applications, the number of categories of the un-

observed instrument has to be chosen. The AIC3, BIC or ICL criteria can be

used for this purpose. The ICL criterion is equal to BIC, with an additional

penalty for fuzzy clustering. As such, it tends to be more conservative, which

avoids overfitting and adds tractability to our specification. We emphasize that

several specifications ofm should be examined, where substantial different

conclusion based on AIC3, BIC, and ICL are to be distrusted. The residuals

can be analyzed to investigate the normality assumption of the errors. When

the errors have a non-normal distribution, the LIV estimates may not be fully

efficient. The results of the simulation study in subsection 4.5.2 illustrate that

the LIV model is fairly robust against fat-tailed and/or skewed error distribu-

tions, which is a desirable property. However, furhter research is required to

investigate this in more detail, in particular the presence of severely skewed,

‘fat’, or multimodal distributed errors and the possible multimodality of the

LIV log-likelihood surface in such cases. Heteroscedasticity and outliers or

influential observations can be dealt with in a similar way, and the methods we

propose were found to be effective in identifying heteroscedasticity and outly-

ing or influential observations. Finally, in section 4.6 we discuss another test to

4.7 Conclusions 107

test for regressor-error correlations, because the Hausman-LIV test was found

to give possible ambiguous results in presence of several exogenous regres-

sors. This new test is a Wald test and focusses on the covariance between the

error terms. We recommend to compute both tests in empirical applications

and similar conclusions should inspire confidence in the results found.

In the next chapter we apply the LIV method to estimate the return to edu-

cation for three empirical applications. We show that reasonable estimates are

obtained without using observed instruments, that are to be preferred over stan-

dard OLS and classical IV estimates for the return to education. The size and

magnitude of the bias found across the three applications in the OLS estimate,

as indicated by the IV estimate, depends on the type of instruments used. Our

results are more consistent, they are in accordance with the traditional ability

bias argument, and they resemble more closely recent results on twin studies.

Furthermore, the LIV diagnostics proposed in this chapter indicate that the

LIV model assumptions are reasonable.


Appendix 4A 1st and2nd order derivatives log-likelihoodof the general LIV model

The 1st and 2nd order derivatives of the simple LIV model presented in subsection 3.3.2can be extended easily to a more general situation wherel1 exogenous regressorsx2i

andl2− l1 observed instrumental variablesx3i are available.

Let z2i = (x′2i , x′3i )′. The expected value of(yi , xi ) in (3B.2) now becomes

µyi | j = β0+ β1π j + x′2iβ2+ z′2i γ2β1

µxi | j = π j + z′2i γ2, (4A.1)

whereβ2 is an l1 × 1 vector,γ2 is an l2 × 1 vector,x2i is an l1 × 1 vector, andz2i

is an l2 × 1 vector. The structure of the first- and second-order derivatives does notchange by the inclusion ofx2i and x3i , but (4A.1) should be used to compute thegradient and hessian, instead of (3B.2). In addition, the results in appendix 3B need tobe augmented with the derivatives of the log-likelihood with respect to the elementsof β2 andγ2, that both have a similar structure as the elements inθ1. Furthermore,the derivative ofqi | j with respect toβ1 changes due to the presence of the product ofz′2i γ2 andβ1 in µy

i | j . Lettingθ1 = (β0, β1, β′2, γ′2)′, we obtain

∂qi | j∂β1= −2σ 2

ν (yi − µyi | j )(π j + z2i γ2)+ 2(xi − µx

i | j )2(σ 2

ν β1+ σεν)+

− 2(xi − µxi | j ){−(β1σ

2ν + σεν)(π j + z2i γ2)+ (yi − µy

i | j )σ2ν } (4A.2)

∂qi | j∂β2l

= x2i l∂qi | j∂β0

(4A.3)

∂qi | j∂γ2k

= z2ik∂qi | j∂π j

, (4A.4)

for l = 1, ..., l1 andk = 1, ..., l2. Similar calculations can be done for the secondorder derivatives.

Appendix 4B Simulation results for the exogenous regressor 109

Appendix 4B Simulation results for the exogenous re-gressor

Table 4B.1: Means and standard deviations of the bias in the estimates forβ2.

γ2z2

ρzu 0 0.1 0.2 0.3 0.4 0.5

0 0.000 -0.002 -0.002 -0.001 0.000 0.002(0.036) (0.033) (0.035) (0.031) (0.032) (0.035)

0.05 -0.003 0.003 0.001 0.003 0.001 0.003(0.035) (0.034) (0.032) (0.032) (0.033) (0.032)

LIV 0.1 -0.001 -0.003 0.000 -0.001 0.000 0.000(0.033) (0.031) (0.032) (0.033) (0.035) (0.033)

0.15 0.000 -0.004 0.000 -0.002 -0.003 0.002(0.033) (0.035) (0.032) (0.035) (0.034) (0.031)

0.2 0.001 0.001 0.000 0.000 -0.001 -0.004(0.031) (0.030) (0.031) (0.031) (0.032) (0.031)

0 0.077 0.072 0.074 0.073 0.074 0.076(0.030) (0.028) (0.028) (0.027) (0.027) (0.028)

0.05 0.070 -0.076 0.074 0.076 0.074 0.076(0.030) -(0.028) (0.029) (0.028) (0.029) (0.027)

OLS 0.1 0.073 0.069 0.073 0.073 0.074 0.074(0.027) (0.026) (0.027) (0.027) (0.030) (0.028)

0.15 0.074 0.072 0.072 0.072 0.071 0.074(0.028) (0.028) (0.028) (0.030) (0.029) (0.028)

0.2 0.074 0.075 0.074 0.073 0.072 0.071(0.027) (0.026) (0.027) (0.028) (0.028) (0.025)

0 0.069 -0.005 -0.008 -0.003 0.000 0.000(0.223) (0.119) (0.062) (0.045) (0.039) (0.039)

0.05 0.024 0.130 0.063 0.042 0.033 0.027(0.571) (0.131) (0.048) (0.039) (0.035) (0.033)

IV 0.1 -0.013 0.230 0.129 0.085 0.061 0.050(1.490) (0.177) (0.051) (0.037) (0.035) (0.032)

0.15 -0.167 0.357 0.202 0.124 0.090 0.076(1.974) (0.241) (0.072) (0.040) (0.032) (0.030)

0.2 0.326 0.473 0.263 0.172 0.123 0.098(2.500) (0.348) (0.091) (0.050) (0.037) (0.030)

Chapter 5

Estimating the return toeducation using LIV

5.1 Introduction

We apply the methods developed in the previous chapters to three empirical

datasets to examine the return to education on income. Education is an impor-

tant topic in public debates. Over the past decades, much research has been

conducted to estimate the causal effect of education on earnings, see for in-

stance Griliches (1977), Card (1999, 2001), or Uusitalo (1999). Most of the

studies in question have focused on estimating a version of the following linear

regression equation:

yi = β0+ β1Si + Xiβ2+ εi , (5.1)

whereyi is the logarithm of a measure of earnings,Si is a measure of education

and Xi is a collection of other explanatory variables assumed to influenceyi .

β1 measures the effect of education on income and is expected to be positive.

The disturbancesεi represent all other influences not explicitly accounted for.

If the disturbances are distributed independently of the explanatory variables

Si andXi , the simple OLS estimator can be used to estimateβ1. However, the

independence assumption may not be realistic and in this case it can be shown

that the OLS estimator is biased (e.g. Greene, 2000). Four major potential

111

112 Chapter 5 Estimating the return to education using LIV

sources of bias (ability bias, measurement error bias, heterogeneity bias, and

optimizing behavior bias) have been identified in the literature on the relation-

ship between education and income, each of which is discussed in section 5.2.

As will become clear from this discussion, there is little agreement on the di-

rection and magnitude of the potential bias in the OLS estimator of the return

to education effect. This situation is not surprising in view of the many sources

of potential regressor-error dependencies, with each of them having their own

specific impact on the direction and magnitude of the bias in OLS. A further

complicating factor is that these causes offset or enforce each other.

One way to circumvent problems of endogeneity is to find instruments and

apply two-stage least squares or limited maximum likelihood estimation tech-

niques (see e.g. Bowden and Turkington, 1984, Verbeek, 2000, or Greene,

2000). Instruments are variables that mimic the troublesome regressors as well

as possible but are uncorrelated with the error term. Hence, instrumental vari-

ables cannot have a direct effect on the dependent variable. In practice it is not

obvious how or where to find valid instruments. Furthermore, instruments are

often weak, i.e. they only explain a small part of the variance of the endoge-

nous regressor. This may result in estimates that are even more biased than

the OLS estimates (Staiger and Stock, 1997, Bound, Jaeger and Baker, 1995),

see also sections 4.3 and 4.4. In section 5.3 we discuss the problems of IV

estimation for the model given in (5.1), and it is shown that instrumental vari-

able estimation in estimating the return to education is not a straightforward

exercise.

Card (1999, 2001) surveys several empirical studies on the return to education

and finds regression estimates ranging from about 0.03− 0.14. Quite often,

the OLS estimates were not found to be statistically different from the instru-

mental variable estimates. As suggested above and discussed in more detail

in section 5.3, instrumental variables estimates for these kind of studies are

potentially biased as well, because the instruments used are possible weak and

endogenous. We find empirical evidence for this in section 5.4. Recent evi-

dence from twin studies suggests an upward bias in the OLS estimator of about

5.2 Sources of bias in the OLS estimate of the return to education 113

10%-15% (cf. Card, 1999). The major advantage of using data on twins is that

no observed instruments are required as the within–family estimator can be

used (see also subsection 5.3.3).

The latent instrumental variable (LIV) method proposed in the previous chap-

ters provides approximately unbiased estimates of the model parameters with-

out relying on observed instruments. It is based on the assumption that a dis-

crete latent variable splits the endogenous regressorx into an exogenous com-

ponent and an endogenous component that is correlated with the error term. In

section 5.4 we estimate the return to education for three datasets using LIV. We

show, by using the previously proposed diagnostics, that the LIV model is not

particularly sensitive to outliers and fits the data sets fairly well. The instru-

ments that are estimated by LIV from the data are shown to be much stronger

than the available observed instruments that are typically used in these ap-

plications. Furthermore, we investigate the validity of the available observed

instrumental variables in these datasets using the methods proposed in section

4.3. We find considerable evidence that in two of the three applications the

observed instruments are weak and/or endogenous. Overall, the LIV approach

yields results that are more consistent than the classical IV results. We find a

moderate upward bias in OLS of≈ 7% which is close to recent results from

twin studies, and supports the ability bias hypothesis. On the other hand, the

bias in OLS, as indicated by the classical IV estimates, ranges from−80% to

+30% for the three applications, which illustrates that opposite answers may

be obtained, if one uses different sets of instrumental variables to address the

same substantive research question. Section 5.5 presents a summary of our

findings.

5.2 Sources of bias in the OLS estimate of the returnto education

5.2.1 Ability bias

Much work has focused on the issue whether the presence of a –so called–

‘ability bias’ overstates the true causal effect of education on earnings (e.g.


Angrist and Krueger, 1991, Harmon and Walker, 1995, Verbeek, 2000). ‘Abil-

ity’ can be seen as an omitted variable that enables (certain) individuals to

obtain more income. In this case, the true model isyi = β0+β1Si +ρAi + εi ,

whereAi denotes ‘ability’ andρ is the effect of ‘ability’ on income which is

expected to be positive. We assume for the moment that there are no other

explanatory variables present. The probability limit of the OLS estimator for

β1, while omitting ‘ability’, is1

plim βOLS1 = plim

∑ni=1(Si − S)(yi − y)∑n

i=1(Si − S)2= β1+ ρ

σS A

σ 2S

,

whereσS A denotes the covariance between education and ability, andσ 2S is

the variance ofS. When individuals with higher ability have chosen to obtain

more education (σS A> 0), the effect of education on income is overstated, as

ρ > 0, since the effect of unobserved ability is falsely attributed to it. As such,

exogenous shocks in education levels will have less effect on individual wages

than what is predicted by the OLS regression model, and education seems more

valuable than it actually is.

5.2.2 Measurement error bias

Although ‘ability’ bias may induce a positive bias in OLS, error in the mea-

surement of the education variableSi may result in downward biases. Often,

the only data available to measure education is ‘years of schooling’. How-

ever, it can be questioned whether ‘years of schooling’ adequately measures

‘total education’. Griliches (1977) shows that if the measures for education

are imperfect, OLS estimates can have a large downward bias. This bias is

magnified (even if the error of measurement is small) when more variables are

included in the model. This can be seen as follows (Griliches, 1977). Let the

true wage-education equation be

yi = β1S∗i + Xiβ2+ εi ,

1Substitute the expression foryi in the expression forβOLS1 , complete the terms, and use the

law of large numbers and Slutsky’s theorems (see Ferguson, 1996).


with β1 = 0.1 andβ2 = 0.01, say. S∗i is the true but unobserved level of

education. The observed level of schoolingSi is measured with error, such

that Si = S∗i + ui , with ui a random term independent ofS∗i and εi . Let

λ = σ 2u/σ

2S be the fraction of the observed variance in schooling that is due

to measurement error and assume thatXi (e.g. ability or other explanatory

variables) is measured without error. Regressingy on S, while ignoring X,

gives

βOLS1 =

∑ni=1(Si − S)(yi − y)∑n

i=1(Si − S)2= β1− β1

∑ni=1(Si − S)(ui − u)∑n

i=1(Si − S)2+

+β2

∑ni=1(Si − S)(Xi − X)∑n

i=1(Si − S)2,

which has probability limit

plim βOLS1 = β1− β1

σSU

σ 2S

+ β2

σSX

σ 2S

,

where the covarianceσSU = σ 2U , sinceS∗ is independent ofu, andσSX is the

covariance betweenS and X. Suppose that 10% of the observed variance in

schooling is due to measurement error, i.e.λ = 0.1, thatσS = 3,σX = 15, and

that the correlation betweenX andS is ρX S= 0.5. Then the inconsistency of

βOLS1 follows from

plim βOLS1 = 0.1− 0.1× 0.1+ 0.01× 3× 15× 0.5

9= 0.115,

from which it can be seen that the simple OLS estimator is biased upward by

15%. If the additional explanatory variableX is added, the probability limit

for βOLS1 becomes (see Judge et al., 1985, p.708)

plim βOLS1 = β1− λ

β1

1− R2SX

= 0.1− 0.1× 0.1/0.75= 0.087,

whereRSX is the multiple correlation coefficient betweenS and X, and OLS

exhibits a downward bias of 13%. It can be seen that: (1) measurement error in


S induces a negative bias in the OLS estimator forβ1, (2) this negative bias can

be offset by an upward bias due to the omission ofX and, hence, the total bias

in βOLS1 may be positive, (3) adding explanatory variablesX correlated with the

systematic components of schooling increases the measurement error bias in

βOLS1 , even when the additional variables do not explain much of the variance

in the observed (log) wages (Griliches, 1977).

5.2.3 Heterogeneity bias

Heterogeneity in the regression coefficients of (5.1) is a third source of poten-

tial bias in the OLS estimates. People differ with respect to their marginal re-

turn to education, their marginal cost for education, and their tastes or beliefs.

Now the return to education is not a single parameter but a random variable

that potentially differs with background characteristics of individuals as well.

Unobserved heterogeneity might induce a dependency betweenS andε. This

can be seen as follows, where we omit other explanatory variables and make

some simplifying assumption on the distribution ofSi to present a constructive

example, see also Card (1999, 2001). Let

yi = β0i + β1i Si + εi , (5.2)

with β0i = β0+ u0i andβ1i = β1+ u1i , such that E(β0i ) = β0 and E(β1i ) =β1. Biases in the OLS estimator for schooling arise when the unobserved het-

erogeneity(u0i ,u1i ) is correlated with schoolingSi . This can be illustrated as

follows. Following Card (1999, 2001), let

u0i = β0i − β0 = λ(Si − µS)+ ν0i

u1i = β1i − β1 = ψ(Si − µS)+ ν1i ,

wherev0i andv1i are independent of each other and ofεi . We also assume that

E (ν0i |Si ) = E (ν1i |Si ) = 0, andµS = E (Si ). Now

λ = cov(β0i , Si )

var(Si )and ψ = cov(β1i , Si )

var(Si ),


andyi = β0+(β1+λ−ψµS)Si+ψS2i +ν0i+ν1i Si+εi , whereβ0 = β0−λµS.

It can be shown that (see also Card, 1999, 2001)2

βOLS1 =

∑ni=1(Si − S)(yi − y)∑n

i=1(Si − S)2= β1+ λ− ψµS+

+ ψ∑n

i=1 S3i − nS

(S◦ S

)∑n

i=1(Si − S)2+∑n

i=1 ν1i S2i − nS

(S◦ ν1

)∑n

i=1(Si − S)2+

+∑n

i=1 Si (ν0i + εi )− nS(ν + ε)∑ni=1(Si − S)2

,

and by using the weak law of large numbers (multiply nominators and denom-

inators by(1/n))3 and Slutsky’s theorem (Ferguson, 1996)

plim βOLS1 = β1+ λ− ψµS+ ψplim

1n

∑ni=1 S3

i − S(S◦ S

)1n

∑ni=1(Si − S)2

.

The latter fraction is equal to the regression coefficient ofS2i on Si which has

probability limit 2µS (it is assumed that the distribution ofSi is symmetric)4.

Hence,

plim βOLS1 = β1+ λ+ ψµS. (5.3)

This relation generalizes the conventional analysis of ability bias (Griliches,

1977). Ifβ1i = β1 for all i , i.e. there is no heterogeneity in the schooling co-

efficient, thenψ = 0 and the inconsistency of the OLS estimate forβ1 is equal

to λ, which can be interpreted as the conventional ability bias. If both intercept

and slope vary across individuals, i.e.λ 6= 0 andψ 6= 0, the OLS estimator

for β1 may be biased in another way. According to Card (1999), people with

higher returns to education tend to acquire more schooling (ψ > 0), and hence

2We write X ◦ Y = (1/n)∑n

i=1 xi yi , i.e. X ◦ Y is the sample mean of the productsxi yi .3E (ν1i |Si ) = 0 implies that E(ν1i S2

i ) = E [E (ν1i S2i |Si )] = 0.

4It can be shown that cov(X, X2) = E X3 − µXσ2X − µ3

X and E(X − µX)3 = E X3 −

3µXE X2 + 2µ2X = E X3 − µ3

X − 3µXσ2X , as EX2 = σ2

X + µ2X . If X is symmetric, E(X −

µX)3 = 0, and it follows that cov(X, X2) = 2µXσ

2X .


a cross-sectional regression of earnings on schooling yields an upward-biased

estimate of the average marginal return to schoolingβ1. Verbeek (2000), on the

other hand, argues that the individual–specific returns to schooling are poten-

tially higher for individuals with low levels of schooling (ψ < 0), and hence

a downward bias in the OLS estimate forβ1 is to be expected. For now, we

do not favor one interpretation over the other, but emphasize that in both situ-

ations the OLS estimator is potentially biased, which may be either upward or

downward5.

5.2.4 Optimizing behavior bias

Finally, a fourth source of possible bias of the OLS estimator in the schooling

equation is due to the optimizing behavior of individuals. This is discussed

to some extent in Griliches (1977) and Card (1999, 2001). Schooling can be

regarded as the result of optimizing behavior of individuals or households.

Individuals try to reach an optimal schooling decision by maximizing ‘wealth’

or ‘utility’ based on anticipated earnings, that depends on schooling, ability,

other unknown factors, and certain (opportunity) costs of schooling (depending

on for instance interest rates and tuition fees). Garen (1984) views this as a

self-selection problem with a continuous choice variable. Uusitalo (1999) and

Harmon and Walker (1995) argue that this behavior could induce a positive

bias in the OLS estimator. Griliches (1977) shows that it might as well lead

to a downward bias. A more extensive discussion is beyond the scope of this

manuscript and we refer to the aforementioned works.

5.3 IV estimation of the returns to education

Given the divergent and a priori unknown sources of potential regressor-error

dependencies in estimating the return to education, it is not an easy task to find

appropriate instruments that alleviate regressor-error dependencies in model

5Card (1999, 2001) shows for model (5.2) that presence of measurement error inSi biasesthe OLS estimate forβ1 towards zero, implying that a relatively small amount of measurementerror may (partly) offset a modest upward ability bias, depending on the magnitudes ofλ andψµS. This can also be seen from the example in subsection 5.2.2

5.3 IV estimation of the returns to education 119

(5.1). Card (1999, 2001) gives an overview of recent studies that use instru-

mental variables to estimate the return to schooling. He distinguishes two sets

of instrumental variables that are commonly used: (1) those that are based

on institutional features of the school system and (2) those that are based on

family background characteristics, both of which are discussed next.

5.3.1 Institutional features of the schooling system

When instrumental variables based on institutional features are used, the re-

sulting IV estimates are approximately 30% higher than the corresponding

OLS results. This finding does not agree with current beliefs in the litera-

ture about the traditional ability bias. Card (1999, 2001) provides four expla-

nations. Firstly, instruments based on institutional features of the schooling

system may not be truly exogenous, since a direct effect of the instruments on

earnings may exist. For instance, ‘college proximity’ is sometimes used as an

instrument (see e.g. section 5.4), but it may have a direct effect on earnings,

since families that place a strong emphasis on education may choose to live

near a college, while their children may have higher abilities and/or motiva-

tion to achieve labor market success (cf. Verbeek, 2000). Bound and Jaeger

(1996) argue that the quarter-of-birth dummy instruments used in Angrist and

Krueger’s (1991) study may have a direct association with the dependent vari-

able, as their results suggest that the association between quarter-of-birth and

earnings is too strong to be fully explained by school attendance laws. Card

(1999)6 argues that instruments based on schooling reforms (treatment), such

as changes in compulsory school attendance laws, are biased further upward

compared to OLS because of unobserved differences between the character-

istics of the treated and non-treated group, since these reform treatments are

often not random. Bound, Jaeger and Baker (1995) argue that Angrist and

Krueger use a large number of weak instruments and show that, in finite sam-

ples, IV estimates based on weak instruments are biased in the same direction

as OLS.

6par. 3.4 (p.1819-p.1822), and p.1841.


Secondly, the downward bias in OLS can be a result of error in the measure-

ment of education. However, the strength of this effect is doubtful in view of

Card (1999) who argues that it is unlikely that measurement error alone can

account for the large positive gap between the IV and OLS estimates.

Thirdly, factors like compulsory schooling or schooling availability are most

likely to affect individuals who otherwise would have had relatively low school-

ing. If, because of potential heterogeneity, these individuals have higher than

average marginal returns to schooling, then instruments based on these vari-

ables tend to recover the returns to education for a subset of individuals with

relatively high returns to education, resulting in estimates higher than OLS.

Uusitalo (1999) notes in this respect that presence of heterogeneity in the co-

efficient of the returns to education yields an additional error termν1i Si . Since

the instrumentZi is correlated withSi , it cannot be uncorrelated with the error

term of the wage equation.

5.3.2 Family background

The second type of instrumental variables commonly used are instruments

based on family background characteristics, for instance measures on educa-

tion levels of family members. The use of these variables as instruments is

motivated by the fact that children’s education tend to exhibit a high correla-

tion with parents’ education. However, Card (1999) shows that when the OLS

estimator is biased upward because of unobserved ability, the bias in the IV es-

timator is at least as large, and potentially larger, depending on the strength of

the instruments and its possible direct effect on the dependent variable. Hence,

if the OLS estimator is biased upward, one would expect that an IV estimator

based on family background is biased upward even more. For a more detailed

discussion we refer to Card (1999).

5.3.3 Alternative, non-IV approaches

Although using instrumental variables is a common way to solve for omit-

ted ‘ability’ bias, some other methods are available (see e.g. Harmon and

5.3 IV estimation of the returns to education 121

Walker, 1995). If good measures for ‘ability’ are available, they can be in-

cluded as proxy variables in the wage equation (5.1). Then, if no other sources

of regressor-error dependencies or measurement error problems remain, stan-

dard OLS techniques can be applied to estimate the return-to-schooling effect.

This approach, however, is doubtful. Empirical results using this approach

suggest an upward ‘ability’ bias in least-squares estimates (see Blackburn and

Neumark, 1993, Wooldridge, 2002). Griliches (1977), however, argues that it

is difficult to find a good measure for ability. Often proxies for unobserved

ability are measures of IQ. If, however, unobserved ability has no relation with

IQ, but is instead related to, say, ‘motivation’, any proxy for ability based on

IQ induces large measurement error biases in the OLS estimator, while unob-

served ability may still not be accounted for (see also the discussion in section

5.2.2).

A particular powerful approach to address regressor-error dependencies in school-

ing models is to use data on twins (or siblings) (Card, 1999). This approach

attempts to eliminate possible omitted variable biases by assuming that some

of the unobserved factors (e.g. ability or motivation) are identical within fam-

ilies (or twin/sibling pairs). In this case, differences of levels of schooling and

education for the twins or siblings can be exploited to estimate the effect of

education on wage. Card (1999) gives an overview of several studies that use

twin-data. He concludes that under the assumption that identical twins have

identical abilities, the within-family estimator gives a consistent estimate for

the average marginal returns to schooling. Furthermore, this estimator can be

corrected for measurement error. Card (1999) concludes from his survey that

the OLS estimator obeys a slight upward-bias of the order of 10%− 15%.

A drawback of these methods is the (possible) lack of generalization to non-

twins and the potential failure of the identical abilities assumptions for iden-

tical twins and siblings. If the assumption does not hold, twin studies might

still overestimate to some extent the effect of education on earnings. In a re-

cent study, Hertz (2003) also finds that the OLS results are biased upward. His

results are based on various measurement–error corrected, within–family esti-

mators for South–African households.


The review of the literature on estimating the return to education in this and

the previous sections demonstrates that IV estimation has produced a less than

satisfactory solution to the endogeneity problem of the schooling effect. In the

next section we present the LIV estimates for model (5.1) for three empirical

applications and compare these estimates with classical IV (2SLS). We show

that the LIV results are more stable across the three datasets and are more in

line with recent evidence from twin studies. In addition, the LIV model fits the

data fairly well, based on the diagnostics in section 4.5. We argue that the LIV

estimates are to be preferred to the classical IV results in these applications.

5.4 Empirical results

In this section we present the results of three applications to examine the ef-

fect of education on income. Each of these three applications are based on

previously published data. First we briefly describe the three datasets, where

a more detailed description can be found in appendix 5A. We then estimate

model (5.1) with latent instrumental variables and compare these results with

the traditional IV and OLS estimates. Furthermore, we investigate the avail-

able observed instruments thoroughly, and conclude that the LIV results are to

be preferred over IV and OLS.

5.4.1 Data description

NLSY data

The first dataset is a sample of 3010 men taken from the US National Lon-

gitudinal Survey of Young Men (NLSY) from 1976. This dataset is analyzed

in Card (1995) and Verbeek (2000). The dataset contains several exogenous

variables and one dummy instrumental variable measuring the presence of a

nearby college, i.e. an instrument based institutional features of the school

system.

5.4 Empirical results 123

Brabant data

The second dataset was originally sampled in 1952 from the Dutch province

‘Noord-Brabant’. Thirty years later the same individuals were contacted to

collect data on, among other things, educational level, income, and social back-

ground statistics. The labor market information used here is from 1983, and

the dataset used contains observations on 833 men who had reached a stable

labor market position. As with the NLSY dataset, several exogenous explana-

tory variables are available. We have two instrumental variables: measures on

the educational level of the respondents’ father and mother, i.e. family back-

ground characteristics (see also Hartog, 1988, for a more detailed description

of the data).

PSID data

The third dataset contains data on 424 working, married white women be-

tween the ages 30 and 60 in 1975, and comes from the University of Michigan

Panel Study of Income Dynamics (PSID), analyzed in Wooldridge (2002) and

Mroz (1987). The labor market information is from 1975. This dataset has

several exogenous variables. The available instruments are family background

variables: the respondents’ fathers and mothers level of education and the hus-

bands level of education. For more details on the datasets and the used regres-

sors and instruments, we refer to appendix 5A.

The three datasets differ on various key aspects (sample sizes, region, sex of re-

spondents, year of labor market information), which makes direct comparison

of the estimated regression coefficients superfluous. However, we compute the

relative bias in OLS with respect to the LIV and IV estimates, which, as will

become clear, allows for a straightforward comparison of the results across

the three applications. The application of LIV with its assumption of discrete

levels of the latent variable may well correspond to the existence of discrete

levels of schooling, underlying the measured education variables, that are free

of measurement error and that represent the levels of education that on would

obtain regardless of ability, but is not predicated on that. Alternatively, as one


reviewer to this study pointed out, LIV can be interpreted to identify ‘latent

twins’ and using an analogue of the twin estimator, conceptually.

5.4.2 LIV results for schooling

We analyze all datasets with the LIV model and methods presented in chapters

3 and 4. We estimate the LIV model form= 2, ...,5 and with the inclusion of

extra exogenous variables. We emphasize that the LIV model doesnot require

the availability of instrumental variables, and the results in this subsection are

obtainedwithout using the available observed instruments mentioned in the

previous subsection. In subsection 5.4.3 these are included in the model in

order to examine their validity, using the methods in section 4.3. We also

present here the results for the standard OLS estimator, the IV estimator, and

LIV model fit diagnostics, but postpone a detailed discussion of the IV results

until subsection 5.4.3.

Estimated coefficients

Table 5.1: Results of OLS, IV and LIV for the schooling coefficient for the threedatasets. LIVx means that the LIV model is estimated withm= x categories.

βS OLS IV LIV2 LIV3 LIV4 LIV5

NLSY 0.074 0.133 0.050 0.065 0.068 0.069(0.0035) (0.0518) (0.0099) (0.0041) (0.0040) (0.0040)

Brabant 0.043 0.056 0.040 0.042 0.040 –(0.0044) (0.0075) (0.0051) (0.0049) (0.0049)

PSID 0.102 0.073 0.134 0.099 0.099 0.096(0.0139) (0.0321) (0.0282) (0.0160) (0.0153) (0.0142)

In table 5.1 we present the results for the estimated schooling coefficients for

the datasets using OLS, IV, and LIV. It can be seen that for all specifications for

m in the LIV model (denoted by LIVx), the resulting estimate for the schooling

coefficient is below the OLS estimate, indicating a small upward bias in the


OLS estimate. On the other hand, the direction of the bias for the IV results

using the observed instruments is not unanimous, and we discuss this in more

detail in subsection 5.4.3. The only downward bias found by LIV is for the

PSID data whenm = 2. This can be expected if a dummy variable exists

which is identical or nearly identical to the unobserved instrument. In this

case, there is a situation of (almost) perfect multicollinearity in the second

stage of the LIV model and the parameters are only nearly identified. This

also explains why the results form = 2 have larger standard deviations than

what could have been expected and why relative large improvements in model

fit occur form> 2. In these applications several dummy regressors are present

(see appendix 5A). In addition, the PSID dataset is the smallest we have and

the likelihood surface may be less smooth in this case. For the Brabant data the

maximized value of the likelihood is degenerate atm = 5 and no estimate for

LIV5 is given in table 5.1. We found form > 5 also degenerate solutions for

the PSID data. Here the LIV method indicates that the number of instruments

(number of categories) should not be too large. Overall it can be seen that

the LIV results are fairly stable for different choices ofm and we consider

choosing the ‘best’m next.

Choosing the number of categories of the latent instrument

As argued in the previous chapter, we choose among the different values for

m by looking at the ICL criterion, and for comparison and validity we also

present AIC3 and BIC in table 5.2. For the NLSY data the ICL statistic yields

a minimum atm = 4 and AIC3 and BIC atm = 5. For the Brabant dataset

ICL yields m = 2 and AIC3/BICm = 4. All three measures givem = 5

for the PSID data. In accordance with recent evidence on the performance

of the ‘classical’ selection criteria, we also find in two of the three cases that

AIC-based statistics point to a larger number of categories for the discrete

instrument. Importantly, it can be seen from table 5.1 and from results in

appendix 5B that the estimated regression coefficients and the estimates for

the schooling equation are not very different for the optimal values ofm as

indicated by ICL and AIC3/BIC. As we will show this result also holds for

testing for (absence of) endogeneity. In the following we will only consider


Table 5.2: Computed values for BIC, AIC3 and ICL. Boldface values indicate theminimum value (row-wise), and a∗ denotes a degenerate solution.

m= 2 m= 3 m= 4 m= 5

NLSYBIC 5832.06 5404.04 5309.55 5291.73AIC3 5751.91 5313.86 5209.36 5181.52ICL 6942.75 5703.59 5611.37 5995.09

BrabantBIC 1867.02 1837.67 1835.73 1849.18∗

AIC3 1799.967 1763.174 1753.774 1759.774∗

ICL 1931.07 1974.67 1990.93 2004.44∗

PSIDBIC 1164.49 1023.99 1005.42 905.26AIC3 1103.498 956.894 932.227 825.97ICL 1199.23 1042.04 1020.93 914.55

the LIV results for the optimal number of categories for the latent instrument,

i.e. m= 4 for NLSY, m= 2 for Brabant, andm= 5 for PSID.

Testing for endogeneity

Table 5.3 shows the results for the relative bias7 in the estimated regression

coefficient for schooling with respect to OLS for the IV and optimal LIV re-

sults. Furthermore, the test results for testing for absence of endogeneity are

presented. We present the results for IV (2SLS) as well, but discuss the IV

estimates and the used instruments in more detail later on. The test-statistics

for LIV are computed without using the observed instrumental variables. The

Hausman-test is based on comparing the complete vectorsβOLS andβLIV (and

βIV ) (see appendix 5B for the estimates of the complete vector of regression

coefficients). The Wald-test examines the covariance between the error terms

of the main regression equation and the equation for the endogenous schooling

(see section 4.6). This test cannot be computed for 2SLS.

Overall, we see that the differences between LIV and OLS are not large, which

7We computed this percentage as 100× (1− βLIV1 /βOLS

1 ) and 100× (1− β IV1 /β

OLS1 ).


Table 5.3: Relative biases with respect to OLS and results for Hausman- and Wald-tests to test for endogeneity (based on Hessian).

Data Estimator %1 H-test W-test

NLSY IV -79.9 1.31 -Opt. LIV m= 4 7.9 9.28 9.31Opt. LIV m= 5 6.5 7.18 7.19

Brabant IV -30.1 4.35 -Opt. LIV m= 2 7.0 0.97 0.97Opt. LIV m= 4 7.0 2.63 2.63

PSID IV 27.8 0.95 -Opt. LIV m= 5 5.5 4.20 4.53

is also indicated by the Hausman- and Wald tests (presented in the last two

columns of table 5.3)8. Both tests give similar conclusions. The optimal LIV

solutions for the NLSY data and the PSID data indicate a significant upward

bias in OLS, but for the Brabant data the estimated value forβ1 by LIV (for

m = 2) is not significantly different from OLS. Here, the classical IV estima-

tor indicates a significant downward bias in OLS.

Before discussion the classical IV results in more detail, we first examine var-

ious diagnostics for the above presented LIV estimates, where we only report

the results for the LIV model indicated by the (preferred) ICL-criteria and re-

port, in case of the NLSY data and the Brabant data, only the results for the

model selected by AIC3 when these are substantially different. We note that

for the Brabant data theR2 measure for the strength of the LIV instrument, as

discussed in subsection 4.5.2, is substantially better form= 4 than form= 2,

which is discussed in more detail in subsection 5.4.3.

8The critical 5%-value of aχ21 distribution is 3.84.


Diagnostics: outliers, influential observations, normality and heteroscedas-ticity

We examined the various diagnostics presented in the previous section to in-

vestigate the fit of the (optimal) LIV model and to identify potential outliers

and influential observations. Residual plots for the three datasets are given in

figure 5.1.

For the NLSY data (n = 3010), residual checks did not reveal heteroscedastic-

ity, and residuals had a skewness of -0.28 and a kurtosis of 3.5. All standard-

ized residuals were smaller than (in absolute value) 4.5. Examining the outliers

and influential observations diagnostics did not identify highly unusual data.

For the PSID data (n = 424) there is evidence of weak heteroscedasticity

for the variable ‘experience’, but this effect is rather small. The residuals are

slightly skewed (-0.26) and are leptokurtic9 (kurtosis is 5.1). One observation

was identified as an influential observation, but no outliers are present. When

this observation is removed results and conclusions do not change, and all stan-

dardized residuals are smaller (in absolute value) than 4.

As for the PSID data, the results for the Brabant data (n = 833) indicate slight

evidence of weak heteroscedasticity, here for the dummy variable whether the

father is self employed at the age of 12. For this dataset, the residuals are more

skewed (skewness is -1.25) and more leptokurtic (kurtosis is 12.7). However,

examination for potential outliers and influential data identifies three observa-

tions that clearly do not ‘fit’ the rest of the data. We re-estimated the model

without these three observations, and found that the estimates and test statis-

tics are not strongly affected. The Hausman- and Wald statistic for them = 4

solution, however, now become 3.47 and 3.48, respectively, which are both

significant atα = 0.10. After omission of these outlying data, the residuals

are less skewed and leptokurtic. All but four of the absolute values of the stan-

dardized residuals are smaller than 4.5, with a maximum of 5.9. We note that

9A distribution with a high peak is called leptokurtic (Weisstein, 2004b).


Figure 5.1: Residuals.


the measures for skewness and kurtosis are based on higher order moments,

which are known to be sensitive to outliers.

5.4.3 Relative biases and comparison with classical IV

From the relative percentage bias in OLS with respect to the optimal LIV and

IV estimates in table 5.3 (the column indicated by %1), it can be seen that the

LIV method reveals an upward bias of OLS ranging from 5.5%− 8%. When

traditional IV is used, the conclusions are very different for the three studies,

ranging from an≈ 80% downward bias to an≈ 30% upward bias in OLS.

For the NLSY data, the IV estimate for the return to schooling, based on a

dummy for college proximity, is about 80% higher than OLS and equal to (ap-

proximately) 0.13 (0.052). For the Brabant data, we find that the IV estimate

is 0.056 (0.008), which is also substantially higher (≈ 30%) than OLS. Here

the instruments are the levels of education of the respondents’ parents. Using

similar instrumental variables, we find for the PSID data anupward bias of

≈ 30% in the OLS estimate. It can be seen that in all cases, the IV estimate

has a standard deviation that is substantially higher than OLS. The instability

of the 2SLS results and the high standard deviations may be a result of weak

and/or endogenous instruments, which is investigated in the next subsection.

For the Brabant data and the PSID data we had more instruments available than

necessary for identification, allowing us to examine the sensitivity of IV to dif-

ferent choices for the set of instruments10. When only mothers’ education is

used as IV in the Brabant data, i.e. the number of instruments is decreased, the

estimated coefficient for the return to education becomes 0.059 (0.009), which

is slightly higher than the estimate obtained from using the full set of instru-

ments. Similarly, in the PSID data we also have husband’s education as an

additional covariate. When this variable is included in the set of instruments,

i.e., the number of instruments is increased, the IV estimate becomes 0.065

(0.023). This value is about 11% lower than the IV results for the smaller set

of instruments. These results indicate that, in particular for the PSID data, the

10This can be seen as changing the number of categories of the latent instrument in the LIVmodel.


2SLS estimates may be sensitive to different choices for the set of instruments.

In the following we examine the strength of the available observed instruments.

Strength of the available observed instruments

Table 5.4: Results strength of observed versus predicted LIV instruments. Instru-ments NLSY: ‘Nearc’, Brabant: ‘FatherEd’ and ‘MotherEd’, respectively, and PSID:‘husbanded’, ‘mothered’, and ‘fathered’, respectively (based on Hessian).

Data Method 1R2 γ2z21γ2z22

γ2z23Test

NLSY Obs IV 0.0029 0.34 –(0.11)

LIV4 IV 0.7503 0.15 6.83(0.06)

LIV5 IV 0.7976 0.16 7.85(0.06)

Brabant Obs IV 0.0922 1.08 1.27 –(0.18) (0.22)

LIV2 IV 0.3906 0.89 1.02 104.41(0.16) (0.19)

LIV4 IV 0.5247 0.61 0.93 48.48(0.14) (0.21)

PSID Obs IV 0.3225 0.34 0.12 0.10 –(0.03) (0.03) (0.03)

LIV5 IV 0.8312 0.04 -0.01 0.02 14.87(0.01) (0.01) (0.01)

Table 5.4 shows the results of the diagnostics proposed in chapter 4 that exam-

ine the strength of the observed instruments. Investigating the strength of the

observed instruments is important when using classical IV (2SLS) estimation.

We examine theR2 of the first-stage regression for the observed and optimal

LIV instruments (see subsection 4.5.2), and discuss the results of the test pro-

posed in subsection 4.3.1. Here we used all available observed instruments11

11In fact, we estimate model (4.12). These estimation results are also used in the next subsec-tion to examine the exogeneity of the observed instruments. Including the observed instruments


and present conclusions only for the optimalm given in table 5.2.

The third column of table 5.4 reports the difference inR2 of the regression of

schooling on the explanatory variables and the available observed instruments,

or, in case of LIV, the optimal LIV instruments, and theR2 of the regression

of schooling on the exogenous explanatory variablesonly. Hence, a large in-

crease inR2 indicates that the instruments explain a substantial amount of the

variance in the endogenous schooling variable. It can be seen that in particular

for the NLSY data the observed instrument ‘Nearc’ appears to be weak. The

family background instruments (Brabant and PSID data) explain a larger part

of the variance in schooling, in particular for the PSID dataset. However, the

increase inR2 is in all cases substantially larger when using the optimal LIV

instruments. It follows that the optimal LIV instruments do a much better job

in explaining the variance ofx than the available observed instruments. These

findings explain the loss of efficiency in the 2SLS estimates for the regres-

sion coefficients in table 5.1, where the IV estimated standard deviations are

(0.052), (0.008), and (0.032) and, respectively, 14.8, 1.7, and 2.3 times higher

than the OLS standard deviations. Not surprisingly, the estimated standard de-

viations for (the optimal) LIV estimates are only 1.14, 1.16, and 1.02 times the

OLS estimated standard deviations.

The results for the Wald-test to test for a zero-effect of the observed instru-

ments on schooling is given in the last column of table 5.4, and the reported

coefficients in the columnsγ2z2are the estimated direct effects of the observed

instruments on schooling. For instance, for the NLSY data it can be seen that

individuals who lived near a college have slightly more education, and from

the Brabant and PSID data it follows that parents education is positively re-

lated to the years of schooling of their children. Under the null hypothesis

H0 : γ2z2= 0, the test-statistic has approximately aχ2-distribution with de-

grees of freedom 1, 2, and 3, for, respectively, the NLSY, Brabant, and PSID

data. It can be seen that in all cases the null hypothesis is rejected, indicat-

yields the following relative biases (%1): for the NLSY data 7.9% (m= 4) and 6.6 % (m= 5),for the Brabant data 6.6% (m= 2) and 9.4% (m= 4), and for the PSID data 9.0% (m= 5).


ing that the observed instruments have a non-zero direct effect on schooling12.

For the NLSY data, however, theP-values are 0.009 (m = 4) and 0.005

(m = 5), which provides evidence that the instrument used is considerable

weak, given the remarks in subsection 4.3.1 and the substantial sample size

of this dataset. For the Brabant and PSID data we also testedH0 for each in-

strument separately. Only for the instrument ‘mothered’ in the PSID data the

null hypothesis of zero effect on schooling was not rejected. Although in all

cases the null hypothesis of a (joint) zero effect is rejected usingα = 0.01,

the incrementalR2’s were not large, in particular for the NLSY data. Here the

available instrument seems to be weak. However, in all cases the optimal LIV

instruments were found to be substantially stronger than the available observed

instruments. Their exogeneity is considered next.

Examining exogeneity of available observed instruments

Table 5.5: Results endogeneity test of available observed instruments. InstrumentsNLSY: ‘Nearc’, Brabant: ‘FatherEd’ and ‘MotherEd’, respectively, and PSID: ‘hus-banded’, ‘mothered’, and ‘fathered’, respectively (based on Hessian).

Data Method β2z21β2z22

β2z23Test

NLSY Opt. LIV m= 4 0.022 1.810(0.02)

Opt. LIV m= 5 0.021 1.760(0.02)

Brabant Opt. LIVm= 2 0.015 0.055 6.075(0.02) (0.03)

Opt. LIV m= 4 0.017 0.058 6.920(0.02) (0.03)

PSID Opt. LIVm= 5 -0.019 -0.006 0.000 2.731(0.01) (0.01) (0.01)

In table 5.5 we present the results of the Wald-test for testing whether the co-

efficient of the direct effect of the observed instruments on the dependent vari-

12The 2SLS model is not identified under the null hypothesis.


able (wage) is zero. A nonzero effect would violate the exogeneity assumption

of the instrument. We emphasize that the test in the previous subsection (table

5.4) tested whether the effect of the instruments on the endogenous regressor

was zero.

The estimated coefficients are reported in the columns indicated byβ2z2and

the values of the test-statistic are given in the last column. The estimated coef-

ficientsβ2z2are the direct effects of the instruments on the dependent variable

wage. The degrees of freedom of theχ2 null-distribution are, as before, 1, 2,

and 3, for, respectively, the NLSY, Brabant, and PSID data. For the Brabant

data the null hypothesis of no (joint) direct effect of the observed instruments

on the dependent variable is rejected forα = 0.05, suggesting that these in-

strumental variables are not exogenous, i.e. parental education levels have a

significant positive effect on the respondent’s wage. Performance of the 2SLS

estimator critically relies on the exogeneity of the used instruments and these

results suggest that the 2SLS estimates for the Brabant data are to be distrusted.

For the other two datasets there was no evidence of significant non-zero effects

of the observed instruments on the dependent variable.

5.4.4 Wrap-up

Our results illustrate the difficulties associated with IV estimation in these ap-

plications. The conclusions for the three datasets with respect to the magni-

tude and sign of the bias in the estimated OLS coefficient for schooling differ

highly, even with a similar set of instruments. The results on the validity of the

available observed instruments in the previous subsections may explain part of

this variability.

First of all, the instrument ‘Nearc’ for the NLSY data was found to be the

weakest available instrumental variable, inducing only a minor increase in the

R2 of the first stage regression. The simulation results in subsection 4.4.1

showed that presence of weak instruments results in large swings in the 2SLS


estimates13, which may account for the large downward bias in the OLS es-

timate found for the NLSY data (see also Card, 1999, 2001). Accordingly,

the 2SLS estimator suffers from large standard deviations in these cases. Sec-

ondly, the family background instruments used for the Brabant data were found

to have a direct effect on the dependent variable, in which case the bias in the

2SLS estimates may be of the same sign and potentially larger in magnitude

than the bias in OLS, which explains the downward bias in OLS found in

the Brabant application when using 2SLS (see also Bound, Jaeger, and Baker,

1995). Finally, our results suggest that the instruments for the PSID data are

the ‘best’ among the three applications, although evidence suggests that they

are somewhat weak. The 2SLS results for the PSID data do indicate an upward

bias in OLS, but the magnitude is much larger than for the LIV model.

Card (1999) argues that instruments based on family background characteris-

tics are likely to be endogenous, in the presence of omitted ability, which was

supported by our findings for the Brabant data. We did not find evidence for

this hypothesis for the PSID data, where the respondent’s husband, father, and

mother’s levels of education are the available observed instruments. One ex-

planation is that the power of the test is lower for the PSID data because of the

smaller sample size. Furthermore, the PSID data contains labor market infor-

mation on women obtained in 1976, while in that period education, income,

and other labor market issues, may have been less of an issue for women than

for their male siblings. Hence, it can be expected that family education has a

lower correlation with the respondents data.

These results are contrary to the optimal LIV solutions, which are more effi-

cient than standard IV since the LIV method optimally estimates instruments

from the available data. Furthermore, the best available evidence from the

latest studies on identical twins suggests a small upward bias on the order of

10–15% in the OLS estimator (cf. Card, 1999), which is not supported by the

standard IV estimates from the three datasets analyzed here. Our estimates

13In fact, the results in table 4.1 for the weak instrument cases report median bias and IQRinstead of mean bias and standard deviation because of the presence of too many outliers.


have the same order of magnitude found in the twin studies but do not fully

recover the 10% difference. A reason for this result might be that estimating

the model by simple OLS yields in general only a modest fit (the OLS results

presented in table 5.1 have R-squares of respectively 29%, 23%, and 17%),

i.e. the regressors do not explain a large part of the variance in wage. The

fact that LIV finds a smaller positive bias might also indicate that a part of the

positive ability bias is offset by negative biases due to e.g. measurement error

or heterogeneity, which is expected to be less in the twin studies14. Further, in

the twin studies there may still be a limited amount of unobserved ability if the

abilities of twins and siblings differ.

5.5 Conclusions

The studies of Card (1999, 2001) clearly indicate the difficulties associated

with applying standard IV methods to estimate the returns to education. The

results are often found to be counterintuitive and different across studies. Fur-

thermore, in many instances it can be questioned whether the instruments used

were ‘valid’. Unfortunately, classical instrumental variable methods do not

allow for straightforward testing of the validity of a specific instrument. We

show that the LIV method can be successfully applied to solve these prob-

lems. The OLS estimates are found to be biased upward by about 7%. Equally

important, the available observed instruments that have been used seem to be

mostly inadequate, and produce results that are both more biased than the OLS

results and have much lower efficiency, see table 5.6.

The advantage of the LIV approach is that no observable instruments are needed.

Furthermore, once estimates have been obtained, endogeneity can be tested for

in a straightforward manner. We showed that for the different specifications

of m and across three different datasets the estimates are consistent. For the

14For instance, let the true schooling effect beβS = 10. With a+4 ability bias and a−2measurement error bias, the estimated schooling effect by OLS is 12, resulting in an upwardbias of≈ 16% in OLS. With fewer measurement error, e.g.−1, the OLS estimate is 13, whichyields an upward bias of≈ 23%.

5.5 Conclusions 137

Table 5.6: Summary main conclusions LIV results effect of education on earnings.

Data

NLSY Brabant PSID

Sample size 3010 833 424Rel. bias OLS by LIV 7.9% 7% 5.5 %Rel. bias OLS by IV -79.9% -30.1% 27.8%Test for endogeneity + - +Instrument strength weak moderate moderateInstrument exogeneity exogenous endogenous exogenous

NSLY and the PSID dataset we find significant evidence of an ‘ability’ bias15.

Furthermore the standard errors of the estimates are much smaller than the

standard errors for standard IV, and not much larger than OLS. Because of the

relative large number of observations in the NLSY data, it is to be expected

that the power of the Hausman- and Wald-test is larger. In using LIV to test for

endogeneity it is recommended to use datasets of substantial size to ensure a

reasonable power. We proposed several diagnostics that may complete an anal-

ysis using LIV. We do not find any evidence that the LIV models used for the

three applications here do not fit the data well. Especially, in view of the large

samples sizes for the three datasets, the small deviations from the assumptions

that were found may not pose a problem in making inferences.

The relative size and magnitude of the bias in the OLS estimator that was found

is somewhat smaller, but still close to the numbers reported in Card (1999) for

the twin studies: 6–8% for all three datasets. This shows considerable con-

vergent validity and it lends additional credibility to the LIV approach. In the

next chapter we consider estimation of multilevel models in presence of en-

dogenous regressors.

15After deleting three outliers, the tests used also point out a significant (α = 0.10) upwardbias in OLS for the Brabant data (subsection 5.4.2).


Appendix 5A Descriptive statistics datasets used

Figure 5A.1: Histogram of ‘Schooling’.

Appendix 5A Descriptive statistics datasets used 139

5A.1 NLSY data

The total sample consists of 3010 men taken from the National Longitudinal Surveyof Young Men (see Verbeek, 2000, and Card, 1995)16. In this survey, a group ofindividuals in the age of 14− 24 years is followed since 1966. The labour marketinformation used is from 1976. In this year, the individuals had on average a littlemore than 13 years of schooling, with a maximum of 18 years (see figure 5A.1). Theaverage working experience was about 8.86 years (in 1976 those men aged 24− 34)with an average hourly wage rate of $5.77. The variables used can be found in table5A.1. We used the values centered around the mean for schooling in estimation.

5A.2 Brabant data

The initial dataset17 used in this paper consisted of 839 observations, but we deleted 5observations with very low wages (log hourly wages< 0). Another observation withan extreme large reported wage was also removed (> 9× IQR from median). Thisdata was collected in 1983 in the Netherlands’ southern province of Noord-Brabant.At that time the average age of the men in the sample was about 43. This cohort wasconfronted with compulsonary schooling until 12 years of age. The schooling measureused is the number of post–compulsonary years of schooling; on average 4.35 years(see figure 5A.1). The average hourly wage was Dfl. 16.72 and the individuals had,on average, 25 years of work experience at the time of the survey. See table 5A.2 formore information on the variables used. As before, we centered the schooling variablearound the mean.

5A.3 PSID data

As for the Brabant dataset, we removed a few observations prior to data analysis: fourobservations had an obvious lower (log) wage (<< −1) than the rest and were notused for estimation. This data come from the University of Michigan Panel Studyof Income Dynamics (PSID)18, obtained in 1976 (also used in Mroz, 1987). Thesample consists of working married white women, who were aged in between 30 and60 in 1975. They earned on average $4.18 per hour. The women reported an average12.7 years of schooling (see figure 5A.1) and a little over 13 years of labor marketexperience. For a detailed description of the used variables, see table 5A.3, where forestimation, the schooling variable was mean-centered.

16http://www.econ.kuleuven.ac.be/GME/.17We thank Hans van Ophem (University of Amsterdam) for making this dataset available to

us.18http://mitpress.mit.edu/Wooldridge-EconAnalysis.


Table 5A.1: Descriptive statistics NLSY dataset (n = 3010).

Variable Description Mean Std.

Regressorsconstant (β0) Model constant - -schooling (β1) Years of schooling in 1976 13.26 2.68experience (β2) Potential experience 8.86 4.14black (β3) Equals 1 if black 0.23 0.42smsa (β4) Equals 1 if lived in metropolitan

area in 1976 0.71 0.45south (β5) Equals 1 if lived in south in 1976 0.40 0.49

Dependentlog wage Logarithm of hourly wage 6.26 0.44

InstrumentsNearc Grew up near a 4 year college 0.68 0.47

Table 5A.2: Descriptive statistics Brabant dataset (n = 833).


Regressorsconstant (β0) Model constant - -schooling (β1) Years of schooling after age 12 4.35 4.00experience (β2) Potential experience 25.52 4.19nr. children (β3) Number of children present at age 12 4.91 2.68av. mark (β4) Average school mark in final year of

primary education 5.62 1.42anti-social (β5) Equals 1 comes from antisocial background 0.10 0.29fself (β6) Equals 1 if father is self employed at age 12 0.31 0.46


InstrumentsFather Ed. Education level father 2.35 0.70Mother Ed. Education level mother 2.22 0.54

(levels: 1− 6, higher categories = highereducation)

Appendix 5B Results optimal LIV model for the three datasets 141

Table 5A.3: Descriptive statistics PSID dataset (n = 424).


Regressorsconstant (β0) Model constantschooling (β1) Years of schooling 12.66 2.29experience (β2) Actual labor market experience 13.09 8.05kidslt6 (β3) Number of children younger than 6 0.14 0.39kidsgr6 (β4) Number of children older than 6 1.34 1.32unempl (β5) Unemployment rate in county of residence 8.54 3.04city (β6) Equals 1 if lives in SMSA 0.64 0.48nwincome (β7) Family income less total income wife / 1000 18.992 10.62


Instrumentsfather ed. Years of schooling father 8.80 3.57mother ed. Years of schooling mother 9.24 3.37husband ed. Years of schooling husband 12.50 3.02

Appendix 5B Results optimal LIV model for the threedatasets

Table 5B.1: NLSY data.Results for OLS, IV, and optimal LIV. Hereβ0 is the constant,β1 is ‘schooling’,β2 is ‘experience’,β3 is ‘black’, β4 is ‘smsa’, andβ5 is ‘south’.

β0 β1 β2 β3 β4 β5 σ 2ε

OLS 6.262 0.074 0.039 -0.188 0.165 -0.129 0.142(0.007) (0.004) (0.002) (0.018) (0.016) (0.015)

IV 6.262 0.133 0.040 -0.103 0.109 -0.100 0.164(0.007) (0.052) (0.002) (0.077) (0.051) (0.030)

Opt. LIV 6.262 0.068 0.039 -0.197 0.170 -0.132 0.142m= 4 (0.007) (0.004) (0.002) (0.018) (0.016) (0.015)

Opt. LIV 6.262 0.069 0.039 -0.196 0.169 -0.132 0.142m= 5 (0.007) (0.004) (0.002) (0.018) (0.016) (0.015)


Table5B

.2:NL

SY

da

ta.O

ptimalLIV

resultsfor

theestim

atedgroup

probabilitiesπ

j ,group

sizesλj

(inita

lics),and

γ2 .

Here

γ21

is‘age’,γ

22is

‘black’,γ

23is

‘smsa’,andγ

24is

‘south’.

π1

π2

π3

π4

π5

γ1

γ2

γ3

γ4

σ2ν

ρεν

-8.50-4.70

-1.032.96

0.01-0.60

0.29-0.14

1.100.091

(0.34)(0.12)

(0.03)(0.04)

(0.01)(0.07)

(0.06)(0.06)

0.0

10

.07

0.5

90

.34

-8.25-4.53

-1.072.31

3.90-0.01

-0.610.28

-0.090.84

0.091(0.29)

(0.10)(0.03)

(0.08)(0.13)

(0.01)(0.07)

(0.06)(0.06)

0.0

10

.07

0.5

60

.24

0.1

2

Appendix 5B Results optimal LIV model for the three datasets 143

Tabl

e5B

.3:B

rab

an

td

ata

.Res

ults

for

OLS

,IV,

and

optim

alLI

V.H

ereβ

0is

the

cons

tant

,β1

is‘s

choo

ling’

,β2

is‘e

xper

ienc

e’,β

3is

‘nr.c

hild

ren’

,β4

is‘a

v.sc

hool

mar

k’,β

5is

‘ant

i-soc

ial’,

andβ

6is

‘fsel

f’.

β0

β1

β2

β3

β4

β5

β6

σ2 ε

OLS

2.70

10.

043

0.00

40.

004

0.03

3-0

.141

0.00

80.

133

(0.0

13)

(0.0

04)

(0.0

04)

(0.0

05)

(0.0

10)

(0.0

29)

(0.0

45)

IV2.

701

0.05

6-0

.004

0.00

50.

011

-0.1

600.

051

0.13

7(0

.013

)(0

.008

)(0

.006

)(0

.005

)(0

.015

)(0

.031

)(0

.050

)

Opt

.LI

V2.

701

0.04

00.

005

0.00

30.

038

-0.1

37-0

.001

0.13

2m=

2(0

.013

)(0

.005

)(0

.004

)(0

.005

)(0

.011

)(0

.029

)(0

.046

)

Opt

.LI

V2.

701

0.04

00.

006

0.00

30.

039

-0.1

36-0

.004

0.13

2m=

4(0

.013

)(0

.005

)(0

.004

)(0

.005

)(0

.011

)(0

.029

)(0

.045

)

Tabl

e5B

.4:B

rab

an

td

ata

.Opt

imal

LIV

resu

ltsfo

rth

ees

timat

edgr

oup

prob

abili

ties

πj,

grou

psi

zesλ

j(in

italic

s),

andγ

2.

Her

eγ

21is

‘age

’,γ

22is

‘nr.

child

ren’

,γ23

is‘a

v.sc

hool

mar

k’,γ

24is

‘ant

i-soc

ial’,

andγ

25is

‘fsel

f’.

π1

π2

π3

π4

γ1

γ2

γ3

γ4

γ5

σ2 ν

ρεν

-1.0

06.

380.

32-0

.02

0.71

0.82

-1.6

64.

390.

057

(0.1

0)(0

.30)

(0.0

3)(0

.03)

(0.0

7)(0

.19)

(0.2

8)0

.86

0.1

4

-1.6

22.

636.

9211

.17

0.36

0.00

0.61

0.59

-1.4

82.

350.

105

(0.1

0)(0

.32)

(0.4

5)(0

.81)

(0.0

3)(0

.03)

(0.0

6)(0

.17)

(0.2

3)0

.72

0.1

90

.08

0.0

1


Table5B

.5:P

SID

da

ta.R

esultsfor

OLS

,IV,

andoptim

alLIV.H

ereβ0

isthe

constant,β1

is‘schooling’,β

2is

‘experience’,β3

is‘kidslt6’,

β4

is‘kidsgr6’,β

5is

‘unempl’,β

6is

‘city’,andβ

7is

‘nwincom

e’.

β0

β1

β2

β3

β4

β5

β6

β7

σ2ε

OLS

1.2180.102

0.0140.004

-0.007-0.002

0.0290.004

0.376(0.030)

(0.014)(0.004)

(0.079)(0.025)

(0.010)(0.065)

(0.003)

IV1.218

0.0730.014

0.030-0.012

0.0000.038

0.0050.380

(0.030)(0.032)

(0.004)(0.083)

(0.025)(0.010)

(0.066)(0.004)

Opt.

LIV1.218

0.0960.014

0.010-0.008

-0.0010.031

0.0040.369

m=

5(0.029)

(0.014)(0.004)

(0.078)(0.024)

(0.010)(0.065)

(0.003)

Table5B

.6:PS

IDd

ata

.Optim

alLIVresults

forthe

estimated

groupprobabilities

πj ,

groupsizesλ

j(in

italics),

andγ

2 .H

ereγ

21is

‘experience’,γ22

is‘kidslt6’,

γ23

is‘kidsgr6’,γ

24is

‘unempl’,γ

25is

‘city’,andγ

26is

‘nwincom

e’.

π1

π2

π3

π4

π5

γ1

γ2

γ3

γ4

γ5

γ6

σ2ν

ρεν

-5.25-2.97

-0.641.39

3.760.00

0.110.00

0.000.05

0.010.22

0.092(0.11)

(0.09)(0.03)

(0.10)(0.06)

(0.00)(0.07)

(0.02)(0.01)

(0.06)(0.00)

0.0

40

.07

0.6

00

.09

0.1

9

Chapter 6

Regressor and random-effectsdependencies in multilevelmodels1

6.1 Introduction

In many situations data have a hierarchical structure. For example, when it

is investigated how workplace characteristics affect worker productivity, both

workers and firms are units in the analysis. Similarly, hierarchical data arise

in the context of panel research, when multiple observations are available on

the ‘objects’ under study. Typically, these types of data are analyzed with

multilevel or hierarchical linear models. The model we consider is given by

yi j = X′i j β + Z′i γ + αi + ηi j , (6.1)

whereyi j is the dependent variable,Xi j ∈ Rk×1 are level-one or individual spe-

cific regressors,Zi ∈ Rl×1 contains level-two or group specific regressors2, ηi j

is a random (error) component with E(ηi j ) = 0 and var(ηi j ) = σ 2η , and where

i = 1, ...,n and j = 1, ..., ni . Throughout this chapter, matrices are printed in

1This chapter is published as Ebbes, Bockenholt and Wedel (2004).2Previously we denoted the matrix of instrumental variables byZ. Here we indicate the

matrix of instruments byV , see e.g. section 6.5.

145

146 Chapter 6 Regressor and random-effects dependencies

capitals and scalars as lowercase. Greek symbols denote unobserved parame-

ters that are to be estimated. The unit–specific interceptαi may be specified to

be random (with E(αi ) = 0 and var(αi ) = σ 2α ) or fixed depending on the con-

text of the study and the types of inferences that can be drawn (Verbeek, 2000,

Judge et al., 1985, Wooldridge, 2002, Bryk and Raudenbush, 2002, Snijders

and Bosker, 1999).

In the modeling of hierarchical data structures it is frequently assumed that

the explanatory variablesX andZ are independent of the random (error) com-

ponents. If independence holds, the regressors are said to be ‘exogenous’ (or

determined outside the model). However, in many applications it is unrealistic

to assume that regressors and random components are independent. For the

model given in (6.1) we consider two types of independence:

1. Level-2 independence orXα– andZα–independence, and

2. level-1 independence orXη– andZη–independence.

This chapter shows that even in the presence of modest dependencies, regres-

sion effects can be biased substantially. Different approaches for testing the

independence assumption are presented and illustrated with the help of simu-

lation studies.

Importantly, the independence assumptions can be violated easily. Examples

include (1) relevant omitted variables (Card, 1999, 2001, Uusitalo, 1999, or

Spencer and Fielding, 1998a, 1998b), (2) measurement error in the regressors

(Plat, 1988, Bagozzi, Yi and Nassen, 1999, Wansbeek and Meijer, 2000, or

Carroll, Ruppert and Stefanski, 1995), (3) self-selection3 (Hamilton and Nick-

erson, 2003), (4) simultaneity (White, 2001, or Greene, 2000), and (5) seri-

ally correlated errors in the presence of lagged dependent variables (White,

3The problem of self-selection arises when individuals tend to select themselves in a certainstate – such as union vs. non-union member (e.g. Vella and Verbeek, 1998) and treated vs.not treated (e.g. Angrist, Imbens and Rubin, 1996) – on basis of economic or other, usuallyunknown, arguments.

6.1 Introduction 147

2001, or Ruud, 2000). In the standard (single level) regression model, the or-

dinary least squares (OLS) estimator can be written asβOLSn = (X′X)−1X′y =

β + (X′X)−1X′ε, where E(ε) = 0. If the assumption of independence of

regressors and errors does not hold (i.e. when E(ε|X) 6= 0), it follows imme-

diately that the OLS estimator is biased. This bias can be reduced, at least in

large samples, by using instrumental variables estimation techniques (Bowden

and Turkington, 1984, White, 2001, or Wooldridge, 2002). Instrumental vari-

ables (IVs) should be uncorrelated with the error termε, and should explain

part of the variability in the endogenous regressorsX. Once instruments are

available, unbiased estimates for the regression parameters can be obtained.

Furthermore, Hausman-like tests (Hausman, 1978) can be used to test for re-

gressor error dependencies in this standard linear regression model. The gen-

eral idea of this approach is to compare two estimators, one that is consistent

under both the null hypothesis of regressor–error independence and the alter-

native hypothesis, and one that is only consistent under the null hypothesis.

The null hypothesis is rejected once a significant difference between these two

estimators is found (cf. Verbeek, 2000)4.

In multilevel models additional random components reflect the nesting struc-

ture in the data. Henceforth, an investigation of independence of explanatory

variables and random terms becomes even more important. Because of the po-

tentially severe consequences when these independence assumptions are vio-

lated, they need to be tested for explicitly in any application of multilevel mod-

els. The literature suggests performing the following diagnostic steps when

endogeneity is suspected, which serves as a roadmap to the remainder of this

paper. First, a diagnostic check to examineXα–independence is readily avail-

able for multilevel models based on the work by Hausman and Taylor (1981).

Fixed–effects (FE) estimation gives an unbiased estimate forβ in model (6.1)

regardless of violation ofXα–independence, whereas random–effects (RE) es-

timation yields biased estimates (see section 6.2). If the test, which is based on

the Hausman test (Hausman, 1978), proposed by Hausman and Taylor (1981)

4See Appendix 6A for a more detailed explanation of instrumental variables techniques andthe Hausman test.


(which we denote by theHα-test) does not reject the independence hypothesis,

both fixed– and random–effects estimation forβ can be used. Once rejected,

only fixed–effects estimation yields consistent results, provided the regressors

are independent of level-1 random components. We show how the inclusion of

group means can be used to examineXα–dependencies (Mundlak, 1978). We

present the Hausman-Taylor (HT) estimator as an alternative to fixed–effects

estimation, which is potentially more efficient and which, in contrast to the

fixed–effects estimator, can be used to estimate level-2 effects. TheHα-test,

Mundlak’s π approach, and the Hausman-Taylor estimator are discussed in

section 6.3.

However, these above steps should be considered with caution. As will be

shown in Section 6.4, the performance of theHα-test relies on the indepen-

dence of regressors and level-1 random components. Unfortunately, endogene-

ity at this level can often not be ruled out a priori. Although this type of endo-

geneity is often ignored, it is a crucial assumption in using standard multilevel

estimators. As a first diagnostic check for it, one should carefully consider

whether or not, based on theoretical grounds, level-1 independence can be as-

sumed. If not, IV estimation techniques can be adopted to estimate regression

parameters in model (6.1) (Bowden and Turkington, 1984, Woodridge, 2002).

Several different multilevel IV estimators can be derived to estimate the regres-

sion parameters in model (6.1), depending on the specific assumptions about

the exact form of the endogeneity problem (see appendix 6B). This approach

is illustrated and discussed in section 6.5. To test for level-1 independence,

another test based on the general approach of Hausman (1978) can be con-

structed. We will refer to this test as theHη-test and illustrate its usefulness in

section 6.5.

The diagnostics steps for investigating independence assumptions in two-level

multilevel models are presented in table 6.1. For the sake of simplicity, no

distinction is made between level-1 and level-2 regressors. Now three types

(cases (ii) – (iv)) of violations of regressor–error dependencies can be distin-

guished. This table specifies the various sections in which tests and estimators

6.2 Biases caused by level-1 (Xα)– and level-2 (Xη)– dependencies 149

Table 6.1: Overview of diagnostic tests to determine independence between regres-sors and random effects in a linear two-level regression model, where ‘yes’ (‘no’)means that the specific independence assumption is (not) satisfied.

Case E Xα E Xη Section Table Test Estimators= 0 = 0

(i) Yes Yes 6.2, 6.5 6.2, 6.7 Hα or Mund- FE or RElak’s π , andHη

(ii) No Yes 6.2, 6.3 6.2, 6.3, Hα or Mund- FE or HT6.4 lak’sπ , andHη

(iii) Yes No 6.2, 6.4, 6.2, 6.5, Hη External IV6.5 6.6, 6.7

(iv) No No 6.2, 6.4, 6.2, 6.5, Hη External IV6.5 6.7

for each case are discussed in more detail.

6.2 Biases caused by level-1 (Xα)– and level-2 (Xη)–dependencies

The parametersβ in the multilevel model given in (6.1) can be estimated by

fixed- or random–effects methods (Verbeek, 2000, Baltagi, 2001, Goldstein,

1995, Longford, 1993). We do not discuss the estimators here, but details can

be found in appendix 6B. To illustrate the effects ofXα– and Xη– depen-

dencies under fixed–effects and random–effects estimation, consider table 6.2

which summarizes the simulation results for the model:

yi j = β0+ β1xi j + αi + ηi j , (6.2)

where i = 1, ...,150, j = 1, ...,10, αi ∼ N(0, σ 2α ) andηi j ∼ N(0, σ 2

η ),

and the following four cases are specified: (i)ρ(x, α) = ρ(x, η) = 0, (ii)

ρ(x, α) = 0.3 andρ(x, η) = 0, (iii) ρ(x, α) = 0 andρ(x, η) = 0.3, and

(iv) ρ(x, α) = ρ(x, η) = 0.3. The table presents means and standard devi-


Table 6.2: Results of simulation study to examine bias fixed–effects (FE) andrandom–effects (RE) estimator for level-1 and level-2 endogeneity. True values:β0 = 10,β1 = 2, σ 2

η = 1, andσ 2α = 1.

Case

(i) (ii) (iii) (iv)

FE β0 - - - -β1 1.99 (0.04) 2.00 (0.04) 2.43 (0.04) 2.42 (0.04)

RE β0 10.04 (0.21) 8.87 (0.29) 7.88 (0.20) 5.79 (0.30)β1 1.99 (0.04) 2.23 (0.05) 2.42 (0.04) 2.84 (0.06)σ 2α 1.01 (0.14) 0.10 (0.02) 0.99 (0.13) 0.00 (0.00)σ 2η 1.00 (0.04) 1.00 (0.04) 0.90 (0.03) 0.91 (0.04)

ations computed across 250 replications. As expected, both the fixed–effects

and random–effects estimator yield unbiased results forβ1 and unbiased esti-

mates for the variances when the regressor is truly exogenous (case (i)). The

fixed–effects estimator cannot estimate the constantβ0 (nor the effects of other

level-2 variables). Unbiased results for these parameters are obtained with the

random–effects estimator. Ifρ(x, α) = 0.3 andρ(x, η) = 0 (case (ii)), the

random–effects estimator is biased upward andσ 2α exhibits a severe downward

bias, but fixed–effects estimation is possible forβ1 and an unbiased estimate

for σ 2η can be obtained. Ifρ(x, α) = 0 andρ(x, η) = 0.3 (case (iii)), both

the fixed–effects and the random–effects estimator yield biased results for the

regression parameters and similar conclusions hold forσ 2η . However,σ 2

α can

be estimated consistently in this case. Finally, if all independence assumptions

are violated (case (iv)), it can be seen that both fixed–effects and random–

effects estimation yields biased results for the regression parameters. The bias

in the fixed–effects estimator for case (iv) is independent of the presence of

Xα–dependency. It can be seen that random–effects estimation yields an even

larger bias in this case. In all replications, the estimate ofσ 2α was negative, and

therefore set to 0. The bias in the estimate ofσ 2η is approximately equal to its

bias for case (iii).

6.3 The case of level-2 (Xα) dependencies only 151

These results indicate clearly that one should consider carefully whether to use

random–effects estimation if there are reasons to assume that independence

may not hold. Even a moderate (positive) correlation betweenx andα in

model (6.2) induces in this case an (upward) bias of approximately 10% in the

random–effects estimator forβ1 and an approximately 90% downward bias

in σ 2α . Dependencies between the regressors andα can be accommodated by

using a fixed–effects estimation. However, failure to correct for dependencies

between the regressor andη leads to biases inboththe random–effects and the

fixed–effects estimator. A moderate positive correlation between the regressor

andη induces a significant upward bias in both the fixed– and random–effects

estimate forβ1. Finally, if the regressor is correlated withboth αi andηi j ,

the bias in the random–effects estimator for the regression parameters is even

larger and under case (iv) it would be concluded incorrectly that no random–

effects are present in the data. The following section focusses on the case when

only Xα– but noXη–dependencies are present.

6.3 The case of level-2 (Xα) dependencies only

6.3.1 Testing forXα–dependencies

In this section we first discuss two test statistics to examineXα–dependencies.

It is assumed that noXη–dependency is present. In case this type of depen-

dency cannot be rejected, we present and illustrate alternative estimators.

Hausman and Taylor (1981) show that the multilevel structure of the data and

the presence of a consistent estimator regardless of the correlation between

regressors andαi (but with X andη independent), facilitate tests for this type of

endogeneity in model (6.1) using the general idea of a Hausman test (Hausman,

1978). This Hausman test statistic can be computed as follows:

Hα = (βF E − βRE)′6−1(βF E − βRE), (6.3)

where6 is an estimate of the covariance matrix ofβF E − βRE and computed

ascov(βF E)− cov(βRE). The resulting test statisticHα can be shown to have a


chi-square distribution under the null hypothesis of independence ofX, Z and

αi . If the null hypothesis is rejected, the fixed–effects estimator should be used.

A great advantage of multilevel over single level applications is the possibility

to test for regressor-error disturbances of this type. This is not possible in

single–level applications, as there is no estimator that is consistent under both

the null hypothesis and alternative hypothesis when IVs are not available.

6.3.2 Mundlak’s approach for Xα–dependencies

One approach for investigating potential correlations betweenX and the ran-

dom–effectsαi is to model the dependence betweenαi and the regressors ex-

plicitly. Mundlak (1978) suggests the inclusion of group means by estimating

αi = Xiπ + ξi . Snijders and Bosker (1999) argue that the inclusion of group

means as explanatory variables in multilevel models can yield interesting sub-

stantive results. It can be shown that the test proposed by Hausman and Taylor

(1981) and Mundlak’s approach, are closely related, and, in fact, yield nu-

merically identical results (Baltagi, 2001, p.65-72). Modeling this dependence

explicitly allows for unbiased random–effects estimation forβ, regardless of

whetherX andα are independent or not. This approach is attractive when

fixed–effects estimation is undesirable, butXi andαi cannot be assumed inde-

pendent. However, this procedure does not yield unbiased estimates for level-2

effects/parameters (γ andσ 2α ).

These methods are illustrated in table 6.3 where we present the results for the

Hα-test and Mundlak’s approach. The data were simulated according to the

same design as in the previous simulation study (model (6.2)). It can be seen

that when there is no regressor-error dependency (case (i)), the proportion of

replications in which theHα-test rejects the null hypothesis is very close to the

nominal P–value of 5%. With a correlation betweenx andα of 0.3, the null

hypothesis of no level-2 (Xα) dependency is rejected in all replications. The

same conclusions follow from Mundlak’sπ , which is significantly different

from zero forρx,α = 0.3 but not forρx,α = 0. Furthermore, random–effects

estimation in Mundlak’s model allows for unbiased estimates of the level-1

predictor, but the constant (and other potential level-2 predictors) cannot be

6.3 The case of level-2 (Xα) dependencies only 153

Table 6.3: ResultsHα-test and Mundlak’s approach (αi = π xi . + ξi ). True values:β0 = 10,β1 = 2, σ 2

η = 1, andσ 2α = 1.

Case

(i) (ii)

Hα-test 4 % 100 %Mundlak β0 9.90 (1.85) -11.17 (0.86)

β1 1.99 (0.04) 2.00 (0.04)π 0.03 (0.38) 4.23 (0.18)σ 2α 1.01 (0.14) 0.10 (0.02)σ 2η 1.00 (0.04) 1.00 (0.04)

estimated unbiasedly. The same holds for the varianceσ 2α , butσ 2

η can be esti-

mated unbiasedly. In the next section we present a more satisfying solution to

the problem whenXα–dependencies, but noXη–dependencies, are present5.

6.3.3 The Hausman-Taylor estimator underXα–dependencies

Although Mundlak’s approach allows for random–effects estimation, no un-

biased results can be obtained for the level-2 (group-specific) variables. As

a solution, Hausman and Taylor (1981) suggested an estimator that consis-

tently and efficiently estimates both level-1 and level-2 parameters. It re-

quires a priori knowledge about which of the level-1 and level-2 regressors

are uncorrelated with the random components. LetXi j = [X1i j : X2i j ] and

Zi = [Z1i : Z2i ], where the variables in setsX1 and Z1 are assumed to be

uncorrelated withαi and all regressors are assumed to be independent ofηi j .

The idea is thatX1i j andZ1i serve as their own instruments;X2i j − X2i can be

used as instruments forX2i j (as in the fixed–effects approach), andX1i serves

as instrument forZ2i . To identify all the regression parameters, the number

5Manchanda, Rossi and Chintagunta (2004) apply a ‘generalization’ of models developedby Chamberlain (1980, 1984), that are related to Mundlak’s approach. Their method appliesto situations where the levels of marketing mix variables are chosen with potential knowledgeof the sales response parameters, and the regressors are potentially correlated to the randomeffects of all the market response model parameters. They obtain results for the effect of salescalls made to physicians on their prescription behavior for specific drugs.


Table 6.4: Results Hausman-Taylor (HT) estimator forρx2,α = 0.3 andρz2,α = 0.3,but no level-1 dependencies (case (ii)). True values:β0 = 10,β1 = β2 = γ1 = γ2 =2, σ 2

η = 1, andσ 2α = 1.

FE RE HT

β0 – 9.25 (0.34) 10.06 (0.97)β1 2.00 (0.05) 1.59 (0.08) 2.00 (0.05)β2 2.00 (0.04) 2.38 (0.07) 2.00 (0.04)γ1 – 1.22 (0.17) 2.02 (0.44)γ2 – 2.40 (0.11) 1.97 (0.44)σ 2η 1.00 (0.04) 1.00 (0.04) 1.00 (0.04)σ 2α – 0.01 (0.01) 1.13 (0.36)

of variables contained in setX1 needs to be at least as large as the number of

variables in setZ2. An attractive feature of the Hausman-Taylor estimator is

that no external instruments (i.e. variables that are not included in the main

regression equation) are needed, as this estimator constructs instruments from

available data (‘internal’ instruments). More recent studies suggest modifica-

tions (to improve efficiency) of the Hausman-Taylor estimator, see Arellano

and Bover (1995).

Table 6.4 illustrates the Hausman-Taylor estimator. The previously considered

model to generate the data is extended as follows:

yi j = β0+ β1x1i j + β2x2i j + γ1z1i + γ2z2i + αi + ηi j (6.4)

for i = 1, ...,150 andj = 1, ...,10. We specifyx1 andz1 to be independent

of the random components.x2 andz2 are related toα (ρx2,α= ρz2,α

= 0.3),

but independent ofηi j (i.e. case (ii)). Table 6.4 contains the means and stan-

dard deviations of the estimated parameters computed across 250 simulation

replications. As can be seen, the fixed–effects (FE) estimator yields consistent

results for level-1 effects, but no estimator for level-2 effects can be obtained.

The random–effects (RE) estimator yields biased results for all regression pa-

rameters andσ 2α , which is in agreement with the results in table 6.2. The

6.4 Limitations in the presence of level-1 (Xη)– dependencies 155

Hausman-Taylor estimator uses the additional information thatx1 andz1 are

exogenous. These ‘internal’ instruments can be used to estimate the effects of

all regression parameters consistently. Furthermore, an approximately unbi-

ased estimate forσ 2α can be obtained. In all cases,σ 2

η can be estimated unbias-

edly.

The Hausman-Taylor estimator is very powerful as it does not require external

instruments. We agree with Verbeek (2000) that despite this obvious advan-

tage, the method has played a surprisingly minor role in empirical work. In

practice one does not know which X and Z’s are independent of theα, but it is

possible to test for this assumption (Hausman and Taylor, 1981).

In this section we assumed independence of regressors andηi j . Unfortunately,

the methods presented in this section become unreliable and yield incorrect

conclusions in the presence ofXη–dependencies. Similar observations were

made for the fixed– and random–effects estimators in section 6.2. This is illus-

trated and discussed in the following section.

6.4 Limitations in the presence of level-1 (Xη)– depen-dencies

This section considers two problems in using the methods discussed so far.

First, as noted when discussing the results in table 6.2, both random–effects

and fixed–effects estimation fails when endogeneity arises from level-1 depen-

dencies (case (iii) and (iv)). Second, although successful in testing and solving

for Xα–dependencies, we will show that theHα-test, Mundlak’s approach, and

the Hausman-Taylor estimator also break down in this case.

In section 6.2 we illustrated the consequences of using the fixed–effects and

the random–effects estimator when regressors are correlated with the lowest

level error termηi j . It was illustrated that even a small correlation between

x andη in model (6.2) induced biases in both the fixed– and random–effects

estimators. Similar limitations apply to theHα-test and Mundlak’s approach


Table 6.5: ResultsHα test and Mundlak’s approach (αi = π xi . + ξi ). True values:β0 = 10,β1 = 2, σ 2

η = 1, andσ 2α = 1.

Case

(iii) (iv)

Hα-test 6 % 100 %Mundlak β0 8.00 (1.95) -13.34 (0.20)

β1 2.42 (0.04) 2.42 (0.04)π -0.03 (0.39) 4.25 (0.05)σ 2α 0.99 (0.13) 0.00 (0.00)σ 2η 0.90 (0.03) 0.91 (0.04)

discussed in subsections 6.3.1 and 6.3.2. Based on model (6.2), the simulation

results in table 6.5 illustrate this situation. First, it can be seen that a situation

with endogeneity at the first level (Xη–dependency) but noXα–dependency,

cannot be detected by theHα-test and Mundlak’s approach (case (iii)). This

is not surprising, as the test is not designed for investigating this hypothesis.

However, the estimates for bothβ1 andσ 2η are still significantly biased due to

Xη–dependencies. Researchers who are not aware of potential endogeneity

problems at the first level may incorrectly conclude from these tests that either

fixed– or random–effects estimation can be used, although, in fact, both meth-

ods yield biased results. WhenXα–dependencies andXη–dependencies are

present (case (iv)), theHα-test and Mundlak’sπ diagnose theXα–dependency.

Given thatXα–dependency is detected, one should now use fixed–effects es-

timation (or the Hausman-Taylor estimator). However, it was seen in table 6.2

that in the presence ofXη–dependencies fixed–effects estimates forβ are bi-

ased. The researcher in this case correctly concludes thatXα–dependencies

are present, but misses theXη–dependencies and, henceforth, still uses biased

estimates.

The same fallacious conclusion follows from the Hausman-Taylor estimator

based on internal instrumental variables, as can be seen from table 6.6. These

results are based on model (6.4), wherex1 and z1 are specified to be inde-

6.4 Limitations in the presence of level-1 (Xη)– dependencies 157

Table 6.6: Results Hausman-Taylor (HT) estimator forρx2,η = 0.3 andρz2,η = 0.3.True values:β0 = 10,β1 = β2 = γ1 = γ2 = 2, σ 2

η = 1, andσ 2α = 1.

FE RE HT

β0 – 8.48 (0.54) 9.64 (1.27)

β1 1.58 (0.06) 1.57 (0.06) 1.58 (0.06)

β2 2.21 (0.03) 2.21 (0.02) 2.21 (0.03)

γ1 – 1.59 (0.13) 1.78 (0.23)γ2 – 2.19 (0.08) 1.99 (0.21)σ 2η 0.95 (0.04) 0.95 (0.04) 0.95 (0.04)σ 2α – 0.95 (0.13) 1.06 (0.19)

Hα-test — 15% —

pendent of all random components, butx2 andz2 are correlated withηi j (but

not with αi ). The table shows that both the fixed–effects and the random–

effects estimators yield biased results. TheHα-test does not diagnose this type

of endogeneity and rejects the null hypothesis in 15% of the cases, when the

nominal rate is at 5%. Thus, importantly, this test indicates too often that there

is a Xα–dependency while in fact there is none, as the dependency is caused

by correlation betweenX andη. The Hausman-Taylor estimator is also biased

in general, but becausex1 is truly exogenous it is a valid instrument forz2, and

the Hausman-Taylor estimate forγ2 is unbiased. The bias in the estimate for

σ 2η is small, as is the one observed forσ 2

α .

We conclude that when endogenous regressors are present at the lowest level of

the hierarchical model, caused by correlations betweenX andη, all available

tests and estimators presented in section 6.3 yield invalid inferences. In the

next section we discuss possible solutions to this problem.


6.5 Testing and solving forXη–dependencies

6.5.1 External Instruments

We consider potential remedies to the situation whereXη–dependencies are

present in the form of ‘classical’ IV methods. These methods are similar to

the Hausman-Taylor estimator, but require the availability of ‘external’ instru-

ments.

External instrumental variables are desirable for unbiased and consistent es-

timation whenXη–dependencies are present in the data (Bowden and Turk-

ington, 1984, Wooldridge, 2002). The main ideas behind these estimators are

similar to the ones of classical IV estimators developed for cross–sectional sit-

uations, with an additional step to account for nonspherical disturbances due

to the hierarchical structure (Wooldridge, 2002). Two multilevel IV estimators

that yield unbiased estimation of the parameters in model (6.1) in the presence

of Xη–dependencies are the (multilevel) two– and three–stage least squares

(SLS) estimators (see appendix 6B), where the latter estimator takes the ran-

dom error component structure into account yielding a potential more efficient

estimator (Im et al., 1999, Wooldridge, 2002, Bowden and Turkington, 1984).

In the following, we will use the multilevel 2SLS estimator to illustrate the

usefulness of external IV estimators when theXη–independence assumption

is violated. We also show how this estimate can be used to construct another

Hausman-based test (Hη-test) to test forXη–independencies. The results of

this test can be used to decide whether fixed–effects, random–effects or the

Hausman-Taylor estimator, or multilevel (external) IV estimators should be

used.

Using model (6.2) we illustrate the multilevel IV estimator with one level-1

instrument. The endogenous regressor is now simulated asxi j = c+ vi j +φi j ,

whereρφ,η = 0.3, c is a constant, andvi j is the instrument generated inde-

pendent of all error terms. In addition, a Hausman–based test is computed

that compares the multilevel IV estimate for(β0, β1) with the random–effects

estimate for(β0, β1) (or the fixed–effects estimate forβ1). The results are pre-

6.5 Testing and solving forXη–dependencies 159

Table 6.7: Results multilevel IV for case (iii) and case (iv) violations. True values:β0 = 10,β1 = 2, σ 2

η = 1, andσ 2α = 1.

Case(i) (iii) (iv)

Hη-test 3.2% 96.4 % 100 %Multilevel IV β0 10.00 (0.21) 9.98 (0.14) 9.98 (0.20)

β1 1.99 (0.07) 2.00 (0.07) 2.01 (0.07)σ 2ε 1.99 (0.12) 1.98 (0.14) 1.99 (0.15)

sented in table 6.7. Note that in table 6.7 we estimateσ 2ε , which is the variance

of εi j = αi + ηi j , from the residuals computed from the IV regression. The

table shows that once valid external instruments are available, we obtain ap-

proximately unbiased estimates for the model parameters. Furthermore, these

estimates are unbiased regardless ofXα–independence (case (iii) vs. (iv)). The

Hη-test based on these estimates detects both case (iii) and case (iv) endogene-

ity, indicating that the multilevel (external) IV estimators should be used. A

disadvantage of this method is that it is less efficient than fixed– and random–

effects estimators. Furthermore, valid instruments that have no direct effect

on y and explain a substantial part of the variance inx, have to be available,

which is often a limitation in empirical work.

Although external IVs can be useful for dealing withXη–dependencies, it

should be noted that IV estimators can be seriously biased in small samples

and may exhibit poor asymptotic properties when weak instruments are ap-

plied. An instrument is said to be ‘weak’ when it explains none or only a small

part of the variance in the endogenous regressor (i.e. it is only weakly corre-

lated). There is a considerable literature that investigates the potential pitfalls

in IV estimation when weak instruments are used and several recommenda-

tions to deal with these problems are made (Staiger and Stock, 1997, Bound,

Jaeger and Baker, 1995, Nelson and Startz, 1990, and Kleibergen and Zivot,

2003).


To address the problem of weak instruments, Hahn and Hausman (2002) re-

cently developed a test for the validity of instruments. Their approach is also

based on the general Hausman specification test approach (Hausman, 1978).

The test statistic is fairly simple to compute and is shown to have at distribu-

tion under the null hypothesis. Rejection of the null hypothesis might indicate

a failure of the orthogonality assumption of the instruments or that the instru-

ments are weak. Hanh and Hausman (2002) suggest a two step approach,

based on this test, to decide which IV estimator, or none, should be used. This

approach may provide a helpful guide in guarding against weak instruments.

Furthermore, it is relatively straightforward to use and it could prevent the re-

searcher from relying on results obtained with weak instruments.

However, although IV methods are attractive in theory, they can be difficult to

apply in practice because it may prove difficult to locate ‘good’ IVs as indi-

cated by the Hanh and Hausman test. As a possible solution, we next consider

Lewbel’s (1997) method for computing instrumental variables from the data at

hand and demonstrate that this method could potentially be extended to multi-

level models with generalXη–dependencies.

6.5.2 Internal instruments: Lewbel’s approach

Lewbel (1997) provides a method for constructing internal instruments when

Xη–dependencies exists. This approach has been proposed originally in the

context of measurement error models, but we argue that it is also useful in

the context of general correlated-regressor error (see also appendix 6C). To

the best of our knowledge, the issue of constructing internal instruments from

available data in multilevel models whereXη–dependencies are present, has

not been addressed before. Lewbel’s (1997) idea is based on the observa-

tion that when the endogenous regressor in model (6.2) has a skewed distribu-

tion, the following transformations of the available data may yield valid instru-

ments:

6.6 Discussion and future research 161

v1i j = (yi j − y)(xi j − x)

v2i j = (yi j − y)2

v3i j = (xi j − x)2 (6.5)

The results in table 6.8 illustrate the internal instrumental variable approach for

model (6.2) and compare it with the external instrumental variable approach in

the previous subsection. The same simulation data as in table 6.7 was used,

where the endogenous regressor was generated asxi j = c + vi j + φi j , with

ρφ,η = 0.3 andvi j is the exogenous instrument. We compare Lewbel’s ap-

proach to the benchmark, where we assume that thevi j are observed instru-

ments. Thus the results from the multilevel IV estimator in table 6.8 are the

same as in table 6.7, and were obtained by usingvi j as ‘observed’ instruments,

whereas the Lewbel approach uses the constructed instruments in (6.5) instead.

Table 6.8 indicates that the Lewbel IVs may yield approximately unbiased re-

sults. Using these IVs is less efficient than using the true observed IVs, which

is not surprising as the former uses less information. Nevertheless, the Lewbel

approach appears to be quite promising since it provides a method to con-

struct instruments from the available data. These instruments can either be

used alone, or to augment a set of existing instruments in order to improve

efficiency.

6.6 Discussion and future research

Although the previous discussion may suggest that regressor and random com-

ponents dependencies can be adequately addressed in multilevel models, much

care is required in using these methods in actual applications. First, the estima-

tion methods and test procedures to solve and test forXα–dependencies rely

critically on the independence ofX andη. Second, methods that rely on IVs

are known to be biased in small samples and standard asymptotic results break

down when instruments are weak (i.e. they are poorly correlated withX). This

holds in particular for the IV-based methods to solve forXη–dependencies and


Table 6.8: Results multilevel and Lewbel’s internal IVs for cases (iii) and (iv) com-pared with (i). True values:β0 = 10,β1 = 2, andσ 2

α = σ 2η = 1.

Case

(i) (iii) (iv)

Multilevel IV β0 10.00 (0.21) 9.98 (0.14) 9.98 (0.20)β1 1.99 (0.07) 2.00 (0.07) 2.01 (0.07)σ 2ε 1.99 (0.12) 1.98 (0.14) 1.99 (0.15)

Lewbel IV β0 10.05 (0.75) 9.86 (0.61) 9.81 (0.58)β1 1.98 (0.28) 2.05 (0.23) 2.07 (0.22)σ 2ε 2.04 (0.16) 2.00 (0.17) 1.97 (0.21)

for the Hausman-Taylor estimator to solve forXα–dependencies.

Although the issues about the validity and the number of instrumental variables

have primarily been investigated in cross-sectional applications, it is clear that

they are relevant for multilevel applications as well. For instance, when for

the simulation study in table 6.4 the instrumentx1i . is weakly correlated with

the endogenous regressorz2 asz2i = 0.01× x1i . + 0.01× z1i + ζi , whereζi

is a random component correlated withαi , and with all other input parame-

ters unchanged, the Hausman-Taylor estimator yieldsγ2 = 3.45(21.72) and

σ 2α = 238.01(2611.22). Similar observations can be made for the ‘exter-

nal’ multilevel IV estimates concerning table 6.7. To deal with these prob-

lems, Bound, Jaeger and Baker (1995) suggest that theR2 or the F statis-

tic of the regression of the endogenous regressors on the instruments serve as

rough guides to the quality of the instruments and should routinely be reported.

The Hahn and Hausman (2002) test or the method suggested by Donald and

Newey (2001) to choose the number of instruments could potentially be ex-

tended to serve as a guide for identifying and selecting ‘valid’ instruments for

the Hausman-Taylor estimator or multilevel IV estimators.

Further, it is often suggested in cross-sectional applications to use the ‘limited

information maximum likelihood’ (LIML) estimator instead of least squares


estimators, since it is found to be less sensitive to weak instruments (e.g.

Davidson and MacKinnon, 1993, Staiger and Stock, 1997). To the best of

our knowledge, this issue has not been addressed for multilevel models, but

perhaps it should be because it may lead to improved results for the Hausman-

Taylor estimator or the multilevel IV estimators discussed in section 6.5.

Finally, the Lewbel approach has been shown to yield consistent results for

a simple multilevel model withXη–dependency. This method deserves more

attention and could potentially be powerful in situations where no or weak in-

struments are available. The performance of this method depends critically on

its underlying assumptions as is shown in Lewbel (1997) and Wansbeek and

Meijer (2000). Most importantly, the method may be sensitive to outliers as

it relies on third order moments. Furthermore, the constructed instruments are

weak when the distribution of the endogenous regressor is not strongly skewed.

It is well known that in this case IV estimators can be seriously biased (Staiger

and Stock, 1997, Bound, Jaeger and Baker, 1995, Wansbeek and Meijer, 2000).

As a result, additional work is needed to determine the exact conditions under

which this approach can be used effectively in multilevel applications.

In some applications where endogeneity arises, however, the nature of the data

generating process itself suggests suitable instruments. This holds in partic-

ular for measurement error models, autoregressive models, and simultaneous

equation models. Possible approaches for measurement error models are dis-

cussed by Wansbeek and Meijer (2000), Carroll, Ruppert and Stefanski (1995),

or Bowden and Turkington (1984). These models can be estimated using IV

techniques, for instance by using other (potentially) mismeasured variables

(see White, 2001). Another method is based on Wald (1940), which assumes

that the observations can be divided into groups. This classification should

be independent of the error terms and discriminate between high and low val-

ues of the unobservable true construct (see also Madansky, 1959). Lewbel’s

(1997) idea presented in subsection 6.5.2 was originally proposed to solve for

measurement error problems. We showed however that that approach can be

fruitfully applied in the analysis of the general IV problems as well and de-


serves more attention. In autoregressive models one can often use lagged de-

pendent or independent variables as instruments (see for instance White, 2001,

or Wooldridge, 2002). Similarly, in simultaneous equations models instru-

ments for each equation can be obtained from the set of excluded exogenous

variables for that equation (e.g. Greene, 2000).

Our discussion of various methods did not address estimation methods in (gen-

eral) random coefficient and non-linear models (like probit- or logit models)

having endogenous regressors. In both cases, however, a similar reasoning ap-

plies as for linear (random intercept) models. Bowden and Turkington (1984)

discuss IV approaches for nonlinear models with additive disturbances (i.e.

y = g(θ, x)+ ε). Techniques developed for linear models, in particular (gen-

eralized) method of moments (GMM) techniques, can be used to estimate these

models. Blundell and Powell (2001a,b) investigate endogeneity issues in sev-

eral generalizations of the linear model. These authors discuss the extent to

which commonly used methods in linear models can be applied to the general-

ized models and show that the methods’ applicability depends on the structural

form.

Random coefficient models assume that differences between the level-2 ob-

jects are not only reflected by different intercepts as in model (6.1) but also by

different slope coefficients. These models can be written asyi = Xiβi + ηi ,

whereβi = β + µβ,i , with µβ,i a random component having mean zero,

E (µβ,iµ′β,i ) = 1 and E(µβ,iµ

′β, j ) = 0 for i = 1, ...,n, and j 6= i . As

for random intercept models, the question whether to use a fixed–effects ap-

proach (in fact a seemingly unrelated regression framework), or a random–

effects approach (a random–effects framework), depends on potential correla-

tion between the random coefficients and the explanatory variables. If depen-

dencies are present, which is sometimes referred to as ‘heterogeneity bias’, the

random–effects estimator ofβ is biased and a fixed–effects approach should be

used. Pudney (1978) provides a test for the null hypothesis that the explanatory

variables are not correlated with the random coefficients. This test is based on

the sample covariance between the (standard) least squares estimators forβi


and the means of the explanatory variables for each individual (see also Cham-

berlain, 1982).

In general, we conclude that much needs to be done before problems of en-

dogeneity in multilevel models can be adequately addressed. We showed

that even small violations of the independence structure result in biased es-

timates for parameters of interest. In table 6.1 we distinguished four cases of

(in)dependence relations among the level-1 regressors and the random com-

ponents. No distinction was made between level-1 and level-2 regressors. If

this distinction is introduced, 15 instead of three possible cases of violations

of the independence assumptions emerge. Each of these combinations could

lead to different biases in the estimators discussed in this chapter. Although it

is possible to apply the methods presented here to address the various cases,

detailed studies are necessary to assess their performance in practice. Clearly,

endogeneity problems require much more attention than they receive in current

applications of multilevel models.

In the next chapter we introduce a nonparametric Bayesian latent instrumental

variables approach to estimate models with regressor-error dependencies at

various stages of the model. The nonparametric Bayes model can be applied

to multilevel models and does not impose restrictions on the distribution of

the latent instrument. Henceforth, it generalizes the LIV model introduced in

chapters 3 and 4.


Appendix 6A Classical instrumental variables (IV) es-timation

The (single level) standard linear regression model forn observations is given by

y = Xβ + ε, (6A.1)

whereX ∈ Rn×k are the regressors,ε = (ε1, ..., εn)′ are the (unobserved) and iden-tically independently distributed errors with mean 0 and varianceσ 2

ε In, and y is ann × 1 vector of dependent variables. The ordinary least squares (OLS) estimator isBLUE and is given byβOLS

n = (X′X)−1X′y. If E (ε|X) = 0, βOLSn is unbiased (e.g.

White, 2001).

In large samples, instrumental variables techniques can be used when this assumptionis not met. Instrumental variables (IVs), collected in matrixV ∈ Rn×m, should beuncorrelated with the error termε, i.e. E(ε|V) = 0, meaning that the instrumentscannot have a direct effect ony (external instruments). Furthermore, the instrumentsshould explain part of the variability in the endogenous regressors. Once instrumentsare available (andm ≥ k), two-stage least squares techniques, for example, can beused to obtain better estimates ofβ. The ‘classical’ IV estimator for model (6A.1)is computed asβ IV

n = (X′PV X)−1X′PV y, wherePV = V(V ′V)−1V ′ (Bowden andTurkington, 1984, White, 2001, or Greene, 2000).

When ‘valid’ instruments are available, a Hausman test (Hausman, 1978) can be usedto test for regressor error dependencies in model (6A.1). UnderH0 : E (ε|X) = 0,the Hausman test-statistic computed asH = (β IV

n − βOLSn )′6−1(β IV

n − βOLSn ), where

6 = Var(β IVn ) − Var(βOLS

n ), has aχ2 distribution. For a more detailed discussion onhow to obtain an estimate for6 and how to determine the degrees of freedom (d.f.),see Greene (2000).

Appendix 6B Estimation for the hierarchical linearmodel

The parametersβ in the multilevel model given in (6.1), can be estimated by eitherfixed-effects (assumeαi to be fixed parameters fori = 1, ...,n) or random–effects(assume theαi to be drawn from a distribution) methods. The fixed–effects estimator,also known as the within-groups– or the covariance–estimator, forβ can be computedas a simple regression on the transformed equation (6B.1) which is obtained by aver-aging (6.1) acrossj for everyi , and subtracting the result from (6.1), resulting in

yi j − yi = (Xi j − Xi )β + (ηi j − ηi ), (6B.1)

Appendix 6B Estimation for the hierarchical linear model 167

where yi = (1/n j )∑

j yi j and similarly for Xi and ηi . Now αi and Zi γ dropout, and thusγ is not identifiable from (6B.1). An alternative would be to replaceall group variables by dummy variables and applying OLS on the equationyi j =∑

i αi di j + Xi j β + ηi j , wheredi j = 1 if i = j and 0 otherwise. The resulting es-timator forβ is known as the least squares dummy variable (LSDV) estimator and isexactly identical to the fixed–effects estimator forβ from (6B.1). For consistent andunbiased estimation in (6B.1) the ordinary least squares (OLS) estimator can be used,if the constructed regressorsXi j − Xi are independent of the constructed errorηi j −ηi .This implies that E(Xi j ηi l ) = 0 for all i, j, l .

The random–effects estimator provides an important alternative under the assumptionthat theαi ’s are i.i.d. random variables. Nowεi j = αi + ηi j in (6.1) is the composite(random) error term. The OLS estimator forβ andγ is consistent and unbiased, butnot fully efficient. Combining all observations we can rewrite model (6.1) as

y = Xβ + Zγ + ε = Wδ + ε, (6B.2)

whereW = [X : Z] andδ = (β ′, γ ′)′, and the other symbols defined accordingly tostacking. For known�, where� = Var(εi ), εi = (εi 1, ..., εin j )

′, the generalized least-

squares (GLS) estimator forβ andγ , given byδGLSn = (W′(In ⊗�−1)W)−1W′(In ⊗

�−1)y is efficient. However, when� is not known, it needs to be estimated, yieldinga feasible GLS estimator. A feasible GLS estimator can be obtained in several ways.We use the method explained in Verbeek (2000) (p.317). The GLS estimator is shownto be equal to a weighted average between the fixed–effects estimator computed from(6B.1) and the –so-called– between estimator, which is the OLS estimator in the model

yi = Xiβ + Zi γ + αi + ηi (6B.3)

for i = 1, ..., n. The latter estimator ignores the within–group information and ex-ploits only differences between groups. For more details on the computation of theweighting matrix, see Verbeek (2000), Hsiao (1986), or Baltagi (2001). Severalother random–effects estimation procedures for model (6.1) are available which in-clude the iterative GLS (IGLS) approach, (restricted) maximum likelihood (REML),or Bayesian procedures (see Goldstein, 1995, or Longford, 1993).

From standard OLS results, it follows that the between estimator forβ andγ from(6B.3) is consistent and unbiased when the constructed regressorsXi and Zi are in-dependent ofαi and ηi . The fixed–effects estimator from (6B.1) is consistent andunbiased when E(Xi j ηi l ) = 0 for all i, j, l . If both conditions hold, the random–effects estimator forβ andγ is consistent and unbiased.

In the simulations studies, the variance forηi j was estimated as the sum of the squaredresiduals from model (6B.1) divided byn(m−1)− (k+ l ), wheren j = m for all j in


our case. The variance forαi is estimated asσ 2α = σ 2

B − 1mσ

2η , whereσ 2

B is estimatedfrom the squared residuals divided byn from (6B.3).

Multilevel instrumental variables estimators

Two IV estimators that yield unbiased estimation of the parameters in model (6B.2)in the presence ofXη–dependencies, are the multilevel 2SLS estimator, given by

δ2SLSn = (W′PVW)−1W′PV y, (6B.4)

wherePV = V(V ′V)−1V , and the multilevel 3SLS estimator, given by

δ3SLSn = (W′ PVW)−1W′ PV y, (6B.5)

with PV = V(V ′(In ⊗ �)V)−1V ′ and where� can be estimated from the residualsfrom a 2SLS estimation. As in appendix 6A,V is a set of (external) instruments. Formore details, see Wooldridge (2002), Im et al. (1999), or Bowden and Turkington(1984).

Appendix 6C Lewbel’s instruments in a simple multi-level model

In this appendix we extend Lewbel’s method (Lewbel, 1997) to a non-measurementerror, multilevel application. We argue that, under certain conditions, instruments canbe constructed from the observed data that can be used with, for instance, two-stageleast squares to obtain a consistent estimate for the regression parameters. We applythe results derived here in subsection 6.5.2. Consider the following multilevel model:

yi j = β0+ β1xi j + αi + ηi j (6C.1)

xi j = θi j + νi j , (6C.2)

where i = 1, ..., n, j = 1, ...,m, αi ∼ N(0, σ 2α ), ηi j ∼ N(0, σ 2

η ), and νi j ∼N(0, σ 2

ν ). The unobserved componentθi j has meanµθ , varianceσ 2θ , and is indepen-

dent of theαi , ηi j , andνi j . Furthermore, we assume that Eηi j αi = 0. The regressorxi j cannot be assumed independent of the random components, i.e. Eνi j ηi j = σνη 6=0 and/or Eνi j αi = σνα 6= 0,∀i, j . Let

yi j = yi j − E yi j

xi j = xi j − E xi j (6C.3)

θi j = θi j − E θi j ,

Appendix 6C Lewbel’s instruments in a simple multilevel model 169

and consider the following transformation of the model given in (6C.1) and (6C.2)(i.e. subtract Eyi j and Exi j from (6C.1) and (6C.2)):

yi j = β1xi j + αi + ηi j (6C.4)

xi j = θi j + νi j . (6C.5)

The reduced form is given by

yi j = β1θi j + β1νi j + αi + ηi j (6C.6)

xi j = θi j + νi j . (6C.7)

OLS (or GLS) estimation in (6C.4) fails asx is not independent of the random com-ponents. Lewbel proposes the following instrumental variables:

z1i j = xi j yi j (6C.8)

z2i j = y2i j (6C.9)

z3i j = x2i j (6C.10)

These IVs should satisfy the following conditions:

1. E zsi jαi = 0;

2. E zsi jηi j = 0;

3. E zsi j xi j 6= 0,

with s= 1, 2,3. In the following we check these conditions fors= 1.

Proof: E z1i j αi = 0

We have

E{z1i j αi

} = E{xi j yi j αi

} =E{β1θ

2i j αi + β1θi j νi j αi + θi j α2

i + θi j ηi j αi + β1νi j θi j αi++β1ν

2i j αi + νi j α

2i + νi j ηi j αi

}.

(6C.11)

Sinceθi j is independent ofαi , ηi j , andνi j , so areθi j and θ2i j . Hence, the expected

value ofθi j andθ2i j multiplied withαi , ηi j , or νi j is zero. Thus we arrive at

E{z1i j αi

} = E{β1ν

2i j αi + νi j α

2i + νi j ηi j αi

}. (6C.12)


Using the law of iterated expectations, E(x) = E {E(x|y)}, and the standard resultfor bivariate normally distributed variables thatL(y|x) = N(γ0 + γ1x, σ 2

y (1− ρ2))

with γ0 = µy − γ1µx, γ1 = σxy/σ2x , it follows that the expectation in (6C.12)

is zero. For instance E{β1ν2i j αi } = E {E (β1ν

2i j αi |νi j )} = E {β1ν

2i j E (αi |νi j )} =

E {β1ν2i j (σαν/σ

2ν νi j )} = E {cν3

i j } = 0, since the third moment of a normal dis-tributed random variable is zero. For E{νi j ηi j αi } use e.g. E{E (νi j ηi j αi |νi j , ηi j )} =E {νi j ηi j E (αi |νi j )} sinceαi andηi j are independent etc..

Proof: E z1i j ηi j = 0

Similar arguments as in the previous case yields

E{z1i j ηi j

} = E{xi j yi j ηi j

} =E{β1θ

2i j ηi j + β1θi j νi j ηi j + θi j ηi j αi + θi j η2

i j + β1νi j θi j ηi j++β1ν

2i j ηi j + νi j αi ηi j + νi j η

2i j

}= 0

(6C.13)

Proof: E z1i j xi j 6= 0

Finally, for the third condition we get

E{z1i j xi j

} = E{

yi j x2i j

}=

= E{β1θ

3i j + β1νi j θ

2i j + αi θ

2i j + ηi j θ

2i j + 2β1θ

2i j νi j + 2β1ν

2i j θi j+

+2θi j νi j αi + 2θi j νi j ηi j + β1ν2i j θi j + β1ν

3i j + ν2

i j αi + ηi j ν2i j

}=

= β1E θ3i j ,

(6C.14)

which is unequal to zero as long as the third order moment ofθ does not vanish andβ1 6= 0.

In order to estimateβ1 (sayβLewbel1n ), one has to obtain a consistent estimate for Eyi j

and Exi j (= E θi j ) first, to make thez1i j ’s operational. These estimates can be easilyobtained from the sample means ofy and x. An estimate forβ0 can be computedusing Eyi j = β0+ β1E xi j , from which we getβLewbel

0n = y− βLewbel1n x.

Note that for the instrumentsz2i j = y2i j andz3i j = x2

i j similar calculations can bedone.

Chapter 7

A Nonparametic Bayesian LIVapproach

7.1 Introduction

In this chapter we introduce two important extensions of the standard LIV

model presented in chapters 3 and 4. Firstly, we generalize the multinomial

distribution of the latent instrument to a general distributionG. Secondly, we

consider endogeneity in two commonly used multilevel models and we suggest

how the method and results from this chapter may improve on the Hausman-

Taylor approach discussed in the previous chapter. Besides, the method pro-

posed here can be applied to a situation withmorethan one endogenous vari-

able. We present a Bayesian framework that can be applied to a wide variety

of models and allows for exact finite-sample inference. The methodological

results and the simulation studies that we present are promising, yet further

research is required and we suggest several steps for that.

The latent instrumental variable (LIV) model introduced in the previous chap-

ters can be used to estimate linear regression models where one regressor is

correlated with the error term. The LIV approach assumes the existence of a

discrete instrument with unobserved category membership. Hence, the latent

instrument has a multinomial distribution and the unconditional probability

distribution of the observations(y, x) is a mixture ofm bivariate normal distri-

171

172 Chapter 7 A Nonparametic Bayesian LIV approach

bution functions, assuming normally distributed error-terms. Hence, the LIV

method does not require the availability of observed instrumental variables

to estimate the regression parameters if regressor-error correlations are sus-

pected. This is an advantage over classical instrumental variables estimation

since in many applications instruments are not available. Besides, available

instruments may be of bad quality (weak or endogenous) and, hence, sub-

stantive conclusions for the same phenomenon may be different when other

instruments are used (see e.g. chapters 2 and 5).

In this chapter we relax the assumption that the unobserved instrument has a

multinomial distribution withm categories. Instead, we let the data determine

the ‘best’ distributionG of the unobserved instrument. As such, this approach

is potentially more efficient than the standard LIV model because the distri-

bution of the instrument is fully estimated from the data and is not limited to

an assumed multinomial distribution withm groups. Besides, the number of

categories of the unobserved instrument is an unknown parameter that is de-

termined by the data. So, no tests for the number of groups are required. Fur-

thermore, we consider endogeneity issues in more general multilevel models.

Ebbes, Bockenholt and Wedel (2004) put forward that endogeneity in multi-

level models is more complex because of the presence of random components

at various level. Traditional methods (fixed-effects estimation, random-effects

estimation, Mundlak’s approach, or the Hausman-Taylor estimator) are shown

to be limited in various ways and they present a list of open problems, some of

which will be addressed here (see also chapter 6).

We propose a nonparametric Bayesian approach to estimate the regression pa-

rameters and the distribution of the unobserved instrument simultaneously.

Nonparametric Bayes models have originally been proposed to alleviate the

parametric assumptions often made in standard hierarchical linear models (Es-

cobar, 1994, Escobar and West, 1998, Ibrahim and Kleinman, 1998). At the

heart of hierarchical models are assumptions on the distributions of various

model parameters, which can often be questioned, and results are found to be

sensitive to assumed forms of the distributions. Nonparametric Bayesian mod-

7.1 Introduction 173

els provide a way to alleviate these parametric assumptions by using Dirichlet

processes. A Dirichlet process is used as a prior distribution on the family

of distributions, i.e. a prior distribution is specified on the space of all possi-

ble distribution functions. Before Escobar’s (1994) results on nonparametric

Bayes models, applications of these methods were limited because of compu-

tational difficulties. However, he solved this by developing a MCMC sample

algorithm that is fairly easy to implement. Since then several studies used non-

parametric Bayes techniques in, for instance, density estimation or hierarchical

modeling. Dey, Muller and Sinha (1998) give an overview of recent develop-

ments in this area.

One advantage of using a Bayesian nonparametric approach to model the dis-

tribution of the unobserved instrument is that it bypasses the need to determine

the correct number of mixing components post hoc, while retaining the ability

to recover a variety of distributions in a unified modeling framework (cf. Kim,

Menzefricke and Feinberg, 2004). I.e., whatever the true distribution of the

instruments is, the Bayes estimate converges to it (Ferguson, 1973). Antoniak

(1974) shows that clustering is inherent to the Dirichlet process and there is a

positive probability that the number of support points found is smaller than the

number of sample points.

In section 7.2 we briefly introduce the nonparametric Bayes approach for a

simple multilevel model with a latent instrument to solve for level-1 depen-

dencies. We discuss model specification, the Dirichlet process prior, and we

present an estimation scheme. Next, we extend this idea to a general hierarchi-

cal model where individual level covariates may be endogenous (i.e. level-2

endogeneity). We show that an unobserved instrument with a Dirichlet process

can be incorporated in a similar way as for the simple multilevel model. We

present simulation results in section 7.4 and the conclusions and a discussion

of the results found are presented in section 7.5. Besides, we propose steps for

further research.


7.2 A simple multilevel model with a general latent in-strument

We consider the following simple multilevel model

yi j = β0+ β1xi j + εi j , (7.1)

where i = 1, ...,n, j = 1, ...,mi , andnt =∑

i mi is the total number of

observations. The error termεi j has mean zero and varianceσ 2ε . We do not

assume that E(xi j εi j

)equals zero. For the sake of simplicity, we omit further

regressors and random terms, but these can be easily included since the Dirich-

let process for the unobserved instrument is unaffected and MCMC estimation

is conditional on all other parameters and observations. In the terminology

of chapter 6, we consider here level-1 dependencies. Contrary to level-2 de-

pendencies, level-1 dependencies are more difficult to address, since potential

remedies require the availability of external instrumental variables that may

not be available or may be of bad quality (subsection 6.5.1).

The endogeneity ofxi j is modeled as follows

yi j = β0+ β1xi j + εi j (7.2)

xi j = θi j + νi j ,

whereθi j is the unobserved instrument and a ‘nuisance’ parameter.xi j is en-

dogenous when E(xi j εi j ) 6= 0. We assume that the endogenous regressorxi j

can be split up in an exogenous partθi j and an endogenous partνi j where the

latter is correlated withεi j . Furthermore, we assume thatεi j andνi j have mean

zero and variance-covariance matrix

6 =[σ 2ε σενσεν σ 2

ν

]. (7.3)

Instead of assuming a discrete distribution withk categories forθi j , we make

a very general assumption and assume a Dirichlet process for the distribution

7.2 A simple multilevel model with a general latent instrument 175

of θi j (Ferguson, 1973, Antoniak, 1974). To be more specific, the unobserved

instruments are independently and identically distributed asG, where we do

not assume a specific parametric form forG. Instead, it has a Dirichlet process

prior, denoted byDP(α,G0), whereα > 0 is a concentration parameter and

G0 is the baseline prior distribution with densityg0. A Dirichlet prior places

a distribution (probability measure) on the space of all distribution functions

for G. For large values ofα, G is very likely to be close toG0, whereas for

small values ofα, the mass ofG is likely to be concentrated on a few atoms.

In fact, the support of the Dirichlet process is the class of all distribution func-

tions, and is, hence, very large. The nonparametric model allows the data to

adapt aG that is skewed, has ‘shoulders’, is multimodal, or any general shape

different fromG0 (cf. MacEachern, 1998). We present the definition and some

technical details on the Dirichlet process in appendix 7A. For more details on

recent results, estimation, and applications, see Dey, Muller and Sinha (1998).

The distribution ofθi j is specified as follows

θi j |G ∼ G (7.4)

G|α, λ ∼ DP(α,G0(.|λ)),

whereθi j ’s are independent andG0 is the baseline prior defined by the param-

eterλ. As stated above, whatever the true distributionG is, its nonparametric

Bayes estimate converges to it, and the choice for the distributionG0 is not

critical. We take forG0 a normal distribution with meanµg and varianceσ 2g .

The Dirichlet process prior forG is a probability distribution on the space of

all possible distributions for the unobserved instrument, andG0 can be seen

as the location parameter (see also appendix 7A). The parameterα acts as a

precision parameter (Escobar, 1994, Escobar and West, 1998): whenα is very

large, the Dirichlet process priorG for θi j is very close to the baseline prior,

and whenα is small,G is not and is likely to concentrate on a few distinct

atoms. The expected number of distinct values ofθi j is approximately equal to

αlog[(n+ α)/α] (Antoniak, 1974, or Escobar, 1994). This number, however,

is unknown and estimated from the data.


Throughout the following we assume that the errors are normally distributed.

Furthermore, letzi j = (yi j , xi j )′, µs

z,i j = (xi j β, θi j )′, wherexi j = (1, xi j ),

β = (β0, β1)′, ands stands for ‘structural’. The density function (likelihood)

for the structural model given in (7.2) is given by

p(zi j |θi j , β,6) = (2π)−1|6|−1/2×

×exp

(−1

2(zi j − µs

z,i j )′6−1(zi j − µs

z,i j )

), (7.5)

which can also be written in the reduced form as

p(zi j |θi j , β,6) = (2π)−1|�|−1/2×

×exp

(−1

2(zi j − µr

z,i j )′�−1(zi j − µr

z,i j )

), (7.6)

whereµrz,i j = (β0+ β1θi j , θi j )

′ and� = B6B′ with

B =[

1 β10 1

]. (7.7)

In the next subsection we discuss the Dirichlet Process prior forθi j in more

detail.

7.2.1 The Dirichlet process prior for θi j

Conditional onβ,6, G0, α, andzi j = (yi j , xi j ), the Dirichlet process prior for

θi j implies the following conditional posterior distribution (see e.g. Escobar,

1994, Escobar and West, 1998)

p(θi j |θ−i j , β,6,α,G0, zi j ) ∼ q0(i j )h(θi j |β,6,G0, zi j )+

+n∑

l=1

ml∑

k=1lk 6=i j

qlk(i j )δθlk (θi j ), (7.8)


whereθ−i j denotes the(nt−1)×1 vector obtained fromθ = (θ11, ..., θmn)′ with

the i j -th element deleted, andδθlk (θi j ) = 1 if θi j = θlk and zero otherwise.

Furthermore,

• h(θi j |β,6,G0, zi j ) ∝ p(zi j |θi j , β,6)g0(θi j |µg, σ2g ) is the density of

the ‘baseline’ posterior distribution forθi j ;

• q0(i j ) ∝ α∫

p(zi j |θi j , β,6)g0(θi j |µg, σ2g )dθi j , i.e. q0(i j ) is proportional

to α times the marginal distribution ofzi j whereθi j is integrated out

under the baseline priorG0;

• qlk(i j ) ∝ p(zi j |β,6, θlk), i.e. qlk(i j ) is proportional to the density ofzi j

conditional onθi j = θlk ;

• andq0(i j ) +∑n

l=1

∑mlk=1

lk 6=i jqlk(i j ) = 1, i.e. q0(i j ) andqlk(i j ) are normalized

to 1.

The intuition behind the above scheme is as follows. If observationi j has

a relatively large (small) residual using observationlk’s valueθlk , then it is

relatively less (more) likely that observationi j ’s value for the latent instrument

is chosen asθlk . Furthermore, the smaller the residual for observationi j , while

assuming its value for the latent instrument isµg, the greater the probability

that observationi j gets a new value for its latent instrument. In fact, at this

point G has been integrated out and it is conceptually easy to sample from the

above distribution using the following scheme:

θi j |θ−i j , β,6, α,G0, zi j

{ = θlk with probabilityqlk(i j )

∼ h(θi j |β,6,G0, zi j ) with probabilityq0(i j ).(7.9)

The ease of implementation, however, depends on whether the likelihood is

easy to evaluate, whether the densityh(θi j |β,6,G0, zi j ) is of manageable

form, and whetherq0(i j ) can be easily computed. WhenG0 is a conjugate

prior, which is often not a strong assumption, the marginal distribution ofzi j is

known analytically for the computation ofq0(i j ). Otherwise it may be possible

to compute it numerically, or to apply methods that circumvent the computa-

tion of q0, see Escobar and West (1998).


Antoniak (1974) shows that clustering of theθi j ’s is inherent to the Dirich-

let process. I.e., the valuesθi j typically reduce ton < nt distinct values or

clusters. These values are drawn fromG0, which can be seen using the Polya

Urn1 representation of the Dirichlet process (Escobar, 1994 or MacEachern,

1998). It is this description of the vectorθ that leads to the view of the mix-

ture of Dirichlet process model as a mixture model (cf. MacEachern, 1998).

We denote thesen distinct values ofθi j by θs, s = 1, ..., n. In the following,

the superscript “−” denotes that observationi j is left out of the conditioning,

i.e. n−s ≤ nt − 1 is the number of observations in ‘cluster’s with common

valueθs without observationi j , andn− is the number of distinct clusters when

observationi j is removed (see also MacEachern, 1998). Now the conditional

posterior distribution ofθi j is given by

p(θi j |θ−i j , β,6,α,G0, zi j ) ∼ q0(i j )h(θi j |β,6,G0, zi j )+

+n−∑

l=1

n−l ql (i j )δθl (θi j ), (7.10)

whereql (i j ) is proportional to the likelihoodp(zi j |β,6, θl ) given that obser-

vation i j is in clusterl andδθl (θi j ) = 1 if θi j = θl and 0 otherwise. The other

elements are defined as before. From (7.10) it can be seen that the full condi-

tional posterior distribution ofθi j is a mixture of a continuous distribution and

a discrete distribution with weights on the distinct values for the latent instru-

ment (excludingθi j ).

We assume that the baseline distributionG0 of the Dirichlet process is a uni-

variate normal distribution with unknown meanµg and varianceσ 2g . We take

conjugate priors for both parameters, i.e. a normal distribution forµg and an

inverse gamma distribution forσ 2g . The parameterα (the dispersion parameter

for the Dirichlet process) is an important parameter since it determines how

1Simply stated, the Polya urn problem concerns an urn that contains (say)ci balls of colori . Each time, a ball is randomly taken from the urn and the probability that it has colori isproportional to the number of balls of that color in the urn. Then it is replaced along withanother ball of the same color, etc..


‘close’ the unknown distributionG gets toG0. If this value is unknown, as

for most applications, a prior distributionp(α) can be specified and the data is

used to learn about this parameter, see Escobar and West (1995), West (1992),

Escobar (1994). We follow West (1995) and specify a gamma distribution as

prior for α. Finally, forβ and6 we take a normal distribution and an inverted

two-dimensional Wishart distribution, respectively, as priors. The parameters

of the prior distributions are known and are specified such that they reflect

‘vague’ or little information.

The complete nonparametric Bayes LIV model is rather complex and it is not

possible to derive closed form expressions for the joint and marginal posterior

distributions of the parametersβ and6, and the other parametersµg, σ 2g , θ ,

andα. However, Escobar’s (1994) MCMC results can be used to approximate

the nonparametric Bayes model, where the full conditional distribution ofθi j ’s

is a mixture of a discrete distribution with weights on the otherθlk ’s, lk 6= i j

and G0, see (7.8) and (7.10). His work has been modified and extended in

several ways and the resulting chains are relatively easy to implement and

have been empirically shown to move quickly through the parameter space

(Dey, Muller and Sinha, 1998). The full conditional distributions of the other

parameters can be derived straightforwardly when the priors are conjugate. We

discuss this in more detail next.

7.2.2 MCMC estimation

Depending on the form of the full conditional we use either (7.5) or (7.6)

for the likelihood. The likelihood of the complete sample is computed as∏ni=1

∏mij=1 p(zi j |θi j , β,6). We use the following expression for6−1:

6−1 =[

σ2ν

|6| −σεν|6|

−σεν|6|

σ2ε

|6|

]=[σ (11) σ (12)

σ (21) σ (22)

], (7.11)

whereσ (12) = σ (21), and|6| = σ 2ε σ

2ν − (σεν)2. The joint posterior distribution

for the parameters(β,6,µg, σ2g , α) is proportional to the likelihood times the

priors, i.e.


p(β,6,µg, σ2g , α) ∝

n∏

i=1

mi∏

j=1

p(zi j |θi j , β,6)× p(θi j |G)× p(G|G0, α)×

× p(α)× p(µ0)× p(60)× p(β)× p(6), (7.12)

wherep(zi j |θi j , β,6) is given in (7.5) or (7.6). This joint posterior density is

intractable analytically but Markov Chain Monte Carlo (MCMC) methods can

be used to generate random draws indirectly without having to calculate the

joint posterior explicitly. The MCMC chain is implemented via the following

full conditional distributions of the joint posterior distribution:

1. p(6|β, α, µg, σ2g , θ, z),

2. p(β|6,α,µg, σ2g , θ, z),

3. p(µg|β,6, α, σ 2g , θ, z),

4. p(σ 2g |β,6, α, µg, θ, z), and

5. p(θi j |θ−i j , β,6, α,µg, z) for eachi = 1, ...,n, j = 1, ...,mi ,

whereθ andz are thent ×1 vectors containing the elementsθi j andzi j . When

the Markov chain stabilizes on a (relatively) small number of distinct values

θs, s = 1, ..., n, it is unlikely that new values forθ are generated and, hence,

the chain gets ‘stuck’ and has undesirable mixing properties. In order to pre-

vent the chain from getting stuck on a few nodes, West, Muller and Escobar

(1994) propose to ‘remix’θs, s = 1, ..., n, after each iteration of the MCMC

algorithm (see also Escobar and West, 1998). LetS = (S11, ..., Snmn) denote

the cluster structure, that isSi j = s if θi j = θs for i = 1, ...,n, j = 1, ...,mi

and s = 1, ..., n. Given this configuration, the full conditional distribution

p(θs|S, n, β,6,µg, σ2g , z), s = 1, ..., n, can be used to generate a new set of

valuesθ to provide more movement in the chain which facilitates convergence.

As we show later on, the remixing step typically involves drawingn values for

θ from a distribution that has a density somewhat similar toh in (7.10).


In the following we provide the specification of the full conditional distribu-

tions above. More detailed information is given in appendix 7B.

Full conditional distribution of 6. The full conditional distribution for6

reduces to

p(6|β, θ, z) ∝

n∏

i=1

mi∏

j=1

p(zi j |θi j , β,6)

× p(6|ω,9), (7.13)

from which is follows that6 is sampled from a two-dimensional inverse Wishart

distribution with parametersω+ n and(∑

i

∑j (zi j −µs

z,i j )(zi j −µsz,i j )′+9).

Full conditional distribution of β. Similarly, the full conditional distribution

of the regression parametersβ is obtained from combining the likelihood and

a normal prior distribution, i.e.

p(β|6, θ, z) ∝

n∏

i=1

mi∏

j=1

p(zi j |θi j , β,6)

× p(β|µβ, 6β). (7.14)

Hence,β is sampled from a normal distribution with mean

C−1

∑

i, j

x′i j(σ (11)yi j + σ (12)

(xi j − θi j

))+6−1β µβ

,

and variance-covarianceC−1, whereC = σ (11)∑i, j x′i j xi j + 6−1

β and xi j =(1, xi j ).

Full conditional of θi j . A sample for eachθi j can be obtained using (7.10).

Hence,

1. sample a proposed ‘cluster’ valueci j from the integers{0,1, ..., n−},with probabilities proportional to{q0(i j ),n

−1 q1(i j ), ...,n

−n−qn−(i j )}.


2. If ci j ∈ {1, ..., n−}, then setθi j = θci j, and if ci j = 0, then draw a new

valueθi j from h(θi j |β,6,G0, zi j ), which is the density of a univariate

normal distribution with meanC−1(σ (22)xi j +σ (12)(yi j − xi j β)+µg/σ2g )

and varianceC−1 = (σ (22) + 1/σ 2g )−1.

We provide more detail on the form of the probabilitiesq0(i j ),q1(i j ), ...,qn−(i j )

in appendix 7B.

Remix θ . Let Jl denote the set of observations for whichθi j = θl , l = 1, ..., n.

The full conditional ofθl is proportional to

∏

i, j∈Jl

p(zi j |θl , β,6)

g0(θl |µg, σ

2g ),

for l = 1, ..., n, see e.g. West, Muller and Escobar (1994), and Escobar and

West (1998). The derivation of this distribution is more or less similar to the

derivation ofh(θi j |β,6,G0, zi j ) above.

The θl ’s are updated by replacing the current values by new values that are

drawn from an univariate normal distribution with meanC−1(σ (22)∑i, j∈Jl

xi j+σ (12)∑

i, j∈Jl(yi j − xi j β)+ µg

σ2g) and varianceC−1, whereC = nlσ

(22) + 1σ2

g.

Full conditional of µg. The parametersθl are independent and identically dis-

tributed fromG0(.|µg, σ2g ) andµg enters the model only throughG0. Hence,

the full conditional distribution forµg is given by (West, Muller and Escobar,

1994)

p(µg|σ 2g , θ , n, z) ∝

{n∏

l=1

g0(θl |µg, σ2g )

}p(µg|µ0, σ

20 ), (7.15)

where both densities are normal densities. Hence, a new value forµg is drawn

from a normal distribution with meanC−1(∑

l θl/σ2g + µ0/θ

20) and variance

C−1, with C = (1/σ 20 + n/σ 2

g ).


Full conditional of σ 2g . Similarly, the full conditional ofσ 2

g reduces to

p(σ 2g |µg, θ , n, z) ∝

{n∏

l=1

g0(θl |µg, σ2g )

}p(σ 2

g |c,d), (7.16)

where the prior forσ 2g is an inverted gamma distribution. It follows thatσ 2

g

is sampled from an inverted gamma distribution with parametersc+ n/2 and12

∑j (θ j − µg)

2+ d.

Full conditional for α. The full conditional distribution ofα, the ‘dispersion’

of the Dirichlet process, reduces top(α|n,nt). In fact, when the priorp(α)

is a gamma density with parameters(τα, γα), it is possible to obtain an exact

expression for the full conditional ofα. Escobar and West (1995), or West

(1992) show that a new value forα can be obtained in two steps:

1. sample an auxiliary valueη from p(η|α, n,nt) ∼ Beta(α + 1,nt), i.e. a

beta distribution with meanα+1α+nt+1.

2. Then, sampleα from the following mixture of gamma’s:p(α|η, n,nt) ∼πnGamma[τα+n, γα−log(η)]+(1−πn)Gamma[τα+n−1, γα−log(η)],

where πn1−πn= τα+n−1

nt (γα−log(η)) .

This completes the specification of the MCMC chain. Escobar (1994) and Es-

cobar and West (1995) prove convergence theorems for MCMC chains that use

a Dirichlet process prior. Using suitable starting values, the above scheme can

be iterated many times to obtain a sample of any size from the true posterior

distribution. An important question is to determine how often this scheme has

to be repeated to ensure convergence of the chain, see for instance Cowles and

Carlin (1996), or Brooks and Roberts (1998). We present simulation results

for this model in section 7.4. In the following we consider a more general

hierarchical regression model.


7.3 Endogenous subject-level covariates and randomcoefficients

In this section we consider general random coefficients models (e.g. Lenk et

al., 1996) with possible endogenous subject-level covariates, in which case

parameter estimates using standard estimation techniques are no longer guar-

anteed to be unbiased. This may occur, for instance, when relevant covariates

that are correlated with included covariates, are omitted, or when some of the

covariates are measured with error. For instance, the self-reported measure for

knowledge about the microcomputer market to explain part of the variability

of the random regression coefficients, used by Lenk et al. (1996), is possibly

measured with error because individuals may find it difficult to adequately ex-

press their knowledge by a few statements. Besides, if the measures used are

not, or only partly, related with the constructs that are actually searched for, the

observed data that is used in estimation contains measurement error. As far as

we know there are no other studies that consider regressor-error dependencies

at this stage of the model, but as will become clear the estimated model param-

eters may be biased in presence of such endogenous covariates using standard

estimation techniques. To be more specific, here we investigate a more general

form of level-2 dependencies as in section 6.1 (see also the discussion about

random coefficient models in section 6.6).

The model we consider is a standard linear two-level model with random co-

efficients. We assume that a set of individual level covariates are available to

explain part of the variance of the random coefficients. The model is given by

yi j = x′i j βi + εi j

βi = γc + γ zi + ηi , (7.17)

with i = 1, ...,n and j = 1, ...,m, or mi . xi j is a set of explanatory variables

(e.g. a design matrix in conjoint analysis), which is, asβi , ak× 1 vector. The

individual-level covariates are given byzi = (z′1i , z′2i )′, wherez1i is a l1 × 1

vector of potential endogenous covariates that are correlated withηi (but not

7.3 Endogenous subject-level covariates and random coefficients 185

with εi j ), andz2i is a l2 × 1 vector containing the exogenous covariates.γ is

a k × (l1 + l2) matrix, and the constantγc is a k × 1 vector. We also write

γ0 = (γ ′, γ ′c)′, i.e. the matrixγ0 represents the effect of the covariateszi on

the regression coefficientsβi . The model forβi is a (latent) multivariate re-

gression model, where E(ηi ) = 0 and Var(ηi ) = 6ββ . We assume that all the

εi j ’s are independent across and within subjects, with mean 0 and varianceσ 2.

We are not restricted to within subject independency and the results presented

here can be generalized to a situation where Var(εi ) = 6, or6i .

The nonparametric Bayes LIV approach presented in the previous section can

be used in the model forβi to solve for possible biases in the presence of

endogenous covariatesz1. Here the idea of latent instruments is in particular

useful since obtaining valid observed instruments at this stage of the model

is highly problematic and ambiguous. The latent instruments are included as

follows

z1i = θi + αz2i + ξi , (7.18)

whereθi is a l1 × 1 vector of unobserved instruments,α a l1 × l2 matrix that

contain the effects of the exogenous covariates on the endogenous covariates2,

andξi is al1×1 vector of errors, which has expectation zero. The dependency

between the covariatesz1i andηi , i.e. the endogeneity, is caused by a nonzero

covariance betweenηi andξi . The variance covariance matrix of(ηi , ξi ) is the

(k + l1) × (k + l1) matrix3, which is assumed to be positive definite, and

contains the block matrices6ββ ,6βz1, and6z1z1

.

As in the previous section, the ‘latent’ instrumentθi has an unknown distribu-

tion G, which has a Dirichlet process prior with parametersρ andG0. G0 is

the normal density with meanµθ , which is a(l1 × 1)-vector, and a(l1 × l1)

variance-covariance matrixVθ . In the following we give the full conditionals

for the MCMC scheme. We use standard conjugate priors for the parame-

2This parameterα is not to be confused with the ‘dispersion’ parameterα of the Dirichletprocess in the previous section. For the model in this section we useρ for that purpose.


ters which are specified such that they represent no or vague prior knowledge.

More details on the derivation of the full conditionals is given in appendix 7C.

7.3.1 Estimating the hierarchical model with general latent in-struments

The unknown parameters of the model in (7.17) and (7.18) are:βi , σ2, γ0, θi ,

α, 3, ρ, µθ andVθ . The joint posterior distribution, conditionally on the data

X, Y, andZ, is formed as a product of the likelihood function, obtained from

the first equation of model (7.17), and the prior densities. Since no closed-

form expressions for the joint posterior density and marginal posterior densi-

ties are available, we use MCMC sampling to approximate a sample from the

true marginal posterior densities. Assuming normal distributions for the error

terms, the full conditional distributions for the parameters with conjugate pri-

ors can be obtained similar to section 7.2. Here we generally use multivariate

distributions to accommodate for having a vectorβi and possible several en-

dogenous variableszi . See Escobar (1994) and Dey, Muller and Sinha (1998)

for more details on general multivariate nonparametric Bayesian estimation.

The MCMC scheme is completed by iterating the following conditional distri-

butions:

1. p(σ 2|βi , i = 1, ...,n;Y, X);

2. p(βi |σ 2, γ0,3, θi , α;Y, X, Z), for i = 1, ..., n;

3. p(3|βi , θi , i = 1, ...,n, γ0, α; Z);

4. p(γ0|βi , θi , i = 1, ...,n,3, α; Z);

5. p(α|βi , θi , i = 1, ...,n,3, γ0; Z);

6. p(θi |θ−i , βi ,3, γ0, α; Z), whereθ−i = (θ1, ..., θi−1, θi+1, ..., θn), for i =1, ...,n;

7. p(µθ |θl , l = 1, ..., n,Vθ ), whereθl are then ≤ n distinct values forθi

that generally arise from the clustering structure in the Dirichlet process;

8. p(Vθ |θl , l = 1, ..., n, µθ ), and


9. p(ρ|n,n),

and, as before, a remixing step for the different values ofθl , l = 1, ..., n. In the

following we give the specific distributions, see appendix 7C for more details.

Full conditional distribution of σ 2. The full conditional forσ 2 is proportional

to the likelihood times the inverse gamma prior distribution forσ 2 with param-

etersτ0 andη0. Hence, a new value forσ 2 is drawn from an inverse gamma

distribution with parametersτ0+ nm/2 and(1/2)∑

i, j (yi j − x′i j βi )2+ η0.

Full conditional distribution of βi . The full conditional forβi can be obtained

from

p(βi |rest,data) ∝

m∏

j=1

p1(yi j |βi , σ2)

pk(βi |z1i , z2i , γ0,3, θ, α) (7.19)

wherepx denotes thex-variate normal density. The latter conditional distribu-

tion in (7.19) is obtained from the joint distributionpk+l1(hi |z2i , γ,3, θ, α),

wherehi = (β ′i , z′1i )′ is a(k+ l1)×1 vector. Hence,pk(βi |z1i , z2i , γ0,3, θ, α)

is ak-variate normal distribution with mean

µβ.z1i= γc + γ zi +6βz1

6−1z1z1(z1i − (θi + αz2i )), (7.20)

and variance-covariance

6ββ.z1= 6ββ −6βz1

6−1z1z16′βz1

(7.21)

(e.g. theorem 3.6, Greene, 2000). LetC = 1σ2

∑j xi j x

′i j + 6−1

ββ.z1. It fol-

lows that the full conditional forβi is a k-variate normal density with mean

C−1( 1σ2

∑j xi j yi j +6−1

ββ.z1µβ.z1i

) and variance-covarianceC−1.

Full conditional distribution of 3. As before, lethi = (β ′i , z′1i )′ and define

µhi= ((γc + γ zi )

′, (θi + αz2i )′)′. The full conditional for3 is obtained as


the product of the joint density ofhi acrossi = 1, ...,n, and the density of

the inverted Wishart prior distribution with parameters(c, D). Hence,3 is

sampled from an inverted Wishartk+l1-distribution with parametersn+ c and

n∑

i=1

(hi − µhi

) (hi − µhi

)′ + D.

Full conditional for γ0. The full conditional distribution forγ0 can be obtained

by vectorizing the model for(βi , z1i ). I.e. we stack the rows ofβi andz1i as

follows (see e.g. Lenk, 2001)

βi = vec(βi ) = vec(γc)+ vec(γ zi )+ vec(ηi )

= (z′0i ⊗ Ik)γ0+ ηi , (7.22)

with z0i = (z′i ,1)′, a(l + 1)× 1-vector, and

z1i = vec(z1i ) = vec(θi )+ vec(αz2i )+ vec(ξi )

= θi + (z′2i ⊗ I l1)α + ξi . (7.23)

We write z0i = z′0i ⊗ Ik andz2i = z′2i ⊗ I l1. Furthermore, letz10i = z1i − θi −

z2i α,3 = var(ηi , ξi ) and3,

3−1 =[3(11) 3(12)

3(21) 3(22)

], (7.24)

where3(12) = 3(21)′ , 3(11) is k × k, 3(12) is k × l1, and3(22) is a l1 × l1matrix. The vectorized system forβi and z1i has a ‘standard’ multivariate

normal form. It follows from appendix 7C that the values forγ0 are drawn

from a(l1+ l2+ 1)-variate normal distribution with mean

C−1

[∑

i

z′0i

(3(12)z10i +3(11)βi

)+ V−1

γ mγ

], (7.25)

3See Greene (2000), formula (2-74), for a general expression of the inverse of a 2× 2partitioned matrix.


and variance-covarianceC−1, whereC =∑i z′0i3(11)z0i + V−1

γ .

Full conditional for α. Using similar arguments as for the matrix of regression

coefficientsγ0, the full conditional distribution forα can be easily obtained

after vectorization. I.e., let

β0i = ηi

z10i = z2i α + ξi , (7.26)

whereβ0i = vec(βi − γc − γ zi ), z10i = vec(z1i − θi ), and z2i = z′2i ⊗ I l1.

It follows that the full conditional density forα is from a multivariate normal

distribution with mean (letC =∑i z′2i3(22)z2i + V−1

α )

C−1

[∑

i

z′2i

(3(21)β0i +3(22)z10i

)+ V−1

α mα

], (7.27)

and varianceC−1.

Full conditional for θi . The structure of the full conditional distributions for

each of the unobserved instrumentsθi , i = 1, ...,n, is derived in a similar way

as above for the simple linear multilevel model. The full conditional distribu-

tion for θi has the following form, withθ−i = (θ1, ..., θi−1, θi+1, ..., θn),

[θi |θ−i , rest, data]∼ q0(i )h(θi |rest, data)+n−∑

l=1

n−l ql (i )δθl (θi ), (7.28)

wheren− are the number of differentθ j ’s, j = 1, ...,n, j 6= i , andn−l are

the number of observations in clusterl when thei -th observation is removed.

Hence,

1. sample a proposed ‘cluster’ valueci from the integers{0,1, ..., n−} with

probabilities proportional to{q0(i ),n−1 q1(i ), ..., n

−n−qn−(i )}.


2. If ci ∈ {1, ..., n−}, setθi = θci, and ifci = 0, then draw a new valueθi

from h(θi |βi , α, γ0,3,µθ ,Vθ ; Z), which is a multivariate normal dis-

tribution with meanC−1(3(21)β0i + 3(22)z10i + V−1θ µθ ) and variance

C−1 = (3(22)+V−1θ )−1, whereβ0i = βi−γc−γ zi andz10i = z1i−αz2i .

In the appendix we give more details on how to compute the probabilities

{q0(i ),n−1 q1(i ), ...,n

−n−qn−(i )}.

Remix θ . The remixing density forθl , l = 1, ..., n, is proportional to

∏

i∈J j

p(hi |θl , α, γ0,3; Z) g0(θl |µθ ,Vθ ),

where Jj is the set of indicators of observations belonging to groupj . This

distribution can be obtained in a similar manner ash(θi |rest, data) in the pre-

vious subsection. Hence, the remixing density forθl is a multivariate normal

distribution with mean

C−1

3(21)

∑

i∈J j

β0i +3(22)∑

i∈Jj

z10i + V−1θ µθ

,

and varianceC−1, with C = nl3(22)+V−1

θ , and whereβ0i andz10i are defined

as before.

Full conditional for µθ . The full conditional distribution forµθ is propor-

tional to

{n∏

l=1

g0(θl |µθ ,Vθ )

}p(µθ |mµ,Vµ

), (7.29)

which are both densities of a multivariate normal distribution. LetC = nV−1θ +

V−1µ . Hence,µθ is sampled from a multivariate normal distribution with mean

C−1

(V−1θ

n∑

l=1

θl + V−1µ mµ

),


and varianceC−1.

Full conditional distribution for Vθ . The prior distribution forVθ is an in-

verted Wishart with parameters(τv, ϒv). Its full conditional distribution is

obtained in a similar way as forµθ and can be shown to be equal to an inverted

Wishart distribution of dimensionl1 with parametersτv + n and

n∑

l=1

(θl − µθ

) (θl − µθ

)′+ϒv.

Full conditional distribution for ρ. The full conditional forρ is derived in

a similar way as the dispersion parameter in the simple multilevel model in

the previous section. Assuming a gamma prior distribution with parameters

(τρ, γρ), we obtain the following scheme for generating a new value forρ:

1. sample an auxiliary valueη from p(η|ρ, n,n) ∼ Beta(ρ + 1,n), i.e. a

beta distribution with meanρ+1ρ+n+1.

2. Then, sampleρ from the following mixture of gamma’s:p(ρ|η, n,n) ∼πnGamma[τρ+n, γρ−log(η)]+(1−πn)Gamma[τρ+n−1, γρ−log(η)],

where πn1−πn= τρ+n−1

n(γρ−log(η)) .

This completes the specification of the MCMC scheme. Similar arguments

as for the convergence of the MCMC algorithm for the simple nonparametric

Bayes LIV model in subsection 7.2.2 apply. In the following section we il-

lustrate the performance of the proposed two models and estimation schemes

using synthetic data.

7.4 A simulation study

In this section we discuss the results of two simulation studies to investigate the

performance of the models and estimation algorithms proposed in the previous

two sections. We first discuss the simulation results for the simple nonparamet-

ric Bayes model presented in section 7.2. Here we investigate the performance


of the model for three different choices of the distribution of the latent instru-

ment. Then we present the results for the random coefficients model in section

7.3 and consider a situation with one and two endogenous covariates.

7.4.1 Simulation results for the simple multilevel model

We specified three different distributions for the latent instrument in model

(7.2): (1) a discrete distribution with two categories, similar to the bimodal

(m = 2) case in section 3.5, (2) a continuous gamma distribution, and (3) at

distribution with six degrees of freedom. When the true distribution forθi is

discrete, the standard LIV model in chapter 3 is correctly specified (condition-

ally on knowing the true number of categories of the unobserved instrument)

and we expect that the standard LIV model outperforms the nonparametric

Bayesian LIV approach. When the unobserved instrument has at distribution,

the model is weakly identified, because it is not identified in case of an exact

normal distribution4.

In all cases the variance ofθi j is equal to 1.5. Furthermore, we normalized its

mean to zero. We took an initial sample size ofn = 1000 and assumed we

had only one observation per individual, i.e.mi = 1 for i = 1, ...,n. Fur-

thermore, we tookβ0 = 1, β1 = 2, σ 2ε = σ 2

ν = 1, andσεν was taken 0, 0.36,

and 0.79, representing a situation with no, moderate and severe endogeneity,

respectively. In total, we generated 15 datasets. We discarded the first 5000

iterations of the MCMC chain and saved the final 20000. To reduce the auto-

correlation in the MCMC draws, we only used every 10th draw. Convergence

was monitored based on iteration plots. We first discuss the results for the

main parameters for the bimodal and gamma distributions and compare the

results obtained from the nonparametric Bayes LIV method with the classical

OLS estimates and the standard LIV estimates. Subsequently, we present our

findings when the latent instrument has at distribution.

4We do not have a formal proof of this conjecture. If the distribution of the unobservedinstrument is exactly normal, it is identical to the specification ofG0 from which the unobservedinstruments are drawn. We can expect to end up with eithern = 1 or n = nt . Both situationsare not identifiable. We found support for this using simulated data.


Results main parameters for bimodal and gamma distribution

Table 7.1: Results main parameters for bimodal distribution for the three situations:σεν = 0 (A), σεν = 0.36 (B), andσεν = 0.79 (C).

β1 σ 2ε σεν k

A OLS 1.99 (0.020) 1.00 (0.027)Bayes LIV 2.00 (0.043) 1.01 (0.029) -0.02 (0.127) 79 (42.99)LIV2 2.00 (0.037) 1.00 (0.028) -0.02 (0.115) 2

B OLS 2.15 (0.012) 0.95 (0.046)Bayes LIV 2.01 (0.030) 1.00 (0.050) 0.35 (0.074) 43 (24.59)LIV2 2.00 (0.029) 1.00 (0.050) 0.36 (0.073) 2

C OLS 2.31 (0.015) 0.76 (0.033)Bayes LIV 2.00 (0.028) 1.01 (0.049) 0.78 (0.030) 8 (2.26)LIV2 2.00 (0.029) 1.00 (0.051) 0.78 (0.031) 2

The results for the bimodal distribution are presented in table 7.1. We present

the mean and standard deviations of the estimated parameters computed across

the 15 simulated datasets. For the nonparametric Bayes model we computed

the posterior means forβ1, σ2ε , σεν , andk across the 2000 saved MCMC it-

erations and, subsequently, we computed the average and standard deviations

across these 15 posterior means. We do not report the results forβ0 because it

was estimated consistently by OLS sincexi j has mean zero in all cases.

It follows from table 7.1 that the simple nonparametric Bayes model gives ap-

proximate unbiased results in all cases. As can be seen, the number of clusters

(i.e. different values ofθi j ), as indicated byk, is a parameter in the nonparamet-

ric Bayes model and is also estimated, as opposed to the standard LIV model

wherek has to be chosen a priori. The high standard deviations of the esti-

mated values fork do not mean that it is not estimated precisely, but through-

out the MCMC iterations a few high values ofk appear that affect its mean

and standard deviation, see for instance figure 7.1. In particular forσεν > 0,

we find that the number of components estimated by the nonparametric Bayes


LIV model was more close to the true number of components, which was two.

It can be seen that the OLS results are biased when the regressor is correlated

with the error term, i.e. whenσεν 6= 0. Whenx is truly exogenous, OLS is the

best alternative and the classical LIV model, which was specified withk = 2,

is slightly more efficient that the nonparametric Bayes model. For nonzero

values ofσεν , the classical and Bayesian LIV methods give approximate sim-

ilar results. We note that the classical LIV model is correctly specified in all

cases since the true number of groups is two. However, the performance of the

nonparametric Bayesian LIV procedure is very encouraging.

Figure 7.1: Iteration plotk for synthetic dataset no. 15 whenσεν = 0.

As for the classical LIV model, in the nonparametric Bayesian LIV model we

can test whetherσεν = 0 (i.e. a test for endogeneity) easily by computing the

fractions of the MCMC sample in which casesσεν > 0 andσεν < 0. We found

for σεν = 0 that these fractions were close to 0.5, as they should be, and for

bothσεν = 0.36 and 0.79 that these were equal to 1 and 0, respectively. Hence,


a test for endogeneity can be computed straightforward as a byproduct of the

MCMC output, using the above procedure based on the posteriorP-values (see

for a discussion Meng, 1994, or Sellke, Bayarri and Berger, 2001).

Table 7.2: Results main parameters for a skewed gamma distribution for the treesituations:σεν = 0 (A), σεν = 0.36 (B), andσεν = 0.79 (C).

β1 σ 2ε σεν k

A OLS 2.00 (0.023) 1.01 (0.053)Bayes LIV 2.00 (0.049) 1.02 (0.053) -0.02 (0.087) 13 (3.05)LIV2 2.00 (0.043) 1.01 (0.053) -0.02 (0.072) 2LIV3 1.99 (0.031) 1.01 (0.088) -0.02 (0.082) 3

B OLS 2.14 (0.018) 0.95 (0.056)Bayes LIV 2.00 (0.021) 1.00 (0.066) 0.37 (0.057) 11 (2.45)LIV2 2.00 (0.025) 1.00 (0.066) 0.37 (0.074) 2LIV3 1.99 (0.024) 1.00 (0.071) 0.38 (0.071) 3

C OLS 2.32 (0.028) 0.74 (0.043)Bayes LIV 1.99 (0.026) 1.01 (0.082) 0.81 (0.075) 14 (2.31)LIV2 1.99 (0.031) 1.01 (0.088) 0.81 (0.096) 2LIV3 1.99 (0.029) 1.00 (0.086) 0.80 (0.082) 3

In table 7.2 we present the results for a situation where the unobserved instru-

ment has a continuous skewed gamma distribution with scale parameter 0.5

and shape 0.577 (i.e. its variance is 1.5). It can be seen that the proposed non-

parametric Bayesian LIV model gives unbiased results in all cases. Its results

are preferred to the classical LIV results forσεν > 0. Although the classical

LIV model with three categories is slightly more efficient than with two cat-

egories, it is still less efficient than the nonparametric Bayes LIV model. For

the classical LIV model withk > 3 we found degenerate solutions in several

runs, which indicates thatk = 3 is more or less the “best” choice. It can be

seen that the nonparametric Bayes model can adapt more easily to a situation

where the true distribution of the instrument is not discrete but continuous. As

before, whenσεν = 0 the OLS estimate is best and the standard LIV model


gives more efficient results than the nonparametric Bayes model. Surprisingly,

the estimated number of clustersk is for the skewed gamma distribution much

lower for σεν = 0 and 0.36 than when the true distribution ofθi j is discrete

with two categories in table 7.1. This can be expected if the two components

of the discrete distribution are not far apart, in which case the sampled distri-

bution ofθ resembles more a symmetric, unimodal distribution, with a flat top,

which is approximated by a large number of support points drawn from a nor-

mal distributions. Apparently, a skewed gamma distribution is approximated

by a mixture of normals with fewer support points. The proposed test to test

for endogeneity was found to give satisfactory results (posteriorP-values of

0.45, 1, and 1 forσεν = 0,0.36, and 0.79).

Table 7.3: Results main parameters for at distribution with six degrees of freedomfor the tree situations:σεν = 0 (A), σεν = 0.36 (B), andσεν = 0.79 (C).

β1 σ 2ε σεν k

A OLS 2.00 (0.006) 1.00 (0.016)Bayes LIV 1.99 (0.027) 1.00 (0.016) 0.03 (0.066) 29 (42.7)

B OLS 2.14 (0.005) 0.94 (0.018)Bayes LIV 1.99 (0.028) 1.00 (0.024) 0.38 (0.069) 15 (8.94)

C OLS 2.32 (0.004) 0.74 (0.015)Bayes LIV 2.00 (0.032) 1.00 (0.058) 0.80 (0.078) 19 (6.75)

In table 7.3 we present the results for the simulatedt6 distribution of the in-

struments. We found that a sample size ofnt = 1000 was not sufficient to

estimate the model. In general, the MCMC chain did not converge, indicating

non- or under-identification, which was immediately clear from convergence

plots for (e.g.)σεν . This can be expected because thet distribution is close

to a normal distribution which is not identified, and a relatively large sample

is needed to have a good representation of the tails of that distribution. The

results presented in table 7.3 are for a total sample size ofnt = 10000 (ob-

tained as, for instance,n = 10000 andm = 1, orn = 1000 andm = 10). As


before, we used 15 simulated datasets. We also estimated the synthetic data

with the standard LIV model, but in all cases we found that the estimated Hes-

sian matrix had several eigenvalues equal to zero, indicating that the model is

not identified. This reveals that the standard LIV model is more sensitive to

the distribution of the instruments when it is close to normal. From the results

in table 7.3 it becomes clear that the nonparametric Bayes LIV model yields

approximate unbiased results, but the relatively large standard deviations in-

dicate that the model is weakly identified. Examination of the iteration plots

suggests that the chain has converged. Contrary to the results in the previous,

the nonparametric Bayes estimates exhibit more variability for larger amounts

of endogeneity, although part of this variability is expected to reduce when the

number of simulated datasets is increased.

The simulation studies presented here illustrate that the nonparametric Bayes

approach with a general distribution for the unobserved instrument is pow-

erful in estimating linear models in presence of regressor-error correlations.

We compared the new method with classical OLS and the LIV method pro-

posed in the previous chapters and we examined two extreme cases, one case

where the classical LIV method is correctly specified and one case which rep-

resents near identifiability. When the distribution of the instrument is truly

discrete, the classical LIV approach performs best, but the nonparametric ap-

proach gives approximately similar results. When the true distribution of the

instrument is continuous the nonparametric Bayes approach performs better,

which illustrates its flexibility in adapting to the distribution ofθ . The stan-

dard LIV model could not be estimated when the unobserved instrument has at

distribution. The nonparametric Bayesian LIV method, however, does give ap-

proximate unbiased results but the estimated standard deviations may be rather

large. Besides, the results critically rely on the sample size used, which may

present a problem for cross sectional applications. However, this may be less

an issue in multilevel studies where typically several observations on a subject

are available.

In the following we present the results for the multilevel model described in


section 7.3. Here we consider a situation with one endogenous regressor and a

situation with two endogenous regressors. Given the problems found with the

t disitribution for the simple model, we only present results for an unobserved

instrument that has a discrete distribution and a continuous skewed distribu-

tion. This model can not be estimated by the standard LIV model developed

in the previous chapters.

7.4.2 Simulation results for the hierarchical model

We considered two situations for model (7.17): one in which case we have one

endogenous regressorz1, i.e. l1 = 1, and a situation with two,l1 = 2. In both

cases we assumed the presence of one exogenous covariatez2 (l2 = 1), with a

small effect (α) on the endogenous covariates, one regressorx, and a constant,

such thatβi is of dimension two, fori = 1, ...,n. We tookn = 500 andm =15 in all cases. As before, we simulated 15 datasets and we considered three

situations: no, moderate and severe endogeneity (the corresponding elements

of 6βz1are 0, 0.36 and 0.79, respectively). We first present the results for one

endogenous regressor, where the true distribution ofθi is a (univariate) discrete

distribution with two categories, subsequently we present the results for two

endogenous covariates where the distribution of the unobserved instrument is

a (bivariate) skewed gamma distribution. The estimates are compared with

results from a standard hierarchical Bayes model as in Lenk et al. (1996).

Results for one endogenous regressor

The results are presented in table 7.4. We only present the results for the

regression parameterβ1, the regression parametersγ that correspond to the

endogenous covariatez1, and the estimated covariances6βz1. We found that

for two of the 15 simulated datasets the estimated value fork equaled 1 in all

MCMC draws, in which case the model is not identified. These situations can

be identified easily from examining iteration plots ofk or γ , see the figure in

appendix 7D. The estimated values for the nonparametric Bayes model forγ

and6βz1in table 7.4 are obtained after excluding the non-converged cases.


Table 7.4: Results main parameters for hierarchical model with one possible en-dogenous regressor tree situations:6βz1 = 0× ι2 (A), 6βz1 = 0.36× ι2 (B), and6βz1 = 0.79× ι2 (C). The other true values are:β1 = 2 andγ11 = γ21 = 1.

A B C

NPB LIV Std HB NPB LIV Std HB NPB LIV Std HB

β1 2.01 2.01 1.99 1.99 2.02 2.02(0.06) (0.06) (0.07) (0.07) (0.08) (0.08)

γ11 1.03 1.00 1.00 1.14 0.99 1.28(0.06) (0.03) (0.06) (0.03) (0.04) (0.02)

γ21 0.99 1.00 1.00 1.15 1.02 1.29(0.06) (0.03) (0.08) (0.03) (0.02) (0.02)

6(11)βz1

-0.05 0.34 0.70(0.13) (0.15) (0.10)

6(21)βz1

0.02 0.37 0.68(0.13) (0.20) (0.08)

k 13 11 5(3.70) (2.74) (1.30)

Two results are immediately clear. Firstly, the results for the main regres-

sion parameterβ1 are almost equal for the nonparametric Bayes model and

the standard hierarchical Bayes model, regardless of whether a covariatez is

endogenous or not. Hence, the results for the main regression parametersβ

areunaffected by the presence of an endogenous covariatez (in this example,

however, the covariatesz were generated independently from the regressors

x, i.e. they are not collinear, which may potentially explain the result found).

Secondly, as expected, when endogeneity is present, the estimated regression

parametersγ obtained from the standard hierarchical Bayes model are biased,

as expected. It can be seen that the nonparametric Bayes model corrects for

this and allows for unbiased estimation.

We note that the effective sample size used to estimate the distribution of the

latent instrument in this simulation study is 500, whereas in the previous sim-

ulation studies the sample size was at least 1000. Furthermore, the regression


parametersβi are unobserved parameters and, hence, less information is avail-

able to estimate the distribution of the unobserved instrument. Hence, we typ-

ically observe larger standard deviations than for the results in table 7.1

In all casesσ 2ε , the variance of the main regression equation fory, was esti-

mated unbiasedly by both the nonparametric Bayes LIV model and the stan-

dard hierarchical Bayes model. Furthermore, although not given in table 7.4,

the nonparametric Bayesian LIV model estimates for the heterogeneity vari-

ances of the regression parameters6ββ are approximately unbiased, regardless

of the presence of endogenous covariates. Contrary, the heterogeneity vari-

ances are severely underestimated by the standard hierarchical Bayes model in

presence of an endogenous covariate. This was also found in chapter 6 for the

variance of the random intercept (for instance table 6.2).

It can be seen from the above results that the bias due to level-2 endogeneity

in the standard hierarchical Bayes estimates forγ , and6ββ , may be quite

large. In the following subsection we present the results for a situation with

two endogenous covariates.

Results for two endogenous regressors

Since there are a large number of parameters in this model, we only focuss

on the main parameters. As for the results in the previous subsection with

one endogenous covariate, the results for the main regression equation are not

affected by presence of endogeneity of some of thez’s (while assuming that

the z’s and thex’s are not collinear), and both the nonparametric Bayes LIV

model and the standard hierarchical Bayes model give approximate similar re-

sults. We therefore choose to report the results only for the elements ofγ

(γ11, γ12, γ21, γ22) corresponding to the endogenous covariates and the the ele-

ments of6βz1. Furthermore, we report the results for the estimated number of

clustersk.

The results are presented in table 7.5. It can be seen that the nonparametric

Bayes LIV model gives unbiased results for the regression parametersγ and


Table 7.5: Results main parameters for hierarchical model with two possible endoge-nous regressors for tree situations:6βz1 = 0× I2 (A), 6βz1 = 0.36× I2 (B), and6βz1 = 0.79× I2 (C). The other true values are:γ11 = γ21 = 1 andγ12 = γ22 = −1.

A B C

NPB LIV Std HB NPB LIV Std HB NPB LIV Std HB

γ11 0.98 1.00 0.98 1.12 1.00 1.24(0.05) (0.03) (0.05) (0.02) (0.04) (0.04)

γ12 -1.00 -1.00 -1.00 -0.88 -0.98 -0.74(0.05) (0.03) (0.05) (0.02) (0.05) (0.04)

γ21 0.99 0.99 0.99 1.12 0.99 1.24(0.07) (0.03) (0.05) (0.02) (0.03) (0.03)

γ22 -0.99 -0.98 -1.02 -0.89 -1.01 -0.75(0.07) (0.03) (0.04) (0.02) (0.03) (0.03)

6(11)βz1

0.03 0.40 0.70(0.13) (0.12) (0.09)

6(12)βz1

0.03 0.37 0.69(0.09) (0.11) (0.10)

6(21)βz1

0.01 0.38 0.71(0.17) (0.08) (0.08)

6(22)βz1

0.01 0.40 0.73(0.17) (0.11) (0.06)

k 23 18 21(12.71) (5.64) (3.58)

the covariances between (ηi , ξi ) in (7.17) and (7.18), that induce the depen-

dency betweenβi and the elements ofz1i . Furthermore, the estimated values

for γ using the standard hierarchical Bayes model are biased when the covari-

atesz1 are not exogenous. For instance, for the case with severe endogeneity

(C), this bias amounts to approximately 25%, which is quite severe.

We found that the estimates for the heterogeneity variance components5 6ββ ,

are approximately unbiased for the nonparametric Bayes LIV model, but the

5Not reported here.


standard hierarchical Bayes model underestimates the variances by about 40%

when there is severe endogeneity.

7.5 Discussion nonparametric Bayesian LIV approach

In this chapter we presented preliminary findings on a very general approach

to model endogeneity in single or multilevel models. We proposed a non-

parametric Bayes approach to model the distribution of the latent instrument

using a Dirichlet prior process. We considered two general multilevel models

that may suffer from regressor-error dependencies. One advantage of using a

Bayesian approach is that it can handle complex model structures in a straight-

forward manner through MCMC estimation. We illustrated that the nonpara-

metric Bayes model for the unobserved instrument proposed in section 7.2

could be adapted to a more general setting in section 7.3 without too much

effort. Although the technicalities surrounding Dirichlet prior processes may

be demanding, it presents a flexible approach that can be extended and adapted

easily to other situations with potential regressor-error dependencies.

The simulation results showed that the nonparametric Bayes LIV model gives

unbiased results for a variety of settings. We compared the approach proposed

in this chapter to the LIV model in chapter 3, and found that the nonparametric

Bayes approach outperforms the classical LIV model for non-discrete distri-

butions, because the Dirichlet process prior allows for full estimation of the

distribution of the unobserved instrument. We saw that the model yields ap-

proximately unbiased results, even for the extreme case when the unobserved

instrument has at distribution, although a large dataset needs to be available.

Implementation of the nonparametric Bayes LIV model does not require a pri-

ori specification of the number of clusters, but this number is estimated as a

by-product of the estimation. Similarly, we found for the random coefficients

model in section 7.3 that the nonparametric Bayesian LIV approach can be

successfully used to estimated the model parameters in presence of endoge-

nous covariates. Importantly, the estimates forγ obtained from the standard

hierarchical Bayes model are strongly biased. In addition, the standard hierar-

7.5 Discussion nonparametric Bayesian LIV approach 203

chical Bayes model substantially underestimates the amount of heterogeneity

in the regression coefficients. We showed that the nonparametric Bayesian LIV

approach can handle situations with more than one endogenous regressor.

Although the results are promising, future research is needed to obtain more

insight in using a Dirichlet prior process for the distribution of the unobserved

instrument. The simulation studies presented here are informative, but limited

to only three kinds of distributions for the unobserved instrument: a discrete

distribution with two categories, a heavily skewed gamma distribution and a

t distribution with fat tales. We plan to investigate the properties of the non-

parametric Bayes LIV model for a broader range of distributions. The results

above suggest that when the distribution of the latent instrument is close to a

normal distribution the results should be interpreted with caution, which was

revealed by examining iteration plots of key parameters. We suggest to inves-

tigate convergence issues in detail in empirical applications, since simply rely-

ing on iterations plots may be too limiting (Cowles and Carlin, 1996, Brooks

and Roberts, 1998). Besides, we plan to investigate whether the mixing of the

MCMC chain can be improved and whether this depends on the amount of

autocorrelation between subsequent MCMC draws. We found that the com-

putational burden for the nonparametric Bayes LIV model is much larger than

for the standard LIV model.

One interesting application for the nonparametric Bayes LIV approach is to

combine the Dirichlet process prior with the Hausman-Taylor approach pre-

sented in subsection 6.3.3. The Hausman-Taylor approach can be applied

to general random intercept models that suffer from level-two dependencies.

Hausman and Taylor (1981) show that the multilevel structure of the data and

prior knowledge on the exogeneity of part of the available regressors can be

used to construct instrumental variables to estimate the regression parameters.

This method has the advantage that no external instrumental variables are re-

quired. Furthermore, their approach allows for estimation of level-two (group-

specific) variables as opposed to e.g. Mundlak’s approach or fixed-effects es-

timation. The Hausman-Taylor estimator, however, was shown to be limited


in its use when level-one regressor-error dependencies are present (see section

6.4). Importantly, in constructing the ‘internal’ instruments as proposed by

Hausman and Taylor focus is only on whether or not certain regressors can be

assumed independent of the random intercepts. Although this may be a valid

assumption in many applications, their method does not address the strength

of the obtained ‘internal’ instruments, and the method seems to be ad-hoc at

this point. We illustrated this important aspect in section 6.6. Incorporation

of a general distributed unobserved instrument in the Hausman-Taylor model

can be done using the results in section 7.2, and may yield improved results

when the proposed ‘internal’ instruments are weak or when possible level-one

endogeneity is present.

Furthermore, our nonparametric Bayesian LIV approach can possibly handle

situations where both level-1 and level-2 dependencies are present. Lousily

spoken, the errors in (7.2) can be regarded as a result of two terms: (1) a

level-2 specific error, and (2) a level-1 specific error. I.e., the ‘total’ errors are

εi j = αi + εi j andνi j = τi + νi j . The variance-covariance matrix6 in (7.3)

can be changed accordingly. Level-2 endogeneity arises when Eαi τi 6= 0, and

level-1 endogeneity when Eεi j νi j 6= 0. In both cases, Eεi j νi j 6= 0, which is in

form similar to the problem considered in section 7.2. It is interesting to inves-

tigate this extension, in particular given the conclusions in the previous chapter.

The standard LIV model in chapter 3 is identified when the unobserved in-

strument has at least two categories. When the unobserved instrument has one

mean (m = 1), the parameters are not identified. In a Dirichilet process there

is a positive probability that the number of different values for the latent instru-

mentθ is less than the sample sizent (i.e. theθ ’s are clustered). Our simulation

studies suggest that identification problems occur when the true distribution of

the unobserved instrument gets close to normal. We found clear indications

of lack of convergence of the MCMC output in such cases, suggesting that

the nonparametric Bayes model ‘automatically’ points-out a non-identified so-

lution, and the results should be disgarded. Nevertheless, further research is

required to investigate this conjecture. Observations made by Lewbel (1997)

7.5 Discussion nonparametric Bayesian LIV approach 205

or Carroll, Roeder and Wasserman (1999) may prove helpful. Related to this,

we suspect that, if the error distributionνi j in (7.2) is non-normal, a normal

distribution for the latent instrument may be possible.


Appendix 7A The Dirichlet process

Ferguson (1973) introduces the Dirichlet process priors as a class of prior distributionsfor a set of probability distributions on a given sample space. The Dirichlet process isbased on the Dirichlet distribution, which is given in definition 7A.1 (Ferguson, 1973).

Definition 7A.1 Let Z1, Z2, ..., Zk be independent random variables withZi ∈Gamma(αi , 1), whereαi ≥ 0 for all i , andαi > 0 for somei . Let Yi = Zi /

∑j Z j .

Then distribution of(Y1, ...,Yk) is a Dirichlet distribution with parameter(α1, ..., αk).

The probability density for a Dirchlet distribution is defined by

f (y1, ..., yk|α1, ..., αk) =0(∑k

i αi )∏ki 0(αi )

(

k∏

i=1

yαi−1i ), (7A.1)

whereα1, ..., αk > 0, y1, ..., yk ≥ 0 and∑

i yi = 1. Let α = ∑i αi . The mean

of a Dirichlet distribution is E(yi ) = αi /α and the variance is var(yi ) = αi (α −αi )/α

2(α + 1). The Dirichlet distribution is an extension of the Beta distribution(k = 2). The Dirichlet distribution can be used as a prior for the (discrete) probabili-ties (group sizes) in e.g. mixture models (see also property 3, Ferguson, 1973).

Ferguson’s Dirichlet process is extensively discussed in Ferguson (1973) and Antoniak(1974). Here we present definition 1 from Antoniak:

Definition 7A.2 Let2 be a set andA be aσ -field of subsets of2. Let ν be a finite,non-null, non-negative, finitely additive measure on(2,A). Now, a random prob-ability measureP on (2,A) is a Dirichlet process on(2,A) with parameterν, iffor everyk = 1, 2, ... and measurable partitionB1, ..., Bk of 2, the joint distribu-tion of the random probabilities(P(B1), ..., P(Bk)) is a Dirichlet distribution withparameters(ν(B1), ..., ν(Bk)).

We write: P ∈ D(ν). Ferguson obtains the following properties of the Dirichletprocess:

1. If P ∈ D(ν), andA ∈ A, then E(P(A)) = ν(A)/ν(2);2. If P ∈ D(ν), and, conditional givenP, θ1, θ2, ..., θn are an i.i.d. sample from

P, thenP|θ1, θ2, ..., θn ∈ D(ν+∑n

i=1 δθi ), whereδx is a measure giving massone to pointx;

3. If P ∈ D(ν), thenP is almost surely discrete.

Furthermore, Ferguson (1973) shows that for a sampleX of size 1 fromP ∈ D(ν),P(X ∈ A) = ν(A)/ν(2), for A ∈ A.

In nonparametric Bayesian applications it is common to specifyν asν = αG0, whereG0 is a distribution, andα > 0. The posterior distribution, which is conditional on

Appendix 7B Full conditionals: the simple multilevel model with general LIV 207

the data, that arises from a structure with a Dirichlet process prior, is known to bea mixture of Dirichlet processes (Antoniak, 1974, Escobar, 1994). Escobar’s (1994)results on MCMC estimation, however, show that the full conditional distributions arefairly simple to use in parameter estimation.

Appendix 7B Full conditionals: the simple multilevelmodel with general LIV

Full conditional for 6. For6 we take the inverted Wishart prior with parametersω

and9, implying

p(6|rest,data) ∝

(det(6))−n/2 exp

−1

2

∑

i

∑

j

(zi j − µsz,i j )′6−1(zi j − µs

z,i j )

×

×|6|−(ω+2+1)/2 exp

[−1

2tr(96−1

)](7B.1)

∝ |6|−(ω+n+2+1)/2 exp

−1

2tr

(∑

i

∑

j

(zi j − µsz,i j )(zi j − µs

z,i j )′ +9)6−1

,

i.e. the full conditional of6 is a inverted Wishart with parametersω + n and(∑

i∑

j (zi j − µsz,i j )(zi j − µs

z,i j )′ +9).

Full conditional for β. The prior forβ is a bivariate normal distribution with meanµβ and variance6β . Because both likelihood and prior are normal distributions,we focuss on the ‘quadratic’ term (kernel) (for the sake of national convenience wesuppress in the following the subscripti j , and letη = (x − θ) andx = (1, x))

(z− µsz)′6−1(z− µs

z) =

=((

yx

)−(

xβθ

))′ [σ (11) σ (12)

σ (12) σ (22)

]((yx

)−(

xβθ

))

= σ (11)(y− xβ)′(y− xβ)+ σ (12)η(y− xβ)+ σ (12)η(y− xβ)′ + η2σ (22)

∝ σ (11)(−β ′ x′y− yxβ + β ′ x′ xβ)− σ (12)ηxβ − σ (12)β ′ x′η, (7B.2)

= σ (11)β ′ x′ xβ − β ′ x′(σ (11)y+ σ (12)η)− (σ (11)y+ σ (12)η)xβ.

By adding the subscripts and the kernel of the prior we get,


β ′(σ (11)∑

i, j

x′i j xi j +6−1β )β − β ′(

∑

i, j

x′i j (σ(11)yi j + σ (12)ηi j )+6−1

β µβ)+

−(µ′β6−1β +

∑

i, j

(σ (11)yi j + σ (12)ηi j )xi j )β.

Let C = σ (11)∑i, j x′i j xi j +6−1

β . Now it follows that the full conditional forβ, giventhe other parameters and the data, is bivariate normal with mean

C−1(∑

i, j

x′i j (σ(11)yi j + σ (12)ηi j )+6−1

β µβ),

and variance-covarianceC−1. Note that if6−1β = 0 the mean of the full conditional

for β is similar to an ordinary regression with a correction term for the endogeneity.If their is no endogeneity, i.e.σ (12) = 0, the ‘standard’ regression model is obtained.

Derivation of full conditional distribution of θi j . In the following we derive theexpressions for the components of (7.10).

Derivation ofh(θi j |rest, data). h(θi j |rest, data) ∝ p(zi j |θi j , β,6)g0(θi j |µg, σ2g ) can

be computed in a similar way as the full conditional ofβ. The kernel of the (structural)likelihood, whereε = y− xβ, is

(z− µsz)′6−1(z− µs

z) == σ (11)ε′ε + σ (12)(x − θ)ε + σ (12)(x − θ)ε′ + (x − θ)2σ (22)

∝ −2σ (12)θ ε + σ (22)θ2− 2σ (22)xθ. (7B.3)

Adding the subscripti j and the kernel ofg0 yields

(σ (22) + 1

σ 2g

)θ2

i j − 2

(σ (22)xi j + σ (12)εi j +

µg

σ 2g

)θi j , (7B.4)

from which it follows thath(θi j |rest, data) is a normal density with mean

C−1

(σ (22)xi j + σ (12)εi j +

µg

σ 2g

),

and varianceC−1, with C = σ (22) + 1σ2

g. Note: whenσ 2

g → ∞, the location of the

distributionh(θi j ) is estimated byxi j and an ‘endogeneity’ correction(σ (12)/σ (22))εi j(sinceσ (12) = −σεν/|6| andσ (22) = σ 2

ε /|6|).

Appendix 7B Full conditionals: the simple multilevel model with general LIV 209

Derivation ofq0. q0 is proportional toα∫

p(zi j |θi j , β,6)dG0(θi j |µg, σ2g ). In the

following we present the steps necessary to integrateθi j out of the likelihood function.We focus on the quadratic term of the reduced form likelihood (7.6). This derivationis conditional on all parameters (exceptθi j ) and the data. We dropi j -subscript fornotational convenience and usez = (y − β0, x)′ andµr

z = θν, whereν = (β1, 1)′.Then,

(z− µrz)′�−1(z− µr

z)+(θ − µg)

2

σ 2g

=

z′�−1z− θν′�−1z− z′�−1νθ + θν′�−1νθ + θ2

σ 2g− 2

µg

σ 2g+ µ

2g

σ 2g

By letting κ1 = ν′�−1z, κ2 = ν′�−1ν + 1σ2

g, and rearranging terms (the terms not

involving θ or z are dropped in the factor of proportionality), we get

κ2θ2− 2

(µg

σ 2g+ κ1

)θ + z′�−1z (7B.5)

which is equal to (takeκ3 = κ1+ µg

σ2g

)

κ2

(θ − κ3

κ2

)2

− κ23

κ2+ z′�−1z (7B.6)

The first term integrates to one. The remaining terms involve matrices and vectorsand we have to be careful with taking squares, transposing and taking inverses. LetA = κ−1

2 , now (7B.6) is equal to

z′�−1z− κ ′3Aκ3 =

z′�−1z−(

z′�−1ν + µg

σ 2g

)A

(ν′�−1z+ µg

σ 2g

)∝

−z′�−1νAµg

σ 2g− µg

σ 2g

Aν′�−1z− z′�−1νAν′�−1z+ z′�−1z=

z′(�−1−�−1νAν′�−1

)z− z′�−1νA

µg

σ 2g− µg

σ 2g

Aν′�−1z. (7B.7)

Let B = �−1−�−1νAν′�−1 = [�+ νσ 2ν′]−1

, where we used expression (2-66b)from Greene (2000). The expression in (7B.7) is the kernel of a multivariate normaldensity6 with variance-covariance matrixB−1 = �+νσ 2ν′ and meanB−1�−1νA

µg

σ2g

.

6The general form of a multivariate normal distribution with meanV−1µ and variance-covarianceV−1 is: (x − V−1µ)′V(x − V−1µ) = x′V x− µ′x − x′µ+ µ′V−1µ.


This latter expression can be simplified toνµg. This is not immediately clear, but sub-

stituting A = σ 2g − σ 2

gν′[�+ νσ 2

gν′]−1

νσ 2g and rearranging terms gives the desired

result. Hence,q0 is proportional toαNi j , whereNi j is the density7 of a bivariate nor-mal distribution atzi j with meanνµg and variance-covariance� + σ 2

gνν′, which is

equal to

B6B′ + σ 2g

[β2

1 β1β1 1

]=

[β2

1(σ2ν + σ 2

g )+ 2β1σεν + σ 2ε β1(σ

2ν + σ 2

g )+ σενβ1(σ

2ν + σ 2

g )+ σεν σ 2ν + σ 2

g

]. (7B.8)

Derivation of q j . This quantity is proportional ton−j times p(zi j |θ j , β,6), j =1, ..., n−, wheren− is the number of differentθ ’s whenθi j is removed.

Remixing θ . Using the expression for the posterior distribution ofθ , it follows thatthe derivation of the remixing distribution forθ is more or less similar to the derivationof h(θi j ) above. We have

∑

i, j∈Jl

(ε

x − θ)′ [

σ (11) σ (12)

σ (12) σ (22)

](ε

x − θ)∝

∑

i, j∈Jl

{−2σ (12)θ ε + σ (22)θ2− 2σ (22)xθ

}, (7B.9)

and by adding the kernel ofg0 and the subscriptsi j we obtain

(nlσ

(22) + 1

σ 2g

)θ2

l − 2

σ (22)

∑

i, j∈Jl

xi j + σ (12)∑

i, j∈Jl

εi j +µg

σ 2g

θl . (7B.10)

It follows that the (full conditional) posterior remixing distribution forθl is a normaldistribution with meanC−1(σ (22)∑

i, j∈Jlxi j + σ (12)∑

i, j∈Jlεi j + µg

σ2g) and variance

C−1, whereC = nlσ(22) + 1

σ2g

andJl is defined in subsection 7.2.2. Whenσ 2g →∞

and there is no endogeneity (σ (12) = 0), then the mean of the full conditional distri-bution for θl is equal to the sample mean ofxi j of the observations in clusterl . The

7This result looks very much like the results of a more simple case:∫f (y|µ, σ2)g(µ|µ0, σ

20 )dµ, where y has a normal distribution with meanµ and vari-

anceσ2 andµ has a normal distribution with meanµ0 and varianceσ20 , of which we know

that this integral yieldsy|σ2, µ0, σ20 ∼ N(µ0, σ

2+ σ20 ).

Appendix 7C Full conditionals: the hierarchical model with general LIV 211

variance is in this case equal toσ 2ν /nl .

Full conditional for µg. From (7.15) we obtain

∑

l

(θl − µg)2

σ 2g

+ (µg − µ0)2

σ 20

∝

µ2g

(1

σ 20

+ n

σ 2g

)− 2

(∑l θl

σ 2g+ µ0

σ 20

)µg, (7B.11)

from which it follows that the full conditional forµg is a normal distribution withmeanC−1(

∑l θl /σ

2g + µ0/σ

20 ) and varianceC−1, with C = (1/σ 2

0 + n/σ 2g ). Note:

if σ 20 →∞ than the mean of the full conditional distribution is computed as

∑l θl /n

and its variance asσ 2g/n.

Full conditional for σ 2g . The derivation is similar to the derivation of the full condi-

tional distribution forµg. We use (7.16), hence

1(√σ 2

g

)nexp

−1

2

n∑

j=1

(θl − µg)2

σ 2g

1

(σ 2g )(c+1)

exp

(− d

σ 2g

)=

1

(σ 2g )(c+n/2+1)

exp

(−

12

∑j (θ j − µg)

2+ d

σ 2g

), (7B.12)

from which it follows that the full conditional distribution forσ 2g is an inverted gamma

distribution with parametersc+ n/2 and12

∑j (θ j − µg)

2+ d.

Full conditional for α. The full conditional distribution for the dispersion parameterα of the Dirichlet process can be obtained using the results in e.g. West (1992) orEscobar and West (1995).

Appendix 7C Full conditionals: the hierarchical modelwith general LIV

Full conditional for σ 2 and βi . Assuming conjugate prior densities, the full condi-tional distributions forσ 2 andβi can be derived easily using standard results fromcombining a normal likelihood with a inverted gamma distribution, and with a multi-variate normal distribution with mean (7.20) and variance-covariance (7.21), respec-tively.


Full conditional distribution for 3. The prior distribution for3 is an uninformative(k+ l1) dimensional inverted Wishart distribution with parametersc andD. Let hi =(β ′i , z

′1i )′ andµhi = ((γc + γ zi )

′, (θi + αz2i )′)′. Now

p(3|rest, data) ∝{

n∏

i=1

det(3)12 exp

[−1

2

((hi − µhi )

′3−1(hi − µhi ))]}×

×det(3)−c+k+l1+1

2 exp

{−1

2tr(

D3−1)}, (7C.1)

from which it follows that the full conditional for3 is also an inverted Wishart(k+l1)-distribution with parametersn+ c and

n∑

i=1

(hi − µhi

) (hi − µhi

)′ + D.

Full conditional for the matrix γ0. We use the vectorized system in (7.22) and (7.23).The quadratic term, under normality of the errors(η′i , ξ

′i )′, of this system is given by

(we drop the subscripti for the moment)

(β − z0γ0

z10

)′ [3(11) 3(12)

3(21) 3(22)

](β − z0γ0

z10

)

∝ (7C.2)

γ ′0z′03(11)z0γ0− γ ′0z′0(3

(12)z10+3(11)β)− (β ′3(11) + z′103(21))z0γ0.

The prior forγ0 is a(l1+l2+1)-variate normal distribution with meanmγ and varianceVγ . Adding the subscripti and the prior kernel we obtain

γ ′0

(∑

i

z′0i3(11)z0i

)γ0− γ ′0

(∑

i

z′0i

(3(12)z10i +3(11)βi

))+

−(∑

i

(β ′i3

(11) + z′10i3(21))

z0i

)γ0+ γ ′0V−1

γ γ0−m′γV−1γ γ0− γ ′0V−1

γ mγ

=

γ ′0

[∑

i

z′0i3(11)z0i + V−1

γ

]γ0− γ ′0

[∑

i

z′0i

(3(12)z10i +3(11)βi

)+ V−1

γ mγ

]+

−[∑

i

(β ′i3

(11) + z′10i3(21))

z0i +m′γV−1γ

]γ0, (7C.3)

from which it follows that the full conditional distribution forγ0 is a (l1 + l2 + 1)-variate normal distribution with mean


C−1

[∑

i

z′0i

(3(12)z10i +3(11)βi

)+ V−1

γ mγ

], (7C.4)

and variance-covarianceC−1, whereC =∑i z′0i3(11)z0i + V−1

γ .

Full conditional for the matrix α. Similarly, the quadratic term of the multivariatenormal distribution of the system in (7.26) is given by

(β0

z10− z2α

)′ [3(11) 3(12)

3(21) 3(22)

](β0

z10− z2α

)

∝ (7C.5)

α′z′23(22)z2α − α′z′2

(3(21)β0+3(22)z10

)−(β ′03

(12) + z′103(22))

z2α.

Adding the kernel of the prior forα (a l1l2-variate normal distribution) and the sub-scriptsi , we find that (similar as forλ) the full conditional density function forα is amultivariate normal distribution with mean (letC =∑i z′2i3

(22)z2i + V−1α ),

C−1

[∑

i

z′2i

(3(21)β0i +3(22)z10i

)+ V−1

α mα

](7C.6)

and varianceC−1.

Full conditional distribution for θi . In the following we derive the probabilitiesq0(i ),q1(i ), ...,qn−(i ) and the densityh(θi |data, rest).

Derivation of q0. We proceed in the same way as in the derivation ofq0 for thenonparametric Bayes LIV model in section 7.2. The reduced form is equal to (wedrop subscripti for the moment)

β = γc + γ1θ + (γ1α + γ2)z2+ γ1ξ + ηz1 = θ + αz2+ ξ. (7C.7)

Defineu = γ1ξ + η, which is equal to

u =(γ1ξ + ηξ

)= B

[η

ξ

](7C.8)

where


B =[

Ik γ10 Il1

].

Define3 = var{(u′, ξ ′)′} = B3B′. The likelihoods for structural and reduced formare equal, since the Jacobian of transformation is 1. The following is conditional on allthe other parameters. Letβ0 = β − γc− (γ1α+ γ2)z2 andz10 = z1−αz2, and defineh = (β0, z10), a (k + l1) × 1 vector. Furthermore, letν = (γ ′1, Il1)

′, a (k + l1) × l1-matrix, andµh = νθ . q0 ∝ A(h) = ρ ∫ p(h|θ, γ0, α, 3, data)dG0(θ |µθ ,Vθ ), i.e. q0is proportional to the density that results whenθ is integrated out. We examine thekernel of the product of these two multivariate normal densities:

(h− µh)′3−1(h− µh)+ (θ − µθ )′V−1

θ (θ − µθ ) ∝h′3−1h− h′3−1νθ − (νθ)′3−1h+ (νθ)′3−1νθ + θ ′V−1

θ θ+ (7C.9)

−µ′θV−1θ θ − θ ′V−1

θ µθ .

Define

κ1 = ν′3−1h

κ2 = ν′3−1ν + V−1θ (7C.10)

κ3 = κ1+ V−1θ µθ .

Then it can be shown that (7C.9) is equal to

(θ − κ−12 κ3)

′κ2(θ − κ−12 κ3)− κ ′3κ−1

2 κ3+ h′3−1h. (7C.11)

The first term is the kernel of a multivariate normal distribution and integrates to 1,and we only need to focuss on the latter two terms. In lettingA = κ−1

2 and B =3−1− 3−1ν Aν′3−1 = (3+ νVθν′)−1 we find that

h′3−1h− κ ′3κ−12 κ3 = h′ Bh− h′3−1ν AV−1

θ µθ − µ′θV−1θ Aν′3−1h, (7C.12)

from which it can be seen thatA(h) is proportional to a(k+ l1)-variate normal distri-bution with varianceB−1 and meanB−13−1ν AV−1

θ µθ = νµθ (this last equality isnot immediately clear but follows after completing all products).

Derivation ofq j . q j is proportional ton−j times the densityp(hi |θ j , α, γ0,3, data),j = 1, ..., n−, which is a(k+ l1)-dimensional multivariate normal density with mean((γc + γ1θ j + (γ1α + γ2)z2)

′, (θ j + αz2i )′)′ and variance3.

Derivation ofh(θi |rest, data). h(θi |rest, data) is proportional to


f (hi |θi , α, γ0,3, data)g0(θi |µθ ,Vθ ),

whereg0 is the probability density function of a multivariate normal distribution. Wefirst consider the kernel off (drop the subscripti , let β0 = β − γc − γ z, andz10 =z1− αz2). We have

(β0

z10− θ)′ [

3(11) 3(12)

3(21) 3(22)

](β0

z10− θ)∝ (7C.13)

∝ θ ′3(22)θ − θ ′(3(21)β0+3(22)z10)− (β03(12) + z′103

(22))θ,

and adding the kernel of the distributiong0 gives

θ ′3(22)θ − θ ′(3(21)β0+3(22)z10)− (β03(12) + z′103

(22))θ+ (7C.14)

+θ ′V−1θ θ − µ′θV−1

θ θ − θ ′V−1θ

∝θ ′(3(22) + V−1

θ )θ − θ ′(3(21)β0+3(22)z10+ V−1θ )+ (7C.15)

−(β ′03(12) + z′103(22) + µ′θV−1

θ )θ,

from which it can be seen that (writeC = 3(22) + V−1θ ) h(θi |rest, data) is a normal

density with mean

C−1(3(21)β0i +3(22)z10i + V−1

θ µθ

)

and variance-covarianceC−1.

Full conditional for µθ . We assume a multivariate normal prior density forµθ withmeanmµ and varianceVθ . Using the normal kernel of the prior andG0, and the clusterstructure of the Dirichlet process, we have

n∑

l=1

(θl − µθ

)′V−1θ

(θl − µθ

)+ (µθ −mµ

)′V−1µ

(µθ −mµ

) ∝

µ′θ(nV−1

θ + V−1µ

)µθ − µ′θ

V−1

θ

n∑

l=1

θl + V−1µ mµ

+

−

n∑

l=1

θ ′l V−1θ +m′µV−1

µ

µθ ,


from which it can be seen that the full conditional distribution forµθ is a multi-

variate normal distribution with meanC−1(

V−1θ

∑nl=1 θl + V−1

µ mµ

)and variance-

covariance matrixC−1, whereC = nV−1θ + V−1

µ .

Full conditional distribution for Vθ . Assuming an inverted Wishart density functionwith parameters(τV , ϒV ) as prior, and using similar arguments as forµθ , we obtainthe following full conditional distribution

p(Vθ |data, rest) ∝

n∏

l=1

|Vθ |−12 exp

{−1

2

(θl − µθ

)′V−1θ

(θl − µθ

)}×

×|Vθ |−τV+l1+1

2 exp

[−1

2tr(ϒV V−1

θ

)](7C.16)

∝ |Vθ |−τV+n+l1+1

2 exp

−1

2tr

n∑

l=1

(θl − µθ

) (θl − µθ

)′+ ϒV

V−1

θ

,

from which it follows that the full conditional forVθ is an inverted Wishart distribu-tion of dimensionl1 with parametersτV + n and

∑nl=1(θl − µθ )(θl − µθ )′ + ϒV .

Full conditional for ρ. As before, we refer to West (1992) or Escobar and West(1995) to construct the full conditional distribution for the ‘dispersion’ parameterρ.

Appendix 7D Iteration plots 217

Appendix 7D Iteration plots

Figure 7D.1: Iteration plotsγ11 andk for dataset no. 1 that failed to converge (withtable 7.4).

Chapter 8

Discussion

The primary objective of this thesis is to develop a new method, the latent in-

strumental variables (LIV) method, to solve and test for regressor-error depen-

dencies in linear models. The traditional instrumental variables (IV) method

is limited in its use because it requires the availability of instruments of de-

cent quality. In many situations such instruments are not available. Besides,

in applications where instruments are available, the performance of inferential

procedures critically depends on the quality of such variables, and results have

to be interpreted with caution. The proposed LIV method allows for consis-

tent estimation in the presence of regressor-error dependencies and testing for

such dependencies without having observed instrumental variables at hand. In

this chapter we present the conclusions of our findings. Table 8.1 gives an

overview of the main topics and findings of the chapters. Furthermore, we

provide a discussion of the LIV model and suggest steps for further research.

8.1 Summary and conclusions

An important assumption in the linear regression model is independence of

the regressors and the error term. In chapter 2 we presented five situations

in which this assumption is questionable: (i) relevant omitted variables, (ii)

measurement error, (iii) self-selection, (iv) simultaneous equation models, and

(v) lagged dependent variables and autocorrelation. In many empirical ap-

plications one or more of these situations may apply and standard estimation

219

220 Chapter 8 DiscussionC

h.S

ubjectM

odelM

ainfindings

2Literature

reviewinstrum

ental–

•B

iasO

LSin

presenceofXε–dependency

variables(IV

)m

ethod•

Possible

caveatsw

ithclassicalIV

estimation

3S

imple

LIVm

odelandtests

Linearm

odel,oneendogenous

x,•

Sim

.studies

forw

iderange

ofsettings,regr.par.

estimated

con-for

Xε–dependency

anunobserveddiscrete

instru-sistently,proposed

testspow

erfuldetectingXε–dependency

mentw

ithm>

1categories

•R

esultsinsensitive

form

isspecificationofm

•Identification

proof

4Tests

forinstrum

entweakness

Extension

ofmodelC

h.3,addex-•

Sim

.studies:

proposedtests

powerfulto

detectbadquality

IVs

andendogeneity,and

imple-

ogenousregressors

andobserved•

LIVm

odelrobustagainstmisspecifying

likelihoodm

entationissues

IVs

•D

iagnosticsto

choosemand

examine

outliers/infl.observations

5E

stimating

thereturn

toeducation

Application

ofmodelC

h.4•

Results

forthree

empiricaldatasets

•O

LSestim

atefor

schoolingbiased

upward

(≈

7%)

•Tests

indicatebad

qualityofavailable

observedIV

s

6M

ultilevelmodels

andrandom

–S

everalrandominterceptm

odels•Tests

forR

Eregressor-dependencies

arereview

edeffects

(RE

)regressor–dependen-

•S

im.

studies:bad

performance

testsand

estimation

methods

ciesin

certainsituations

•R

ecomm

endationsare

made

totargetthese

situations

7H

ierarchicalmodels

andendoge-

Hierarchicalm

odelsw

ithD

irichlet•A

lleviatingdiscreteness

assumption

latentinstrument

neityprior

processforunobserved

instru-•

Sim

ulationstudies

promising

ment

•W

orkin

progress

Table8.1:O

verviewthesis-chapters

andm

ainfindings.

8.1 Summary and conclusions 221

procedures for the linear regression model are known to give biased and incon-

sistent results. Important examples are, for instance, estimating the effect of

marketing mix variables in sales response models and estimating the return to

education on income. Studies in marketing and industrial economics (Berry,

Levinsohn and Pakes, 1995, or Besanko, Gupta and Jain, 1998) find that the

estimated price response parameter in choice models is biased towards zero

when endogeneity of prices is ignored. Managerial decisions based on price

response measures that are not corrected for endogeneity are likely to have un-

derestimated the effect of a price change on sales or market share. Similarly,

policy makers that rely on the OLS estimates for the return to education (Card,

1999) find themselves over-ambitious because the true effect of education on

wages can be expected to be lower. Hence, ignoring endogeneity leads to false

conclusions and erroneous decision making.

The ‘classical’ instrumental variables (IV) method can be used to estimate

models where regressor-error dependencies may be present. This method as-

sumes that an additional set of instruments is available that can be used to

separate the endogenous regressors into an exogenous part and an endogenous

part. If the instruments are of good quality, then the IV estimates are known to

be consistent. However, the literature review given in chapter 2 points out two

problems with classical IV estimation: (i) instruments need to be available, and

(ii) performance of the IV method critically relies on the quality of the instru-

ments used. Despite (ii), these variables are often chosen on basis of ad-hoc

arguments or convenience, as in many empirical applications instruments are

not readily available. Several studies in econometrics have proposed solutions

to the problem of weak instruments (Stock, Wright and Yogo, 2002, and Hahn

and Hausman, 2003). The results from these studies present a toolbox with

methods and tests to improve on classical IV inference in presence of weak

instruments. Most of these studies, however, do not address instrument endo-

geneity and are conditional on the availability of a set of instrumental variables.

For empirical problems the question how and where to find instruments is still

open. The latent instrumental variables (LIV) method presented in chapter 3

222 Chapter 8 Discussion

addresses this issue right at the heart. We propose a new method that doesnot

require the availability of observed instrumental variables. We prove that the

LIV model parameters can be identified through the likelihood and we illus-

trate the method on synthetic data. The simulation studies show that the LIV

model gives consistent results for the regression parameters and the proposed

test to test for regressor-error dependencies has a reasonable power across a

wide variety of settings. These results are obtainedwithout having observed

instrumental variables at hand. In addition, the LIV model gives identical re-

sults to classical IV estimation for a measurement error application where a

laboratory dummy instrumental variable is available. Furthermore, we show

that the LIV results are rather insensitive to misspecification of the true number

of categories of the discrete instrument. These results are important for em-

pirical researchers because our ‘instrument-free’ approach does not require the

necessity of first finding good quality instrumental variables when regressor-

error dependencies are suspected.

In chapter 4 we extend the simple LIV model by allowing for additional exoge-

nous regressors and possible available instruments. Furthermore, we discuss

several implementation issues that complete an LIV analysis. The results of the

identification proof for the more general LIV model suggest two procedures to

investigate the validity of instrumental variables: (i) a test for a zero effect

of the instrument on the endogenous variable (i.e. whether the instrument is

‘weak’), (ii) a test for a direct effect of the instrument on the dependent vari-

able (i.e. whether the instrument is exogenous). Our synthetic data results

show that the proposed procedures have a reasonable power to detect ‘bad’

quality instruments. Furthermore, our results indicate that the LIV estimates

for the regression parameters are rather insensitive to misspecification of the

true distribution of the error terms. This can be expected, since the LIV model

belongs to the class of mixture models, that are known to be flexible in adapt-

ing to a broad range of distributions.

The literature review in chapter 5 illustrates the difficulties in estimating the

return to education on income due to potential ability bias and the lack of good

8.1 Summary and conclusions 223

quality instrumental variables. The LIV results for three empirical datasets in-

dicate an upward ability bias of approximately 7%. This number is close to

recent results from twin studies (Card, 1999). On the contrary, the classical

IV results are highly unstable, inconsistent with the traditional ability bias crit-

icism, and suffer from large standard deviations. We investigate the quality

of the available instrumental variables in the three datasets and compare them

with the ‘optimal’ LIV instruments. We find in two of the three applications

that the available instruments are weak and/or exogenous. In all cases the opti-

mal LIV instruments are found to be much stronger and, hence, the LIV results

are more efficient than the classical IV results. The results that we find are con-

vergent and lend credibility to the usefulness of the LIV method in empirical

settings.

Chapters 6 and 7 consider endogeneity problems in multilevel models. In many

applications data has an hierarchical structure, which introduces additional er-

ror terms and possible endogeneity-relations in the model. The model we con-

sider in chapter 6 has two levels, and endogeneity may arise at the individual-

specific level (level-one) or at the group level (level-two). In this chapter we

review previous literature on estimating random intercept models in presence

of regressor-error dependencies. Traditional methods (fixed-effects estima-

tion, the Hausman-Taylor approach, Mundlak’s approach) to solve for level-

two dependencies are shown to be limited in their use in presence of level-one

dependencies. Our results reveal that even small violations of level-one in-

dependence may lead to fallacious conclusions in applying these traditional

methods. Besides, we provide evidence that the problem of weak instruments

also applies to multilevel applications, in particular to multilevel methods that

solve for level-one dependencies, but also to the Hausman-Taylor approach to

address level-two dependencies. We argue that much work needs to be done

before problems of endogeneity in multilevel models can be adequately ad-

dressed and we present a list of open problems.

In chapter 7 we address two issues. Firstly, we present a solution for two

multilevel models discussed in chapter 6 that may suffer from regressor error-


dependencies: the standard (random-intercept) multilevel model and the ran-

dom coefficient regression model with individual-level covariates to explain

part of the heterogeneity-variance. Furthermore, we suggest how our results

may improve on the standard Hausman-Taylor approach. Secondly, we pro-

pose a nonparametric Bayesian method to alleviate the discreteness assump-

tion of the unobserved instrument. The model can be estimated using Markov

Chain Monte Carlo methods. The advantage of a Bayesian approach is that it

provides a general framework that can be extended easily to incorporate more

general models (e.g. choice models or models with several endogenous vari-

ables). Besides, a Bayesian analysis facilitates exact small sample inference.

By assuming that the unobserved instrument has a Dirichlet prior process, the

unobserved distribution of the instrument can adapt to any distribution. As op-

posed to the LIV model, it is not necessary to specify the number of support

points of the mixture distribution since the model estimates the distribution

from the available data. We present several simulation studies and show that

the results are promising, yet several issues are still open for future research.

8.2 Limitations and future research

There are several issues concerning the LIV method that we did not address in

this thesis. We will discuss the following issues in more detail below:

• Methodological (technical) issues

– Large sample results– Identification in more general settings– Testing for a discrete instrument– Relation with classical IV estimation

• Substantive issues

– Extensions to more than one endogenous variable– Choice models and more general GLM– Self-selection problems– Comparison to Lewbel’s approach and heterogenous LIV– Straightforward testing for endogeneity

8.2 Limitations and future research 225

– Generalizing the unobserved instrument

These issues mostly apply to the standard LIV model introduced in chapters 3

and 4. A discussion on and steps for further research for the Bayesian approach

in chapter 7 was given in section 7.5.

8.2.1 Methodological (technical) issues

Large sample results.Two technical issues that we did not address in this the-

sis are the consistency and the asymptotic distribution, that approximates the

finite sample distribution, of the LIV estimator. The simulation studies pre-

sented in this thesis indicate that the LIV estimates are consistent, but we have

not yet proven this.

The LIV estimates are maximum-likelihood (ML) estimates and consistency

can be examined using basic results from maximum likelihood theory (e.g.

Ferguson, 1996). Redner and Walker (1984), and Titterington, Smith and

Makov (1985) summarize large sample results for ML estimation in mixture

models, the class to which the LIV model belongs. They find that asymptotic

theory for mixtures is not always straightforward because of possible singu-

larities in the likelihood surface. Besides, the likelihood may be unbounded.

However, Titterington, Smith and Makov (1985) state that the regularity con-

ditions for consistency and asymptotic normality are satisfied in many well

known and commonly occurring cases.

It may be more interesting, however, to investigate whether the regression pa-

rameterβ, which is not a mixing parameter in the LIV model, can be estimated

consistently by maximum likelihood when the model fitted has fewer compo-

nents than the actual model. In other words, can consistency be proven for

m = 2, regardless of whether the true value form is larger than two. The

simulation studies in section 3.5 suggest a positive answer to this question.

Besides, if one has a set of strong instruments at hand, then adding a few ad-

ditional instruments does not change the asymptotic results in a classical IV

framework.


Two recent articles (Cheng and Liu, 2001, and Zhu and Zhang, 2004) estab-

lish asymptotic theory for comparing nested mixture models in which case the

distribution is represented by a subset in the parameter space. Their results

suggest that under certain regularity conditions the ML estimator converges

to an arbitrary point in this subset, and quantities of interest such as means

or variances may be estimated consistently even though the distribution is not

uniquely represented. These results are supported by our simulation results in

section 3.5. Andrews (1999) considers asymptotic theory for extremum esti-

mators (e.g. ML) when a parameter is on the boundary. His results are inter-

esting because he establish conditions under which the asymptotic distribution

of a subvector of the parameter is not affected by the true values of another

sub-vector being on a boundary of a parameter space. For instance, he shows

that for a random coefficient model, the quasi-ML estimator for the regression

coefficients are asymptotically normal whether or not some of the random co-

efficient variances are zero. His theory appears to be very general and may be

applicable to the LIV model. The conditions he establishes, however, may be

difficult to verify.

The LIV model in reduced form is quite similar to measurement error models,

although standard measurement error models assume zero covariance between

the errors. As mentioned in section 2.4, the grouping results of Wald (1940)

and Madansky (1959) are similar in thought to the grouping idea of the LIV

model. Wald and Madansky assume that a grouping of the data into two groups

exists, or can be constructed. Once a ‘valid’ grouping is available, a line can

be drawn, because it is determined by two points. This line is estimated con-

sistently under certain conditions (e.g. Neymann and Scott, 1951). Madansky

also considers another grouping method from an ANOVA point of view, where

ki observations forXi , i = 1, ..., k, are available. He shows that the within

mean square error and between mean square error can be used to obtain a con-

sistent estimate forβ when the grouping is independent of the model error1,

hence, consistency is independent ofk. The LIV model does not assume prior

1See also his discussion on the Housner-Brennan estimate (p. 189 - p. 191).


existence of such a grouping and uses mixture methodologies to classify the

sample into groups. The results that we found using synthetic data also suggest

that consistency of the LIV estimate forβ does not depend on the number of

categories chosen for the discrete instrument.

Another model closely related to the LIV model is a measurement error model

considered by Kiefer and Wolfowitz (1956), who prove the consistency of the

ML estimator in the presence of infinitely many incidental parameters. The

model considered is

Xi 1 = αi + ui

Xi 2 = θ01+ θ20αi + νi , (8.1)

where(νi ,ui ) have a bivariate normal distribution with mean zero and a co-

variance matrix consisting of the elements{d11,d12,d22}. They find that the

maximum-likelihood estimates for(θ1, θ2) are strongly consistent, given that

d11, d22, and d11d22 − d212 are bounded away from zero. Reiersøl (1950)

proves for normally distributed errors thatθ1 andθ2 are nonidentifiable if and

only if X1, X2 are constants or normally distributed (cf. Madansky, 1959, p.

180). Something similar was observed in chapter 7 using the nonparametric

Bayesian LIV model. Furthermore, the mixture approach for measurement

error models advocated by Carroll, Roeder and Wasserman (1999), and their

discussion, may be applicable to our framework as well.

Although we have not proven consistency of the maximum-likelihood esti-

mates for the LIV model introduced in chapters 3 and 4, the simulation studies

presented in this thesis suggest that they are. Furthermore, the articles cited

above consider similar models, and provide intuition for the simulation results

found, and a possible starting point to formally prove consistency and asymp-

totic normality ofβLIVn .

Identification in more general settings.Identification of all LIV model param-

eters was proven in chapters 3 and 4 assuming a bivariate normal distribution


for the error terms(ε, ν). Although a mixture of normals can adapt to a broad

class of distributions (Kim, Menzefricke and Feinberg, 2004), it is desirable

to generalize the LIV model to allow for non-normally distributed error terms.

In some applications, for instance, the normality assumption may be too re-

strictive and a more robust or general specification (e.g.t , gamma, logistic, or

Gumbel distributions) may be desirable. We found in subsection 4.5.2 that the

LIV model appears to be fairly robust against misspecified errors, although in

case of severe misspecification of the error distribution of the regression equa-

tion this may present a problem. In such a case, a more robust distribution for

the errors may circumvent this.

The existence of a discrete instrument.Identification of the LIV model re-

quires the existence of a discrete instrument with at least two categories. Sub-

sequently, a likelihood-framework can be used to estimate the regression pa-

rameters. Two important questions that were not considered in this thesis are:

(i) is it possible to test for the existence of a discrete instrument, and (ii) what

happens if the category meansπ in (3.1) for k = 2 are not very distinct, i.e.

||π2− π1|| is small?

Recent studies (Cheng and Liu, 2003, and Zhu and Zhang, 2004) have devel-

oped tests to test for a simpler mixture model versus a full mixture model, i.e.

tests of the formH0 : λ(1−λ)||π1−π2|| = 0 versusH1 : λ(1−λ)||π1−π2|| 6=0. These tests may be applicable to the LIV model to investigate the assump-

tion of the existence of a discrete instrument. However, given that mixture

models are often used to approximate continuous distributions, we feel that

the discreteness assumption, which does not imply thatx is discrete, is not

limiting in most empirical applications. Besides, many classical IV studies

rely on discrete instruments.

The second question is an important issue in the mixture model literature and

is closely related to the information matrix and the Mahalanobis distance be-

tween mixture components. It is known that if the mixture components do not

separate well, large sample sizes may be required to obtain precise maximum-


likelihood estimates (e.g. Redner and Walker, 1984, or Titterington, Smith,

and Makov, 1985). Something similar was observed in subsection 3.5.3 where

we found for synthetic data that usingm > 2 in the LIV model, increases the

occurrence of degenerate solutions. This is not much of an issue in most ap-

plications since the latent category instrument is a ‘nuisance’ parameter rather

than of theoretical interest. However, estimation may be problematic if the true

distribution of the unobserved instrument consists of only two groups that are

not well separated. In this case the model is weakly identified and this issue is

related to (i). The distribution of the latent instrument is now very close to a

normal distribution, or a constant. Deriving the actual information matrix may

give some insights in these issues. Furthermore, increasing the sample size and

EM-algorithm estimation may improve estimation results in such situations.

Relation with classical IV estimation. The basic LIV model does not as-

sume the existence of observed instrumental variables, and identification is

established through the likelihood. The classical IV approach assumes the ex-

istence of good quality instrumental variables and the model parameters can

be identified via the first two moments or via the likelihood. Although we ar-

gued and showed in both synthetic and real data examples that the LIV model

results are rather insensitive to the different choices form, to different shapes

of the distribution ofx, or to a modest misspecification of the likelihood, re-

searchers who have been using the traditional instrumental variables approach

(i.e. identification via theory and observed data) may be skeptical in adopting

the latent instrumental variables approach. In this research we have not ex-

plicitly pursued the relation with classical instrumental variables, because the

main goal is to formulate a new method that does not require such instruments

(an exception is the study in section 3.6). However, in order to introduce the

LIV method to more traditional IV users, we feel that future research should

emphasize the relation between LIV and classical IV. This can be done in one

or more of the following four ways.

Firstly, as was shown before, the LIV estimates can be used to obtain an a pos-

teriori clustering of the data using Bayes’ rule, which gives the ‘optimal’ LIV


instrumentZ, an×m matrix. This instrument matrix can be used to compute

a 2SLS estimate for the regression parameters. In a simulation study the fol-

lowing questions can be investigated: (1) are the 2SLS estimates forβ usingZ

similar to the LIV estimates, (2) is the optimal LIV instrumentZ uncorrelated

with ε, (3) what is theR2 of a regression ofx on Z compared to theR2 of a

regression ofx on the true (discrete)Z, and (4) what is the relation betweenx

andx = ZπLIV . For the simulation results presented in section 3.5 we find that

the 2SLS estimate, based on LIV instruments, yields approximately similar

results (means and standard deviations) to the maximum likelihood (LIV) esti-

mate ofβ (in most cases the values are exactly identical, but for the unimodel

case with eight instruments there are small differences). We also examined the

correlation betweenx andε, and the correlation betweenx andx. We found,

on average, that the correlation betweenx and the true errors is approximately

zero, while the correlation betweenx andx was found to be much larger than

zero. Although these preliminary findings suggest that the LIV predicted in-

struments are possibly ‘optimal’, because they are not correlated withε and

are of considerable strength, future research is needed to give more conclusive

results.

Secondly, in empirical applications the LIV instrumentsZ can be profiled us-

ing (additional) observed data. The results in section 3.6, for instance, illustrate

that the predicted LIV instrument is identical to the laboratory temperature ef-

fect. We have not yet found interpretations for the predicted instruments for

‘schooling’ in chapter 5. However, if an instrument can be given a sensible

interpretation, it may inspire confidence in the results found, or even point out

new theories that can be used in subsequent studies to obtain instrumental vari-

ables.

Thirdly, another empirical validation of the LIV model for schooling applica-

tions (chapter 5) can be obtained using twin or sibling data. In twin or sibling

studies the schooling parameter is estimated using a fixed-effects estimator be-

cause unobserved ‘ability’ cancels out within families (see also section 5.3.3

and chapter 6). Ideally, both methods should give similar results. In addition,


the predictive validity of the estimated LIV model can be examined using the

transformed ‘within-family’ data, since differences in years of schooling of

twins or siblings is exogenous, because the effect of omitted ability is elimi-

nated. However, to assess predictive validity, the schooling variable has to be

measured without error, which is questionable, see recent results on twin stud-

ies (e.g. Bonjour et al., 2003, Hertz, 2003, Isacsson, 2004).

Finally, it is interesting to investigate in what situations the LIV model can

be used to improve efficiency in standard IV models if ‘valid’ observed in-

struments are available. Since IV estimates often suffer from large standard

deviations, addition of an unobserved discrete instrument may improve on ef-

ficiency. Furthermore, the more traditional IV users are now still identifying

the model through a priori formed theories or reasoning. The simulation study

in section 4.4 indirectly addresses this issue and we found that combining ob-

served instruments with a latent discrete instrument may be beneficial.

8.2.2 Substantive issues

Extensions to more than one endogenous variable.Although one right-hand

side endogenous variable is the most commonly occurring situation (cf. Hanh

and Hausman, 2003), applications may suffer from two or more endogenous

regressors. For instance, marketing managers not only set prices based on

unobserved information, but also other marketing mix variables like advertis-

ing or shelf-space location (Chintagunta, Kadiyali, and Vilcassim, 2003, Man-

chanda, Rossi, and Chintagunta, 2004). Furthermore, in estimating the return

to schooling it is common to include measures for experience and squared

experience that are constructed from ‘years of schooling’, and hence also en-

dogenous (Verbeek, 2000).

The nonparametric Bayes approach in chapter 7 is applicable to problems with

more than one endogenous variable. The standard LIV model in (3.1) can

be extended to (say)l endogenous variables by taking forxi a (l × 1)-vector

and extending the variance-covariance matrix6 to a(l + 1)× (l + 1) matrix.

Hence, the more general LIV model is a mixture of(l + 1)-dimensional mul-


tivariate normal distributions. The identification proof has to be modified and

we suspect that a discrete instrument with at least two categories has to exist

for each endogenous variable. Consequently, the resulting mixture LIV model

hasm ≥ 2l categories. Simulation studies and theoretical results need to be

obtained prior to applying the outlined approach to empirical applications.

Choice models and more general linear models (GLM).The models consid-

ered in this thesis are simple linear models. However, for many applications

the linearity assumption is too restrictive whereas endogeneity may be present.

For instance, most studies cited in subsection 2.1.4 and section 2.3 (methods

that model demand, cost, and competition) are choice models. An interesting

and important extension of the simple LIV model is a generalization to this

class of models.

Observed choices can be modeled using a random utility framework. It is as-

sumed that the alternative with the highest utility is chosen. Lety j denote the

(unobserved) utility derived from choosing alternativej = 1, ...,m, and letc

be the observed choice. Thenc = j if y j = maxl=1,...,m yl . The model for the

unobserved utility is just a standard linear model. If the errors are assumed to

have a normal distribution and one of the explanatory variables is endogenous,

then model (3.1) can be augmented with the maximum utility framework to

obtain a ‘LIV-probit’ model. Furthermore, the LIV approach can be applied

to the type of problems and the linearization of choice models introduced by

Berry (1994) and Berry, Levinsohn, and Pakes (1995), that has recently gener-

ated a stream of subsequent research.

However, extending endogeneity issues to general nonlinear models is not

straightforward. Dube and Chintagunta (2003) argue that “Characterizing [en-

dogeneity] bias is not straightforward in the context of non-linear models [...]

it is unclear how strong the correlation between prices and [the errors] must

be to generate statistical bias. [...] It is also unclear how the endogeneity bias

will manifest itself in the estimates”. Cramer (2004) considers omitted vari-

ables bias in discrete models. He observes that “Even if the omitted variable


is orthogonal to the other regressors, its effect shows up in the variance of

the disturbance. Since the slope coefficients of discrete models are scaled by

the standard deviation, [...] the remaining coefficients are depressed towards

zero”. Furthermore, he finds that the omitted variables bias may be larger be-

cause of a misspecification of the disturbances. Mullahy (1997) considers de-

pendence of covariates and unobservables in count data models. He observes

that the standard assumption of separable additivity of the unobservables from

the parametric structural model does generally not hold. Hence, even certain

nonlinear IV estimators (e.g. Bowden and Turkington, 1984) may not be con-

sistent. He proposes an alternative approach based on transforming the basic

model that may be more appropriate to use. Foster (1997) also notes that tra-

ditional instrumental variables estimation does not simply extend to non-linear

models. He proposes a non-linear two stage least squares estimator for a logit

model, but the comments made by Mullahy (1997) may still apply. See also

Blundell and Powell (2001a,b) for a more detailed discussion. From this dis-

cussion it becomes clear that extending the LIV approach to general nonlinear

model is of great importance, yet nontrivial because researchers do not agree

on how to model endogeneity in such models.

Self-selection problems.As discussed before, self-selection issues arise when

an individual tends to select itself in a certain state (treated vs. non-treated,

internet user vs. non-user) in a non-random way. A simple self-selection model

is given byyi = β0 + β1di + εi , wheredi is zero or one, depending on the

‘state’ of individual i . This model is similar to (3.1) with a single discrete

endogenous regressor. The LIV model, however, assumes that the endogenous

regressor is a continuous variable. But, using similar arguments as above for

choice models, the LIV approach can possible be extended by incorporating a

probit model forxi to handle self-selection problems.

Comparison to Lewbel’s approach and heterogenous LIV.As mentioned in

section 2.3, Lewbel’s approach (Lewbel, 1997, Erickson and Whited, 2002)

is in spirit similar to the LIV approach in the sense that Lewbel’s approach

also does not require the availability of observed instrumental variables. In-


stead, Lewbel proposes to construct instruments from the available data based

on higher-order moment restrictions. Subsequently, 2SLS or GMM estimates

can be computed to estimate the regression parameters. The identifying condi-

tions for Lewbel’s approach are not similar to the conditions for the LIV model

(see also appendix 6C). Hence, it is interesting to compare the performance of

the LIV- and Lewbel estimates forβ under the different identifying conditions

using synthetic data.

For instance, identification for the Lewbel estimator requires that the distribu-

tion of the unobserved instrument is non-symmetric. The LIV model, however,

is not restricted to non-symmetric distributions, as was shown in (e.g.) section

3.5. Secondly, as opposed to the LIV model, Lewbel’s approach requiresβ1 in

(3.1) to be nonzero, and situations where it is close to zero are weakly iden-

tified. On the other hand, the LIV model assumes the existence of a discrete

unobserved instrument. If, for instance, the true distribution of the instrument

is a skewed gamma distribution, Lewbel’s method can be used, whereas the

LIV model is ‘technically’ not identified, because all observations belong to

the same group (m= 1). However, as stated before, mixture models are gener-

ally used to approximate continuous distributions, a property that also extends

to the LIV model. This was illustrated for the nonparametric Bayes model in

chapter 7, and the standard LIV model was estimated for a situation where the

latent instrument had a skewed gamma distribution. The LIV model in chap-

ters 3 and 4 assumes that the mixture components forx have equal variances.

This assumption may be too restrictive to approximate general continuous bi-

variate densities of(y, x). Hence, an interesting development is to extend the

LIV model to the class of heterogenous mixture models where the varianceσ 2ν

in (3.2) can be different for each groupj = 1, ...,m. This model may be very

robust in adapting to any distribution.

We emphasize that Lewbel’s method for measurement error models has not

yet been extended to models with general regressor-error dependencies. The

results presented in appendix 6C for a general multilevel model have, to the

best of our knowledge, not appeared in the literature before.


Straightforward testing for endogeneity without having instruments.We pro-

posed two tests to test for endogeneity in standard linear models without hav-

ing observed instruments at hand: a Hausman test (section 3.4) and a Wald

test (section 4.6). Both were shown to have a reasonable power to detect an

endogenous regressor. Another asymptotical equivalent test is a Lagrange-

multiplier test (e.g. Greene, 2000). The potential advantage of this test is that

it operates under the restricted model, i.e. whenσεν = 0 (x is endogenous). As

such, the model parametersβ andσ 2ε can be estimated by OLS in a standard

statistical package, and estimates for the group meansπ , the group sizesλ, and

σ 2ν , can be obtained using standard software for mixture models. Subsequently,

the estimated values can be substituted in the gradient vector (evaluated at the

restricted parameter vector), which should give a vector of zeros, at least within

the range of sample variability, if the restrictions are valid.

The only complicated step is to evaluate the score vector, that is based on the

first-order derivatives in appendix 3B. However, once these derivatives are

programmed, this test is potentially easy to apply, because it does not require

the availability of observed instrumental variables, and may serve as a standard

diagnostic tool to investigate endogeneity in linear regression estimation.

Generalizing the unobserved instrument.Finally, an interesting empirical ques-

tion is whether the exogenous part (i.e. the unobserved discrete instrument) of

the endogenous regressor can be profiled and given an interpretation. We elab-

orated on this before and suggested to examine the posterior classifications.

Alternatively, one can investigate this formally by using a concomitant mix-

ture model (Wedel and Kamakura, 2000) in which case the prior group sizesλ

are made dependent on individual level covariates, i.e.

λ j |i =exp

(γ0 j + v′i γ j

)∑k

l=1 exp(γ0l + v′i γl

) , (8.2)

for j = 1, ..., k. The parameterγ j represents the effect of the concomitant

variablesvi on the prior probabilitiesλ j . As such, each observation has its


own prior probabilityλ j |i of belonging to thej -th group of the discrete instru-

ment. This generalizes the standard LIV model where the observations have

the same prior probabilitiesλ j . An important question is to investigate under

which conditions inclusion of concomitant variables yields improved results.

For instance, if thevi are observed instrumental variables, this approach may

give more efficient results than classical IV estimation and simple LIV estima-

tion.

Furthermore, it is interesting to investigate whether a generalization of the prior

distribution of the latent instrument can identify patters of endogeneity. For in-

stance, Dube and Chintagunta (2003) observe for the results obtained by Yang,

Chen and Allenby (2003), that the pattern of endogeneity is most pronounced

at the lower price levels. In other applications similar observations can possi-

bly be made and the pattern of endogeneity may depend on certain covariates.

In summary, we believe that the LIV method is a powerful approach to address

endogeneity issues, it is simple to implement, and it presents an avenue for

further research and future applications that can shed light on the issues raised

in this discussion.

Bibliography

Anderson, T. W. and Rubin, H. (1949). Estimators on the parameters of a singleequation in a complete set of stochastic equations.Annals of Mathemati-cal Statistics, 21:570–582.

Andrews, D. W. K. (1999). Estimation when a parameter is on the boundary.Econometrica, 67:1341–1383.

Andrews, R. L. and Currim, A. S. (2003). Retention of latent segments inregression-based marketing models.International Journal of Research inMarketing, 20:315–321.

Angrist, J. D. (1990). Lifetime earnings and the vietnam era draft lottery:Evidence from social security administrative records.The American Eco-nomic Review, 80:313–336.

Angrist, J. D., Imbens, G. W., and Krueger, A. B. (1999). Jacknife instrumentalvariables estimation.Journal of Applied Econometrics, 14:57–67.

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causaleffects using instrumental variables.Journal of the American StatisticalAssociation, 91:444–455.

Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendanceaffect schooling and earnings?The Quarterly Journal of Economics,56:979–1014.

Antoniak, C. E. (1974). Mixtures of dirichlet processes with applications tobayesian nonparametric problems.The Annals of Statistics, 2:1152–1174.

Apostol, T. M. (1969).Calculus, 2nd Ed., Vol. 1: One-Variable Calculus, withan Introduction to Linear Algebra. Blaisdell, Waltham (MA).

Aptech (2000). GAUSS Language Reference. Aptech Systems, Inc., MapleValley.

237

238 Bibliography

Arellano, M. (2002). Sargan’s instrumental variables estimation and the gen-eralized method of moments.Journal of Business & Economic Statistics,20:450–459.

Arellano, M. and Bover, O. (1995). Another look at the instrumental variablesestimation of error-components models.Journal of Econometrics, 68:29–51.

Asher, H. B. (1983). Causal modelling (2nd edition). InQuantitative Appli-cations in the Social Sciences, No. 07-003. Sage Publications, NewburyPark (CA).

Bagozzi, R. P., Yi, Y., and Nassen, K. D. (1999). Representation of measure-ment error in marketing variables: Review of approaches and extensionto three-faced designs.Journal of Econometrics, 89:393–421.

Baltagi, B. H. (2001). Econometric Analysis of Panel Data. John Wiley &Sons, Ltd, Chichester.

Bekker, P. A. (1994). Alternative approximations to the distributions of instru-mental variable estimators.Econometrica, 62:657–681.

Bekker, P. A. and Kleibergen, F. (2003). Finite-sample instrumental vari-ables inference using an asymptotic pivotal statistic.Econometric Theory,19:744–753.

Belsley, D. A., Kuh, E., and Welsch, R. E. (1980).Regression Diagnostics:Identifying Influential Data and Sources of Collinearity. John Wiley &Sons, Inc., New York.

Berry, S. (2003). Comment: Bayesian analysis of simultaneous demand andsupply.Quantitative Marketing and Economics, 1:251–275.

Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in marketequilibrium. Econometrica, 63:841–890.

Berry, S. T. (1994). Estimating discrete-choice models of product differentia-tion. The RAND Journal of Economics, 25:242–262.

Besanko, D., Dube, J.-P., and Gupta, S. (2000). Heterogeneity and target mar-keting using aggregate retail data: A structural approach. Cornell Univer-sity.

Besanko, D., Gupta, S., and Jain, D. (1998). Logit demand estimation undercompetitive pricing behavior: An equilibrium framework.ManagementScience, 44:1533–1547.

Bibliography 239

Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixture modelfor clustering with the integrated completed likelihood.IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 22:719 – 725.

Blackburn, M. L. and Neumark, D. (1993). Omitted-ability bias and the in-crease in the return to schooling.Journal of Labor Economics, 11:521–544.

Blundell, R. and Powell, J. L. (2001a). Endogeneity in nonparametric andsemiparametric regression models.Working paper, University CollegeLondon.

Blundell, R. and Powell, J. L. (2001b). Endogeneity in semiparametric binaryresponse models.CEMMAP working paper, CWP05/01.

Bonjour, D., Cherkas, L. F., Haskel, J. E., Hawkes, D. D., and Spector, T. D.(2003). Returns to education: Evidence from U.K. twins.The AmericanEconomic Review, 93:1799–1812.

Bound, J. and Jaeger, D. A. (1996). On the validity of season of birth as an in-strument in wage equations: A comment on Angrist and Krueger’s “doescompulsory school attendance affect schooling and earnings?”. TechnicalReport 5835, NBER.

Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumentalvariables estimation when the correlation between the instruments andthe endogenous explanatory variable is weak.Journal of the AmericanStatistical Association, 90:443–450.

Bowden, R. J. and Turkington, D. A. (1984).Instrumental Variables. Cam-bridge University Press, New York.

Bronnenberg, B. J. and Mahajan, V. (2001). Unobserved retailer behavior inmultimarket data: Joint spatial dependence in market shares and promo-tion variables.Marketing Science, 20:284–299.

Brooks, S. P. and Roberts, G. O. (1998). Convergence assessment techniquesfor Markov Chain Monte Carlo.Statistics and Computing, 8:319–335.

Bryk, A. S. and Raudenbush, S. W. (1992).Hierarchical Linear Models, Ap-plications and Data Analysis Methods. Sage Publications, Newbury Park,CA.

Buse, A. (1992). The bias of instrumental variables estimators.Econometrica,60:173–180.

240 Bibliography

Card, D. (1995). Using geographical variation in college proximity to estimatethe return to schooling. In Christofides, L. N., Grant, E., and Swidinsky,R., editors,Aspects of Labour Market Behaviour: Essays in Honour ofJohn Vanderkamp, pages 201–222. University of Toronto Press, Toronto.

Card, D. (1999). The causal effect of education on earnings. In Ashenfelter,O. C. and Card, D., editors,Handbook of Labor Economics, volume 3A,pages 1801–1863. Elsevier Science B.V., North-Holland.

Card, D. (2001). Estimating the return to schooling: Progress on some persis-tent econometric problems.Econometrica, 69:1127–1160.

Carroll, R. J., Roeder, K., and Wasserman, L. (1999). Flexible parametricmeasurement error models.Biometrics, 55:44–54.

Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995).Measurement Error inNonlinear Models. Chapman & Hall, London.

Chamberlain, G. (1980). Analysis of covariance with qualitative data.TheReview of Economic Studies, 47:225–238.

Chamberlain, G. (1982). Multivariate regression models for panel data.Jour-nal of Econometrics, 18:5–46.

Chamberlain, G. (1984). Panel data. In Griliches, S. and Intriligator, M. D., ed-itors,Handbook of Econometrics, Volume II, pages 1247–1318. Elsevier,Amsterdam: North Holland.

Cheng, R. C. H. and Liu, W. B. (2001). The consistency of estimators in finitemixture models.The Scandinavian Journal of Statistics, 28:603–616.

Chintagunta, P. K. (2001). Endogeneity and heterogeneity in a probit demandmodel: Estimation using aggregate data.Marketing Science, 20:442–456.

Chintagunta, P. K., Kadiyali, V., and Vilcassim, N. J. (2003). Endogeneityand simultaneity in competitive pricing and advertising: A logit demandanalysis.Working paper, University of Chicago.

Chintagunta, P. K., Dube, J.-P., and Goh, K. Y. (2004). Beyond the endogeneitybias: The effect of unmeasured brand characteristics on household-levelbrand choice models.Working paper, University of Chicago.

Cook, R. D. and Weisberg, S. (1982).Residuals and Influence in Regression.Chapman and Hall, New York.

Bibliography 241

Cowles, M. K. and Carlin, B. P. (1996). Markov Chain Monte Carlo con-vergence diagnostics: A comparative review.Journal of the AmericanStatistical Association, 91:883–904.

Cramer, M. . (2004). Omitted variable bias in discrete models.Working paper,Tinbergen institute.

Davidson, R. and MacKinnon, J. G. (1993).Estimation and Inference inEconometrics. Oxford University Press, New York.

Dey, D., Muller, P., and Sinha, D. (1998).Practical Nonparametric and Semi-parametric Bayesian Statistics. Springer-Verlag, New York.

Dhrymes, P. J. (2003). Tests for endogeneity and instrument suitability.Work-ing paper, Columbia University.

Dijk, van, A., Heerde, van, H. J., Leeflang, P. S. H., and Wittink, D. R. (2004).Similarity-based spatial methods for estimating shelf space elasticitiesfrom correlational data.Quantitative Marketing and Economics, 2:257–277.

Donald, S. G. and Newey, W. K. (2001). Choosing the number of instruments.Econometrica, 69:1161–1191.

Draganska, M. and Jain, D. (2004). A likelihood approach to estimating marketequilibrium models.Management Science, 50:605–616.

Dube, J.-P. and Chintagunta, P. K. (2003). Comment: Bayesian analysis of si-multaneous demand and supply.Quantitative Marketing and Economics,1:293–298.

Ebbes, P., Bockenholt, U., and Wedel, M. (2004). Regressor and random-effects dependencies in multilevel models.Statistica Neerlandica,58:161–178.

Erickson, T. (2001). Constructing instruments for regressions with measure-ment error when no additional data are available: Comment.Economet-rica, 69:221–222.

Erickson, T. and Whited, T. M. (2002). Two-step GMM estimation of theerrors-in-variables model using high-order moments.Econometric The-ory, 18:776–799.

Escobar, M. D. (1994). Estimating normal means with a dirichlet process prior.Journal of the American Statistical Association, 89:268–277.

242 Bibliography

Escobar, M. D. and West, M. (1995). Bayesian density estimation and in-ference using mixtures.Journal of the American Statistical Association,90:577–588.

Escobar, M. D. and West, M. (1998). Computing bayesian nonparametric hi-erarchical models. In Dey, D., Muller, P., and Sinha, D., editors,Practi-cal Nonparametric and Semiparametric Bayesian Statistics, pages 1–22.Springer-Verlag, New York.

Fahrmeir, L. and Tutz, G. (1994).Multivariate Statistical Modelling Based onGeneralized Linear Models. Springer-Verlag, New York.

Ferguson, T. S. (1973). A bayesian analysis of some nonparametric problems.The Annals of Statistics, 1:209–230.

Ferguson, T. S. (1996).A Course in Large Sample Theory. Chapman & Hall,New York.

Foster, E. M. (1997). Instrumental variables for logistic regression: An illus-tration. Social Science Research, 26:487–504.

Fox, J. (1991).Regression Diagnostics. Sage Publications, inc., London.

Fuller, W. (1977). Some properties of a modification of the limited informationestimator.Econometrica, 45:939–953.

Garen, J. (1984). The returns to schooling: A selectivity bias approach with acontinuous choice variable.Econometrica, 52:1199–1218.

Gasmi, F., Laffont, J. J., and Vuong, Q. (1992). Econometric analysis of collu-sive behavior in a soft-drink market.Journal of Economics and Manage-ment Strategy, 1:277–311.

Goldstein, H. (1995).Multilevel Statistical Models. John Wiley & Sons Ltd.,New York.

Gonul, F. F., Kim, B.-D., and Shi, M. (2000). Mailing smarter to catalogcustomers.Journal of Interactive Marketing, 14:2–16.

Greene, W. H. (2000).Econometric Analysis. Prentice-Hall, Inc., Upper Sad-dle River, New Jersey.

Griliches, Z. (1977). Estimating the returns to schooling: Some econometricproblems.Econometrica, 45:1–22.

Bibliography 243

Hahn, J. (2002). Optimal inference with many instruments.Econometric The-ory, 18:140–168.

Hahn, J. and Hausman, J. (2002). A new specification test for the validity ofinstrumental variables.Econometrica, 70:163–189.

Hahn, J. and Hausman, J. (2003). Weak instrumens: Diagnosis and cures inempirical econometrics.Recent Advances in Econometric Methodology,93:118–125.

Hamilton, B. H. and Nickerson, J. A. (2003). Correcting for endogeneity instrategic management research.Strategic Organization, 1:51–78.

Harmon, C. and Walker, I. (1995). Estimates of the economic return to school-ing for the united kingdom.American Economic Review, 85:1278–1286.

Hartog, J. (1988). An ordered response model for allocation and earnings.Kyklos, 41:113–141.

Hausman, J. A. (1978). Specification tests for econometrics.Econometrica,46:1251–1271.

Hausman, J. A. and Taylor, W. E. (1981). Panel data and unobservable indi-vidual effects.Econometrica, 49:1377–1398.

Hennig, C. (2000). Identifiability of models for clusterwise linear regression.Journal of Classification, 17:273–296.

Hertz, T. (2003). Upward bias in the estimated returns to education: Evidencefrom south africa.The American Economic Review, 93:1354–1368.

Honore, B. O. and Hu, L. (2004). On the performance of some robust instru-mental variables estimators.Journal of Business & Economic Statistics,22:30–39.

Hsiao, C. (1986).Analysis of Panel Data. Cambridge University Press, NewYork.

Ibrahim, J. G. and Kleinman, K. P. (1998). Semiparametric bayesian methodsfor random effects models. In Dey, D., Muller, P., and Sinha, D., editors,Practical Nonparametric and Semiparametric Bayesian Statistics, pages89–114. Springer-Verlag, New York.

Im, K. S., Ahn, S. C., Schmidt, P., and Wooldridge, J. M. (1999). Efficientestimation of panel data models with strictly exogenous explanatory vari-ables.Journal of Econometrics, 93:177–201.

244 Bibliography

Isacsson, G. (2004). Estimating the economic return to educational levels usingdata on twins.Journal of Applied Econometrics, 19:99–119.

Judge, G. G., Griffiths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T.-C. (1985).The Theory and Practice of Econometrics. John Wiley & Sons Inc., NewYork.

Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihoodestimator in the presence of infinitely many incidental parameters.Annalsof Mathematical Statistics, 27:887–906.

Kim, J. G., Menzefricke, U., and Feinberg, F. M. (2004). Assessing hetero-geneity in discrete choice models using a dirichlet process prior.Reviewof Marketing Science, 2:1–39.

Kleibergen, F. (2002). Pivotal statistics for testing structural parameters ininstrumental variables regression.Econometrica, 70:1781–1803.

Kleibergen, F. and Zivot, E. (2003). Bayesian and classical approaches toinstrumental variables regression.Journal of Econometrics, 114:29–72.

Leeflang, P. S. H. (1994).Probleemgebied Marketing: De Marktinstrumenten.Stenfert Kroese, Houten.

Lenk, P. J. (2001). Bayesian inference and Markov Chain Monte Carlo.Notes,University of Michigan.

Lenk, P. J., DeSarbo, W. S., Green, P. E., and Young, M. R. (1996). Hierar-chical bayes conjoint analysis: Recovery of partworth heterogeneity fromreduced experimental designs.Marketing Science, 15:173–191.

Lewbel, A. (1997). Constructing instruments for regressions with measure-ment error when no additional data are available, with an application topatents and R&D.Econometrica, 65:1201–1213.

Longford, N. T. (1993).Random Coefficient Models. Oxford University Press,New York.

MacEachern, S. N. (1998). Computational methods for mixture of dirichletprocess models. In Dey, D., Muller, P., and Sinha, D., editors,Practi-cal Nonparametric and Semiparametric Bayesian Statistics, pages 23–43.Springer-Verlag, New York.

Madansky, A. (1959). The fitting of straight lines when both variables aresubject to error.Journal of the American Statistical Association, 54:173–205.

Bibliography 245

Manchanda, P., Rossi, P. E., and Chintagunta, P. K. (2004). Response mod-eling with non-random marketing mix variables.Journal of MarketingResearch, forthcoming.

Meng, X.-L. (1994). Posterior predictive P-values.The Annals of Statistics,22:1142–1160.

Mroz, T. A. (1987). The sensitivity of an empirical model of married women’shours of work to economic and statistical assumptions.Econometrica,55:765–799.

Mullahy, J. (1997). Instrumental-variable estimation of count data models:Applications to models of cigarette smoking behavior.The Review ofEconomics and Statistics, pages 586–593.

Mundlak, Y. (1978). On the pooling of time-series and cross section data.Econometrica, 46:69–85.

Naik, P. A., Shi, P., and Tsai, C.-L. (2003). Extending Akaike informationcriterion to mixture regression models.Working paper.

Nelson, C. R. and Startz, R. (1990). Some further results on the exact smallsample properties of the instrumental variable estimator.Econometrica,58:967–976.

Nevo, A. (2000). A practitioner’s guide to estimation of random-coefficientslogit models of demand.Journal of Economics & Management Strategy,9:513–548.

Nevo, A. (2001). Measuring market power in the ready-to-eat cereal industry.Econometrica, 69:307–342.

Neyman, J. and Scott, E. L. (1951). On certain methods of estimating the linearstructural relation.The Annals of Mathematical Statistics, 22:352–361.

Pagan, A. (1984). Econometric issues in the analysis of regressions with gen-erated regressors.International Economic Review, 25:221–247.

Petrin, A. and Train, K. (2000). Omitted product attributes in discrete choicemodels.Working paper, University of Berkeley.

Plat, F. W. (1988).Modelling for Markets: Applications of Advanced Modelsand Methods for Data Analysis. PhD thesis, Rijksuniversiteit Groningen.

Ploeg, van der, J. (1997).Instrumental Variable Estimation and Group-Asymptotics. PhD thesis, SOM, University of Groningen.

246 Bibliography

Pudney, S. E. (1978). The estimation and testing of some error componentsmodels. Technical report, London school of economics.

Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likeli-hood and the EM algorithm.SIAM Review, 26:195–239.

Reiersøl, O. (1950). Identifiability of a linear relation between variables whichare subject to error.Econometrica, 18:375–389.

Ruud, P. A. (2000).An Introduction to Classical Econometric Theory. OxfordUniversity Press, New York.

Sargan, J. D. (1958). The estimation of economic relationships using instru-mental variables.Econometrica, 26:393–415.

Sargan, J. D. (1959). The estimation of relationships with autocorrelated resid-uals by the use of instrumental variables.Journal of the Royal StatisticalSociety, Series B, 21:91–105.

Sellke, T., Bayarri, M. J., and Berger, J. O. (2001). Calibration of P-values fortesting precise null hypotheses.The American Statistician, 55:62–71.

Shugan, S. M. (2004). Endogeneity in marketing decision models.MarketingScience, 23:1–3.

Snijders, T. A. B. and Bosker, R. J. (1999).Multilevel Analysis. SAGE Publi-cations, London.

Spencer, N. H. and Fielding, A. (1998a). A comparison of modelling strategiesfor value-added analyses of educational data.Working paper, Universityof Hertfordshire.

Spencer, N. H. and Fielding, A. (1998b). An instrumental variable consistentestimation procedure to overcome the problem of endogenous variablesin multilevel models.Working paper, University of Hertfordshire.

Staiger, D. and Stock, J. H. (1997). Instrumental variables regression withweak instruments.Econometrica, 65:557–586.

Stern, S. (2004). Do scientist pay to be scientist?Management Science,50:835–853.

Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instrumentsand weak identification in generalized method of moments.Journal ofBusiness & Economic Statistics, 20:518–529.

Bibliography 247

Sudhir, K. (2001). Competitive pricing behavior in the auto market: A struc-tural analysis.Marketing Science, 20:42–60.

Teicher, H. (1963). Identifiability of finite mixtures.The Annals of Mathemat-ical Statistics, 34:1265–1269.

Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985).StatisticalAnalysis of Finite Mixture Distributions. John Wiley & Sons Ltd., Chich-ester.

Uusitalo, R. (1999).Essays in Economics of Education. PhD thesis, Universityof Helsinki.

Vella, F. (1998). Estimating models with sample selection bias: A survey.TheJournal of Human Resources, 33:127–169.

Vella, F. and Verbeek, M. (1998). Whose wages do unions raise? a dynamicmodel of unionism and wage rate determination for young men.Journalof Applied Econometrics, 13:163–183.

Verbeek, M. (2000).A Guide to Modern Econometrics. John Wiley & SonsLtd., Chichester.

Vilcassim, N. J. and Chintagunta, P. K. (1995). Investigating retailer productcategory pricing from household scanner panel data.Journal of Retailing,71:103–128.

Villas-Boas, J. M. and Winer, R. S. (1999). Endogeneity in brand choice mod-els. Management Science, 45:1324–1338.

Wald, A. (1940). The fitting of straight lines if both variables are subject toerror. The Annals of Mathematical Statistics, 11:284–300.

Wang, P., Puterman, M. L., Cockburn, I., and Le, N. (1996). Mixed poissonregression models with covariate dependent rates.Biometrics, 52:381–400.

Wansbeek, T. and Meijer, E. (2000).Measurement Error and Latent Variablesin Econometrics. Elsevier, Amsterdam.

Wedel, M. and Kamakura, W. A. (2000).Market Segmentation. Kluwer Aca-demic Publishers, Boston.

Weisstein, E. W. (2004a). Multinomial distribution.From Mathworld – AWolfram Web Resource.

248 Bibliography

Weisstein, E. W. (2004b). Leptokurtic.From Mathworld – A Wolfram WebResource.

Weisstein, E. W. (2004c). Power.FromMathworld– A Wolfram Web Resource.

West, M. (1992). Hyperparameter estimation in dirichlet process mixture mod-els. ISDS Discussion Paper, no. 92-A02, Duke University.

West, M., Muller, P., and Escobar, M. D. (1994). Hierarchical priors and mix-ture models, with applications in regression and density estimation. InFreeman, P. R. and Smith, A. F. M., editors,Aspects of Uncertainty, aTribute to D. V. Lindley, pages 363–386. John Wiley & Sons Ltd., Chich-ester.

White, H. (1980). A heteroscedasticity-consistent covariance matrix estimatorand a direct test for heteroscedasticity.Econometrica, 48:817–838.

White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrica, 50:1–25.

White, H. (2001).Asymptotic Theory for Econometricians. Academic Press,New York.

Wooldridge, J. M. (2002).Econometric Analysis of Cross Section and PanelData. Massachusetts Institute of Technology, Cambridge.

Yakowitz, S. J. and Spragins, J. D. (1968). On the identifiability of finitemixtures.The Annals of Mathematical Statistics, 39:209–214.

Yang, S., Chen, Y., and Allenby, G. M. (2003). Bayesian analysis of simultane-ous demand and supply.Quantitative Marketing and Economics, 1:251–275.

Zhu, H.-T. and Zhang, H. (2004). Hypothesis testing in mixture regressionmodels.Journal of the Royal Statistical Society, Series B, 66:3–16.

Author index

Ahn, S. C., 158, 168Allenby, G. M., 18, 29, 30, 33,

236Anderson, T. W., 25Andrews, D. W. K., 226Andrews, R. L., 95Angrist, J. D., 12, 17, 18, 21, 26,

114, 119, 146Antoniak, C. E., 175, 178, 206,

207Apostol, T. M., 71Aptech, 39, 49Arellano, M., 2, 18, 19, 154Asher, H. B., 14

Bagozzi, R. P., 11, 146Baker, R. M., 17, 19–21, 35, 38,

85, 89, 94, 96, 112, 119, 135,159, 162, 163

Baltagi, B. H., 149, 152, 167Bayarri, M. J., 195Bekker, P. A., 22, 25, 36Belsley, D. A., 78, 96, 101Berger, J. O., 195Berry, S. T., 13, 29, 221, 232Besanko, D., 13, 29, 221Biernacki, C., 95, 96Blackburn, M. L., 121Blundell, R., 164, 233Bockenholt, U., 145, 172Bonjour, D., 231Bosker, R. J., 7, 146, 152Bound, J., 17, 19–21, 35, 38, 85,

89, 94, 96, 112, 119, 135,159, 162, 163

Bover, O., 154Bowden, R. J., 2, 13, 17, 22–24,

27, 32, 35, 38, 60, 94, 112,147, 148, 158, 163, 164,166, 168, 233

Bronnenberg, B. J., 18, 30Brooks, S. P., 183, 203Bryk, A. S., 7, 146Buse, A., 20, 38, 94

Card, D., 5, 8, 10, 13, 18, 27,77, 111–113, 116–122, 135–137, 139, 146, 221, 223

Carlin, B. P., 183, 203

Carroll, R. J., 11, 27, 146, 163,205, 227

Celeux, G., 95, 96Chamberlain, G., 153, 165Chen, Y., 18, 29, 30, 33, 236Cheng, R. C. H., 226, 228Cherkas, L. F., 231Chintagunta, P. K., 9, 27, 29, 30,

153, 231, 232, 236Cockburn, I., 78, 97Cook, R. D., 78, 96, 101Cowles, M. K., 183, 203Cramer, M., 232Currim, A. S., 95

Davidson, R., 14, 15, 17, 23, 45,163

DeSarbo, W. S., 184, 198Dey, D., 6, 173, 175, 179, 186Dhrymes, P. J., 103Dijk, van, A., 30Donald, S. G., 23, 26, 162Draganska, M., 29Dube, J.-P., 27, 29, 30, 232, 236

Ebbes, P., 145, 172Erickson, T., 28, 32, 36, 233Escobar, M. D., 172, 175–180,

182, 183, 186, 207, 211, 216

Fahrmeir, L., 101Feinberg, F. F., 33, 173, 228Ferguson, T. S., 2, 16, 114, 117,

175, 206, 225Fielding, A., 146Foster, E. M., 233Fox, J., 78, 96, 101Fuller, W., 26

Garen, J., 118Gasmi, F., 13Goh, K. Y., 27, 30Goldstein, H., 149, 167Gonul, F. F., 15Govaert, G., 95, 96Green, P. E., 184, 198Greene, W. H., 2, 7, 15, 28, 31, 35,

45, 47, 64, 78, 86, 98, 100,

250 Author index

102, 103, 105, 111, 112,146, 164, 166, 187, 209, 235

Griffiths, W. E., 2, 7, 9, 14, 115,146

Griliches, Z., 10, 111, 114, 116–118, 121

Gupta, S., 13, 29, 221

Hahn, J., 2, 19–22, 24, 26, 31, 36,160, 162, 221, 231

Hamilton, B. H., 12, 146Harmon, C., 114, 118, 121Hartog, J., 123Haskel, J. E., 231Hausman, J. A., 2, 19–22, 24, 26,

31, 36, 37, 147, 148, 151–153, 155, 160, 162, 166,203, 221, 231

Hawkes, D. D., 231Heerde, van, H. J., 30Hennig, C., 36, 41, 67Hertz, T., 121, 231Hill, R. C., 2, 7, 9, 14, 115, 146Honore, B. O., 98Hsiao, C., 167Hu, L., 98

Ibrahim, J. G., 172Im, K. S., 158, 168Imbens, G. W., 12, 26, 146Isacsson, G., 231

Jaeger, D. A., 17, 19–21, 35, 38,85, 89, 94, 96, 112, 119, 135,159, 162, 163

Jain, D., 13, 29, 221Judge, G. G., 2, 7, 9, 14, 115, 146

Kadiyali, V., 231Kamakura, W. A., 235Kiefer, J., 227Kim, B.-D., 15Kim, J. G., 33, 173, 228Kleibergen, F., 17, 19, 20, 25, 159Kleinman, K. P., 172Krueger, A. B., 18, 21, 26, 114,

119Kuh, E., 78, 96, 101

Laffont, J. J., 13Le, N., 78, 97

Lee, T.-C., 2, 7, 9, 14, 115, 146Leeflang, P. S. H., 18, 30Lenk, P. J., 184, 188, 198Levinsohn, J., 13, 29, 221, 232Lewbel, A., 28, 32, 36, 160, 163,

168, 204, 233Liu, W. B., 226, 228Longford, N. T., 149, 167Lutkepohl, H., 2, 7, 9, 14, 115,

146

MacEachern, S. N., 178MacKinnon, J. G., 14, 15, 17, 23,

45, 163Madansky, A., 28, 32, 36, 60, 63,

64, 163, 226, 227Mahajan, V., 18, 30Makov, U. E., 33, 41, 45, 225, 229Manchanda, P., 153, 231Meijer, E., 11, 27, 36, 146, 163Meng, X.-L., 195Menzefricke, U., 33, 173, 228Mroz, T. A., 123, 139Mullahy, J., 233Muller, P., 6, 173, 175, 179, 180,

182, 186Mundlak, Y., 148, 152

Naik, P. A., 95Nassen, K. D., 11, 146Nelson, C. R., 21, 159Neumark, D., 121Nevo, A., 8–10, 13, 18, 29Newey, W. K., 23, 26, 162Neyman, J., 226Nickerson, J. A., 12, 146

Pagan, A., 98Pakes, A., 13, 29, 221, 232Petrin, A., 9, 27Plat, F. W., 11, 146Ploeg, van der, J., 14, 17, 32, 36,

37Powell, J. L., 164, 233Pudney, S. E., 164Puterman, M. L., 78, 97

Raudenbush, S. W., 7, 146Redner, R. A., 45, 58, 225, 229Reiersøl, O., 227Roberts, G. O., 183, 203Roeder, K., 205, 227

Author index 251

Rossi, P. E., 153, 231Rubin, D. B., 12, 146Rubin, H., 25Ruppert, D., 11, 27, 146, 163Ruud, P. A., 8, 147

Sargan, J. D., 19, 23Schmidt, S. P., 158, 168Scott, E. L., 226Sellke, T., 195Shi, M., 15Shi, P., 95Shugan, S. M., 9, 93Sinha, D., 6, 173, 175, 179, 186Smith, A. F. M., 33, 41, 45, 225,

229Snijders, T. A. B., 7, 146, 152Spector, T. D., 231Spencer, N. H., 146Spragins, J. D., 41, 42, 44, 67Staiger, D., 19, 22, 24, 25, 35, 36,

112, 159, 163Startz, R., 21, 159Stefanski, L. A., 11, 27, 146, 163Stern, S., 27Stock, J. H., 2, 18–20, 22, 24–26,

31, 35, 36, 112, 159, 163,221

Sudhir, K., 10, 11, 13, 29

Taylor, W. E., 147, 151–153, 155,203

Teicher, H., 41, 42, 67Titterington, D. M., 33, 41, 45,

225, 229Train, K., 9, 27Tsai, C.-L., 95Turkington, D. A., 2, 13, 17, 22–

24, 27, 32, 35, 38, 60, 94,112, 147, 148, 158, 163,164, 166, 168, 233

Tutz, G., 101

Uusitalo, R., 8, 111, 118, 120, 146

Vella, F., 12, 13, 146Verbeek, M., 12, 15, 22, 27, 32,

112, 114, 118, 119, 122,139, 146, 147, 149, 155,167, 231

Vilcassim, N. J., 9, 231Villas-Boas, J. M., 9Vuong, Q., 13

Wald, A., 27, 32, 163, 226Walker, H. F., 45, 58, 225, 229Walker, I., 114, 118, 121Wang, P., 78, 97Wansbeek, T., 9, 11, 27, 36, 146,

163Wasserman, L., 205, 227Wedel, M., 9, 145, 172, 235Weisberg, S., 78, 96, 101Weisstein, E. W., 48, 80, 128Welsch, R. E., 78, 96, 101West, M., 172, 175–177, 179,

180, 182, 183, 211, 216White, H., 2, 14, 16, 17, 27, 48,

101, 146, 147, 163, 164, 166Whited, T. M., 28, 32, 233Winer, R. S., 9Wittink, D. R., 30Wolfowitz, J., 227Wooldridge, J. M., 7, 8, 13, 16,

17, 20, 27, 84, 121, 123,146–148, 158, 164, 168

Wright, J. H., 2, 18–20, 22, 24–26, 31, 221

Yakowitz, S. J., 41, 42, 44, 67Yang, S., 18, 29, 30, 33, 236Yi, Y., 11, 146Yogo, M., 2, 18–20, 22, 24–26,

31, 221Young, M. R., 184, 198

Zhang, H., 226, 228Zhu, H.-T., 226, 228Zivot, E., 17, 19, 20, 159

Subject index

Xα-dependency, 146Tests, 151

Xη-dependency, 146External instruments, 158Tests, 158

2SLS, 16–26, 35, 85–94, 130–134, 158, 166, 168see also Instrumental vari-ables approach

3SLS, 158, 168

Ability bias, 4, 9, 113–114AIC, 94–96AIC3, 94–96Asymptotic distribution LIV esti-

mator, 225–227

Between estimator, definition,166–168

BIC, 94–96

CAIC, 94–96Chamberlain’s approach, 153Choice models, 232–233Competition, see Simultaneous

equationsConcomitant mixture models, 235Conjoint Analysis, 184Consistency LIV estimator, 225–

227Control function approach, 27Cook’s distance, 101–102Covariance estimator,seeFixed-

effects estimatorCOVRATIO, 101–102

Degeneracy, 57, 125Demand-cost models,seeSimul-

taneous equationsDirichlet process, 173, 175–179,

184–186, 206–207

Endogeneity, 2, 8Exogeneity, 2, 8, 35

Test,seeTesting

Fixed-effects (FE) estimator, defi-nition, 166–168

Generalized least squares (GLS)estimator, definition, 166–168

Generalized linear models(GLM), 232–233

Goodness-of-fit diagnostics,seeLIV model

Gradient, 46, 68–69, 85, 108Numerical optimization, 70

Halo effect, 11Hausman test, 22, 24, 35, 47–48,

53–54, 102–105, 147, 151Negative values, 103

Hausman-Taylor approach, 151–155

Hessian, 46, 71–76, 85, 108Heterogeneity bias, 116–118, 164Heterogenous LIV, 233–235Heteroscedasticity, 97, 101, 128–

130

ICL, 94–96Identification, 39–44, 67, 80–85,

227–228Definition, 40Theorem LIV, 43, 82

Indicator IV method, 27Influential observations, 96–102,

128–130Information matrix, 44–47, 229Instrumental variables (IV) ap-

proach, 2, 16–31, 166Multilevel, 158–161, 168Relation with LIV, 229–231see also2SLS

Instruments‘Optimal’ LIV, seePredictedLIV instrumentsConsiderations, 17–24Discrete, 32, 36, 37, 60, 122Endogenous, 20, 86–94External, 36, 154, 158–160,174Internal, 154, 160–161Latent,seeLIV modelNatural/laboratory, 60–65Number of, 23Pitfalls, 17–21

Subject index 253

Validity, 17–26, 85–94Weak, 2, 19–21, 85, 96, 159,162

Internet bias,seeSelf-selection

Jacknife LIV, 101–102

Lagged dependent variables, 14–15, 27

Lagrange multiplier test, 105, 235Least squares dummy (LSDV) es-

timator, definition, 166–168Level-1 dependency,see Xη-

dependencyLevel-2 dependency,see Xα-

dependencyLewbel’s approach, 28, 32, 36,

160–161, 163, 168–170,233–235

Likelihood distance, 101–102LIML, 16–26, 35, 85, 162LIV model, 37–39, 79

Model diagnostics, 94–102,228Nonparametric Bayes, 173–191Residuals, 97–98Simple, 38, 40, 41, 48

LIV-IV estimator, 78, 106, 229–231

LIV-probit model, 232

Mailing catalogues, 15Market response model, 9, 153MCMC algorithm, 179–183,

186–191Measurement error, 10–11, 27,

60–65, 114–116, 184Method errors, 11, 63Model diagnostics,seeLIV modelMundlak’s approach, definition,

152–153

Nonparametric Bayesian LIVmodel, 171–205

Normality assumption of the er-rors,seeRobustness issues

OLS estimatorBias, 15–16Consistency, 15–16

Omitted variables bias, 8–10, 27,113–114, 184

Online versus offline stores,seeSelf-selection

Optimizing behavior, 118Outer product of gradient (OPG)

matrix, 45Outliers, 96–102, 128–130

Polya urn, 178PosteriorP-values, 195Power, definition, 48Predicted LIV instruments, 60–

65, 96, 98, 131–133, 229–231

PriceEndogeneity, 9, 28–30Measurement error bias, 11Omitted variables bias, 9Simultaneity bias, 13

Promotional effects,seeLaggeddependent variables

Proxy-variable OLS method, 27,121

Random coefficient model, 164,184

Random intercept model, 149,174

Random-effects (RE) estimator,definition, 166–168

RegressorsEndogenous, definition, 2, 8Exogenous, definition, 2, 8

Residual analysis, 96–102, 128–130

Return to education,seeSchool-ing

Robustness issuesMisspecification of the errordistribution, 98–101, 228Mixture models, 33, 228see alsoModel diagnostics

SchoolingAbility bias, 9, 113–114Heterogeneity bias, 116–118IV estimation, 118–122LIV results, 124–136Measurement error bias, 10,114–116

254 Subject index

Optimizing behavior bias,118Proxies for ability, 121Twin studies, 121, 135, 231

Selection ofm, 94–96Self-selection, 12–13, 233Simultaneous equations, 13–14,

27–30Spatial Econometrics, 30–31Supply-Demand-Competition

models, 28–30

TestingFor existence of a latent in-strument, 228–229For exogenous instruments,23, 86, 133–134For instrument validity, 22,85–94

For regressor-error correla-tion, 24, 47–48, 102–105For weak instruments, 23,85–86, 131–133see alsoHausman test

Wald test, 85, 86, 102–105Wald’s method, 27, 32, 60–65,

226Weak instruments

‘Classical’ solutions to, 24–26seeInstrumentsseeTesting

White’s consistent covariance ma-trix, 48, 101

Within-groups estimator, seeFixed-effects estimator

Samenvatting (summary in Dutch)

In dit proefschrift introduceren en ontwikkelen we een nieuwe methode, de la-tente instrumenten (LIV) methode, die bijvoorbeeld gebruikt kan worden omhet volgende probleem op te lossen. Een verkoopster van ijsjes is geınteres-seerd in de prijsgevoeligheid van haar verkopen. Dat wil zeggen, ze wil derelatie tussen de prijs van haar ijsjes en de verkopen weten. Daartoe verandertze voor een lange periode de ijscoprijs regelmatig en noteert de gerealiseerdeverkopen. Deze twee variabelen zijn beschikbaar voor analyse. Echter, zijvertelt niet aan de onderzoeker dat ze een relatief hogere prijs voor een ijsjevraagt voor warme dagen: een hogere temperatuur leidt tot een toename vanhet aantal badgasten die, vanwege de warmte, een hoge bereidwilligheid zullenhebben om een ijsje te consumeren. Om het voorbeeld eenvoudig te houdennemen we aan dat de relatie tussen de ijscoprijs en de verkopen, gegeven dedata, kan worden bepaald middels het veelgebruikte lineaire regressie model.Dat wil zeggen,St = α + βPt + εt , waarSt staat voor het aantal verkochteijsjes (“sales”) op dagt en Pt is de op tijdstipt gehanteerde prijs. De on-bekende parameterβ geeft de prijsgevoeligheid weer en is naar verwachtingnegatief. Omdat dit model een vereenvoudigde voorstelling van de werkelijk-heid is, wordt er altijd een (kleine) foutεt gemaakt. Zo is bijvoorbeeld hettemperatuur effect ‘onderdeel’ van deze storingsterm. Nadat de gegevens vande verkopenS en de prijzenP beschikbaar zijn, ligt het voor de hand om eenschatting voorβ te maken door middel van de standaard lineaire regressie tech-niek (“the ordinary least squares (OLS) method”). Deze techniek is gebaseerdop een aantal veronderstellingen, waaronder de aanname dat de prijsP en destoringstermε onafhankelijk van elkaar zijn. Indien dat niet het geval is, is deberekende schatting voor de prijsgevoeligheidβ op basis van de OLS methodemet zekerheid onjuist. In dit voorbeeld geldt deze onafhankelijkheidsaannameniet. Omdat de verkoopster informatie over de temperatuur gebruikt om haarprijs te bepalen, hangt de prijs af van de temperatuur en zijn de variabeleP enε afhankelijk. Er wordt dus een foute prijsgevoeligheid berekend met de stan-daard OLS methode, omdat een deel van het temperatuureffect onterecht wordttoebedeeld aan het prijseffect. Soortgelijke problemen komen in meer realis-tische toepassing regelmatig voor. In dit proefschrift stellen we een nieuwemethode voor die in dergelijke situaties wel het juiste antwoord kan geven.

Het zojuist gebruikte model, het lineaire regressie model, en de genoemdetechniek om de onbekende parameters te schatten, de OLS methode, is, inalgemenere vorm, een veel gebruikte methode in de praktijk en in de weten-schap om de relatie tussen een afhankelijke variabele, bijvoorbeeld verkopen,en onafhankelijke variabelen, bijvoorbeeld prijs, te bepalen. In dit proefschrift

256 Samenvatting (summary in Dutch)

beschouwen we situaties waarin een onafhankelijke variabele gecorreleerd ismet de storingsterm. In dit geval is die variabele niet exogeen, maar endo-geen en is er een endogeniteitsprobleem. Dergelijke afhankelijkheden kunnenvoorkomen in verschillende situaties, bijvoorbeeld wanneer verklarende vari-abelen onterecht niet in het model zijn opgenomen, zoals de variabele tempera-tuur in het bovengenoemde voorbeeld, wanneer de afhankelijke variabele ookde onafhankelijke variabelen beınvloedt (simultaniteit), of wanneer de onaf-hankelijke variabelen meetfouten bevatten. De standaard inferentie proceduresleiden tot foute conclusies. Bijvoorbeeld, de OLS schatters voor de regressieparameters zijn onzuiver en niet consistent, en de werkelijke effecten van deregressoren op de afhankelijke variabele worden systematisch onder- of over-schat. Een manager of beleidsmaker die dergelijke resultaten gebruikt om eenbeslissing te nemen, kan een grote fout maken.

In de wetenschap bestaat er een techniek die, in plaats van OLS, kan wordengebuikt als de regressoren correleren met de storingsterm. Dit is de klassiekeinstrumenten methoden (“the instrumental variables (IV) method”). Deze me-thode heeft een lange geschiedenis in de econometrische literatuur en werktglobaal als volgt. Men verzamelt additionele variabelen, de zogeheten in-strumenten, die moeten voldoen aan de volgende eisen: (1) de instrumentenverklaren een gedeelte van de variantie in de endogene regressoren, en (2) deinstrumenten zijn onafhankelijk van de storingsterm. Voor het bovengenoemdeijsjes probleem dient men duseen of meerdere variabelen te vinden die de ijs-coprijs verklaren, maar onafhankelijk zijn van de temperatuur. Wanneer der-gelijke instrumenten beschikbaar zijn, kan men de regressie parameters con-sistent schatten middels, bijvoorbeeld, twee-stap regressie technieken. Echter,dergelijke additionele variabelen zijn vaak niet beschikbaar of voldoen niet aande voorgenoemde twee voorwaarden. In het eerste geval kan de instrumentenschatter niet worden berekend en in het tweede geval kan deze methode ergonbetrouwbare resultaten produceren. Deze resultaten zijn soms slechter dansimpelweg het endogeniteitsprobleem te negeren en de OLS methode te ge-bruiken in de wetenschap dat OLS een fout antwoord levert. Ondanks deze kri-tieken worden instrumenten in de praktijk vaak op ad-hoc basis, of simpelwegop basis van beschikbaarheid, gedefinieerd. De problemen met de klassiekeinstrumenten methode en de mogelijke inconsistentie van de populaire stan-daard OLS schatter vormen het startpunt van dit onderzoek. Wij introducereneen nieuwe methode, de latente instrumenten methode (LIV methode), die hetendogeniteitsprobleem kan oplossenzondergebruik te maken van additioneledata. Onze methode schat de ‘perfecte’ instrumenten uit de beschikbare data,waardoor de regressie parameters consistent kunnen worden geschat, ongeachtde aanwezigheid van mogelijke regressor-storingsterm afhankelijkheden.


De bovengenoemde klassieke instrumenten (IV) methode en de gerelateerdepotentiele problemen worden in detail besproken in hoofdstuk 2. We presen-teren een literatuur overzicht dat een groot gedeelte omvat van de studies overde klassieke instrumenten methode, met name de situatie waarin men slechtsbeschikt over zwakke instrumenten, en een groot aantal empirische studiesmet endogeniteitsproblemen. Tevens wijzen we op een aantal alternatieveresultaten, die gebruikt kunnen worden om een endogeniteitsprobleem op telossen. Dit overzicht toont aan dat endogeniteitsproblemen in veel situatiesvoorkomen en niet altijd eenvoudig kunnen worden opgelost met de beschik-bare standaard methoden. In de conclusie van dit hoofdstuk motiveren we indetail de ontwikkeling van de LIV methode.

In hoofdstuk 3 introduceren we de latente instrumenten methode, “the latentinstrumental variables (LIV) method”. Deze methode neemt aan dat er eeninstrument bestaat dat discreet en latent (ongeobserveerd) is. Evenals in hetklassieke IV model wordt er aangenomen dat de endogene regressor kan wor-den opgesplitst in een exogeen gedeelte en een endogeen gedeelte. De aan-name dat het instrument discreet is betekent dat de steekproef kan wordenopgedeeld in groepen en de aanname dat het instrument latent is betekent datdeze opdeling niet wordt waargenomen. We stellen voor om gebruik te makenvan mengsel modellen (mixture models) om deze opdeling te schatten. Onzeaanpak maakt het tevens mogelijk om te toetsen op de afwezigheid van endo-geniteit, zonder te eisen dat geobserveerde instrumenten beschikbaar zijn. Deuitgevoerde simulatie studies tonen aan dat de LIV methode de regressie pa-rameters consistent schat, terwijl er geen gebruik is gemaakt van additioneledata. Tevens laten de resultaten zien dat onze methode superieur is aan de stan-daard OLS methode als de regressoren niet exogeen zijn. We laten zien dat devoorgestelde toetsingsmethode een redelijk sterk vermogen heeft om endoge-niteit te detecteren. Verder bewijzen we dat de model parameters geıdenti-ficeerd zijn. We passen de LIV methode toe op een meetfouten probleem waarwe de beschikking hebben over een discreet instrument dat afkomstig is vaneen experimentele omgeving. Het blijkt dat het geschatte LIV instrument iden-tiek is aan het geobserveerde instrument. Dit is een belangrijk empirisch resul-taat, omdat in de meeste studies in economie en marketing men niet beschiktover experimentele data. Deze resultaten tonen aan dat de LIV methode suc-cesvol kan worden toegepast in een situatie waar een onafhankelijke variabeleis gecorreleerd met de storingsterm van het model. De resultaten zijn nietafhankelijk van de beschikbaarheid en de kwaliteit van geobserveerde instru-menten, in tegenstelling tot de klassieke IV methode.


In hoofdstuk 4 breiden we het LIV model verder uit. We nemen diverse ex-ogene regressoren op in het model. Daarnaast beschouwen we een situatiewaarin geobserveerde instrumenten beschikbaar zijn. We laten zien hoe dezevariabelen kunnen worden opgenomen in het LIV model. We gebruiken soort-gelijke technieken als in hoofdstuk 3 om aan te tonen dat alle model parametersidentificeerbaar zijn. De identificatie resultaten suggereren een nieuwe me-thode om de validiteit van de geobserveerde instrumenten te bestuderen. Wekunnen nagaan of de beschikbare instrumenten voldoen aan de twee bovenge-noemde eisen. Dergelijke methoden zijn, voor zover ons bekend, nog nieteerder ontwikkeld en zijn van belang voor empirische onderzoekers, gegevendat de resultaten van de klassieke instrumenten methode sterk afhangen vande kwaliteit van de gebruikte instrumenten. We illustreren de voorgesteldemethode op gesimuleerde data en tonen aan dat de aanpak succesvol is inhet identificeren van instrumenten van slechte kwaliteit. Daarnaast stellen weeen aantal diagnostieken voor om na te gaan of het LIV model de data goedbeschrijft. Verder laten we zien dat de LIV schattingen redelijk robuust zijntegen foute model specificaties (verdelingsaanames) en uitschieters in de data.

In hoofdstuk 5 onderzoeken we de relatie tussen ‘opleiding’ en ‘inkomen’.We maken hierbij gebruik van de technieken ontwikkeld in de voorgaandehoofdstukken. De variabele ‘opleiding’ is mogelijk endogeen omdat een goedemaatstaf voor de mogelijkheid om iets te bereiken, ‘kunnen’ (“ability”), veelalontbreekt in de beschikbare data. Iemand die van nature succesvol is, zal naarwaarschijnlijkheid langer een opleiding volgen en tevens in staat zijn om meerinkomen te genereren, ongeacht zijn opleiding. De standaard OLS methodeom het opleidingseffect te bepalen kan daardoor niet worden vertrouwd. Wepresenteren een overzicht van de opleiding’s literature en hieruit volgt dat ereen groot aantal onderzoeken zijn gedaan naar het effect van ‘opleiding’ op‘inkomen’. Echter blijkt ook dat er nog geen bevredigend antwoord is gevon-den omdat geschikte instrumenten veelal ontbreken. Daarnaast zijn veel re-sultaten verkregen met de klassieke instrumenten (IV) methode tegenstrijdigin termen van hoe groot en wat het teken van de fout in de OLS schatter voorhet opleiding’s effect is. Deze resultaten worden toegeschreven aan het ge-bruik van slechte instrumenten. De resultaten van onze LIV methode hangenniet af van dergelijke geobserveerde, en mogelijk slechte, instrumenten. Wijvinden voor drie datasets dat de onzuiverheid in de OLS schatter ongeveer7% bedraagt. Dat wil zeggen, opleiding lijkt meer waard te zijn volgensde standaard OLS methode dan dat het in werkelijkheid is. Onze resultatenkomen sterk overeen met recente conclusies van studies die gebruik maken vantweelingen. We vinden dat het LIV model de data voldoende goed beschrijft.Daarnaast vinden we voor twee van de drie datasets dat de beschikbare instru-

Rijksuniversiteit Groningen

Latent Instrumental VariablesA New Approach to Solve for Endogeneity

Proefschrift

ter verkrijging van het doctoraat in deEconomische Wetenschappen

aan de Rijksuniversiteit Groningenop gezag van de

Rector Magnificus, dr. F. Zwarts,in het openbaar te verdedigen op

donderdag 23 december 2004om 11.00 uur

door

Peter Ebbes

geboren op 28 mei 1976te Smilde


menten niet voldoen aan de hiervoor genoemde aannames en dus van slechtekwaliteit zijn. We concluderen dat de door ons gevonden resultaten, gebaseerdop het LIV model, geprefereerd dienen te worden boven de standaard OLSmethode en de klassieke instrumenten methode.

In hoofdstuk 6 beschouwen we endogeniteit in multiniveau modellen. Derge-lijke modellen worden gebruikt wanneer de data een hierarchische structuurheeft. Zoals voor het simpele regressie model wordt in dit soort modellenverondersteld dat de regressoren onafhankelijk zijn van de stochastische com-ponenten (“random effects”) in het model. Indien men beschikt over bijvoor-beeld panel data of gegevens van tweelingen of broers/zussen, dan kunnenmultiniveau modellen in bepaalde gevallen worden gebruikt om het endogeni-teitsprobleem op te lossen. Er zijn diverse veelgebruikte schattingstechniekendie in dergelijke situaties gebruikt kunnen worden. In dit hoofdstuk laten wezien dat men moet oppassen met het toepassen van deze technieken. Onze sim-ulatie studies tonen aan in welke gevallen deze methoden het juiste antwoordgeven, en in welke gevallen deze methoden niet werken. We concluderen datde huidige methoden, alhoewel veel toegepast, gelimiteerd zijn in hun gebruiken we suggereren stappen voor verder onderzoek. Het blijkt dat de regressors-random effect afhankelijkheid in diverse econometrische studies is onderzocht,maar dit belangrijke probleem heeft nauwelijks aandacht gekregen in de soci-aal wetenschappelijk en gedragswetenschappelijke literatuur.

De resultaten van hoofdstuk 6 worden verder uitgebreid in hoofdstuk 7. Weintroduceren niet-parametrische Bayesiaanse methoden om regressor-randomeffect correlaties op verschillende niveaus in het model op te lossen. Dezemethode kan een aantal van de openstaande vragen in hoofdstuk 6 beantwoor-den. Daarnaast generaliseert deze aanpak het LIV model in de hoofdstukken3 en 4, omdat er nu een algemene verdeling voor het ongeobserveerde instru-ment wordt aangenomen in plaats van een (discrete) multinomiale verdeling.Een bijkomend voordeel is dat het nu niet meer noodzakelijk is om vooraf hetaantal categorieen van het ongeobserveerde instrument te specificeren. Om-dat dit model volledige is geformuleerd in een Bayesiaans raamwerk, kan hetvrij eenvoudig worden uitgebreid naar meer algemenere situaties. Daarnaastverkrijgen we een beter inzicht in de kleine steekproef eigenschappen van deschatters. Het onderzoek in dit hoofdstuk is nog niet volledig afgerond, maarde eerste resultaten zijn veelbelovend. We geven in de discussie suggestiesvoor vervolgstappen.

We concluderen dat de nieuwe LIV methode geıntroduceerd in dit proefschrifteen interessante en waardevolle techniek is om endogeniteitsproblemen op te


lossen. Daarnaast is de LIV methode relatief simpel in gebruik. In het laat-ste hoofdstuk van dit proefschrift geven we de belangrijkste conclusies weeren presenteren we een discussie van de resultaten. We geven de voornaam-ste beperkingen aan, en suggereren hoe toekomstig onderzoek de openstaandevraagstukken kan beantwoorden.

Stellingen

behorende bij het proefschrift


van

Peter Ebbes

23 december 2004

1. Similar to the LIV approach, the instruments defined by Lewbel (1997)do not necessarily correspond to a ‘physical’ measure or an economictheory, and can be considered as ‘nuisance’ (chapter 2).

2. A drawback of most of the methods proposed to solve for price endo-geneity in market response models is that these methods are based onrather strong assumptions on the nature of endogeneity, and are, there-fore, expected to be limited in use (chapter 2).

3. All the parameters in the simple LIV model are identifiable as long asthe endogenous regressor has a non-normal distribution (chapter 3).

4. The LIV estimates for the regression parameters are consistent, regard-less of the choice form (chapters 3, 4, and 8).

5. The LIV approach can be used to investigate whether available ob-served instrumental variables are valid (chapter 4).

6. The LIV estimate of the return to education is to be preferred over theOLS and the classical IV estimates (chapter 5).

7. Family background instrumental variables in schooling applications aremost likely endogenous (chapter 5).

8. The Hausman-Taylor approach, which is used to solve for endogeneityproblems at level two, should be used with caution because its resultsare seriously biased in the presence of endogeneity at level one (chapter6).

9. The nonparametric Bayesian approach to model the distribution of the

latent instrument illustrates that the exact choice ofm is not very im-portant, but that estimating its value may yield efficiency advantages incertain situations (chapter 7).

10. Omitted product attributes in a conjoint analysis study leads to biasedestimates for the part worth utilities, but correcting for it with instru-mental variables methods is not straightforward.

11. It is incorrect to label the Latent Instrumental Variables (LIV) approachas ‘latent’.

12. Increasing the amount of items in a choice set does not necessarily addto the happiness of a consumer. A nice example is ordering yourx-thbeer in an American bar, wherex > 2.

13. Writing down a formulated theorem (‘stelling’) right away, is a cleverstrategy.

14. It is possible in the USA to make a living out of suing others.

15. The duration of a game ‘Settlers of Catan’ is usually too short for thelaw of large numbers to apply. But labeling the winner as a ‘geluksvo-gel’ is typically not accepted.

university of groningen latent instrumental variables ... · my family and friends have always been...

Documents