yatchew a. semiparametric regression for the applied econometrician (cup, 2003)(isbn...

Download Yatchew a. Semiparametric Regression for the Applied Econometrician (CUP, 2003)(ISBN 0521812836)(235s)_GL

If you can't read please download the document

Upload: mateus-ramalho

Post on 29-Jul-2015

180 views

Category:

Documents


2 download

TRANSCRIPT

This page intentionally left blankSemiparametric Regression for the Applied EconometricianThis book provides an accessible collection of techniques for analyzing nonpara-metric and semiparametric regression models. Worked examples include estimationof Engel curves and equivalence scales; scale economies; semiparametric CobbDouglas, translog, and CES cost functions; household gasoline consumption; hedo-nic housing prices; and, option prices and state price density estimation. The bookshould be of interest to a broad range of economists, including those working inindustrial organization, labor, development, urban, energy, and nancial economics.A variety of testing procedures are covered such as simple goodness-of-t testsand residual regression tests. These procedures can be used to test hypotheses suchas parametric and semiparametric specications, signicance, monotonicity, andadditive separability. Other topics include endogeneity of parametric and nonpara-metric effects as well as heteroskedasticity and autocorrelation in the residuals.Bootstrap procedures are provided.Adonis Yatchewteaches economics at the University of Toronto. His principal areasof research are theoretical and applied econometrics. In addition, he has a stronginterest inregulatoryandenergyeconomicsandisJoint EditoroftheEnergyJournal. He has received the social science undergraduate teaching award at theUniversity of Toronto and has taught at the University of Chicago.iFurther Praise for Semiparametric Regression for the Applied EconometricianThis uent book is an excellent source for learning, or updating ones knowl-edge of semi- and nonparametric methods and their applications. It is a valuableaddition to the existent books on these topics. Rosa Matzkin, Northwestern UniversityYatchews book is an excellent account of semiparametric regression. Thematerial is nicely integrated by using a simple set of ideas which exploit theimpact of differencing and weighting operations on the data. The empiricalapplications are attractive and will be extremely helpful for those encounteringthis material for the rst time. Adrian Pagan, Australian National UniversityAt the University of Toronto Adonis Yatchewis known for excellence in teach-ing. The key to this excellence is the succinct transparency of his exposition. Atits best such exposition transcends the mediumof presentation (either lecture ortext). This monograph reects the clarity of the authors thinking on the rapidlyexpanding elds of semiparametric and nonparametric analysis. Both studentsand researchers will appreciate the mix of theory and empirical application. Dale Poirier, University of California, IrvineiiThemes in Modern EconometricsManaging editorpeter c.b. phillips, Yale UniversitySeries editorsrichard j. smith, University of Warwickeric ghysels, University of North Carolina, Chapel HillThemes in Modern Econometrics is designed to service the large and growingneed for explicit teaching tools in econometrics. It will provide an organizedsequence of textbooks in econometrics aimed squarely at the student popula-tion and will be the rst series in the discipline to have this as its express aim.Written at a level accessible to students with an introductory course in econo-metrics behind them, each book will address topics or themes that students andresearchers encounter daily. Although each book will be designed to stand aloneas an authoritative survey in its own right, the distinct emphasis throughout willbe on pedagogic excellence.Titles in the seriesStatistics and Econometric Models: Volumes 1 and 2christian gourieroux and alain monfortTranslated by quang vuongTime Series and Dynamic Modelschristian gourieroux and alain monfortTranslated and edited by giampiero galloUnit Roots, Cointegration, and Structural Changeg.s. maddala and in-moo kimGeneralized Method of Moments EstimationEdited by l aszl o m aty asNonparametric Econometricsadrian pagan and aman ullahEconometrics of Qualitative Dependent Variableschristian gourierouxTranslated by paul b. klassenThe Econometric Analysis of Seasonal Time Serieseric ghysels and denise r. osborniiiivSEMIPARAMETRICREGRESSION FOR THEAPPLIED ECONOMETRICIANADONIS YATCHEWUniversity of Torontovcaxniiociuxiviisir\iiissCambridge,NewYork,Melbourne,Madrid,CapeTown,Singapore,SoPauloCambridgeUniversityPressThe Edinburgh Building, Cambridge cn: :iu, United KingdomFirst published in print format isbn-13 978-0-521-81283-2hardbackisbn-13 978-0-521-01226-3paperbackisbn-13 978-0-511-07313-7 eBook (EBL) Adonis Yatchew 20032003Informationonthistitle:www.cambridge.org/9780521812832This book is in copyright. Subject to statutory exception and to the provision ofrelevant collective licensing agreements, no reproduction of any part may take placewithout the written permission of Cambridge University Press.isbn-10 0-511-07313-5 eBook (EBL)isbn-10 0-521-81283-6hardbackisbn-10 0-521-01226-0paperbackCambridge University Press has no responsibility for the persistence or accuracy ofuiis for external or third-party internet websites referred to in this book, and does notguarantee that any content on such websites is, or will remain, accurate or appropriate.PublishedintheUnitedStatesofAmericabyCambridgeUniversityPress,NewYorkwww.cambridge.orgisnx-:,isnx-:cisnx-:,isnx-:cisnx-:,isnx-:c c,,;o:To Marta, Tamara and Mark.Your smiles are sunlight,your laughter, the twinkling of stars.viiviiiContentsList of Figures and Tables page xvPreface xvii1 Introduction to Differencing 11.1 A Simple Idea 11.2 Estimation of the Residual Variance 21.3 The Partial Linear Model 21.4 Specication Test 41.5 Test of Equality of Regression Functions 41.6 Empirical Application: Scale Economies in ElectricityDistribution 71.7 Why Differencing? 81.8 Empirical Applications 111.9 Notational Conventions 121.10 Exercises 122 Background and Overview 152.1 Categorization of Models 152.2 The Curse of Dimensionality and the Need for LargeData Sets 172.2.1 Dimension Matters 172.2.2 Restrictions That Mitigate the Curse 172.3 Local Averaging Versus Optimization 192.3.1 Local Averaging 192.3.2 Bias-Variance Trade-Off 192.3.3 Naive Optimization 222.4 A Birds-Eye View of Important Theoretical Results 232.4.1 Computability of Estimators 232.4.2 Consistency 232.4.3 Rate of Convergence 23ixx Contents2.4.4 Bias-Variance Trade-Off 252.4.5 Asymptotic Distributions of Estimators 262.4.6 How Much to Smooth 262.4.7 Testing Procedures 263 Introduction to Smoothing 273.1 A Simple Smoother 273.1.1 The Moving Average Smoother 273.1.2 A Basic Approximation 283.1.3 Consistency and Rate of Convergence 293.1.4 Asymptotic Normality and Condence Intervals 293.1.5 Smoothing Matrix 303.1.6 Empirical Application: Engel Curve Estimation 303.2 Kernel Smoothers 323.2.1 Estimator 323.2.2 Asymptotic Normality 343.2.3 Comparison to Moving Average Smoother 353.2.4 Condence Intervals 353.2.5 Uniform Condence Bands 363.2.6 Empirical Application: Engel Curve Estimation 373.3 Nonparametric Least-Squares and Spline Smoothers 373.3.1 Estimation 373.3.2 Properties 393.3.3 Spline Smoothers 403.4 Local Polynomial Smoothers 403.4.1 Local Linear Regression 403.4.2 Properties 413.4.3 Empirical Application: Engel Curve Estimation 423.5 Selection of Smoothing Parameter 433.5.1 Kernel Estimation 433.5.2 Nonparametric Least Squares 443.5.3 Implementation 463.6 Partial Linear Model 473.6.1 Kernel Estimation 473.6.2 Nonparametric Least Squares 483.6.3 The General Case 483.6.4 Heteroskedasticity 503.6.5 Heteroskedasticity and Autocorrelation 513.7 Derivative Estimation 523.7.1 Point Estimates 523.7.2 Average Derivative Estimation 533.8 Exercises 54Contents xi4 Higher-Order Differencing Procedures 574.1 Differencing Matrices 574.1.1 Denitions 574.1.2 Basic Properties of Differencing and RelatedMatrices 584.2 Variance Estimation 584.2.1 The mth-Order Differencing Estimator 584.2.2 Properties 594.2.3 Optimal Differencing Coefcients 604.2.4 Moving Average Differencing Coefcients 614.2.5 Asymptotic Normality 624.3 Specication Test 634.3.1 A Simple Statistic 634.3.2 Heteroskedasticity 644.3.3 Empirical Application: Log-Linearity ofEngel Curves 654.4 Test of Equality of Regression Functions 664.4.1 A Simplied Test Procedure 664.4.2 The Differencing Estimator Applied to thePooled Data 674.4.3 Properties 684.4.4 Empirical Application: Testing Equality of EngelCurves 694.5 Partial Linear Model 714.5.1 Estimator 714.5.2 Heteroskedasticity 724.6 Empirical Applications 734.6.1 Household Gasoline Demand in Canada 734.6.2 Scale Economies in Electricity Distribution 764.6.3 Weather and Electricity Demand 814.7 Partial Parametric Model 834.7.1 Estimator 834.7.2 Empirical Application: CES Cost Function 844.8 Endogenous Parametric Variables in the Partial Linear Model 854.8.1 Instrumental Variables 854.8.2 Hausman Test 864.9 Endogenous Nonparametric Variable 874.9.1 Estimation 874.9.2 Empirical Application: Household GasolineDemand and Price Endogeneity 884.10 Alternative Differencing Coefcients 894.11 The Relationship of Differencing to Smoothing 90xii Contents4.12 Combining Differencing and Smoothing 924.12.1 Modular Approach to Analysis of the PartialLinear Model 924.12.2 Combining Differencing Procedures in Sequence 924.12.3 Combining Differencing and Smoothing 934.12.4 Reprise 944.13 Exercises 945 Nonparametric Functions of Several Variables 995.1 Smoothing 995.1.1 Introduction 995.1.2 Kernel Estimation of Functions of Several Variables 995.1.3 Loess 1015.1.4 Nonparametric Least Squares 1015.2 Additive Separability 1025.2.1 Backtting 1025.2.2 Additively Separable Nonparametric Least Squares 1035.3 Differencing 1045.3.1 Two Dimensions 1045.3.2 Higher Dimensions and the Curse of Dimensionality 1055.4 Empirical Applications 1075.4.1 Hedonic Pricing of Housing Attributes 1075.4.2 Household Gasoline Demand in Canada 1075.5 Exercises 1106 Constrained Estimation and Hypothesis Testing 1116.1 The Framework 1116.2 Goodness-of-Fit Tests 1136.2.1 Parametric Goodness-of-Fit Tests 1136.2.2 Rapid Convergence under the Null 1146.3 Residual Regression Tests 1156.3.1 Overview 1156.3.2 U-statistic Test Scalar xs, MovingAverage Smoother 1166.3.3 U-statistic Test Vector xs, Kernel Smoother 1176.4 Specication Tests 1196.4.1 Bierens (1990) 1196.4.2 H ardle and Mammen (1993) 1206.4.3 Hong and White (1995) 1216.4.4 Li (1994) and Zheng (1996) 1226.5 Signicance Tests 124Contents xiii6.6 Monotonicity, Concavity, and Other Restrictions 1256.6.1 Isotonic Regression 1256.6.2 Why Monotonicity Does Not Enhance the Rateof Convergence 1266.6.3 Kernel-Based Algorithms for Estimating MonotoneRegression Functions 1276.6.4 Nonparametric Least Squares Subject toMonotonicity Constraints 1276.6.5 Residual Regression and Goodness-of-Fit Testsof Restrictions 1286.6.6 Empirical Application: Estimation of Option Prices 1296.7 Conclusions 1346.8 Exercises 1367 Index Models and Other Semiparametric Specications 1387.1 Index Models 1387.1.1 Introduction 1387.1.2 Estimation 1387.1.3 Properties 1397.1.4 Identication 1407.1.5 Empirical Application: Engels Method forEstimation of Equivalence Scales 1407.1.6 Empirical Application: Engels Method for MultipleFamily Types 1427.2 Partial Linear Index Models 1447.2.1 Introduction 1447.2.2 Estimation 1467.2.3 Covariance Matrix 1477.2.4 Base-Independent Equivalence Scales 1487.2.5 Testing Base-Independence and Other Hypotheses 1497.3 Exercises 1518 Bootstrap Procedures 1548.1 Background 1548.1.1 Introduction 1548.1.2 Location Scale Models 1558.1.3 Regression Models 1568.1.4 Validity of the Bootstrap 1578.1.5 Benets of the Bootstrap 1578.1.6 Limitations of the Bootstrap 1598.1.7 Summary of Bootstrap Choices 1598.1.8 Further Reading 160xiv Contents8.2 Bootstrap Condence Intervals for Kernel Smoothers 1608.3 Bootstrap Goodness-of-Fit and Residual Regression Tests 1638.3.1 Goodness-of-Fit Tests 1638.3.2 Residual Regression Tests 1648.4 Bootstrap Inference in Partial Linear and Index Models 1668.4.1 Partial Linear Models 1668.4.2 Index Models 1668.5 Exercises 171AppendixesAppendix A Mathematical Preliminaries 173Appendix B Proofs 175Appendix C Optimal Differencing Weights 183Appendix D Nonparametric Least Squares 187Appendix E Variable Denitions 194References 197Index 209List of Figures and TablesFigure 1.1. Testing equality of regression functions. page 6Figure 1.2. Partial linear model log-linear cost function:Scale economies in electricity distribution. 9Figure 2.1. Categorization of regression functions. 16Figure 2.2. Naive local averaging. 20Figure 2.3. Bias-variance trade-off. 21Figure 2.4. Naive nonparametric least squares. 24Figure 3.1. Engel curve estimation using moving average smoother. 31Figure 3.2. Alternative kernel functions. 33Figure 3.3. Engel curve estimation using kernel estimator. 38Figure 3.4. Engel curve estimation using kernel, spline, andlowess estimators. 42Figure 3.5. Selection of smoothing parameters. 45Figure 3.6. Cross-validation of bandwidth for Engel curve estimation. 46Figure 4.1. Testing linearity of Engel curves. 65Figure 4.2. Testing equality of Engel curves. 70Figure 4.3. Household demand for gasoline. 74Figure 4.4. Household demand for gasoline: Monthly effects. 75Figure 4.5. Scale economies in electricity distribution. 77Figure 4.6. Scale economies in electricity distribution: PUC andnon-PUC analysis. 79Figure 4.7. Weather and electricity demand. 82Figure 5.1. Hedonic prices of housing attributes. 108Figure 5.2. Household gasoline demand in Canada. 109Figure 6.1. Constrained and unconstrained estimation and testing. 113Figure 6.2A. Data and estimated call function. 131Figure 6.2B. Estimated rst derivative. 132Figure 6.2C. Estimated SPDs. 133Figure 6.3. Constrained estimation simulated expected mean-squared error. 135xvxvi List of Figures and TablesFigure 7.1. Engels method for estimating equivalence scales. 141Figure 7.2. Parsimonious version of Engels method. 144Figure 8.1. Percentile bootstrap condence intervals forEngel curves. 162Figure 8.2. Equivalence scale estimation for singles versus couples:Asymptotic versus bootstrap methods. 170Table 3.1. Asymptotic condence intervals for kernelestimators implementation. 36Table 4.1. Optimal differencing weights. 61Table 4.2. Values of for alternate differencing coefcients. 62Table 4.3. Mixed estimation of PUC/non-PUC effects: Scaleeconomies in electricity distribution. 80Table 4.4. Scale economies in electricity distribution: CEScost function. 85Table 4.5. Symmetric optimal differencing weights. 90Table 4.6. Relative efciency of alternative differencing sequences. 90Table 5.1. The backtting algorithm. 103Table 6.1. Bierens (1990) specication test implementation. 120Table 6.2. H ardle and Mammen (1993) specicationtest implementation. 122Table 6.3. Hong and White (1995) specicationtest implementation. 123Table 6.4. Li (1994), Zheng (1996) residual regression test ofspecication implementation. 123Table 6.5. Residual regression test of signicance implementation. 125Table 7.1. Distribution of family composition. 143Table 7.2. Parsimonious model estimates. 145Table 8.1. Wild bootstrap. 157Table 8.2. Bootstrap condence intervals at f (xo). 161Table 8.3. Bootstrap goodness-of-t tests. 164Table 8.4. Bootstrap residual regression tests. 165Table 8.5. Percentile-t bootstrap condence intervals for in thepartial linear model. 167Table 8.6. Asymptotic versus bootstrap condence intervals:Scale economies in electricity distribution. 168Table 8.7. Condence intervals for in the index model:Percentile method. 169PrefaceThis book has been largely motivated by pedagogical interests. Nonparametricand semiparametric regression models are widely studied by theoretical econo-metricians but are much underused by applied economists. In comparison withthe linear regression modely =z , semiparametric techniques are theo-retically sophisticated and often require substantial programming experience.Two natural extensions to the linear model that allowgreater exibility are thepartial linear model y = z f (x) , which adds a nonparametric function,and the index modely =f (z) , which applies a nonparametric functionto the linear index z. Together, these models and their variants comprise themost commonly used semiparametric specications in the applied econometricsliterature. A particularly appealing feature for economists is that these modelspermit the inclusion of multiple explanatory variables without succumbing tothe curse of dimensionality.We begin by describing the idea of differencing, which provides a simpleway to analyze the partial linear model because it allows one to remove thenonparametric effect f (x) and to analyze the parametric portion of the modelz as if the nonparametric portion were not there to begin with. Thus, one candraw not only on the reservoir of parametric human capital but one can alsomake use of existing software. By the end of the rst chapter, the reader willbe able to estimate the partial linear model and apply it to a real data set (theempirical example analyzes scale economies in electricity distribution using asemiparametric CobbDouglas specication).Chapter 2 describes the broad contours of nonparametric and semiparametricregressionmodeling, thecategorizationofmodels, thecurseofdimensio-nality, and basic theoretical results.Chapters 3 and 4 are devoted to smoothing and differencing, respectively. Thetechniques are reinforced by empirical examples on Engel curves, gasoline de-mand, the effect of weather on electricity demand, and semiparametric translogand CES cost function models. Methods that incorporate heteroskedasticity,autocorrelation, and endogeneity of right-hand-side variables are included.xviixviii PrefaceChapter 5 focuses on nonparametric functions of several variables. The ex-ample on hedonic pricing of housing attributes illustrates the benets of non-parametric modeling of location effects.Economic theory rarely prescribes a specic functional form. Typically, theimplicationsoftheoryinvolveconstraintssuchasmonotonicity, concavity,homotheticity, separability, and so on. Chapter 6 begins by outlining two broadclasses of tests of these and other properties: goodness-of-t tests that com-pare restricted and unrestricted estimates of the residual variance, and residualregression tests that regress residuals from a restricted regression on all theexplanatory variables to see whether there is anything left to be explained. Bothof these tests have close relatives in the parametric world. The chapter thenproceeds to constrained estimation, which is illustrated by an option pricingexample.Chapter 7 addresses the index model with an application to equivalence scaleestimation using South African household survey data. Chapter 8 describesbootstrap techniques for various procedures described in earlier chapters.A cornerstone of the pedagogical philosophy underlying this book is thatthe second best way to learn econometric techniques is to actually apply them.(The best way is to teach them.1) To this purpose, data and sample programs areavailable for the various examples and exercises at www.chass.utoronto.ca/yatchew/. With the exception of constrained estimation of option prices, allcode is in S-Plus.2The reader should be able to translate the code into otherprograms such as Stata easily enough.By working through the examples and exercises,3the reader should be abletorestimate nonparametric regression, partial linear, and index models;rtest various properties using large sample results and bootstrap techniques;restimate nonparametric models subject to constraints such as monotonicityand concavity.Well-known references in the nonparametrics and semiparametrics literatureinclude H ardle (1990), Stoker (1991), Bickel et al. (1993), Horowitz (1998),1Each year I tell my students the apocryphal story of a junior faculty member complaining to asenior colleague of his inability to get through to his students. After repeating the same lectureto his class on three different occasions, he exclaims in exasperation I am so disappointed.Today I thought I had nally gotten through to them. This time even I understood the material,and they still did not understand.2Krause and Olson (1997) have provided a particularly pleasant introduction to S-Plus. See alsoVenables and Ripley (1994).3Many of the examples and portions of the text draw upon previously published work, in par-ticular, Yatchew (1997, 1998, 1999, 2000), Yatchew and Bos (1997), Yatchew and No (2001),and Yatchew, Sun, and Deri (2001). The permission for use of these materials is gratefullyacknowledged.Preface xixand Pagan and Ullah (1999).4It is hoped that this book is worthy of beingsqueezed onto a nearby bookshelf by providing an applied approach with nu-merical examples and adaptable code. It is intended for the applied economistand econometrician working with cross-sectional or possibly panel data.5Itis expected that the reader has had a good basic course in econometrics andis thoroughly familiar with estimation and testing of the linear model and as-sociated ideas such as heteroskedasticity and endogeneity. Some knowledgeof nonlinear regression modeling and inference is desirable but not essential.Given the presence of empirical examples, the book could be used as a text inan advanced undergraduate course and certainly at the graduate level.I owe a great intellectual debt to too many to name them individually, andregrettably not all of themappear in the references. Several anonymous review-ers provided extensive and valuable comments for which I am grateful. Thanksare also due to Scott Parris at Cambridge University Press for his unaggingefforts in this endeavor. My sister Oenone kindly contributed countless hoursof proofreading time. Finally, it is indeed a special privilege to thank PeterPhillips, whose intellectual guidance shaped several aspects of this book. It wasPeter who from the start insisted on reproducible empirical exercises. Thosewho are acquainted with both of us surely know to whom the errors belong.4There are also several surveys: Delgado and Robinson (1992), H ardle and Linton (1994), Powell(1994), Linton (1995a), and Yatchew (1998). See also DiNardo and Tobias (2001).5With the exception of correlation in the residuals, time-dependent data issues have not beencovered here.xx1 Introduction to Differencing1.1 A Simple IdeaConsider the nonparametric regression modely =f (x) (1.1.1)for which little is assumed about the functionf except that it is smooth. In itssimplest incarnation, the residuals are independently and identically distributedwith mean zero and constant variance 2 , and the xs are generated by a processthat ensures they will eventually be dense in the domain. Closeness of thexs combined with smoothness of f provides a basis for estimation of theregression function. By averaging or smoothing observations ony for whichthe corresponding xs are close to a given point, say xo, one obtains a reasonableestimate of the regression effect f (xo).This premise that xs that are close will have corresponding values of theregression function that are close may also be used to remove the regressioneffect. It is this removal or differencing that provides a simple exploratory tool.To illustrate the idea we present four applications:1.Estimation of the residual variance 2 ,2.Estimation and inference in the partial linear model y = z f (x) ,3.A specication test on the regression functionf , and4.A test of equality of nonparametric regression functions.11The rst-order differencing estimator of the residual variance in a nonparametric setting ap-pears in Rice (1984). Although unaware of his result at the time, I presented the identicalestimator at a conference held at the IC2 Institute at the University of Texas at Austin in May1984. Differencing subsequently appeared in a series of nonparametric and semiparametric set-tings, including Powell (1987), Yatchew (1988), Hall, Kay, and Titterington (1990), Yatchew(1997, 1998, 1999, 2000), Lewbel (2000), Fan and Huang (2001), and Horowitz and Spokoiny(2001).12 Semiparametric Regression for the Applied Econometrician1.2 Estimation of the Residual VarianceSuppose one has data (y1, x1), . . . , (yn, xn) on the pure nonparametric regres-sion model (1.1.1), where x is a bounded scalar lying, say, in the unit interval, is i.i.d. with E( [ x) = 0, Var ( [ x) = 2 , and all that is known aboutf is thatits rst derivative is bounded. Most important, the data have been rearrangedso that x1 xn. Consider the following estimator of 2 :s2diff =12nn

i =2(yi yi 1)2. (1.2.1)The estimator is consistent because, as the xs become close, differencing tendsto remove the nonparametric effect yiyi 1 =f (xi) f (xi 1) ii 1 =i i 1, so that2s2diff =12nn

i =2(i i 1)2 =1nn

i =12i 1nn

i =2ii 1. (1.2.2)An obvious advantage ofs2diffis that no initial estimate of the regressionfunction f needs to be calculated. Indeed, no consistent estimate of f is im-plicit in (1.2.1). Nevertheless, the terms in s2diff that involvef converge to zerosufciently quickly so that the asymptotic distribution of the estimator can bederived directly from the approximation in (1.2.2). In particular,n1/2_s2diff2_DN(0, E(4)). (1.2.3)Moreover, derivation of this result is facilitated by the assumption that theiare independent so that reordering of the data does not affect the distribution ofthe right-hand side in (1.2.2).1.3 The Partial Linear ModelConsider now the partial linear model y = z f (x) , where for simplicityall variables are assumed to be scalars. We assume that E( [ z, x) =0 andthat Var( [ z, x) =2 .3As before, the xs have bounded support, say the unitinterval, and have been rearranged so that x1 xn. Suppose that the con-ditional mean of z is a smooth function of x, sayE(z [ x) =g(x) where g/ is2To see why this approximation works, suppose that thexiare equally spaced on the unitinterval and that f/ L. By the mean value theorem, for somexi[xi 1, xi] we havef (xi) f (xi 1) = f/(xi)(xi xi 1) L/n. Thus, yi yi 1 =i i 1 O(1/n).For detailed development of the argument, see Exercise 1. If thexihave a density functionboundedawayfromzeroonthesupport, thenxi xi 1=OP(1/n)andyiyi 1=i i 1 OP(1/n). See Appendix B, Lemma B.2, for a related result.3For extensions to the heteroskedastic and autocorrelated cases, see Sections 3.6 and 4.5.Introduction to Differencing 3bounded and Var(z [ x) = 2u. Then we may rewrite z = g(x)u. Differencingyieldsyi yi 1 =(zi zi 1) ( f (xi) f (xi 1)) i i 1=(g(xi) g(xi 1)) (ui ui 1)( f (xi) f (xi 1)) i i 1=(ui ui 1) i i 1. (1.3.1)Thus, the direct effect f (x) of the nonparametric variablexand the indirecteffect g(x) that occurs throughzare removed. Suppose we apply the OLSestimator of to the differenced data, that is,diff =

(yi yi 1)(zi zi 1)

(zi zi 1)2. (1.3.2)Then, substituting the approximations zi zi 1 = ui ui 1 and yi yi 1 =(ui ui 1) i i 1 into (1.3.2) and rearranging, we haven1/2( diff)=n1/2 1n

(i i 1)(ui ui 1)1n

(ui ui 1)2. (1.3.3)The denominator converges to 2 2u, and the numerator has mean zero andvariance 6 22u. Thus, the ratio has mean zero and variance 6 22u/(22u)2=1.5 2/2u. Furthermore, the ratio may be shown to be approximately normal(using a nitely dependent central limit theorem). Thus, we haven1/2( diff)DN_0,1.5 22u_. (1.3.4)For the most efcient estimator, the corresponding variance in (1.3.4) would be2/2u so the proposed estimator based on rst differences has relative efciency2/3= 1/1.5. In Chapters 3 and 4 we will produce efcient estimators.Now, in order to use (1.3.4) to perform inference, we will need consistentestimators of 2and 2u. These may be obtained usings2 =12nn

i =2((yi yi 1) (zi zi 1) diff)2=12nn

i =2(i i 1)2P2(1.3.5)ands2u =12nn

i =2(zi zi 1)2 =12nn

i =2(ui ui 1)2P2u. (1.3.6)4 Semiparametric Regression for the Applied EconometricianThe preceding procedure generalizes straightforwardly to models with multipleparametric explanatory variables.1.4 Specication TestSuppose, for example, one wants to test the null hypothesis that f is a linearfunction. Let s2res be the usual estimate of the residual variance obtained froma linear regression ofy onx. If the linear model is correct, then s2res will beapproximately equal to the average of the true squared residuals:s2res =1nn

i =1(yi 1 2xi)2 =1nn

i =12i . (1.4.1)If the linear specication is incorrect, then s2res will overestimate the residualvariance while s2diff in (1.2.1) will remain a consistent estimator, thus formingthe basis of a test. Consider the test statisticV =n1/2_s2ress2diff_s2diff. (1.4.2)Equations (1.2.2) and (1.4.1) imply that the numerator of V is approximatelyequal ton1/21n

ii 1DN_0, 4_. (1.4.3)Since s2diff, the denominator of V, is a consistent estimator of 2 , V is asymp-totically N(0,1) under H0. (Note that this is a one-sided test, and one rejects forlarge values of the statistic.)As we will see later, this test procedure may be used to test a variety ofnull hypotheses such as general parametric and semiparametric specications,monotonicity, concavity, additive separability, andother constraints. One simplyinserts the restricted estimator of the variance in (1.4.2). We refer to test statisticsthat compare restricted and unrestricted estimates of the residual variance asgoodness-of-t tests.1.5 Test of Equality of Regression FunctionsSupposewearegivendata (yA1, xA1), . . . , (yAn, xAn) and (yB1, xB1), . . . ,(yBn, xBn) from two possibly different regression models A and B. Assumexis a scalar and that each data set has been reordered so that thexs are inincreasing order. The basic models areyAi =fA(xAi) Ai(1.5.1)yBi =fB(xBi) BiIntroduction to Differencing 5where given thexs, thes have mean 0, variance2 , and are independentwithin and between populations; fAand fBhave rst derivatives bounded.Using (1.2.1), dene consistent within differencing estimators of the variances2A =12nn

i(yAi yAi 1)2(1.5.2)s2B =12nn

i(yBi yBi 1)2.As we will do frequently, we have dropped the subscript diff . Now poolall the data andreorder sothat the pooled xs are inincreasingorder:(y1, x1), . . . . . . , (y2n, x2n). (Note that the pooled data have only one subscript.)Applying the differencing estimator once again, we haves2p =14n2n

j_yj yj 1_2. (1.5.3)The basic idea behind the test procedure is to compare the pooled estimatorwith the average of the within estimators. If fA = fB, then the within andpooled estimators are consistent and should yield similar estimates. IffA , =fB,thenthewithinestimatorsremainconsistent, whereasthepooledestimatoroverestimates the residual variance, as may be seen in Figure 1.1.To formalize this idea, dene the test statistic (2n)1/2_s2p 1/2_s2As2B__. (1.5.4)If fA= fB, thendifferencingremovestheregressioneffectsufcientlyquickly in both the within and the pooled estimators so that (2n)1/2_s2p 1/2_s2As2B__=(2n)1/24n_2n

j_j j 1_2n

i(Ai Ai 1)2n

i(Bi Bi 1)2_=(2n)1/22n_2n

j2jjj 1n

i2AiAiAi 1n

i2Bi BiBi 1_=1(2n)1/2_n

iAiAi 1n

iBiBi 1_1(2n)1/2_2n

jjj 1_.(1.5.5)Consider the two terms in the last line. In large samples, each is approx-imatelyN(0, 4). If observations that are consecutive in the individual data6 Semiparametric Regression for the Applied EconometricianABWithin estimators of residual varianceABPooled estimator of residual varianceFigure 1.1. Testing equality of regression functions.Introduction to Differencing 7sets tend to be consecutive after pooling and reordering, then the covariancebetween the two terms will be large. In particular, the covariance is approxi-mately 4(1), where equals the probability that consecutive observationsin the pooled reordered data set come from different populations.It follows that under Ho : fA =fB,DN_0, 24_. (1.5.6)For example, if reordering the pooled data is equivalent to stacking data setsA and B because the two sets of xs, xA and xB, do not intersect then = 0and indeed the statisticbecomes degenerate. This is not surprising, sinceobserving nonparametric functions over different domains cannot provide abasis for testing whether they are the same. If the pooled data involve a simpleinterleaving of data sets A and B, then = 1 and N(0, 24). If xA andxB are independent of each other but have the same distribution, then for thepooled reordered data the probability that consecutive observations come fromdifferent populations is1/2and N(0, 4).4To implement the test, one mayobtain a consistent estimate by taking the proportion of observations in thepooled reordered data that are preceded by an observation from a differentpopulation.1.6 Empirical Application: Scale Economies in Electricity Distribution5To illustrate these ideas, consider a simple variant of the CobbDouglas modelfor the costs of distributing electricitytc =f (cust) 1wage 2 pcap3PUC 4kwh 5life 6lf 7 kmwire (1.6.1)where tc is the log of total cost per customer, cust is the log of the number ofcustomers, wage is the log wage rate, pcap is the log price of capital, PUC is adummy variable for public utility commissions that deliver additional servicesand therefore may benet from economies of scope, life is the log of the re-maining life of distribution assets, lf is the log of the load factor (this measurescapacity utilization relative to peak usage), and kmwire is the log of kilometersof distribution wire per customer. The data consist of 81 municipal distributorsin Ontario, Canada, during 1993. (For more details, see Yatchew, 2000.)4For example, distribute n men and n women randomly along a stretch of beach facing the sunset.Then, for any individual, the probability that the person to the left is of the opposite sex is 1/2.More generally, if xA and xBare independent of each other and have different distributions,then depends on the relative density of observations from each of the two populations.5Variable denitions for empirical examples are contained in Appendix E.8 Semiparametric Regression for the Applied EconometricianBecause the data have been reordered so that the nonparametric variable custis in increasing order, rst differencing (1.6.1) tends to remove the nonpara-metric effect f . We also divide by2 so that the residuals in the differencedEquation (1.6.2) have the same variance as those in (1.6.1). Thus, we have[tci tci 1]/2= 1[wagei wagei 1]/2 2[pcapi pcapi 1]/23[PUCi PUCi 1]/2 4[kwhi kwhi 1]/25[lifei lifei 1]/2 6[lfi lfi 1]/27[kmwirei kmwirei 1]/2 [i i 1]/2. (1.6.2)Figure 1.2 summarizes our estimates of the parametric effects using thedifferenced equation. It also contains estimates of a pure parametric speci-cation in which the scale effect f is modeled with a quadratic. Applying thespecication test (1.4.2), where s2diff is replaced with (1.3.5), yields a value of1.50, indicating that the quadratic model may be adequate.Thus far our results suggest that by differencing we can performinference on as if there were no nonparametric component f in the model to begin with.But, havingestimated, we canthenproceedtoapplya varietyof nonparametrictechniques to analyzef as if were known. Such a modular approach simpliesimplementation because it permits the use of existing software designed for purenonparametric models.More precisely, suppose we assemble the ordered pairs (yizi diff, xi); then,we haveyi zi diff = zi( diff) f (xi) i =f (xi) i. (1.6.3)If we apply conventional smoothing methods to these ordered pairs suchas kernel estimation (see Section 3.2), then consistency, optimal rate of con-vergence results, and the construction of condence intervals for f remainvalid becausediff converges sufciently quickly to that the approximationin the last part of (1.6.3) leaves asymptotic arguments unaffected. (This is in-deed why we could apply the specication test after removing the estimatedparametric effect.) Thus, in Figure 1.2 we have also plotted a nonparametric(kernel) estimate offthat can be compared with the quadratic estimate. In sub-sequent sections, we will elaborate this example further and provide additionalones.1.7 Why Differencing?An important advantage of differencing procedures is their simplicity. Con-sider once again the partial linear model y =z f (x) . ConventionalIntroduction to Differencing 9Variable Quadratic model Partial linear modelaCoef SE Coef SEcust 0.833 0.175 cust20.040 0.009 wage 0.833 0.325 0.448 0.367pcap 0.562 0.075 0.459 0.076PUC 0.071 0.039 0.086 0.043kwh 0.017 0.089 0.011 0.087life 0.603 0.119 0.506 0.131lf 1.244 0.434 1.252 0.457kmwire 0.445 0.086 0.352 0.094s2.021 .018R2.618 .675ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooolog customerslog total cost per year6 8 10 125.05.25.45.65.86.0KernelQuadraticEstimated scale effectaTestofquadraticversusnonparametricspecicationofscaleeffect: V=n1/2(s2res s2diff)/s2diff = 811/2(.021 .018)/.018 = 1.5, where Vis N(0,1), Section 1.4.Figure 1.2. Partial linear model Log-linear cost function: Scale economies in elec-tricity distribution.10 Semiparametric Regression for the Applied Econometricianestimators, such as the one proposed by Robinson (1988) (see Section 3.6),require one to estimate E(y [ x) and E(z [ x) using nonparametric regressions.The estimated residuals from each of these regressions (hence the term doubleresidual method) are then used to estimate the parametric regressiony E(y [ x) = (z E(z [ x)) . (1.7.1)If z is a vector, then a separate nonparametric regression is run for each com-ponent of z, where the independent variable is the nonparametric variable x. Incontrast, differencing eliminates these rst-stage regressions so that estimationof can be performed regardless of its dimension even if nonparametricregression procedures are not available within the software being used. Simi-larly, tests of parametric specications against nonparametric alternatives andtests of equality of regression functions across two or more (sub-) samples canbe carried out without performing a nonparametric regression.As should be evident from the empirical example of the last section, dif-ferencingmayeasilybecombinedwithotherprocedures. Inthatexample,we used differencing to estimate the parametric component of a partial linearmodel. We then removed the estimated parametric effect and applied conven-tional nonparametric procedures to analyze the nonparametric component. Suchmodular analysis does require theoretical justication, which we will providein Section 4.12.As we have seen, the partial linear model permits a simple semiparametricgeneralizationof theCobbDouglas model. Translogandother linear-in-parameters models may be generalized similarly. If we allowthe parametric por-tion of the model to be nonlinear so that we have a partial parametric model then we may also obtain simple semiparametric generalizations of models suchas the constant elasticity of substitution (CES) cost function. These, too, maybe estimated straightforwardly using differencing (see Section 4.7). The keyrequirement is that the parametric and nonparametric portions of the model beadditively separable.Other procedures commonly used by the econometrician may be importedinto the differencing setting with relative ease. If some of the parametric vari-ables are potentially correlated with the residuals, instrumental variable tech-niques can be applied, with suitable modication, as can the Hausman endo-geneity test (see Section 4.8). If the residuals are potentially not homoskedastic,then well-known techniques such as Whites heteroskedasticity-consistent stan-dard errors can be adapted (see Section 4.5). The reader will no doubt nd otherprocedures that can be readily transplanted.Earlier we have pointed out that the rst-order differencing estimator ofin the partial linear model is inefcient when compared with the most efcientestimator (see Section 1.3). The same is true for the rst-order differencing esti-mator of the residual variance (see Section 1.2). This problem can be correctedusing higher-order differencing, as demonstrated in Chapter 4.Introduction to Differencing 11Most important, however, the simplicity of differencing provides a usefulpedagogical device. Applied econometricians can begin using nonparametrictechniques quickly and with conventional econometric software. Indeed, allthe procedures in the example of Section 1.6 can be executed within packagessuch as E-Views, SAS, Shazam, Stata, or TSP. Furthermore, because the partiallinear model can easily accommodate multiple parametric variables, one canimmediately apply these techniques to data that are of practical interest.Simplicity and versatility, however, have a price. One of the criticisms ofdifferencing is that it can result in greater bias in moderately sized samples thanother estimators.6A second criticism is that differencing, as proposed here,works only if the dimension of the models nonparametric component does notexceed 3 (see Section 5.3). Indeed, in most of what follows we will apply differ-encing to models in which the nonparametric variable is a scalar. More generaltechniques basedonsmoothingwill usuallybe prescribedwhenthe nonparamet-ric variable is a vector. However, we would argue that, even if differencing tech-niques were limited to one (nonparametric) dimension, they have the potentialof signicant market share. The reason is that high-dimensional nonparamet-ric regression models, unless they rely on additional structure (such as additiveseparability), suffer from the curse of dimensionality which severely limitsones ability to estimate the regression relationship with any degree of precision.It is not surprising, therefore, that the majority of applied papers using nonpara-metric regression limit the nonparametric component to one or two dimensions.1.8 Empirical ApplicationsThetarget audienceforthisbookconsistsofappliedeconometriciansandeconomists. Thus, the following empirical applications will be introduced andcarried through various chapters:rEngel curve estimation (South African data)rScale economies in electricity distribution (data from Ontario, Canada)rHousehold gasoline consumption (Canadian data)rHousing prices (data from Ottawa, Canada)rOption prices and state price densities (simulated data)rWeather and electricity demand (data from Ontario, Canada)rEquivalence scale estimation (South African data).Empirical results presented in tables and in gures are worked through inexercises at the end of each chapter along with additional empirical and theo-retical problems. The reader is especially urged to do the applied exercises for6Seifert, Gasser, and Wolf (1993) have studied this issue for differencing estimators of theresidual variance.12 Semiparametric Regression for the Applied Econometricianthis is by the far the best way to gain a proper understanding of the techniques,their range, and limitations.For convenience, variable denitions are collected in Appendix E. Other datasets may be obtained easily. For example, household survey data for variousdeveloping countries are available at the World Bank Web site www.worldbank.org/lsms. These are the Living Standard Measurement Study household surveysfrom which our South African data were extracted.1.9 Notational ConventionsWith mild abuse of notation, symbols such as y and x will be used to denote boththe variable in question and the corresponding column vector of observationson the variable. The context should make it clear which applies. If x is a vector,thenf (x) will denote the vector consisting of f evaluated at the componentsof x. IfX is a matrix and is a conformable parameter vector, thenf (X) isalso a vector.We will frequently use subscripts to denote components of vectors or ma-trices, for example,i, Aijor [AB]ij. For any two matricesA, B of identicaldimension, we will on a few occasions use the notation [A B]ij =AijBij.When differencing procedures are applied, the rst few observations may betreated differently or lost. For example, to calculate the differencing estimatorof the residual variance s2diff =

ni (yi yi 1)2/2n, we begin the summation ati = 2. For the mathematical arguments that follow, such effects are negligible.Thus, we will use the symbol.= to denote equal except for end effects. Asmust be evident by now, we will also use the symbol = to denote approximateequality,P for convergence in probability, andD for convergence in dis-tribution. The abbreviation i.i.d. will denote independently and identicallydistributed.Becausedifferencingwillbeoneofthethemesinwhatfollows, severalestimators will merit the subscript diff , as in the preceding paragraph or in(1.3.2). For simplicity, we will regularly suppress this annotation.To denote low-order derivatives we will the use the conventional notationf /,f //,f ///. Occasionally we will need higher-order derivatives which we willdenote by bracketed superscripts; for example, f(m).1.10 Exercises71. Supposey =f (x) , [ f /[ Lfor which we have data(yi, xi) i=1, . . . , n,where the xi are equally spaced on the unit interval. We will derive the distribution7Data and sample programs for empirical exercises are available on the Web. See the Prefacefor details.Introduction to Differencing 13of s2diff =

ni (yi yi 1)2/2n, which we may rewrite ass2diff =12nn

i =2(i i 1)212nn

i =2( f (xi) f (xi 1))2 1nn

i =2( f (xi) f (xi 1))(i i 1).(a) Show that the rst term on the right-hand side satisesn1/2_12nn

i =2(i i 1)22_DN(0, E(4)).(b) Show that the second term is O(1n2).(c) Showthat the variance of the third termis O(1n3) so that the third termis OP(1n3/2).Thus,n1/2_s2diff2_=_12nn

i =2(i i 1)22_ O_1n3/2_ OP_1n_.2. Consider the restricted estimator of the residual variance (1.4.1) used in the differ-encing specication test. Show thats2res =1nn

i =1(yi 1 2xi)2=1nn

i =1(i (1 1) (2 2xi))2=1nn

i =12i OP_1n_.Combinethiswith(1.2.2)andtheresultsofthepreviousexercisetoderivethedistribution of Vin Section 1.4.3. Derive the covariance between the two terms in the last line of (1.5.5). Use this toobtain the approximate distribution of the differencing test of equality of regressionfunctions (1.5.6). How would the test statistic change if the two subpopulations wereof unequal size?4. Scale Economies in Electricity Distribution(a) Verify that the data have been reordered so that the nonparametric variable cust,which is the log of the number of customers, is in increasing order.(b) Fit the quadratic model in Figure 1.2, where all variables are parametric. Estimatethe residual variance, the variance of the dependent variabletc, and calculateR2= 1 s2/s2t c.(c) Transform the data by rst differencing as in (1.6.2) and apply ordinary least-squares to obtain estimates of the parametric effects in the partial linear model.To obtain the standard errors, rescale the standard errors provided by the OLSprocedure by1.5, as indicated in (1.3.4).14 Semiparametric Regression for the Applied Econometrician(d) Remove the estimated parametric effects using (1.6.3) and produce a scatterplotof the ordered pairs (yizi diff, xi), where the x variable is the log of the numberof customers.(e) Apply a smoothing or nonparametric regression procedure (such as ksmooth inS-Plus) to the ordered pairs in (d) to produce a nonparametric estimate of thescale effect.(f) Applythespecicationtestin(1.4.2)totheorderedpairsin(d)totestthequadratic specication against the nonparametric alternative.2 Background and Overview2.1 Categorization of ModelsWe now turn to a description of the range of models addressed in this book.Consider rst the pure nonparametric model y =f (x) , where is i.i.d.with mean 0 and constant variance2 . If f is only known to lie in a family ofsmooth functions `, then the model is nonparametric and incorporates weakconstraints on its structure. We will soon see that such models are actuallydifculttoestimatewithprecisionifxisavectorofdimensionexceedingtwo or three. Iff satises some additional properties (such as monotonicity,concavity, homogeneity, or symmetry) and hence lies in` `, we will saythat the model is constrained nonparametric. Figure 2.1 depicts a parametricand a pure nonparametric model at opposite corners.Given the difculty in estimating pure nonparametric models with multi-ple explanatory variables, researchers have sought parsimonious hybrids. Onesuch example is the partial linear model introduced in Chapter 1. One can seein Figure 2.1 that for any xed value ofx, the function is linear inz. Par-tial parametric models are an obvious generalization, wherey =g(z; ) f (x) andgisaknownfunction. ForthepartialparametricsurfaceinFigure 2.1, g is quadratic in z a shape that is replicated for any xed valueof x.Index models constitute another hybrid. In this case y =f (x) . For anyxed value of the index x, the functionf (x) is constant. The index modeldepicted in Figure 2.1 is given byf (x) = cos(x1 x2); thus, the function isat along lines where x1 x2 = constant. Partial linear index models are yetanother generalization, where y =f (x) z .Finally, if we can partition x into two subsets xa and xb such that f is of theformfa(xa) fb(xb), wherefa andfb are both nonparametric, then the modelis called additively separable. (Of course, partial linear and partial paramet-ric models are also additively separable, but in these cases one component isparametric and the other nonparametric.)1516 Semiparametric Regression for the Applied Econometrician` is a family of smooth functions. ` is a smooth family with additional constraints such as mono-tonicity, concavity, symmetry, or other constraints.Figure 2.1. Categorization of regression functions.Background and Overview 172.2 The Curse of Dimensionality and the Need for Large Data Sets2.2.1 Dimension MattersIn comparison with parametric estimation, nonparametric procedures can im-pose enormous data requirements. To gain an appreciation of the problem aswell as remedies for it, we begin with a deterministic framework. Suppose theobjective is to approximate a function f. If it is known to be linear in one vari-able, two observations are sufcient to determine the entire function precisely;three are sufcient if f is linear in two variables. If f is of the formg(x; ),where g is known and is an unknown k-dimensional vector, then k judiciouslyselected points are usually sufcient to solve for . No further observations onthe function are necessary.Let us turn to the pure nonparametric case. Supposef, dened on the unit in-terval, is knownonlytohave a rst derivative boundedbyL(i.e., supx[0,1][ f /[ L). If we sample f at n equidistant points and approximate f at any point bythe closest point at which we have an evaluation, then our approximation errorcannot exceed 1/2 L/n. Increasing the density of points reduces approximationerror at a rate O(1/n).Now supposef is a function on the unit square and that it has derivativesbounded in all directions by L. To approximate the function, we need to samplethroughout its domain. If we distribute n points uniformly on the unit square,each will occupy an area 1/n, and the typical distance between points will be1/n1/2 so that the approximation error is now O(1/n1/2). If we repeat this argu-ment for functions of k variables, the typical distance between points becomes1/n1/kand the approximation error is O(1/n1/k). In general, this method of ap-proximationyields errors proportional tothe distance tothe nearest observation.Thus, for n = 100, the potential approximation error is 10 times larger in 2dimensions than in 1 and 40 times larger in 5 dimensions. One begins to seethe virtues of parametric modeling to avoid this curse of dimensionality.12.2.2 Restrictions That Mitigate the CurseWe will consider four types of restrictions that substantially reduce approxi-mation error: a partial linear structure, the index model specication, additiveseparability, and smoothness assumptions.Suppose a regression function dened on the unit square has the partial linearformz f (x) (the functionf is unknownexcept for a derivative bound). Inthiscase, we need two evaluations along the z-axis to completely determine (seethe partial linear surface in Figure 2.1). Furthermore, n equidistant evaluations1For an exposition of the curse of dimensionality in the case of density estimation, see Silverman(1986) and Scott (1992).18 Semiparametric Regression for the Applied Econometricianalong thex-axis will ensure thatf can be approximated with error O(1/n)so that the approximation error for the regression function as a whole is alsoO(1/n), the same as if it were a nonparametric function of one variable.Now consider the index model. Ifwere known, then we would have anonparametric function of one variable; thus, to obtain a good approximationof f , we need to take n distinct and say equidistant values of the indexx.How do we obtain? Suppose for simplicity that the model is f (x1 x2).(The coefcient of the rst variable has been normalized to 1.) Beginning at apoint (x1a, x2a), travel in a direction along whichfis constant to a nearby point(x1b, x2b). Because f (x1a x2a) =f (x1b x2b) and hence x1a x2a =x1b x2b, we may solve for . Thus, just as for the partial linear model, theapproximation error for the regression function as a whole is O(1/n), the sameas if it were a nonparametric function of one variable.Next, consider an additively separable function on the unit square: f (xa, xb)=fa(xa) fb(xb), where the functions fa and fb satisfy a derivative bound( fb(0) = 0is imposedas anidenticationcondition). If we take 2n observations,n along each axis, thenfa andfb can be approximated with error O(1/n), soapproximation error for f is alsoO(1/n), once again the same as if f were anonparametric function of one variable.Thefollowingpropositionshouldnowbeplausible: Forpartiallylinear,index, or additively separable models, the approximation error depends on themaximum dimension of the nonparametric components of the model.Smoothness can also reduce approximation error. Suppose f is twice differ-entiable on the unit interval withf / andf // bounded by L and we evaluate f atn equidistant values of x. Consider approximation of f at xo [xi,xi 1]. Usinga Taylor expansion, we havef (xo) =f (xi) f /(xi)(xo xi)1/2f //(x)(xo xi)2x [xi, xo]. (2.2.1)If we approximate f (xo) using f (xi) f /(xi)(xo xi), the error isO(xo xi)2=O(1/n2). Ofcoursewedonot observe f /(xi). However,the bound on the second derivative implies that f /(xi) [ f (xi 1) f (xi)]/[xi 1 xi] is O(1/n) and thusf (xo) =f (xi) [ f (xi 1) f (xi)][xi 1 xi](xo xi) O_ 1n2_. (2.2.2)This local linear approximation involves nothing more than joining the ob-served points with straight lines. If third-order (kth order) derivatives are boun-ded, then local quadratic (k 1 order polynomial) approximations will reducethe error further.Inthissection, wehaveusedtheelementaryideathat if afunctionissmooth, its value at a given point can be approximated reasonably well by usingBackground and Overview 19evaluations of the function at neighboring points. This idea is fundamental tononparametric estimation where, of course, f is combined with noise to yield theobserved data. All results illustrated in this section have analogues in the non-parametric setting. Data requirements growvery rapidly as the dimension of thenonparametric component increases. The rate of convergence (i.e., the rate atwhich we learn about the unknown regression function) can be improved usingsemiparametric structure, additive separability, and smoothness assumptions.Finally, the curse of dimensionality underscores the paramount importance ofprocedures that validate models with faster rates of convergence. Among theseare specication tests of a parametric null against a nonparametric alternativeand signicance tests that can reduce the number of explanatory variables inthe model.2.3 Local Averaging Versus Optimization2.3.1 Local AveragingIn Chapter 1 we introduced the idea of differencing, a device that allowed usto remove the nonparametric effect. Suppose the object of interest is now thenonparametric function itself. A convenient way of estimating the function ata given point is by averaging or smoothing neighboring observations. Supposewe are given data (y1, x1) . . . (yn, xn) on the model y =f (x) , where x isa scalar. Local averaging estimators are extensions of conventional estimatorsof location to a nonparametric regression setting. If one divides the scatterplotinto neighborhoods, then one can compute local means as approximations to theregression function. Amore appealing alternative is to have the neighborhoodmove along the x-axis and to compute a moving average along the way. Thewider the neighborhood, the smoother the estimate, as may be seen in Figure 2.2.(If one were in a vessel, the sea represented by the solid line in the bottompanel would be the most placid.)2.3.2 Bias-Variance Trade-OffSuppose then we dene the estimator to bef (xo) =1no

N(xo)yi=f (xo) 1no

N(xo)( f (xi) f (xo)) 1no

N(xo)i(2.3.1)wheresummationsaretakenoverobservationsintheneighborhoodN(xo)aroundxo, andnois the number of elements inN(xo). Conditional on thexs, the bias of the estimator consists of the second term, and the variance isdetermined by the third term.20 Semiparametric Regression for the Applied EconometricianData-generating mechanismyi =xi cos(4xi) ii N(0, .09) xi [0, 1], n = 100.Observations are averaged over neighborhoods of the indicated width.Figure 2.2. Naive local averaging.Background and Overview 21The mean-squared error (i.e., the bias squared plus the variance) is given byE[ f (xo) f (xo)]2=_ 1no

N(xo)f (xi) f (xo)_22no. (2.3.2)Mean-squared error can be minimized by widening the neighborhoodN(xo)until the increase in bias squared is offset by the reduction in variance. (Thelatter declines because no increases as the neighborhood widens.) This trade-off between bias and variance is illustrated in Figure 2.3, which continues theData-generating mechanism yi = xi cos(4xi) ii N(0, .09) xi [0, 1], n = 100.Figure 2.3. Bias-variance trade-off.22 Semiparametric Regression for the Applied Econometricianexample of Figure 2.2. In the rst panel, local averaging is taking place usingjust 10 percent of the data at each point (of course, fewer observations are usedas one approaches the boundaries of the domain). The solid line isE[ f (x)]and the estimator exhibits little bias; it coincides almost perfectly with the trueregression function (the dotted line). The broken lines on either side correspondto two times the standard errors of the estimator at each point: 2(Var[ f (x)])1/2.In the second panel the neighborhood is substantially broader; we are nowaveraging about 30 percent of the data at each point. The standard error curvesare tighter, but some bias has been introduced. The E[ f (x)] no longer coincidesperfectly with the true regression curve. In the third panel, averaging is takingplace over 80 percent of the data. The standard error curves are even tighter,but now there is substantial bias particularly at the peaks and valleys of thetrue regression function. The expectation of the estimator E[ f (x)] is fairly at,while the true regression function undulates around it.A more general formulation of local averaging estimators modies (2.3.1)as follows:f (xo) =n

1wi(xo)yi. (2.3.3)The estimate of the regression function atxo is a weighted sum of theyi,where the weights wi(xo) depend on xo. (Various local averaging estimators canbe put in this form, including kernel and nearest-neighbor.) Because one wouldexpect that observations close to xo would have conditional means similar tof (xo), it is natural to assign higher weights to these observations and lowerweights to those that are farther away. Local averaging estimators have theadvantage that, as long as the weights are known or can be easily calculated,fis also easy to calculate. The disadvantage of such estimators is that it is oftendifcult to impose additional structure on the estimating functionf .2.3.3 Naive OptimizationOptimization estimators, on the other hand, are more amenable to incorporatingadditional structure. As a prelude to our later discussion, consider the followingnaive estimator. Given data(y1, x1) . . . (yn, xn) onyi= f (xi) i, wherexi [0, 1] and [ f /[ L, suppose one solvesmin y1,..., yn1n

i(yi yi)2s.t. yi yjxi xj L i, j = 1, . . . , n.(2.3.4)Here yi is the estimate off at xi andf is a piecewise linear function joining the yiwith slope not exceeding the derivative bound L. Under general conditionsthis estimator will be consistent. Furthermore, addingmonotonicityor concavityconstraints, at least at the points where we have data, is straightforward. AsBackground and Overview 23additional structure is imposed, the estimator becomes smoother, and its t tothe true regression function improves (see Figure 2.4).2.4 A Birds-Eye View of Important Theoretical ResultsThe non- and semiparametric literatures contain many theoretical results. Herewe summarize in crude form the main categories of results that are ofparticular interest to the applied researcher.2.4.1 Computability of EstimatorsOur preliminary exposition of local averaging estimators suggests that theircomputationis generallystraightforward. The naive optimizationestimator con-sidered in Section 2.3.3 can also be calculated easily even with additional con-straints on the regression function. What is more surprising is that estimatorsminimizing the sum of squared residuals over (fairly general) innite dimen-sional classes of smooth functions can be obtained by solving nite dimensional(often quadratic) optimization problems (see Sections 3.1 to 3.4).2.4.2 ConsistencyIn nonparametric regression, smoothness conditions (in particular, the existenceof bounded derivatives) play a central role in ensuring consistency of the es-timator. They are also critical in determining the rate of convergence as wellas certain distributional results.2With sufcient smoothness, derivatives of theregression function can be estimated consistently, sometimes by differentiatingthe estimator of the function itself (see Sections 3.1 to 3.4 and 3.7).2.4.3 Rate of ConvergenceHow quickly does one discover the true regression function? In a parametricsetting, the rate at which the variance of estimators goes to zero is typically 1/n.32Forexample, inprovingtheseresultsforminimizationestimators, smoothnessisusedtoensure that uniform (over classes of functions) laws of large numbers and uniform central limittheorems apply (see Dudley 1984, Pollard 1984, and Andrews 1994a,b).3In the location model y = , Var( y) = 2y/n; hence, y = OP(n1/2) and ( y)2=OP(1/n). For the linear model y = x where the ordered pairs (y, x) are say i.i.d.,we have_( x x)2dx =( )2_dx ( )2_x2dx2( )( )_xdx = OP(1/n)because , are unbiased and Var( ), Var( ) and Cov( ,) converge to 0 at 1/n. The samerate of convergence usually applies to general parametric forms of the regression function.24 Semiparametric Regression for the Applied EconometricianData-generating mechanismyi =xi ii N(0, .04) xi [0,1]. Simulations performedusing GAMS General Algebraic Modeling System (Brooke, Kendrick, and Meeraus 1992).Figure 2.4. Naive nonparametric least squares.Background and Overview 25It does not depend on the number of explanatory variables. For nonparametricestimators, convergence slows dramatically as the number of explanatory vari-ables increases (recall our earlier discussion of the curse of dimensionality), butthis is ameliorated somewhat if the function is differentiable. The optimal rateat which a nonparametric estimator can converge to the true regression functionis given by (see Stone 1980, 1982)_[ f (x) f (x)]2dx = OP_1n2m/(2m d)_, (2.4.1)where m is the degree of differentiability of f and d is the dimension of x. Fora twice differentiable function of one variable, (2.4.1) implies an optimal rateof convergence of OP(n4/5) (a case that will recur repeatedly). For a functionof two variables, it is OP(n2/3).Local averagingandnonparametricleast-squaresestimatorscanbecon-structed that achieve the optimal rate of convergence (see Sections 3.1 through3.3). Rate of convergence also plays an important role in test procedures.If the model is additively separable or partially linear, then the rate of con-vergence of the optimal estimator depends on the nonparametric component ofthe model with the highest dimension (Stone 1985, 1986). For example, for theadditively separable model y =fa(xa) fb(xb) , where xa, xb are scalars,the convergence rate is the same as if the regression function were a nonpara-metric function of one variable. The same is true for the partial linear modely = z f (x) , where x and z are scalars.Estimators of in the partial linear model can be constructed that are n1/2-consistent (i.e., for which the variance shrinks at the parametric rate 1/n) andasymptotically normal. In Section 1.3, we have already seen a simple differenc-ing estimator with this property (see Sections 3.6 and 4.5 for further discussion).Also, estimators of in the index model y =f (x) can be constructed thatare n1/2-consistent asymptotically normal (see Chapter 7).For the hybridregressionfunctionf (z, xa, xb, xc) = z fa(xa) fb(xb) fc(xc), where xa, xb, xc are of dimension da, db, dc, respectively, the optimalrate of convergence for the regression as a whole is the same as for a nonpara-metric regression model with number of variables equal to max{da, db, dc].Constraints such as monotonicity or concavity do not enhance the (largesample) rate of convergence if enough smoothness is imposed on the model(see Section 6.6). They can improve performance of the estimator (such as themean-squared error) if strong smoothness assumptions are not made or if thedata set is of moderate size (recall Figure 2.4).2.4.4 Bias-Variance Trade-OffBy increasing the number of observations over which averaging is taking place,one can reduce the variance of a local averaging estimator. But as progressively26 Semiparametric Regression for the Applied Econometricianless similar observations are introduced, the estimator generally becomes morebiased. The objective is to minimize the mean-squared error (variance plus biassquared). For nonparametric estimators that achieve optimal rates of conver-gence, the square of the bias and the variance converge to zero at the same rate(see Sections 3.1 and 3.2). (In parametric settings the former converges to zeromuch more quickly than the latter.) Unfortunately, this property can complicatethe construction of condence intervals and test procedures.2.4.5 Asymptotic Distributions of EstimatorsFor a wide variety of nonparametric estimators, the estimate of the regressionfunction at a point is approximately normally distributed. The joint distribu-tion at a collection of points is joint normally distributed. Various functionalssuch as the average sum of squared residuals are also normally distributed (seeSections 3.1 through 3.3). In many cases, the bootstrap may be used to con-struct condence intervals and critical values that are more accurate than thoseobtained using asymptotic methods (see Chapter 8).2.4.6 How Much to SmoothSmoothness parameters such as the size of the neighborhood over which av-eraging is being performed can be selected optimally by choosing the valuethat minimizes out-of-sample prediction error. The technique, known as cross-validation, will be discussed in Section 3.5.2.4.7 Testing ProceduresAvariety of specication tests of parametric or semiparametric null hypothesesagainst nonparametric or semiparametric alternatives are available.Nonparametric tests of signicance are also available as are tests of additiveseparability, monotonicity, homogeneity, concavity, andmaximizationhypothe-ses. A fairly unied testing theory can be constructed using either goodness-of-t type tests or residual regression tests (see Chapters 6 and 8).3 Introduction to Smoothing3.1 A Simple Smoother3.1.1 The Moving Average Smoother1Awide variety of smoothing methods have been proposed. We will begin with avery simple moving average or running mean smoother. Suppose we are givendata(y1, x1) . . . (yn, xn) on the modely =f (x) . We continue to assumethat x is scalar and that the data have been reordered so that x1 xn. Forthe time being, we will further assume that thexiare equally spaced on theunit interval. Dene the estimator of f at xi to be the average of k consecutiveobservations centered at xi. (To avoid ambiguity, it is convenient to choose kodd.) Formally, we denef (xi) = 1ki

j =iyj, (3.1.1)where i =i (k 1)/2 and i =i (k 1)/2 denote the lower and upperlimits of summations. The estimator is of course equal tof (xi) = 1ki

j =if (xj) 1ki

j =ij. (3.1.2)If k the number of neighbors being averaged increases withn, thenby conventional central limit theorems the second term on the right-hand sidewill be approximately normal with mean 0 and variance 2/k. If these neigh-bors cluster closer and closer toxi the point at which we are estimatingthe function then the rst term will converge to f (xi). Furthermore, if this1The estimator is also sometimes called the symmetric nearest-neighbor smoother.2728 Semiparametric Regression for the Applied Econometricianconvergence is fast enough, we will havek1/2( f (xi) f (xi))DN_0, 2_. (3.1.3)A 95 percent condence interval for f (xi) is immediatef (xi) 1.96k1/2, (3.1.4)and indeed quite familiar from the conventional estimation of a mean ( maybe replaced by a consistent estimator). It is this simple kind of reasoning thatwe will now make more precise.3.1.2 A Basic ApproximationLet us rewrite (3.1.2) as follows:f (xi) =1ki

j =iyj=1ki

j =if (xj) 1ki

j =ij=f (xi) f /(xi)ki

j =i(xj xi)f //(xi)2ki

j =i(xj xi)2 1ki

j =ij=f (xi) 1/2 f //(xi)1ki

j =i(xj xi)2 1ki

j =ij. (3.1.5)In the third and fourth lines, we have applied a second-order Taylor series.2Note that with the xj symmetric around xi, the second term in the third line iszero. So, we may rewrite (3.1.5) as3f (xi)=f (xi) 124_kn_2f //(xi) 1ki

j =ij. (3.1.6)2In particular, f (xj) =f (xi) f/(xi)(xj xi) 1/2 f/ /(xi)(xj xi)2 o(xj xi)2. We areobviously assuming second-order derivatives exist.3We have used the result1/21k

ij =i(xj xi)2=124(k21)n2=124(kn)2. See the exercises forfurther details.Introduction to Smoothing 29The last term is an average of k independent and identical random variables sothat its variance is 2/k and we havef (xi) =f (xi) O_kn_2 OP_1k1/2_. (3.1.7)The bias E( f (xi) f (xi)) is approximated by the second term of (3.1.6) andthe Var( f (xi)) is approximately2/k, thus, the mean-squared error (the sumof the bias squared and the variance) at a point xi isE[ f (xi) f (xi)]2= O_kn_4 O_1k_. (3.1.8)3.1.3 Consistency and Rate of ConvergenceThe approximation embodied in (3.1.6) yields dividends immediately. As longas k/n 0 and k , the second and third terms go to zero and we have aconsistent estimator.The rate at whichf (xi) f (xi) 0 depends on which of the second orthird terms in (3.1.6) converge to zero more slowly. Optimality is achievedwhen the bias squared and the variance shrink to zero at the same rate. Using(3.1.7), one can see that this occurs if O(k2/n2) =OP(1/k1/2), which impliesthat optimality can be achieved by choosing k = O(n4/5). In this case,f (xi)=f (xi) O_1n2/5_ OP_1n2/5_. (3.1.9)Equivalently, we could have solved for the optimal rate using (3.1.8). SettingO(k4/n4) =O(1/k) and solving, we again obtain k =O(n4/5). Substitutinginto (3.1.8) yields a rate of convergence of E[ f (xi) f (xi)]2= O(n4/5) forthe mean-squared error at a point xi. This, in turn, underpins the following,_[ f (x) f (x)]2dx = OP_1n4/5_, (3.1.10)which is a rather pleasant result in that it satises Stones optimal rate of con-vergence, (2.4.1), where the order of differentiability m = 2 and the dimensiond = 1.3.1.4 Asymptotic Normality and Condence IntervalsApplying a central limit theorem to the last term of (3.1.6), we havek1/2_ f (xi) f (xi) 124_kn_2f //(xi)_DN_0, 2_. (3.1.11)30 Semiparametric Regression for the Applied EconometricianIf we select k optimally, say, k =n4/5, then k1/2(k/n)2=1 and the con-struction of a condence interval for f (xi) is complicated by the presence ofthe term involvingf //(xi), which would need to be estimated. However, if werequire k to grow more slowly than n4/5(e.g., k = n3/4), then k1/2(k/n)2 0and(3.1.11)becomesk1/2( f (xi) f (xi))DN(0, 2). Intuitively, weareadding observations sufciently slowly that they are rapidly clustering aroundthe point of estimation. As a consequence, the bias is small relative to the vari-ance (see (3.1.7)). In this case, a 95 percent condence interval for f (xi) isapproximatelyf (xi) 1.96/k1/2. These are of course exactly the results webegan with in (3.1.3) and (3.1.4).Let us pause for a moment. In these last sections, we have illustrated threeessential results for a simple moving average estimator: that it is consistent; thatby allowing the number of terms in the average to grow at an appropriate rate,the optimal rate of convergence can be achieved; and, that it is asymptoticallynormal.3.1.5 Smoothing MatrixIt is often convenient to write moving average (and other) smoothers in matrixnotation. Let S be the smoother matrix dened byS(nk1)xn=__1k, . . .. 1k, 0, . . . . . . . . . . . . . . . . . . . ., 00,1k, . . . . . ,1k, 0, . . . . . . . . . . . . . . . . 0: :: :0, . . . . . . . . . . . . . . . 0,1k, . . . . . . ,1k, 00, . . . . . . . . . . . . . . . . . . 0,1k, . . . . . ,1k__. (3.1.12)Then we may rewrite (3.1.1) in vector-matrix form as y =f (x) = Sy, (3.1.13)where x, y, y,f (x) are vectors.3.1.6 Empirical Application: Engel Curve EstimationA common problem in a variety of areas of economics is the estimation ofEngel curves. Using South African household survey data (see Appendix E),we select the subset consisting of single individuals and plot the food share oftotal expenditure as a function of the log of total expenditure in Figure 3.1. Thesubset contains 1,109 observations.We apply the moving average smoother with k =51 to obtain the solid irregu-lar line in the upper panel. The lack of smoothness is a feature of moving averagesmoothers. Note that the estimator does not quite extend to the boundaries ofIntroduction to Smoothing 31Model: y =f (x) ,x is log of total expenditure and y is the food share of expenditure.Data: The data consist of a sample of 1,109 single individuals (Singles) from South Africa.Figure 3.1. Engel curve estimation using moving average smoother.32 Semiparametric Regression for the Applied Econometricianthe data because it drops observations at either end. This shortcoming will beremedied shortly, but boundary behavior is an important feature distinguishingnonparametric estimators.The lower panel uses (3.1.3) to produce 95 percent pointwise condenceintervals. At median expenditure (log (total expenditure) = 6.54), the 95 percentcondence interval for food share is 38 to 46 percent.3.2 Kernel Smoothers3.2.1 EstimatorLet us return nowto the more general formulation of a nonparametric estimatorwe proposed in Chapter 2:f (xo) =n

1wi(xo)yi. (3.2.1)Here we are estimating the regression function at the point xo as a weighted sumof theyi, where the weightswi(xo) depend on xo. A conceptually convenientway to construct local averaging weights is to use a unimodal function centeredat zero that declines in either direction at a rate controlled by a scale parameter.Natural candidates for such functions, which are commonly known as kernels,are probability density functions. Let Kbe a bounded function that integratesto 1 and is symmetric around 0. Dene the weights to bewi(xo) =1n K_xixo_1n

n1K_xixo_. (3.2.2)The shape of the weights (which, by construction, sum to 1) is determined byK, and their magnitude is controlled by, which is known as the bandwidth.A large value of results in greater weight being put on observations that arefar from xo. Using (3.2.1) the nonparametric regression function estimator, rstsuggested by Nadaraya (1964) and Watson (1964), becomesf (xo) =1n

n1K_xixo_yi1n

n1K_xixo_. (3.2.3)A variety of other kernels are available (see Figure 3.2). Generally, selectionofthekernelislessimportantthanselectionofthebandwidthoverwhichobservations are averaged. The simplest is the uniform kernel (also known asFigure 3.2. Alternative kernel functions.3334 Semiparametric Regression for the Applied Econometriciantherectangularorboxkernel), whichtakesavalueof1/2on[1,1]and0elsewhere. But the normal and other kernels are also widely used (see Wandand Jones 1995 for an extensive treatment of kernel smoothing).Much of the intuition developed using the moving average smoother appliesin the current setting. Indeed, with equally spaced xs on the unit interval, andthe uniform kernel, the essential difference is the denition of the smoothingparameter. The uniform kernel simply averages observations that lie in theintervalxo. With n data points in the unit interval, the proportion of ob-servations falling in an interval of width 2 will be 2, and the number ofobservations will be 2n. Thus, if one uses the substitutionk =2nin thearguments of Section 3.1, analogous results will be obtained for the uniformkernel estimator, which in this case is virtually identical to the moving averagesmoother.In particular, (3.1.6) and (3.1.7) becomef (xi)=f (xi) 124(2)2f //(xi) 12n

jj(3.2.4)andf (xi)=f (xi) O(2) OP_11/2n1/2_. (3.2.4a)Analogously to the conditions on k, we impose the following two conditionson : the rst is 0, which ensures that averaging takes place over a shrink-ing neighborhood, thus eventually eliminating bias. The second isn ,which ensures that the number of observations being averaged grows and thevariance of the estimator declines to 0.3.2.2 Asymptotic NormalitySuppose nowthat the xs are randomly distributed (say on the unit interval) withprobability densityp(x). For a general kernel, the NadarayaWatson kernelestimator (3.2.3) is consistent. The numerator converges tof (xo) p(xo) and thedenominator converges top(xo).Therateofconvergenceisoptimizedif =O(n1/5)inwhichcasetheintegrated squared error converges at the optimal rate OP(n4/5), as in (3.1.10).Condence intervals may be constructed using1/2n1/2_ f (xo) f (xo) 1/2aK2_f //(xo) 2 f /(xo)p/(xo)p(xo)__DN_0,bK2p(xo)_, (3.2.5)Introduction to Smoothing 35wherep (.) is the density of x andaK =_u2K(u)du bK =_K2(u)du. (3.2.6)Wand and Jones (1995, p. 176) provide the values of aKand bKfor variouskernels.3.2.3 Comparison to Moving Average SmootherEquation (3.2.5) requires estimation of the rst and second derivatives of theregression function. However, if shrinks to zero faster than at the optimal rate,then the bias term disappears. Under such conditions, and assuming a uniformkernel for which bK =1/2, we may rewrite (3.2.5) as1/2n1/2( f (xo) f (xo))DN_0,22p(xo)_. (3.2.7)What is the probability that an observation will fall in the interval xo ?It is roughly the height of the density times twice the bandwidth or 2p(xo).Now consider the variance off (xo) implied by (3.2.7) 2/2p(xo)n. Thedenominator is approximately the number of observations one can expect to beaveraging when calculating the estimate off at xo. Compare this to the varianceof the moving average estimator in Section 3.1, which is 2/k.3.2.4 Condence IntervalsAgain let us assume that the bias term is made to disappear asymptotically bypermitting the bandwidth to shrink at a rate that is faster than the optimal rate.Applying (3.2.5), dene the standard error of the estimated regression functionat a point to bes f(xo) =bK 2 p(xo)n, (3.2.8)where p(xo) =1nn

1K_xi xo_(3.2.9)is the denominator of (3.2.3). (See footnote to Table 3.1 for values of bK.) Thena 95 percent pointwise condence interval can be constructed usingf (xo) 1.96 s f. (3.2.10)36 Semiparametric Regression for the Applied EconometricianTable 3.1.Asymptotic condence intervals for kernelestimators implementation.1. Select so that n1/5 0, for example, = O(n1/4). This ensures thatthe bias term does not appear in (3.2.5).2. Select a kernel K and obtain bK =_K2(u)du. For the uniform kernel on[1,1] bK =1/2.a3. Estimate f using the NadarayaWatson estimator (3.2.3).4. Calculate 2 = 1/n

(yi f (xi))2.5. Estimatep(xo) using (3.2.9). If the uniform kernel is used, p(xo) equalsthe proportion of xi in the interval xo divided by the width of theinterval 2.6. Calculate the condence interval at f (xo) usingf (xo) 1.96_bK 2/ p(xo)n7. Repeat at other points if desired.aFor other kernels, the values of bK are as follows: triangular, 2/3; quartic orbiweight, 5/7; Epanechnikov, 3/5; triweight, 350/429; normal, 1/(21/2).Table 3.1 provides implementation details. For condence intervals whenthe residuals are heteroskedastic, see the bootstrap procedures in Chapter 8,Table 8.2.3.2.5 Uniform Condence Bands4A potentially more interesting graphic for nonparametric estimation is a con-dence band or ribbon around the estimated function. Its interpretation is that, inrepeated samples, 95 percent of the estimated condence bands will contain theentire true regression functionf . The plausibility of an alternative specication(such as a parametric estimate, a monotone or concave estimate) can then beassessed by superimposing this specication on the graph to see if it falls withinthe band. Without loss of generality, assume that the domain of the nonpara-metric regression function is the unit interval. Returning to the assumption that 0 at a rate faster than optimal (but slowly enough to ensure consistency),a uniform 95 percent condence band or ribbon can be constructed around thefunctionf usingf (x) _cd d 12d ln_ _(K/(u))242_K2(u)__s f, (3.2.11)where d =2 ln(1/), c satises exp[2exp(c)] =.95, and s fis the4See H ardle and Linton (1994, p. 2317). See also Eubank and Speckman (1993) for an alternativeapproach to constructing uniformcondence bands for the case where the xs are equally spaced.Introduction to Smoothing 37estimated standard error of the estimated regression function dened in(3.2.8).3.2.6 Empirical Application: Engel Curve EstimationWe now apply kernel estimation to the South African data set on single indi-viduals considered earlier. The upper panel of Figure 3.3 illustrates a kernelestimate (using a triangular kernel). It is considerably smoother than the simplemoving average estimator in Figure 3.1. The lower panel of Figure 3.3 displays95 percent pointwise condence intervals as well as a 95 percent uniform con-dence band around the estimate. Note that the uniform band because it isdesigned to capture the entire function with 95 percent probability is widerthan the pointwise intervals.3.3 Nonparametric Least-Squares and Spline Smoothers3.3.1 EstimationIn Section 2.3.3, we introduced a primitive nonparametric least-squares estima-tor that imposed smoothness by bounding the slope of the estimating function.Wewillneedamoretractablewaytoimposeconstraintsonvariousorderderivatives. Let Cmbe the set of functions that have continuous derivatives upto order m. (For purposes of exposition we restrict these functions to the unitinterval.) A measure of smoothness that is particularly convenient is given bythe Sobolev norm| f |Sob =__f 2( f /)2( f //)2 _f(m)_2dx_1/2, (3.3.1)where(m)denotes the mth-order derivative. A small value of the norm impliesthat neither the function nor any of its derivatives up to order m can be too largeover a signicant portion of the domain. Indeed, bounding this normimplies thatall lower-order derivatives are bounded in supnorm. Recall from Section 2.3.3and Figure 2.4 that even bounding the rst derivative produces a consistentnonparametric least-squares estimator.Suppose we take our estimating set to be the set of functions in Cmforwhich the square of the Sobolev normis bounded by say L, that is, ={ f Cm,| f |2Sob L]. The task of nding the function in that best ts the data appearsto be daunting. After all, is an innite dimensional family. What is remarkableis that the solutionf that satisess2= minf1n

i[yi f (xi)]2s.t. | f |2Sob L (3.3.2)38 Semiparametric Regression for the Applied EconometricianModel: y =f (x) ,x is log total expenditure and y is the food share of expenditure.Data: The data consist of a sample of 1,109 single individuals (Singles) from South Africa.Figure 3.3. Engel curve estimation using kernel estimator.Introduction to Smoothing 39can be obtained by minimizing a quadratic objective function subject to a qua-dratic constraint. The solution is of the formf= n1 cirxi, where rx1, . . . , rxnare functions computable from x1, . . . , xn and c = ( c1, . . . , cn) is obtained bysolvingminc1n[y Rc]/[y Rc] s.t. c/Rc L. (3.3.3)Herey is the n1 vector of observations on the dependent variable, andRis an nn matrix computable from x1, . . . , xn. Note that even though one isestimating n parameters to t n observations, the parameters are constrained;thus, there is no immediate reason to expect perfect t.The rxi are called representor functions and R, the matrix of inner products ofthe rxi, the representor matrix (see Wahba 1990, Yatchewand Bos 1997). Detailsof these computations are contained in Appendix D. An efcient algorithm forsolving (3.3.3) may be found in Golub and Van Loan (1989, p. 564).Furthermore, ifxis a vector, the Sobolev norm (3.3.1) generalizes to in-clude various order partial derivatives. The optimization problem has the samequadratic structure as in the one-dimensional case above, and the functionsrx1, . . . , rxnaswellasthematrixRaredirectlycomputablefromthedatax1, . . . , xn. Further results may be found in Chapters 5 and 6 and Appendix D.3.3.2 Properties5The main statistical properties of the procedure are these:f is a consistentestimator of f ; indeed, low-order derivatives off consistently estimate thecorresponding derivatives of f . The rate at whichfconverges tofsatises theoptimal rates given by Stone, (2.4.1). The optimal convergence result is usefulin producing consistent tests of a broad range of hypotheses.The average minimumsumof squared residuals s2is a consistent estimator ofthe residual variance 2 . Furthermore, in large samples, s2is indistinguishablefrom the true average sum of squared residuals in the sense thatn1/2_s2 1n

2i_P0. (3.3.4)Next, since n1/2(1/n

2i 2) N(0, Var(2)) ( just apply an ordinarycentral limit theorem), (3.3.4) implies thatn1/2_s22_DN(0, Var(2)). (3.3.5)5These results are proved using empirical processes theory, as discussed in Dudley (1984) andPollard (1984).40 Semiparametric Regression for the Applied EconometricianAs explained in Section 3.6.2, this result lies at the heart of demonstratingthat nonparametric least squares can be used to produce n1/2-consistent normalestimators in the partial linear model.3.3.3 Spline SmoothersThe nonparametric least-squares estimator is closely related to spline estima-tion. Assumeforthemoment >0isagivenconstant,6andconsiderthepenalized least-squares problemminf1n

i[yi f (xi)]2| f |2Sob. (3.3.6)The criterion function trades off delity to the data against smoothness of thefunctionf . There is a penalty for selecting functions that t the data extremelywell but as a consequence are very rough (recall that the Sobolev normmeasuresthe smoothness of a function and its derivatives). Alarger results in a smootherfunction being selected.If one solves (3.3.2), our nonparametric least-squares problem, takes theLagrangian multiplier, say associated with the smoothness constraint, andthen uses it in solving (3.3.6), the resultingf will be identical.In their simplest incarnation, spline estimators use_( f //)2as the measure ofsmoothness (see Eubank 1988, Wahba 1990). Equation (3.3.6) becomesminf1n

i[yi f (xi)]2_( f //)2. (3.3.7)As increases, the estimate becomes progressively smoother. In the limit, f //is forced to zero, producing a linear t. At the other extreme, as goes to zero,the estimator produces a function that interpolates the data points perfectly.3.4 Local Polynomial Smoothers3.4.1 Local Linear RegressionA natural extension of local averaging is the idea of local regression. Supposeone runs a linear regression using only observations that lie in a neighborhoodof xo which we will denote by N(xo). If included observations were given equalweight, one would solvemina,b

xiN(xo)[yi a(xo) b(xo)xi]2(3.4.1)6Actually, it is selected using cross-validation, which is a procedure we will discuss shortly.Introduction to Smoothing 41where the dependence of the regression coefcients on xo is emphasized by thenotation. The estimate of f at xo would be given byf (xo) = a(xo) b(xo)xo. (3.4.2)Repeating this procedure at a series of points in the domain, one obtains anonparametric estimator of the regression functionf .Alternatively, onecouldperformaweightedregressionassigninghigherweights to closer observations and lower ones to those that are farther away.(In the preceding procedure, one assigns a weight of 1 to observations in N(xo)and 0 to others.) A natural way to implement this is to let the weights be deter-mined by a kernel function and controlled by the bandwidth parameter . Theoptimization problem may then conveniently be written asmina, b

i[yi a(xo) b(xo)xi]2K_xi xo_. (3.4.3)Solutions are once again plugged into (3.4.2). This procedure is sometimesreferred to as kernel regression because it applies kernel weights to a localregression. By replacing the linear function in (3.4.3) with a polynomial, theprocedure generalizes to local polynomial regression.Key references in this literature include Cleveland (1979), Cleveland andDevlin (1988), and Fan and Gijbels (1996). The latter is a monograph devotedto the subject and contains an extensive bibliography.3.4.2 PropertiesUnder general conditions, local polynomial regression procedures are consis-tent, achieve optimal rates of convergen