lecture notes: statistics 550 spring 2004 - … · lecture notes: statistics 550 spring 2004 ... l....

LECTURE NOTES: STATISTICS 550

SPRING 2004

Robert J. Boik

Department of Mathematical Sciences

Montana State University — Bozeman

April 30, 2004

Contents

1 RESOURCES 71.1 SYLLABUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 INTERMEDIATE LEVEL REFERENCE BOOKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 ADVANCED LEVEL REFERENCE BOOKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 SPECIAL TOPICS BOOKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 SELECTED JOURNAL ARTICLES & BOOK CHAPTERS . . . . . . . . . . . . . . . . . . . . . . . . 10

2 ORDER OF CONVERGENCE 112.1 ORDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 ORDER IN PROBABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 DERIVATIVES & OPTIMIZATION 173.1 MATRIX DERIVATIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Derivative of a Matrix with Respect to a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Derivative of a Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Derivative of a Kronecker Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.4 Chain Rule for Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.5 Derivatives of Inner Products and Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . 193.1.6 Derivatives of Traces and Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.7 Derivatives of Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.8 Derivatives of Exponential Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.9 Second-Order and Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.9.1 Second-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.9.2 Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.9.3 Special Case: Derivatives of a Scalar with Respect to Vectors . . . . . . . . . . . . . . 20

3.1.10 Miscellaneous Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.11 Vec and Vech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.11.1 Vec Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.11.2 Vech Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.11.3 Vec Permutation Matrix (Commutation Matrix) . . . . . . . . . . . . . . . . . . . . . 223.1.11.4 Miscellaneous Vec Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.12 Some Cautions About Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 EXAMPLE: DERIVATIVES OF LIKELIHOOD FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 First-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 Second-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.3 Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.4 Example: Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 MULTIVARIATE TAYLOR SERIES EXPANSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Arrangement of Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 Taylor Series with Remainder—Scalar-Valued Function . . . . . . . . . . . . . . . . . . . . . . 28

3

4 CONTENTS

3.3.3 Taylor Series with Remainder—Vector-Valued Function . . . . . . . . . . . . . . . . . . . . . . 29

3.4 CUMULANT GENERATING FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 JACOBIANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 CHANGE OF VARIABLE FORMULAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6.1 Review of Scalar Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6.2 Multivariate Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6.3 Some Multivariate Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7 SECOND DERIVATIVE TEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8 LAGRANGE MULTIPLIERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9 NEWTON-RAPHSON ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.10 FISHER SCORING ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.10.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.10.1.1 Beta MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.10.1.2 Bernoulli MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.11 GAUSS-NEWTON ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.12 EM ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.12.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.12.2 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.12.3 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.12.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.12.3.2 Jensen’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.12.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.12.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.12.4.1 E Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.12.4.2 M Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.12.5 Application to MVN Regression Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.12.5.1 Conditional Distribution of Ymiss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.12.5.2 The Log Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.12.5.2.1 E Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.12.5.2.2 M Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.12.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.12.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 PRINCIPLES OF MATHEMATICAL STATISTICS 45

4.1 SUFFICIENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Distribution Constant Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.2 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 EXPONENTIAL FAMILY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 INVARIANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 Dimension Reduction by Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.3 Maximal Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.3.1 Examples of Maximal Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.4 Illustration: Uniformly Most Powerful Invariant Test . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 CONDITIONING ON ANCILLARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1 Conditionality Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.2 Distribution of MLEs in Location-Scale Families . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 LIKELIHOOD PRINCIPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6 COMPLETENESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6.1 Completeness in Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6.2 UMVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

CONTENTS 5

5 LIKELIHOOD BASED INFERENCE 695.1 MULTIVARIATE CENTRAL LIMIT THEOREM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 SLUTSKY’S THEOREM AND THE DELTA METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3 DISTRIBUTION OF MLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.3.2 Consistency of MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.3 Distribution of Score Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.4 Asymptotic Distribution of MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3.5 Distribution of Maximized Log Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3.6 Distributions Under Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3.7 Large Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.7.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.3.8 Asymptotic Distribution Under Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3.9 Likelihood Ratio Test Under Local Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.3.10 Wald’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.3.11 Rao’s Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 AKAIKE’S AND BAYESIAN INFORMATION CRITERIA . . . . . . . . . . . . . . . . . . . . . . . . 855.4.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.4.2 Kullback-Leibler Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.4.3 Asymptotic Distribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.4.4 AIC and TIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.4.5 BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 EDGEWORTH EXPANSIONS 936.1 CHARACTERISTIC FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 HERMITE POLYNOMIALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.3 EDGEWORTH EXPANSION FOR Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3.1 Cornish-Fisher Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.4 GENERAL EDGEWORTH EXPANSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.4.1 Expansion of Approximate Pivotal Quantities in Regression Analysis . . . . . . . . . . . . . . . 1006.4.2 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.4.3 Example MLE of Gamma Shape Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7 SADDLEPOINT EXPANSIONS 1077.1 TILTED EXPONENTIAL PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.2 THE SADDLEPOINT SOLUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.2.1 Example: Noncentral χ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.3 SADDLEPOINT APPROXIMATIONS TO MLES IN EXPONENTIAL FAMILIES . . . . . . . . . . . 111

7.3.1 Example MLE of Gamma Shape Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8 GENERALIZED ESTIMATING FUNCTIONS 1158.1 LEAST SQUARES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158.2 GENERALIZED ESTIMATING FUNCTIONS IN CLASS G2 . . . . . . . . . . . . . . . . . . . . . . . 1158.3 OPTIMAL CHOICE OF Fi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168.4 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.5 OPTIMAL ESTIMATING FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.6 ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.7 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 CONTENTS

Chapter 1

RESOURCES

1.1 SYLLABUS

• Required Texts:

– Pace, L. & Salvan, A. (1997). Principles of Statistical Inference from a Neo-Fisherian Perspective,Singapore: World Scientific.

– Journal articles and chapters from various books also will be used.

• Instructor

– Robert J. Boik, 2–260 Wilson, 994-5339, [email protected].

– Office Hours: Monday 11:00–11:50; Tuesday 2:10–3:00; Wednesday 11:00–11:50; Thursday 2:10–3:00.

• Course home page: <http://www.math.montana.edu/∼rjboik/classes/550/stat.550.html>

• Holidays & Other “No Class” Days: Monday Jan 19 (Martin Luther King), Monday January 26: U. Co. talk,Monday Feb 16 (Presidents Day), Wednesday March 3 (Exam Exchange Day), Monday–Friday Mar 15–19(Spring Break), Friday April 9 (University Day).

• HW: Discussion about HW problems with colleagues is allowed, but written work must be doneindependently. Late HW will not be accepted without prior arrangements.

• Grading: A Midterm exam will be given on Thursday March 4 at 6:00-8:00 PM (20%). A Final exam will begiven on Thursday May 6 at 4:00–5:50 PM (40%). The remaining 40% is from HW.

Syllabus

1. Referee Reports

2. Mathematical Tools

(a) Big O, Little o Notation

(b) Matrix Derivatives

(c) Multivariate Taylor Series

(d) Big Op, Little op Notation

3. Numerical Methods

(a) Newton Raphson Algorithm

(b) Fisher Scoring Algorithm

(c) EM Algorithm

4. Overview of Selected Concepts

(a) Sufficiency

7

8 CHAPTER 1. RESOURCES

(b) Completeness

(c) UMVUE

(d) Exponential Families

(e) Cumulants

(f) Ancillary Statistics

(g) Invariance & Equivariance

(h) Transformation Family

(i) Maximum Likelihood Estimation

(j) Testing Hypotheses

i. Neyman-Pearson Lemma

ii. Extended Neyman-Pearson Lemma

iii. Invariant Tests

iv. Locally Most Powerful Tests

v. Unbiased Tests

(k) Construction of Confidence Regions

i. Pivotal Quantities

ii. CDF Method

iii. Inverting Tests

5. Large Sample Approximations

(a) Multivariate Central Limit Theorem

(b) Delta Method

(c) Edgeworth Expansions

i. Characteristic Functions & Inversion Theorems

ii. Hermite’s Polynomials

(d) Saddlepoint Approximations

6. First Order Likelihood Based Inference

(a) Asymptotic Distribution of Maximum Likelihood Estimators

(b) Generalized Likelihood Ratio Tests

(c) Other Large Sample Tests (Score, Wald, Lagrange multiplier)

(d) Confidence Regions

(e) Profile Likelihood

(f) Model Selection: Akaike & Bayesian Information Criteria Information Criterion

7. Estimating Functions

(a) Optimal Estimating Functions

(b) Asymptotic Distributions

8. Higher Order Asymptotics

(a) Saddlepoint Approximations

(b) Barndorff-Nielsen’s p∗ Formula

(c) Bartlett Corrections

9. Empirical Likelihood

10. Other Topics as Time Permits

(a) Signed Likelihood Ratio Statistic

1.2. INTERMEDIATE LEVEL REFERENCE BOOKS 9

(b) Inference when Nuisance Parameters are Present

i. Conditional Likelihood

ii. Marginal Likelihood

iii. Integrated Likelihood

(c) Modified Profile Likelihood Function

1.2 INTERMEDIATE LEVEL REFERENCE BOOKS

1. Bickel, P. J., & Doksum, K. A. (1977). Mathematical Statistics: Basic Ideas and Selected Topics, Oakland,CA: Holden-Day Inc.

2. Casella, G., & Berger, R. L. (2002). Statistical Inference, 2nd edition, Pacific Grove CA: Duxbury Press.

3. Dudewicz, E. J., & Mishra, S. N. (1988). Modern Mathematical Statistics, New York: John Wiley & Sons.

4. Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the Theory of Statistics, 3rd edition, NewYork: McGraw-Hill.

5. Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood, Oxford: OxfordUniversity Press.

6. Pestman, W.R. (1998). Mathematical Statistics: An Introduction, Berlin: Walter de Gruyter.

1.3 ADVANCED LEVEL REFERENCE BOOKS

1. Cox, D. R., & Hinkley, D. V. (1974). Theoretical Statistics, London: Chapman & Hall.

2. Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach, New York: Academic Press.

3. Knight, K. (2000). Mathematical Statistics, Boca Raton, FL: Chapman & Hall/CRC.

4. Lehmann, E. L., & Casella, G. (1988). Theory of Point Estimation, 2nd edition, New York: Springer-Verlag.

5. Lehmann, E. L. (1986). Testing Statistical Hypotheses, Second Edition, New York: John Wiley & Sons.

6. Lindsey, J. K. (1996). Parametric Statistical Inference, Oxford: Clarendon Press.

7. Severini, T. A. (2000). Likelihood Methods in Statistics, Oxford: Oxford University Press.

8. Stuart, A., & Ord, K. (1987). Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory, (5th

ed), New York: Oxford University Press.

1.4 SPECIAL TOPICS BOOKS

1. Barndorff-Nielsen, O. E., & Cox, D. R. (1989). Asymptotic Techniques for use in Statistics, London:Chapman & Hall.

2. Barndorff-Nielsen, O. E., & Cox, D. R. (1994). Inference and Asymptotics, London: Chapman & Hall.

3. Basawa, I. V., Godambe, V. P., & Taylor, R. L. (Eds.). (1997). Selected Proceedings of the Symposium onEstimating Functions, Hayward CA: IMS.

4. Ferguson, T. S. (1996). A Course in Large Sample Theory, London: Chapman & Hall.

5. Field, C., & Ronchetti, E. (1990). Small Sample Asymptotics, Hayward, CA: Institute of MathematicalStatistics.

6. Ghosh, J. (1994). Higher Order Asymptotics, Hayward, CA: Institute of Mathematical Statistics.

7. Jensen, J. L. (1995). Saddlepoint Approximations, New York: Oxford University Press.

10 CHAPTER 1. RESOURCES

8. Lehmann, E. L. (1999). Elements of Large-Sample Theory, New York: Springer.

9. McCullagh, P. (1987). Tensor Methods in Statistics, London: Chapman & Hall.

10. Owen, A.B. (2001). Empirical Likelihood, Boca Raton, FL: Chapman & Hall/CRC.

11. Sen, P. K., & Singer, J. M. (1996). Large Sample Methods in Statistics, London: Chapman & Hall.

12. van der Vaart, A.W. (1998). Asymptotic Statistics, Cambridge: Cambridge University Press.

1.5 SELECTED JOURNAL ARTICLES & BOOK CHAPTERS

1. Andrews, D. F., & Stafford, J. E. (1993). Tools for the symbolic computation of asymptotic expansions.Journal of the Royal Statistical Society, B 55, 613–617.

2. Barndorff-Nielsen, O., & Cox, D. R. (1979). Edgeworth and Saddle-point approximations with statisticalapplications, Journal of the Royal Statistical Society, B41, 279–312.

3. Berger, J. O., Liseo, B., & Wolpert, R. L. (1999). Integrated likelihood methods for eliminating nuisanceparameters. Statistical Science, 14, 1–28.

4. Efron, B. (1998). R. A. Fisher in the 21st century. Statistical Science, 13, 95–122.

5. Fraser, D. A. S., Wong, A¿, & Wu,, J. (1999). Regression analysis, nonlinear or nonnormal: Simple andaccurate p values from likelihood analysis. Journal of the American Statistical Association, 94, 1286–1295.

6. Godambe, V. P., & Kale, B. K. (1991). Estimating functions: An overview. In Estimating Functions,V. P. Godambe (Ed.), 3–20. Oxford: Clarendon Press.

7. Goutis, C., & Casella, G. (1999). Explaining the saddlepoint approximation. The American Statistician, 53,216–224.

8. Hoeting, J.A., Madigan, D., Raftery, A.E., & Volinsky, C.T. (1999). Bayesian model averaging: A tutorial.Statistical Science, 14, 382–401.

9. Huzurbazar, S. (1999). Practical saddlepoint approximations. The American Statistician, 53, 225–232.

10. Liang, K-Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika,73, 13–22.

11. Liang, K-Y., & Zeger, S. L. (1995). Inference based on estimating functions in the presence of nuisanceparameters (with discussion). Statistical Science, 10, 158–173.

12. Prentice, R. L., & Zhao, L. P. (1991). Estimating equations for parameters in means and covariances ofmultivariate discrete and continuous responses. Biometrics, 47, 825–839.

13. Reid, N. (1995). The roles of conditioning in inference (with discussion). Statistical Science, 10, 138–157.

14. Reid, N. (1988). Saddlepoint methods and statistical inference (with discussion). Statistical Science, 3,213–238.

15. Schwarz, G. (1976). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.

16. Stafford, J. E., & Andrews, D. F. (1993). A symbolic algorithm for studying adjustments to the profilelikelihood. Biometrika, 80, 715–730.

17. Strawderman, R. L., Casella, G., & Wells, M. T. (1996). Practical small-sample asymptotics for regressionproblems. Journal of the American Statistical Association, 91, 643–654. Correction, Journal of the AmericanStatistical Association, 92, 1657.

18. Zeger, S. L., Liang, K-Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimatingequation approach. Biometrics, 44, 1049–1060.

Chapter 2

ORDER OF CONVERGENCE

Much of the material in this part is taken from chapter 14 in

Bishop, Y.M.M, Fienberg, S.E., & Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice,Cambridge , MA: MIT Press.

2.1 ORDER

2.1.1 Conventions

In this section, we examine the relative behavior of two sequences an and bn as n→ ∞. The following listillustrates how to verbalize the notation.

1. an = o(bn) =⇒ an is little o of bn

2. an = O(bn) =⇒ an is big O of bn

3. an = op(bn) =⇒ an is little o p of bn

4. an = Op(bn) =⇒ an is big O p of bn

The equal sign is read as “is” rather than ‘equals.” It signifies a relation between the left- and right-handsides. It does not signify that the two sides are identical. Note that an = o(bn) 6⇒ o(bn) = an just as [statisticsalways is fun] 6⇒ [fun always is statistics].

2.1.2 Definitions

Let ann=∞n=1 and bnn=∞

n=1 be two sequences of real numbers and let x be a continuous variable.

1. Definition: an = O(bn) if the ratio |an/bn| is bounded for large n. That is,

∃ a finite number K and an integer n(K) 3 n > n(K) =⇒ |an| < K|bn|.

2. Definition: an = o(bn) if the ratio |an/bn| converges to zero. That is,

for any ε > 0,∃ an integer n(ε) 3 n > n(ε) =⇒ |an| < ε|bn|.

3. Definition: a(x) = O[b(x)] as x→ L if a(xn) = O[b(xn)] for any sequence xn satisfying xn → L. Forexample, let g(x) be a scalar valued function having second order derivatives in a neighborhood of a. Then,by Taylor’s theorem

f(x) = f(a) +d f(x)

d x

∣∣∣x=a

(x− a) +O(|x− a|2).

4. Definition: a(x) = o[b(x)] as x→ L means that a(xn) = o[b(xn)] for any sequence xn satisfying xn → L.Forexample, let g(x) be a scalar valued function having second order derivatives in a neighborhood of a. Then,by Taylor’s theorem

f(x) = f(a) +d f(x)

d x

∣∣∣x=a

(x− a) + o(|x− a|).

11

12 CHAPTER 2. ORDER OF CONVERGENCE

2.1.3 Properties

1. an = O(an).

2. an = o(1) =⇒ limn→∞

an = 0.

3. an = O(1) =⇒ |an| < K for n > n(K).

4. ano(1) = o(an).

5. anO(1) = O(an).

6. O(an)O(bn) = O(anbn).

7. O(an)o(bn) = o(anbn).

8. o(an)o(bn) = o(anbn).

9. O[o(an)] = o(an).

10. o[o(an)] = o(an).

11. o[O(an)] = o(an).

12. O(an) +O(bn) = max [O(an), O(bn)] = O(max(|an|, |bn|)).

13. O(an) + o(bn) =

O(an) if bn = o(an) or bn = O(an)

O(bn) if an = O(bn)

o(bn) if an = o(bn).

Note, if bn = o(1), then O(bn) = o(1).

14. o(an) + o(bn) = o(max(|an|, |bn|)).

2.1.4 Examples

1. Geometric Series.

1

1 + xn

= 1 − x

n+(xn

)2

−(xn

)3

+(xn

)4

− · · ·

= 1 − x

n+(xn

)2[1 − x

n+(xn

)2

−(xn

)3

+(xn

)4

− · · ·]

= 1 − x

n+(xn

)2(

1

1 + xn

)if∣∣∣xn

∣∣∣ < 1

= 1 − x

n+O

(n−2

)because

1

n−2

(xn

)2(

1

1 + xn

)=

x2

1 − xn

and

∣∣∣∣x2

1 − xn

∣∣∣∣ <x2

1 − |x|n

=nx2

n− |x| < (1 + |x|)x2 if n > 1 + |x|.

Accordingly,

1

1 + xn

=

1 − x

n+(xn

)2

−(xn

)3

+(xn

)4

− · · ·

1 +O(n−1

)= 1 + o

(n−

12

)

1 − x

n+O

(n−2

)= 1 − x

n+ o

(n−

32

).

2.1. ORDER 13

2. Expansion of ln(1 + ε).

ln(1 +

x

n

)=

x

n− 1

2

(xn

)2

+1

3

(xn

)3

− 1

4

(xn

)4

+ · · ·

=x

n− 1

2

(xn

)2

+1

3

(xn

)3 ∞∑

i=0

3

i+ 3(−1)i

(xn

)i

=x

n− 1

2

(xn

)2

+O(n−3

)because

∣∣∣∣∣1

n−3

1

3

(xn

)3 ∞∑

i=0

3

i+ 3(−1)i

(xn

)i∣∣∣∣∣ <

|x|33

∞∑

i=0

( |x|n

)iand

|x|33

∞∑

i=0

( |x|n

)i=

|x|3

3(1 − |x|

n

) =n|x|3

3(n− |x|) <|x|3(1 + |x|)

3if n > 1 + |x|.

Accordingly,

ln(1 +

x

n

)=

x

n− 1

2

(xn

)2

+1

3

(xn

)3

− 1

4

(xn

)4

+ · · ·

O(n−1

)= o

(n−

12

)

x

n+O

(n−2

)=x

n+ o

(n−

32

).

3. General Series.

Theorem 1 Suppose that

fn(x) = a0 + a1

(x− x0

n

)+ a2

(x− x0

n

)2

+ a3

(x− x0

n

)3

+ · · · ,

where limi→∞ |ai+1/ai| is finite. Then,

fn(x) =

a0 + a1

(x− x0

n

)+ a2

(x− x0

n

)2

+ a3

(x− x0

n

)3

+O(n−4

)

a0 + a1

(x− x0

n

)+ a2

(x− x0

n

)2

+O(n−3

)

a0 + a1

(x− x0

n

)+O

(n−2

)

a0 +O(n−1

)

O (1) .

Proof: The above result can be verified by using the ratio test for convergence of infinite series. For example,

fn(x) = a0 + a1

(x− x0

n

)+ a2

(x− x0

n

)2[1 +

∞∑

i=1

ai+2

a2

(x− x0

n

)i]

= a0 + a1

(x− x0

n

)+O

(n−2

), because

∣∣∣∣∣a2

n−2

(x− x0

n

)2[1 +

∞∑

i=1

ai+2

a2

(x− x0

n

)i]∣∣∣∣∣ < |a2|(x− x0)2

[1 +

∞∑

i=1

ui

],


where ui =

∣∣∣∣ai+2

ai

∣∣∣∣( |x− x0|

n

)i.

The proof is completed by showing that the series∑∞i=1 ui is finite. The ratio test verifies that the series is

finite because

limi→∞

∣∣∣∣ui+1

ui

∣∣∣∣ = limi→∞

∣∣∣∣ai+3

ai+2

∣∣∣∣|x− x0|

n< 1 if n > |x− x0| lim

i→∞

∣∣∣∣ai+1

ai

∣∣∣∣ .

The above Theorem can be stated in a slightly different manner.

Theorem 2 Suppose that

fn(x) = O (1) +O(n−1

)+O

(n−2

)+O

(n−3

)+ · · · .

Then,

fn(x) =

O (1) +O(n−1

)+O

(n−2

)+O

(n−3

)+O

(n−4

)

O (1) +O(n−1

)+O

(n−2

)+O

(n−3

)

O (1) +O(n−2

)+O

(n−2

)

O (1) +O(n−1

)

O (1) .

Proof: The above result can be verified by using the ratio test for convergence of infinite series. For example,

fn(x) = O (1) +O(n−1

)+O

(n−2

)[O (1) +

∞∑

i=1

O(n−i)]

= O (1) +O(n−1

)+O

(n−2

), because

∣∣∣∣∣1

n−2O(n−2

) ∞∑

i=0

O(n−i)∣∣∣∣∣ = O (1)

∞∑

i=0

O(n−i).

The proof is completed by showing that the series∑∞i=0O

(n−i)

is finite. The ratio test verifies that the seriesis finite because

limi→∞

∣∣∣∣∣∣

O(n−(i+1)

)

O(n−i)

∣∣∣∣∣∣= O

(n−1

)< 1 if n is sufficiently large.

2.2 ORDER IN PROBABILITY

2.2.1 Definitions

Let bnn=∞n=1 be a sequence of real numbers and let Xnn=∞

n=1 be a sequence of random variables.

1. Definition: Xn = op(1) if Xnprob−→ 0. That is,

for every ε > 0, limn→∞

Pr(|Xn| < ε) = 1

or, equivalently, for every ε > 0 and for every η > 0, ∃ an integer n(ε, η) such that if n > n(ε, η) then

Pr(|Xn| < ε) ≥ 1 − η.

One can say, informally, that Xn = op(1) if Xn = o(1) with arbitrarily high probability.

2. Definition: Xn = op(bn) if Xn/bn = op(1).

2.2. ORDER IN PROBABILITY 15

3. Convergence in probability such as Xnprob−→ 0 is a special case of convergence in measure, the measure being

the distribution function.

4. Definition: Xn = Op(1) if for every η > 0, ∃ a number K(η) and an integer n(η) such that if n > n(η), then

Pr [|Xn| ≤ K(η)] ≥ 1 − η.

One can say, informally, that Xn = Op(1) if Xn = O(1) with arbitrarily high probability.

5. Definition: Xn = Op(bn) if Xn/bn = Op(1).

2.2.2 Properties

1. Xn = op(1) =⇒ Xnprob−→ 0.

2. Xn = Op(1) =⇒ |Xn| < K with arbitrarily high probability for n > n(K).

3. anop(1) = op(an).

4. anOp(1) = Op(an).

5. Op(an)Op(bn) = Op(anbn).

6. Op(an)op(bn) = op(anbn).

7. op(an)op(bn) = op(anbn).

8. Op[o(an)] = op(an).

9. op[o(an)] = op(an).

10. op[O(an)] = op(an).

11. Op(an) +Op(bn) = max [Op(an), Op(bn)] = Op(max(|an|, |bn|)).

12. Op(an) + op(bn) =

Op(an) if bn = o(an) or bn = O(an)

Op(bn) if an = O(bn)

op(bn) if an = o(bn).

Note, if bn = o(1), then Op(bn) = op(1).

13. op(an) + op(bn) = op(max(|an|, |bn|)).

14. o(an)Op(bn) = op(anbn)

15. O(an)op(bn) = op(anbn)

16. You are only as big as your standard deviation. That is, if Xn is a stochastic sequence with E(Xn) = µnand Var(Xn) = σ2

n, thenXn − µn = Op(σn).

For proof, use Tchebychev’s inequality.

17. If Xndist−→ X, where X is a random variable, then Xn = Op(1).

18. Let X be a random variable. The notation Xna.d.= X means that Xn and X are asymptotically equal in

distribution. The relationa.d.= is defined as

Xna.d.= X if Xn −X = op(1)

Chapter 3

DERIVATIVES & OPTIMIZATION

3.1 MATRIX DERIVATIVES

3.1.1 Derivative of a Matrix with Respect to a Matrix

Let Y : p× q be a matrix of variables that are functions of the elements of X : m× n. That is Y is a matrixfunction of X. Then the derivative of Y with respect to X is the pm× qn matrix

∂Y

∂X=

∂

∂X⊗ Y =

∂Y∂ x11

∂Y∂ x12

· · · ∂Y∂ x1m

∂Y∂ x21

∂Y∂ x22

· · · ∂Y∂ x2n

......

. . ....

∂Y∂ xm1

∂Y∂ xm2

· · · ∂Y∂ xmn

,

where

∂Y

∂ xij=

∂ y11∂ xij

∂ y12∂ xij

· · · ∂ y1q∂ xij

∂ y21∂ xij

∂ y22∂ xij

· · · ∂ y2q∂ xij

......

. . ....

∂ yp1∂ xij

∂ yp2∂ xij

· · · ∂ ypq∂ xij

.

The Kronecker product notation serves as an instruction as to how to arrange the derivatives.If Y is a scalar-valued function and X is a vector, then the matrix derivative simplifies. For example, suppose

that g(x) is a scalar-valued function of the m× 1 vector x. Then,

∂ g(x)

∂x=

∂

∂ x⊗ g(x) =

∂ g(x)∂ x1∂ g(x)∂ x2

...∂ g(x)∂ xm

, and

∂ g(x)

∂x′ =∂

∂ x′ ⊗ g(x) =(∂ g(x)∂ x1

∂ g(x)∂ x2

· · · ∂ g(x)∂ xm

).

Another way to represent the matrix of derivatives is

∂Y

∂X=

m∑

i=1

n∑

j=1

(eni e

m′j ⊗ ∂Y

∂ xij

).

In many applications, it is more convenient to work with

∂ vec(Y)

∂ vec(X)=

mn∑

i=1

(emni ⊗ ∂ y

∂ xi

)or with

17

18 CHAPTER 3. DERIVATIVES & OPTIMIZATION

∂ vec(Y)

∂ vec(X)′=

mn∑

i=1

(emn′ ⊗ ∂ y

∂ xi

),

where y = vec(Y) and x = vec(X).

Theorem 3 (Transpose of Derivatives) Let Y : p× q be a matrix of variables depending on the elements ofX : m× n. Then,

[∂Y

∂X

]′=∂Y′

∂X′ .

Proof: HW.

3.1.2 Derivative of a Product

Let X : m× n and Y : n× r be matrices of variables which depend on the elements of Z : p× q. Then the derivativeof XY with respect to Z is the mp× rq matrix

∂ (XY)

∂ Z=

(∂X

∂ Z

)× (Iq ⊗ Y) + (Ip ⊗ X) ×

(∂Y

∂ Z

)

where Iq and Ip are identity matrices of order q and p.

3.1.3 Derivative of a Kronecker Product

Let X : m× n and Y : q × r be matrices of variables which depend on the elements of Z : s× t. Then the derivativeof X ⊗ Y with respect to Z is the mqs× nrt matrix

∂ (X ⊗ Y)

∂ Z=

(∂X

∂ Z⊗ Y

)+(Is ⊗ I(q,m)

)(∂Y

∂ Z⊗ X

)(It ⊗ I(n,r)

),

where Is and It are identity matrices of order s and t; and I(q,m) and I(n,r) are vec permutation matrices.

3.1.4 Chain Rule for Matrices

Let Z be a p× q matrix whose entries are a matrix function of the elements of Y : s× t. where Y is a function ofX : m× n. That is,

Z = H1(Y), where Y = H2(X).

Note, H1 : Rs×t → Rp×q and H2 : Rm×n → Rs×t Accordingly, Z = F(X) where F is a composite function:F = H2 H1. The matrix of derivatives of Z with respect to X is given by

∂ Z

∂X=

∂ [vec(Y)]

′

∂X⊗ Ip

In ⊗ ∂ Z

∂ vec(Y)

.

Consider the special case where z = h1(y) and y = h2(x). That is, z is a scalar, y is a vector, and x is avector. Then

∂ z

∂ x=

(∂ y′

∂ x⊗ I1

)(I1 ⊗

∂ z

∂ y

)=∂ y′

∂ x

∂ z

∂ y.

3.1. MATRIX DERIVATIVES 19

3.1.5 Derivatives of Inner Products and Quadratic Forms

Let c : n× 1 and A : n× n be fixed constants. Let x : n× 1 be a vector of variables. Then, the following results canbe established.

1. ∂ x/∂ x = vec(In).

2. ∂ x′/∂ x′ = vec′(In).

3. ∂ x/∂ x′ = ∂ x′/∂ x = In.

4. ∂ (x′c)/∂ x = ∂ (c′x)/∂ x = c.

5. ∂ (Ax)/∂ x = vec(A).

6. ∂ (Ax)/∂ x′ = A.

7. ∂ (x′A)/∂ x′ = vec′(A′).

8. ∂ (x′A)/∂ x = A.

9. ∂ (x′Ax)/∂ x = Ax + A′x which simplifies to 2Ax if A is symmetric.

3.1.6 Derivatives of Traces and Determinants

1. ∂ tr(XA)/∂X = A′ if the elements of X are functionally independent.

2. ∂ tr(XA)/∂X = A + A′ − diag(aii) if X is symmetric.

3. ∂ tr(XY)/∂ z = tr[(∂X/∂ z)Y] + tr[X(∂Y/∂ z)].

4. ∂ ln |X|/∂ y = tr[X−1(∂X/∂ y)] for symmetric and non-symmetric X.

5. ∂ ln |X|/∂X = 2X−1 − Diag(X−1) if X is symmetric. This result follows from (4) above.

6. ∂ ln |X|/∂X = X−1 if X is not symmetric. This result follows from (4) above.

3.1.7 Derivatives of Inverses

1. Let X be an n× n nonsingular matrix whose entries are functions of a scalar y. That is, X = F(y). Then,

∂X−1/∂ y = −X−1(∂X/∂ y)X−1.

2. Let X be an n× n nonsingular matrix whose entries are functions of Y, a p× q matrix. That is, X = F(Y).Then,

∂X−1

∂Y= −(Ip ⊗ X−1)

∂X

∂Y(Iq ⊗ X−1).

3.1.8 Derivatives of Exponential Functions

Let c : n× 1 and A : n× n be fixed constants. Let x be an n× 1 vector of variables and let g(x) be a scalarfunction of x. Then, by the chain rule,

∂ expg(x)∂ x

=∂ g(x)

∂ xexpg(x).

Examples:

1.∂ expx′c

∂ x= c expx′c

2.∂ expx′Ax

∂ x= (Ax + A′x) expx′Ax


3.1.9 Second-Order and Higher-Order Derivatives

3.1.9.1 Second-Order Derivatives

Let Y be a p× q matrix of variables whose elements are functions of the matrices X : m× n and Z : s× t. Then

∂2Y

∂X ⊗ ∂ Z=

∂

∂X⊗(∂Y

∂ Z

)=

∂

∂X⊗ ∂

∂ Z⊗ Y

=

m∑

i=1

n∑

j=1

s∑

k=1

t∑

`=1

(emi en′j ⊗ eske

t′` ⊗ ∂2Y

∂xij ∂zk`

).

Note that the matrix of second derivatives has dimension pms× qnt

3.1.9.2 Higher-Order Derivatives

Higher-order derivatives can be defined recursively. Suppose that Y is a p× q matrix whose elements are functionsof the matrices X1, X2, . . ., Xg, where Xr has dimension mr × nr. Denote the ijth element of Xr by (Xr)ij . Then

∂rY

∂Xr ⊗ ∂Xr−1 ⊗ · · · ⊗ ∂X1=

∂

∂Xr⊗ Y(r−1)

=

mr∑

i=1

nr∑

j=1

(emri enr′j ⊗ ∂Y(r−1)

∂ (Xr)ij

)where

Y(r−1) =∂r−1Y

∂Xr−1 ⊗ ∂Xr−2 ⊗ · · · ⊗ ∂X1.

Note that the matrix of rth-order derivatives has dimension

p

r∏

i=1

mi × q

r∏

i=1

ni.

It is allowed that Xi = Xj for any or all pairs. For example, if Xi = X for all i and X is m× n, then

∂4Y

∂X ⊗ ∂X ⊗ ∂X ⊗ ∂X=

m∑

i=1

n∑

j=1

(emi en′j ⊗ ∂Y(3)

∂xij

)where

Y(3) =∂3Y

∂X ⊗ ∂X ⊗ ∂X.

In this case, the matrix of 4th-order derivatives has dimension pm4 × qn4.

3.1.9.3 Special Case: Derivatives of a Scalar with Respect to Vectors

Suppose that Q is a scalar valued function of the vectors x1, x2, . . ., xg, where xr has dimension nr × 1. Denotethe ith element of xr by xri. Then,

∂ Q

∂ xr=

nr∑

i=1

(enri ⊗ ∂ Q

∂ xri

)is nr × 1,

∂2Q

∂ xr ⊗ ∂ xs=

nr∑

i=1

ns∑

j=1

(enri ⊗ ensj ⊗ ∂ Q

(∂2xri)(∂ xsj)

)is nrns × 1,

∂2Q

∂ xr ⊗ ∂ x′s

=

nr∑

i=1

ns∑

j=1

(enri ⊗ ens′j ⊗ ∂2Q

(∂ xri)(∂ xsj)

)is nr × ns,


∂2Q

∂ xr ⊗ ∂ x′s

=∂2Q

∂ x′s ⊗ ∂ xr

,

∂3Q

∂ xr ⊗ ∂ x′s ⊗ x′

t

=

nr∑

i=1

ns∑

j=1

nt∑

k=1

(enri ⊗ ens′j ⊗ ent′k ⊗ ∂3Q

(∂ xri)(∂ xsj)(∂ xtk)

)

is nr × nsnt, and

∂3Q

∂ xr ⊗ ∂ x′t ⊗ x′

s

=

nr∑

i=1

nt∑

k=1

ns∑

j=1

(enri ⊗ ent′k ⊗ ens′j ⊗ ∂2Q

(∂ xri)(∂ xsj)(∂ xtk)

)

=∂3Q

∂ xr ⊗ ∂ x′s ⊗ x′

t

I(nt,ns).

3.1.10 Miscellaneous Derivatives

Let X be an n× p matrix. Then

1.∂X

∂ vecX= Ip ⊗ vec In

2.∂X′

∂ vecX= (Ip ⊗ I(p,n))(vec Ip ⊗ In) = I(np,p)(In ⊗ vec Ip)

3.∂X′

∂ (vecX)′= Ip ⊗ (vec In)

′

4.∂ [vec(X′X)]

′

∂ vecX= (Ip ⊗ X)(Ip2 + I(p,p))

5.∂ vec(XX′)

∂ vecX= vec [X ⊗ (vec In)

′] + vec(X ⊗ In)

6.∂X′X

∂ (vecX)′=[Ip ⊗ (vecX′)

′]

+ [(vec Ip)′ ⊗ X′]

(Ip ⊗ I(n,p)

)

7.∂ vec

[(X′X)−1

]

∂ vecX= −

[Inp ⊗ (X′X)−1 ⊗ Ip

] [ ∂X′X

∂ vecX⊗ Ip

]vec[(X′X)−1

]

= − vec

(X′X)−1 ∂X′X

∂ (vecX)′[Inp ⊗ (X′X)−1

]

provided that X has full column-rank.

3.1.11 Vec and Vech

3.1.11.1 Vec Operator

Let A be an p× q matrix: A =(a1 a2 · · · aq

). The vec of A (vec for vector) is obtained by successively

stacking the columns of A into a single vector:

vec(A) =

a1

a2

...aq

.


Theorem 4 Some useful results are the following.

1. vec(ABC) = (C′ ⊗ A) vec(B).

2. tr(AB) = vec′(A′) vec(B).

3. ‖A‖2= tr(A′A) = vec′(A) vec(A).

Proof: In class or HW (maybe).

For other results on the vec operator see MacRae (1974).

3.1.11.2 Vech Operator

For symmetric A : n× n, the vech of A (vech for vector half) is the n(n+1)2 × 1 vector obtained by stacking the

unique elements in the columns of A into a single vector. For example, if A is a 4 × 4 symmetric matrix, then

A =

a11 a12 a13 a14

a12 a22 a23 a24

a13 a23 a33 a34

a14 a24 a34 a44

and vech(A) =

a11

a12

a13

a14

a22

a23

a24

a33

a34

a44

.

Some useful results are the following.

1. If A is an n× n symmetric matrix, then vec(A) = Dn vech(A) where Dn : n2 × n(n+1)2 is a unique matrix and

depends on A only through the dimensions. The matrix Dn is called the duplication matrix because itduplicates the off diagonal elements in vech(A) to create vec(A) (Magnus & Neudecker, 1999).

2. If A is an n× n symmetric matrix, then vech(A) = H vec(A) where H : n(n+1)2 × n2 is not unique and

depends on A only through the dimensions. One form of H is (D′nDn)

−1D′n.

3. Dn(D′nDn)

−1D′n = 1

2 (In2 + I(n,n)) and this is the ppo that projects onto the space of symmetric matrices.

That is, 12 (In2 + I(n,n)) vec(A) = vec(A) iff A is symmetric.

4. In2 − Dn(D′nDn)

−1D′n = 1

2 (In2 − I(n,n)) and this is the ppo that projects onto the space of skew-symmetric

matrices. That is, 12 (In2 − I(n,n)) vec(A) = vec(A) iff A is skew-symmetric.

5. HDn = I for Dn in (a) and any H in (b).

6. |H(A ⊗ A)Dn| = |A|n+1.

For other results, see Henderson & Searle (1979)

3.1.11.3 Vec Permutation Matrix (Commutation Matrix)

Let A be an r× c matrix. The entries in vec(A′) can be obtained by permuting the entries of vec(A). In particular,

vec(A′) = I(c,r) vec(A),

where I(c,r) is a permutation matrix. That is, I(c,r) is an rc× rc identity matrix in which the rows have beenpermuted. The permutation matrix I(c,r) can be computed as follows:

I(c,r) =

c∑

i=1

(e′i ⊗ Ir ⊗ ei) or I(c,r) =

r∑

j=1

(ej ⊗ Ic ⊗ e′j),

where ei is the ith column of Ic and ej is the jth column of Ir. The following properties are readily established.


1. I(c,1) = Ic and I(1,r) = Ir.

2. I(r,c) = I′(c,r).

3. I(c,r)I(r,c) = Irc.

4. (A ⊗ B) = I(p,r)(B ⊗ A)I(c,q), where A is r × c and B is p× q

3.1.11.4 Miscellaneous Vec Results

The following results have been useful in my own research. I’m listing them here so that I don’t lose them.

1. vec(ABC) = [Is ⊗ vecB′ ⊗ Ip]′[vec(C) ⊗ vec(A)], where A is p× q, B is q × r, and C is r × s.

2. vec(A ⊗ b) = [vec(A)] ⊗ b, where A is a matrix and b is a column vector.

3. b ⊗ [vec(A)] = vec(b′ ⊗ A), where A is a matrix and b is a column vector.

4. AB =Ir ⊗ [vec(B′)]′

vec(A′) ⊗ Ip

, where A is r × c and B is c× p.

5. AB =[vec(B)]

′ ⊗ Ir

Ip ⊗ vec(A), where A is r × c and B is c× p.

6. A = [(vec Ib)′ ⊗ Ia]

(Ib ⊗ I(a,b)

)(vecA ⊗ Ib) = [Ia ⊗ (vec Ib)

′](I(b,a) ⊗ Ib

)(vecA ⊗ Ib), where A is any a× b

matrix.

7. (ABC ⊗ D)E = A ⊗ [vec(C′)]′ ⊗ D [vec(B′) ⊗ E].

8. E(ABC ⊗ D) = [(vecB)′ ⊗ E] (C ⊗ vecA ⊗ D).

9. (A ⊗ BCD)E = [A ⊗ (vecD)′ ⊗ B] (E ⊗ vecC).

10. E(A ⊗ BCD) =E ⊗ [vec(C′)]

′

[A ⊗ vec(B′) ⊗ D].

11. [A ⊗ (vecB)′ ⊗ C] = [A ⊗ (vec Is)′ ⊗ (vec Ir)

′ ⊗ C][

I(s,q) ⊗ I(r,s)

]I(rs,qs)

[vec(B′) ⊗ I(q,s)

]⊗ Iru

, where A

is p× q; B is r × s; and C is t× u.

12. (A ⊗ B)C = I(r,p) [Ir ⊗ (vec Is)′ ⊗ A]

[vec(B′) ⊗ I(q,s)C

], where A is p× q and B is r × s. This is a special

case of result #7.

13. vec(A ⊗ B) = (Iq ⊗ I(p,s) ⊗ Ir)(vecA ⊗ vecB), where A is p× q and B is r × s.

3.1.12 Some Cautions About Notation

Some authors use a different arrangement for matrix derivatives. Let Y : p× q be a matrix of variables dependingon the elements of X : m× n. MacRae (1974) and Balestra (1976), for example, define the derivative of Y withrespect to X as the pm× qn matrix

∂Y

∂X= Y ⊗ ∂

∂X=

m∑

i=1

n∑

j=1

(∂Y

∂ xij⊗ eni e

m′j

)

instead of our definition on page 17 which is

∂Y

∂X=

∂

∂X⊗ Y =

m∑

i=1

n∑

j=1

(eni e

m′j ⊗ ∂Y

∂ xij

).

Unfortunately, there does not appear to be a dominant convention. We will use only the definition on page 17.Magnus and Neudecker (1999, pp. 171–173) criticize both of the above definitions because they do not yield

multivariate Jacobians in a direct manner. There argument, however, is not substantive. They prefer to evaluatematrix derivatives as the pq ×mn matrix

∂ vec(Y)

∂ vec(X)′


which can be obtained by either of the definitions:

∂ vec(Y)

∂ vec(X)′=

mn∑

i=1

(emn′i ⊗ ∂ y

∂ xi

)=

mn∑

i=1

(∂ y

∂ xi⊗ emn′i

),

where y = vec(Y) and x = vec(X).

3.2 EXAMPLE: DERIVATIVES OF LIKELIHOOD FUNCTIONS

Let y be an n× 1 random vector with joint distribution fY (y;θ), where θ is an m× 1 vector of unknownparameters. The likelihood and log likelihood functions are

L(θ;y) ∝ fY (y;θ) and `(θ;y) = ln [L(θ;y)]

respectively.

3.2.1 First-Order Derivatives

The first order derivative of ` is

`1(θ;y) =∂ `(θ;y)

∂ θ=

m∑

i=1

emi∂ `(θ;y)

∂ θi,

where emi is the mth column of Im.Often it is convenient to partition the parameter vector as

θ =

θ1

θ2

...θk

,

where θi is an mi × 1 vector. Denote the length of θ by m·. That is,

m· =

k∑

i=1

mi = 1′km where m =

m1

m2

...mk

and 1k is a k-vector of ones. The vector `1 can be written as

`1(θ;y) =

k∑

i=1

Ei,m∂ `(θ;y)

∂ θi, where Ei,m =

0ai×miImi

0bi×mi

,

ai = −mi +

i∑

j=1

mj , and bi = m· −mi − ai.

Note that Ei,m is m· ×mi.

3.2.2 Second-Order Derivatives

If the vector of parameters has not been partitioned, then the matrix of second order derivatives of the loglikelihood function is

`2(θ;y) =∂2`(θ;y)

∂ θ′ ⊗ ∂ θ=

∂

∂ θ′

(∂ `(θ;y)

∂ θ

)

3.2. EXAMPLE: DERIVATIVES OF LIKELIHOOD FUNCTIONS 25

=m∑

i=1

m∑

j=1

em′i ⊗ emj ⊗ ∂2`(θ;y)

(∂ θi)(∂ θj)

=

m∑

i=1

m∑

j=1

emj ⊗ em′i ⊗ ∂2`(θ;y)

(∂ θi)(∂ θj)

premultiplying by Imm = I(m,1) ⊗ 1

and postmultiplying by Im = I(m,1) ⊗ 1

=

m∑

i=1

m∑

j=1

emi ⊗ em′j ⊗ ∂2`(θ;y)

(∂ θi)(∂ θj)changing the order of sumation

=∂2`(θ;y)

∂ θ ⊗ ∂ θ′by definition

=

m∑

i=1

m∑

j=1

emi∂2`(θ;y)

(∂ θi)(∂ θj)em′j .

If the vector of parameters has been partitioned, then the matrix of second derivatives of the log likelihoodfunction is

`2(θ;y) =∂2`(θ;y)

∂ θ′ ⊗ ∂ θ=

∂

∂ θ′

(∂ `(θ;y)

∂ θ

)

=

k∑

i=1

k∑

j=1

Ei,m∂2`(θ;y)

∂ θi ⊗ ∂ θ′jE′j,m.

3.2.3 Higher-Order Derivatives

If the vector of parameters has not been partitioned, then the matrix of third order derivatives of the log likelihoodfunction is

`3(θ;y) =∂3`(θ;y)

∂ θ′ ⊗ ∂ θ′ ⊗ ∂ θ=

∂

∂ θ′

(∂2`(θ;y)

∂ θ′ ⊗ ∂ θ

)

=m∑

i=1

m∑

j=1

m∑

k=1

em′i ⊗ em′

j ⊗ emk ⊗ ∂3`(θ;y)

(∂ θi)(∂ θj)(∂ θk)

=m∑

i=1

m∑

j=1

m∑

k=1

emk ⊗ em′i ⊗ em′

j ⊗ ∂3`(θ;y)

(∂ θi)(∂ θj)(∂ θk)

=∂3`(θ;y)

∂ θ ⊗ ∂ θ′ ⊗ ∂ θ′

=

m∑

i=1

m∑

j=1

m∑

k=1

emi∂3`(θ;y)

(∂ θi)(∂ θj)(∂ θk)

(em′j ⊗ em′

k

).

If the vector of parameters has been partitioned, then the matrix of third order derivatives of the log likelihoodfunction is

`3(θ;y) =∂3`(θ;y)

∂ θ′ ⊗ ∂ θ′ ⊗ ∂ θ=

∂

∂ θ′

(∂2`(θ;y)

∂ θ′ ⊗ ∂ θ

)


=∂3`(θ;y)

∂ θ ⊗ ∂ θ′ ⊗ ∂ θ′=

∂

∂ θ

(∂2`(θ;y)

∂ θ′ ⊗ ∂ θ′

)

=

k∑

s=1

k∑

t=1

k∑

u=1

Eu,m∂3`(θ;y)

∂ θ′s ⊗ ∂ θ′t ⊗ ∂ θu

(E′s,m ⊗ E′

t,m

)

=

k∑

s=1

k∑

t=1

k∑

u=1

Es,m∂3`(θ;y)

∂ θs ⊗ ∂ θ′t ⊗ ∂ θ′u

(E′t,m ⊗ E′

u,m

).

Fourth and higher order derivatives follow the same pattern.

3.2.4 Example: Linear Model

Consider the linear model y ∼ N(Xβ, σ2In), where X is n× p. The log likelihood function is

`(θ;y) = − 1

2σ2(y − Xβ)′(y − Xβ) − n

2ln(σ2), where θ =

(β

σ2

).

It is readily shown that the first three derivatives of the log likelihood function are

`1(θ;y) =

2∑

s=1

Ei,m∂ `(θ;y)

∂ θs=

X′(y − Xβ)

2σ2

(y − Xβ)′(y − Xβ)

2σ4− n

2σ2

,

`2(θ;y) =2∑

s=1

2∑

t=1

Ei,m∂2`(θ;y)

∂ θs ⊗ ∂ θ′tEt,m

=

−X′X

σ2−X′(y − Xβ)

σ4

− (y − Xβ)′X

σ4− (y − Xβ)′(y − Xβ)

σ6+

n

2σ4

,

`3(θ;y) =2∑

s=1

2∑

t=1

2∑

u=1

Es,m∂3`(θ;y)

∂ θs ⊗ ∂ θ′t ⊗ ∂ θ′u

(E′t,m ⊗ E′

u,m

)

= E1,mX′X

σ4

(E′

1,m ⊗ E′2,m

)+ E1,m

X′X

σ4

(E′

2,m ⊗ E′1,m

)

+E1,m2X′(y − Xβ)

σ6

(E′

2,m ⊗ E′2,m

)+ E2,m

[vec(X′X)]′

σ4

(E′

1,m ⊗ E′1,m

)

+E2,m2(y − Xβ)′X

σ6

(E′

1,m ⊗ E′2,m

)+ E2,m

2(y − Xβ)′X

σ6

(E′

2,m ⊗ E′1,m

)

+E2,m

[3(y − Xβ)′(y − Xβ)

σ8− n

σ6

] (E′

2,m ⊗ E′2,m

),

where

m =

(p1

).

3.3. MULTIVARIATE TAYLOR SERIES EXPANSIONS 27

3.3 MULTIVARIATE TAYLOR SERIES EXPANSIONS

3.3.1 Arrangement of Derivatives

Let Y be a p× q of variables that are functions of the elements of the m× n matrix X. That is, yij = yij(X). Thederivative of Y with respect to X is

∂Y

∂X=

∂

∂X⊗ Y =

∂Y∂ x11

∂Y∂ x12

· · · ∂Y∂ x1m

∂Y∂ x21

∂Y∂ x22

· · · ∂Y∂ x2n

......

. . ....

∂Y∂ xm1

∂Y∂ xm2

· · · ∂Y∂ xmn

,

where

∂Y

∂ xij=

∂ y11∂ xij

∂ y12∂ xij

· · · ∂ y1q∂ xij

∂ y21∂ xij

∂ y22∂ xij

· · · ∂ y2q∂ xij

......

. . ....

∂ yp1∂ xij

∂ yp2∂ xij

· · · ∂ ypq∂ xij

.

The Kronecker product notation serves as an instruction as to how to arrange the derivatives. Note that thetranspose of the derivative is equal to the derivative of the transpose:

(∂Y

∂X

)′=

∂Y′

∂ x11

∂Y′

∂ x21· · · ∂Y′

∂ xm1∂Y′

∂ x12

∂Y′

∂ x22· · · ∂Y′

∂ xn2

......

. . ....

∂Y′

∂ x1m

∂Y′

∂ x2m· · · ∂Y′

∂ xnm

=∂Y′

∂X′ =∂

∂X′ ⊗ Y′.

If Y is a scalar-valued function and X is a vector, then the matrix derivative simplifies. For example, supposethat g(x) is a scalar-valued function of the m× 1 vector x. Then,

∂ g(x)

∂x=

∂

∂ x⊗ g(x) =

∂ g(x)∂ x1∂ g(x)∂ x2

...∂ g(x)∂ xm

, and

∂ g(x)

∂x′ =∂

∂ x′ ⊗ g(x) =(∂ g(x)∂ x1

∂ g(x)∂ x2

· · · ∂ g(x)∂ xm

).

Higher order derivatives are defined in an analogous manner. For example, suppose that g(x) is a scalar-valuedfunction of the m× 1 vector x. Then

∂2 g(x)

∂ x ⊗ x=

∂

∂ x

(∂ g(x)

∂ x

)=

∂

∂ x

[∂

∂ x⊗ g(x)

]=

∂

∂ x⊗ ∂

∂ x⊗ g(x)


=∂

∂ x⊗

∂ g(x)∂ x1∂ g(x)∂ x2

...∂ g(x)∂ xm

=

∂2 g(x)∂ x1 ∂ x1∂2 g(x)∂ x1 ∂ x2

...∂2 g(x)∂ x1 ∂ xn∂2 g(x)∂ x2 ∂ x1∂2 g(x)∂ x2 ∂ x2

...∂2 g(x)∂ x2 ∂ xm

...

...∂2 g(x)∂ xm ∂ x1∂2 g(x)∂ xm ∂ x2

...∂2 g(x)

∂ xm ∂ xm

.

Similarly,

∂2 g(x)

∂ x ⊗ x′ =∂

∂ x

(∂ g(x)

∂ x′

)=

∂

∂ x

[∂

∂ x′ ⊗ g(x)

]=

∂

∂ x⊗ ∂

∂ x′ ⊗ g(x)

=∂

∂ x⊗(∂ g(x)∂ x1

∂ g(x)∂ x2

· · · ∂ g(x)∂ xm

)

=

∂2 g(x)∂ x1 ∂ x1

∂2 g(x)∂ x1 ∂ x2

· · · ∂2 g(x)∂ x1 ∂ xn

∂2 g(x)∂ x2 ∂ x1

∂2 g(x)∂ x2 ∂ x2

· · · ∂2 g(x)∂ x2 ∂ xm

......

. . ....

∂2 g(x)∂ xm ∂ x1

∂2 g(x)∂ xm ∂ x2

· · · ∂2 g(x)∂ xm ∂ xm

.

Note that the matrix of second derivatives is symmetric. Accordingly

∂2 g(x)

∂ x ⊗ x′ =∂2 g(x)

∂ x′ ⊗ x.

3.3.2 Taylor Series with Remainder—Scalar-Valued Function

Suppose that g(x) is a scalar valued function of the m× 1 vector x. Denote the ith derivative of g(x) evaluated at

x = x0 by g(i)x0 . As shown in the previous section, there are many ways to order higher order derivatives. We will

use the following arrangements:

g(1)x0

=∂ g(x)

∂ x

∣∣∣∣∣x=x0

, note that g(1)x0

is m× 1;

g(2)x0

=∂2g(x)

∂ x′ ⊗ ∂ x

∣∣∣∣∣x=x0

, note that g(2)x0

is m×m;

g(3)x0

=∂3g(x)

∂ x′ ⊗ ∂ x′ ⊗ ∂ x

∣∣∣∣∣x=x0

, note that g(3)x0

is m×m2;

3.3. MULTIVARIATE TAYLOR SERIES EXPANSIONS 29

g(4)x0

=∂4g(x)

∂ x′ ⊗ ∂ x′ ⊗ ∂ x′ ⊗ ∂ x

∣∣∣∣∣x=x0

, note that g(4)x0

is m×m3

etcetera. If i ≥ 2, then the dimension of g(i)x0 is m×mi−1. Powers of (x − x0) will be denoted as follows:

(x − x0)⊗1 = (x − x0),

(x − x0)⊗2 = (x − x0) ⊗ (x − x0),

(x − x0)⊗3 = (x − x0) ⊗ (x − x0) ⊗ (x − x0),

etcetera.Suppose that derivatives of g(x) up to order r exist and are continuous in an open neighborhood of x0.

Taylor’s theorem with remainder (Taylor, 1712) states that if x is in this neighborhood, then

g(x) = g(x0) + (x − x0)′g(1)

x0+

r−1∑

i=2

1

i!(x − x0)

′g(i)x0

(x − x0)⊗i−1

+1

r!(x − x0)

′g(r)x∗ (x − x0)

⊗r−1,

where x∗ = αx0 + (1 − α)x for some α ∈ (0, 1). If

1

r!(x − x0)

′g(r)x0

(x − x0)⊗r−1 → 0

as r → ∞, then

g(x) = g(x0) + g(1)x0

(x − x0) +

∞∑

i=2

1

i!(x − x0)

′g(i)x0

(x − x0)⊗i−1.

More generally, if g(r)x∗ is bounded for all x∗ = αx0 + (1 − α)x, α ∈ (0, 1), then

g(r)x∗ = O(1),

1

r!(x − x0)

′g(r)x∗ (x − x0)

⊗r−1 = O (|x − x0|r)

and g(x) = g(x0) + (x − x0)′g(1)

x0+r−1∑

i=2

1

i!(x − x0)

′g(i)x0

(x − x0)⊗i−1 +O (|x − x0|r)

3.3.3 Taylor Series with Remainder—Vector-Valued Function

Suppose that g(x) is a vector-valued function of the m× 1 vector x. Specifically, suppose that g(x) is p× 1.

Denote the ith derivative of g(x) evaluated at x = x0 by g(i)x0 . As shown in a previous section, there are many ways

to order higher order derivatives. We will use the following arrangements:

g(1)x0

=∂ g(x)

∂ x′

∣∣∣∣∣x=x0

, note that g(1)x0

is p×m;

g(2)x0

=∂2g(x)

∂ x′ ⊗ ∂ x′

∣∣∣∣∣x=x0

, note that g(2)x0

is p×m2;

g(3)x0

=∂3g(x)

∂ x′ ⊗ ∂ x′ ⊗ ∂ x′

∣∣∣∣∣x=x0

, note that g(3)x0

is p×m3

etcetera. In general, the dimension of g(i)x0 is p×mi. As before, the ith power of (x − x0) will be denoted by

(x − x0)⊗i.


Suppose that derivatives of g(x) up to order r exist and are continuous in an open neighborhood of x0.Taylor’s theorem with remainder states that if x is in this neighborhood, then

g(x) = g(x0) +

r−1∑

i=1

1

i!g(i)x0

(x − x0)⊗i +

1

r!g

(r)x∗ (x − x0)

⊗r,

where x∗ = αx0 + (1 − α)x for some α ∈ (0, 1). If

1

r!g(r)x0

(x − x0)⊗r → 0

as r → ∞, then

g(x) = g(x0) +

∞∑

i=1

1

i!g(i)x0

(x − x0)⊗i.

More generally, if g(r)x∗ is bounded for all x∗ = αx0 + (1 − α)x, α ∈ (0, 1), then

g(r)x∗ = O(1),

1

r!g

(r)x∗ (x − x0)

⊗r = O (|x − x0|r)

and g(x) = g(x0) +

r−1∑

i=1

1

i!g(i)x0

(x − x0)⊗i +O (|x − x0|r)

3.4 CUMULANT GENERATING FUNCTIONS

Suppose that y is a scalar random variable whose moment generating function exists. Write the MGF as My(t) andexpand ln[My(t)] in a Taylor series around t = 0. The result is called the cumulant generating function:

Cy(t) = ln[My(t)] =

∞∑

r=1

tr

r!kr,

where kr is called the rth cumulant. Cumulants also are known as semi-invariants. The cumulants are polynomialfunctions of the moments of y. Let µr be the rth moment of y and let δr be the rth central moment of y. That is,µr = E(yr) and δr = E[(y − µ1)

r]. Then, the relationships among the first four values of kr, µr and δr aresummarized in the following table.

Table 3.1: Relations Among Cumulants and Moments

r κr κr µr0 0 0 11 µ1 µ1 κ1

2 µ2 − µ21 δ2 κ2 + κ2

1

3 µ3 − 3µ1µ2 + 2µ31 δ3 κ3 + 3κ1κ2 + κ3

1

4 µ4 − 3µ22 − 4µ1µ3 + 12µ2

1µ2 − 6µ41 δ4 − 3δ22 κ4 + 3κ2

2 + 4κ1κ3 + 6κ22 + κ4

1

The standardized 4th cumulantκ4

κ22

=E(y − µ)4

[E(y − µ)2]2 − 3

is the usual measure of kurtosis.If y is a scalar random variable with distribution y ∼ N(µ, σ2), then My(t) = etµ+ 1

2 t2σ2

andln [My(t)] = tµ+ 1

2 t2σ2. Accordingly, κ1 = µ, κ2 = σ2, and all higher-order cumulants are zero.

3.5. JACOBIANS 31

3.5 JACOBIANS

3.6 CHANGE OF VARIABLE FORMULAE

3.6.1 Review of Scalar Results

Suppose Y is a univariate random variable with density f(y) for a < y < b. Let X = r(Y ) be a one to onetransformation from Y to X. That is, r(·) has a unique inverse so that x = r(y) iff y = r−1(x). Under theseconditions, the density of X is

g(x) = f(r−1(x)

)|dy/dx|

for x in the range of r(y). The quantity J = dy/dx is called the Jacobian of the transformation. It is sometimeseasier to obtain J by J = (dx/dy)−1.

3.6.2 Multivariate Results

In this section, Y will denote a random matrix or a realization, depending on context. Suppose Y is a randommatrix with density f(Y). Let X = R(Y) be a one to one transformation from Y to X. Note that R(Y) is amatrix function of a matrix argument. That is, if X is n× d, then

X = R(Y) = rij(Y) =

r11(Y) r12(Y) · · · r1d(Y)

......

. . ....

rn1(Y) rn2(Y) · · · rnd(Y)

.

The density of X is

g(X) = f(R−1(X)

)|J |, where J =

∣∣∣∣∂ vec(Y)

∂ vec ′(X)

∣∣∣∣

for X in the range of R(Y). Sometimes it is easier to find the Jacobian by

J =

∣∣∣∣∂ vec(X)

∂ vec ′(Y)

∣∣∣∣−1

.

The preceding multivariate change of variable formula assumes that the elements of Y are functionallyindependent. That is, if Y is n× d, then Y has an nd-dimensional distribution. If Y is d× d and symmetric, then

the distribution is only(d(d+1)

2

)-dimensional and vec(Y) needs to be replaced by vech(Y) = H vec(Y) where H is

described on page 24 of the Stat 505 notes. Recall that if Y is symmetric, then vec(Y) = G vech(Y) for G on page24 of the Stat 505 notes.

If Y is d× d and upper triangular, then the Y has a(d(d+1)

2

)-dimensional distribution and vec(Y) must be

replaced by vech(Y′). The transpose is used to obtain a lower triangular matrix so that vech(Y′) stacks thenon-zero elements in Y. If Y is lower triangular, then replace vec(Y) by vech(Y). Note, however, that if Y is lowertriangular rather than symmetric, then vech(Y) 6= H vec(Y) and vec(Y) 6= G vech(Y).

The general rule is that if Y has some constraining structure, then, replace vec(Y) by a vector obtained bystacking the functionally independent elements of Y. The elements of Y are functionally dependent if one or moreof the elements (scalar random variables) can be written as functions of the remaining elements. For example, if Y

is symmetric, then the (ij)th

element is identical to the (ji)th

element and the elements are not functionallyindependent.

3.6.3 Some Multivariate Jacobians

Theorem 5 Suppose that Y is an a× b random matrix of functionally independent random variables. Let A andB be a× a and b× b nonsingular matrices. Transform Y to X by X = AYB. Then

J =

∣∣∣∣∂ vec(Y)

∂ vec ′(X)

∣∣∣∣ = |A|−b|B|−a.


This theorem can be used to establish a well known result multivariate normal result. If y is an n× 1 vectorwith distribution N(µ,Σ). Then, Ay ∼ N(Aµ,AΣA′) where A is a nonsingular matrix. We know that the resultalso holds for singular A, but the nonsingular condition is necessary here to insure that R(·) is one to one.

Theorem 6 Suppose Y is a random d× d symmetric matrix and A is a d× d is a nonsingular matrix ofconstants. Let X = AYA′. Then,

Y = A−1XA′−1, vec(Y) = (A−1 ⊗ A−1) vec(X),

and vech(Y) = H(A−1 ⊗ A−1)G vech(X).

The last identity is obtained from the relationships vec(Y) = G vech(Y), vech(Y) = H vec(Y), and HG = I. Note

that ∂ vech(X)/∂ vech ′(X) is an identity of order d(d+1)2 . The Jacobian is

J =

∣∣∣∣∂ vech(Y)

∂ vech ′(X)

∣∣∣∣ = |H(A−1 ⊗ A−1)G|.

Henderson and Searle (“Vec and Vech operators for Matrices, with Some Uses in Jacobians and MultivariateStatistics,” The Canadian Journal of Statistics, 1979, 7, 65–81) showed that

|H(A−1 ⊗ A−1)G| = |A|−(d+1).

Theorem 7 Suppose that Y is a d× d upper triangular random matrix. Let X = YC where C is a d× d uppertriangular matrix of constants. Note that X also is upper triangular. The inverse transformation is Y = XC−1 orY′ = C′−1

X′. The functionally independent elements of Y′ are vech(Y′). The Jacobian is

J =

∣∣∣∣∂ vech(Y′)

∂ vech ′(X′)

∣∣∣∣ = |H(Id ⊗ C′−1)G| = |H(Id ⊗ C′−1

)G| =

d∏

i=1

aiii,

where aii is the ith diagonal entry in C′−1. Note, aii also is the ith diagonal entry in C−1.

Theorem 8 Suppose that Y is a d× d upper triangular matrix of random variables. Let X = Y′Y. Note that X issymmetric and thus has only d(d+ 1)/2 functionally independent random variables. This is the same number as inY. If the diagonal elements of Y are positive with probability 1, then the transformation is one to one. TheJacobian is

J =

∣∣∣∣∂ vech(Y)

∂ vech ′(X′)

∣∣∣∣ =∣∣∣∣∂ vech(X)

∂ vech ′(Y′)

∣∣∣∣−1

= 2−dd∏

i=1

yi−(d+1)ii .

Theorem 9 Suppose Y is a random d× d symmetric matrix. Let X = Y−1. Then,

Y = X−1, vec(Y) = vec(X−1), and vech(Y) = G vech(X−1).

The Jacobian is

J =

∣∣∣∣∂ vech(Y)

∂ vech ′(X)

∣∣∣∣ = |X|−(d+1).

3.7. SECOND DERIVATIVE TEST 33

3.7 SECOND DERIVATIVE TEST

Assume that f(t) is a twice differentiable scalar valued function of t. The stationary points of f(t) are thesolutions to the equations

∂f(t)

∂t= 0.

Suppose that t0 is a stationary point of f(t). The second order Taylor expansion of f(t) around t = t0 is

f(t) = f(t0) + g′(t − t0) +1

2(t − t0)

′H(t − t0) + o(||t − t0||2),

where g is the gradient vector

g =∂f(t)

∂t

∣∣∣∣t=t0

and H is the Hessian matrix

H =∂2f(t)

∂t ⊗ ∂t′

∣∣∣∣t=t0

.

At the stationary point t0, g = 0 so that

f(t) = f(t0) +1

2(t − t0)

′H(t − t0) + o(||t − t0||2).

Accordingly, if H > 0, then f(t0) is a local minimum and if H < 0, then f(t0) is a local maximum.

3.8 LAGRANGE MULTIPLIERS

The goal is to find the vector t0 : q × 1 which maximizes (or minimizes) f(t) subject to the constraint thatG(t) = g where G maps vectors from Rq to Rp. Assume that the constraint is a differentiable function of t and

∂G(t0)

∂t′0

∣∣∣∣t=t0

: p× q

has rank p. Let λ : p× 1 be a vector of unknown Lagrange multipliers. Note that a multiplier is required for eachfunctionally independent constraint. Define a new function

Q(t,λ) = f(t) − λ′ (G(t) − g) .

Theorem 10 (Lagrange) The vector which maximizes (minimizes) f(t) subject to G(t) = g can be obtained bysimultaneously solving two equations:

∂Q(t,λ)

∂t= 0

and∂Q(t,λ)

∂λ= 0.

The two equations can be written as∂f(t)

∂t=

(∂[G(t)]′

∂t

)λ

andG(t) = g.

The second equation consists of the constraints to be satisfied.

To motivate the above result, suppose that t0 is the vector which maximizes f(t) subject to G(t) = g. Thetangent plane to f(t) at t = t0 is given by

(∂f(t)

∂t′

∣∣∣∣t=t0

)(t − t0) = 0.


Also, the tangent plane to G(t) − g at t = t0 is

(∂[G(t) − g]

∂t′

∣∣∣∣t=t0

)(t − t0) = 0.

At the solution, the tangent planes to G and f must coincide. That is,

(∂[G(t) − g]

∂t′

∣∣∣∣t=t0

)(t − t0) = 0 =⇒

(∂f(t)

∂t′

∣∣∣∣t=t0

)(t − t0) = 0.

The tangent plane to G will be on the tangent plane to f iff

∂f(t)

∂t′

∣∣∣∣t=t0

= λ′(∂[G(t) − g]

∂t′

∣∣∣∣t=t0

)

for some λ. Hence, ∂Q(t,λ)/∂t = 0 must be satisfied.

3.9 NEWTON-RAPHSON ALGORITHM

Let Y be a random matrix with joint density f(Y;θ) where θ is an r-vector of unknown parameters. Denote by`(θ;Y), the log likelihood function of θ given an observed Y. The MLE of θ is the vector which maximizes `(θ;Y).One way of computing this maximization is to employ the Newton-Raphson algorithm.

Let θ0 be an initial guess and let θi be the estimate of θ after the ith iteration. Denote the MLE of θ by θ.Begin by expanding `(θ;Y) in a Taylor series around θ = θi. That is,

`(θ;Y) = `(θi;Y) + (θ − θi)′[∂`(θ;Y)

∂θ

∣∣∣∣θ=θi

]

+1

2(θ − θi)′

[∂2`(θ;Y)

∂θ ⊗ ∂θ′

∣∣∣∣θ=θi

](θ − θi) + o(‖θ − θi‖2).

This expansion can be written as

`(θ;Y) ≈ `(θi;Y) + (θ − θi)′gθi +1

2(θ − θi)′Hθi

(θ − θi), (3.1)

where

gθi

=∂`(θ;Y)

∂θ

∣∣∣∣θ=θi

and Hθi

=∂2`(θ;Y)

∂θ ⊗ ∂θ′

∣∣∣∣θ=θi

.

To find the vector θ that maximizes `(θ;Y), set the derivative of `(θ;Y), expressed as in (3.1), to zero and

solve for θ. The result is

∂`(θ;Y)

∂θ= 0 =⇒ g

θi+ H

θi(θ − θi) = 0. (3.2)

Solving (3.2) for θ gives

θ = θi − H−1

θigθi. (3.3)

The left-hand-side (3.3) becomes the new guess and the procedure is repeated. If the original guess is not too farfrom the MLE, then the Taylor series expansion is accurate and the updated guess will be even closer to the MLE.Iteration continues until convergence. Note that at the solution, g

θ= 0 so that further iterations have no effect. If

the original guess is far from the MLE, the Taylor series expansion is not accurate and the algorithm may notconverge to the MLE.

3.10. FISHER SCORING ALGORITHM 35

3.10 FISHER SCORING ALGORITHM

The Fisher scoring algorithm (named after R. A. Fisher) is identical to the Newton-Raphson algorithm with one

exception. The Fisher scoring algorithm replaces Hθi

by its expectation under θ = θi. The quantity

Iθ = −E [Hθ] = −Eθ

[∂2`(θ)

∂θ ⊗ ∂θ′

]

is called Fisher’s information (total). The quantity −Hθ

is called the observed information (total) matrix. Fisher’sscoring algorithm replaces −H

θiby I

θi.

3.10.1 Examples

3.10.1.1 Beta MLEs

Suppose yi ∼ iid Beta(α, β), for i = 1, . . . , n. Method of moment estimators provide sensible initial guesses for αand β:

E(yr) =B(α+ r, β)

B(α, β)=

Γ(α+ r)Γ(α+ β)

Γ(α+ r + β)Γ(α), for α+ r > 0

=⇒ E(y) =α

α+ βand var(y) =

αβ

(α+ β)2(α+ β + 1).

Matching moments yields

α0 =y2(1 − y)

S2y

− y and β0 =α0(1 − y)

y.

The Newton-Raphson and Fisher scoring algorithms are identical. In particular,

θ =

(αβ

);

∂`(θi)

∂θi=

( ∑ni=1 ln(yi) − nψ(αi) + nψ(αi + βi)∑n

i=1 ln(1 − yi) − nψ(βi) + nψ(αi + βi)

)and

Iθi

= − ∂2`(θi)

(∂θi) ⊗ (∂θ′i)

= n

(ψ′(αi) − ψ′(αi + βi) −ψ′(αi + βi)

−ψ′(αi + βi) ψ′(βi) − ψ′(αi + βi)

).

3.10.1.2 Bernoulli MLEs

Suppose that yj for j = 1, . . . , n are independently distributed as Bernoulli(πj), where logit(πj) = x′jθ. The

Newton-Raphson and Fisher scoring algorithms are identical. In particular,

∂`(θi)

∂θi=

n∑

j=1

xj (yj − πij) and Iθi

= − ∂2L(θi)

(∂θi)(∂θ′i)

=

n∑

j=1

xj πij(1 − πij)x′j ,

where

πij =expx′

j θi1 + expx′

j θi.

3.11 GAUSS-NEWTON ALGORITHM

The Gauss-Newton algorithm is useful for obtaining least squares estimators of parameters in nonlinear models.Suppose that y1, . . . ,yn are independently distributed random p-vectors with E(yi) = gi(θ) and Var(yi) = Σi. Forsimplicity, it is assumed that Σi for i = 1, . . . , n are known. The least squares estimator of the q-vector θ is theminimizer of

SSE(θ) =

n∑

i=1

[yi − gi(θ)]′Σ−1

i [yi − gi(θ)].


It is assumed that the nonlinear function gi is twice differentiable with respect to θ and it has continuousderivatives.

Denote the minimizer by θ. In the hth iteration of the Gauss-Newton algorithm, the function gi(θ) is

expanded in a Taylor series around the current guess, say θ(h)

. The expansion is

gi(θ) = gi(θ(h)) + Gi(θ

(h))(θ − θ(h)) +O(|θ − θ(h)|2

), where

Gi(θ(h)) =

∂ gi(θ)

∂ θ′

∣∣∣∣θ=θ(h)

.

Drop the remainder term and substitute the expansion into the SSE criterion to obtain

SSE(θ) ≈n∑

i=1

[yi − gi(θ

(h)) − Gi(θ − θ(h))]′

Σ−1i

[yi − gi(θ

(h)) − Gi(θ − θ(h))],

where Gi = Gi(θ(h)).

To obtain the update of θ, take the derivative of the approximate SSE function with respect to θ, set it tozero, and solve for θ. The solution, called θ(h+1) is

θ(h+1) = θ(h) +

[n∑

i=1

G′iΣ

−1i Gi

]−1 n∑

i=1

G′iΣ

−1i [yi − gi(θ

(h))].

A modification is to update θ as

θ(h+1) = θ(h) + αh

[n∑

i=1

G′iΣ

−1i Gi

]−1 n∑

i=1

G′iΣ

−1i [yi − gi(θ

(h))],

where αh ∈ (0, 1]. Sometimes it is necessary to use αh < 1 to ensure that the algorithm reduces SSE on iteration h.

3.12 EM ALGORITHM

3.12.1 References

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood estimation from incomplete datavia the EM algorithm (with discussion), Journal of the Royal Statistical Society, B39, 1–38.

McLachlan, G. J., & Krishnan, T. (1997). The EM Algorithm and Extensions, New York: Wiley.

Little, R. J. A., and Rubin, D. B. (1987). Statistical Analysis with Missing Data, New York: John Wiley.

3.12.2 Missing Data Mechanisms

Let Y be a random matrix and let Y be a realization of Y. Denote the pdf of Y by f(y|θ), where θ is a vector ofunknown parameters and y = vec(Y). The pdf can be considered as a random variable [i.e., u = f(y|θ)] or arealization of a random variable [i.e., u = f(y|θ)]. Denote the observable random variables by yobs and theunobservable (i.e., missing) random variables as ymiss. The vector yobs is observed, while the vector ymiss is notobserved.

Complete data consists of Yobs and Ymiss plus a matrix of indicator variables, R, for the observed data. Theindicator matrix has structure

R = rij, where rij =

1 if yij is observed and

0 if yij is missing.

3.12. EM ALGORITHM 37

If no data are missing data, then the joint pdf can be written as

f(yobs|θ),

where θ is a vector of unknown parameters. If some data are missing, then the complete data pdf can be written as

f(yobs,ymiss, r|θ,ψ),

where r = vec(R), and ψ is a vector of unknown parameters corresponding to the missing data mechanism (i.e., thedistribution of r). Assume that the parameter vectors θ and ψ are distinct and that yobs and ymiss are jointlysufficient for θ.

The pdf of the observed data is

f(yobs, r|θ,ψ) =

∫f(yobs,ymiss, r|θ,ψ)dymiss

=

∫f(yobs,ymiss|θ)f(r|yobs,ymiss,ψ)dymiss.

The missing data mechanism can be ignored when computing MLEs of θ if either of the following two conditionsare satisfied.

1. The missing data are missing at random (MAR). In this case,

f(r|yobs,ymiss,ψ) = f(r|yobs,ψ)

and the pdf of the observed data simplifies to

f(yobs, r|θ,ψ) = f(r|yobs,ψ)

∫f(yobs,ymiss|θ)dymiss

∝∫f(yobs,ymiss|θ)dymiss.

Note that in the MAR condition, the value of missing data does not influence the missing data mechanism.

2. The missing data are missing completely at random (MCAR). In this case,

f(r|yobs,ymiss,ψ) = f(r|ψ)

and the pdf of the observed data simplifies to

f(yobs, r|θ,ψ) = f(r|ψ)

∫f(yobs,ymiss|θ)dymiss

∝∫f(yobs,ymiss|θ)dymiss.

Note that in the MCAR condition, the values of missing and observed data do not influence the missing datamechanism.

3.12.3 Jensen’s Inequality

3.12.3.1 Definitions

• Convex Set. Let S be a subset of n dimensional space. The subset S is called convex if

x ∈ S and y ∈ S =⇒ αx + (1 − α)y ∈ S ∀ α ∈ (0, 1).

• Convex Function. A scalar valued function g(x) defined on a convex set, S, is said to be convex if

g [αx + (1 − α)y] ≤ αg(x) + (1 − α)g(y) ∀ α ∈ (0, 1).

For example g(x) = x′x for x ∈ Rn is convex. If the second derivative exists, then a convex function satisfies

∂2g(x)

∂x ⊗ ∂x′ > 0 ∀ x ∈ S.


3.12.3.2 Jensen’s Theorem

Theorem 11 (Jensen) Let g(x) be a convex function defined on a nonempty convex subset S in Rn. Let z be ann-vector of random variables for which E(z) exists and Pr(z ∈ S) = 1. Then

g [E(z)] ≤ E [g(z)] .

Proof: A proof for the special case n = 1 will be given in class. Extension to higher dimensions can be established byinduction. See

Ferguson, T.S. (1967). Mathematical Statistics: A Decision Theoretic Approach, New York: Academic Press.

3.12.3.3 Examples

• g(z) = z2. Jensen’s inequality reveals that σ2z ≥ 0.

• g(z) = ln(z). Jensen’s inequality reveals that E[ln(z)] ≤ ln[E(z)].

3.12.4 Theory

We will assume MAR and ignore the missing data mechanism. The complete data pdf can be factored as

f(yobs,ymiss|θ) = f(yobs|θ) × f(ymiss|yobs,θ),

which reveals that

f(yobs|θ) =f(yobs,ymiss|θ)f(ymiss|yobs,θ)

.

The corresponding factoring of the log likelihood function for yobs = yobs is

`(θ|yobs) = `(θ|yobs,ymiss) − ln [f(ymiss|yobs,θ)] .

We wish to maximize the observed log likelihood with respect to θ for fixed yobs.Let θ(t) be the approximation to the mle of θ at the tth iteration. To get rid of the random variables in the

likelihood function, take expectations with respect to the density of ymiss conditional on yobs = yobs and assumingθ(t) as the unknown parameter vector. Note that `(θ|yobs) does not depend on ymiss, so it is not changed by takingthe expectation with respect to the conditional distribution of ymiss. That is,

E [`(θ|yobs)] = `(θ|yobs)

= Q(θ|θ(t)) −H(θ|θ(t)),

where

Q(θ|θ(t)) =

∫`(θ|yobs,ymiss)f(ymiss|yobs,θ

(t)) dymiss,

and

H(θ|θ(t)) =

∫ln [f(ymiss|yobs,θ)] f(ymiss|yobs,θ

(t)) dymiss.

Lemma H(θ(t)|θ(t)) −H(θ|θ(t)) ≥ 0 for all θ,θ(t) in the parameter space.

Proof: Let

Z = Z(ymiss) =f

f (t), where f (t) = f(ymiss|yobs,θ

(t)) and f = f(ymiss|yobs,θ).

Note that − ln(Z) is an convex function of Z. It follows from Jensen’s inequality that E [− ln(Z)] ≥ − ln [E(Z)].Accordingly

H(θ(t)|θ(t)) −H(θ|θ(t)) =

∫ln

(f (t)

f

)f (t) dymiss


= −∫

ln

(f

f (t)

)f (t) dymiss = −E

[ln(Z)|yobs,θ

(t)]≥ − ln

[E(Z|yobs,θ

(t))]

= 0,

because E(Z|yobs,θ

(t))

=

∫f

f (t)f (t) dymiss =

∫f dymiss = 1.

The EM algorithm chooses θ(t+1) to maximize Q(θ|θ(t)) with respect to θ. It can be shown that under somefairly general conditions, this strategy yields a sequence of estimates that converges to the MLE. We will not provethis result, but we will prove the following theorem.

Theorem 12 The likelihood function is increased at each iteration of the EM algorithm.

Proof: Write the likelihood at the tth iteration as

`(θ(t)|yobs) = Q(θ(t)|θ(t)) −H(θ(t)|θ(t)).

Denote the maximizer of Q(θ|θ(t)) with respect to θ by θ(t+1). Then, the likelihood function evaluated at θ = θ(t+1)

is`(θ(t+1)|yobs) = Q(θ(t+1)|θ(t)) −H(θ(t+1)|θ(t)).

Consider the difference in the likelihood function when evaluated at θ(t+1) and θ(t):

`(θ(t+1)|yobs) − `(θ(t)|yobs)

=[Q(θ(t+1)|θ(t)) −Q(θ(t)|θ(t))

]+[H(θ(t)|θ(t)) −H(θ(t+1)|θ(t))

].

Both terms in [ ] are nonnegative; the first because θ(t+1) maximizes Q(θ|θ(t)) and the second because of the lemma.

The algorithm is called EM because each iteration consists of two steps. For the (t+ 1)st iteration the stepsare as follows.

3.12.4.1 E Step

Evaluate the expectationQ(θ|θ(t)) = E [`(θ|yobs,ymiss)]

=

∫`(θ|yobs, ymiss)f(ymiss|yobs,θ

(t)) dymiss.

Note that the E step finds the expectation of the complete data log likelihood function, `(θ|y), conditional on the

observed data and the current estimate of θ. To evaluate the expectation, the conditional density of ymiss|yobs,θ(t)

must be known. In exponential families, the E step consists of computing the expectation of the sufficient statistic(based on complete data) conditional on the observed data.

3.12.4.2 M Step

Maximize the expected complete data log likelihood function with respect to θ. This maximization is usually easybecause it is identical to the maximization required when there are no missing values.

3.12.5 Application to MVN Regression Set-Up

Let Y be a normally distributed n× d random matrix with mean XB and dispersion Σ⊗ In. We desire MLEs of Band Σ. For complete data, we know the answers:

B = (X′X)−X′Y and Σ =

Y′(In − P)Y

n,


where P = ppo(X).Let W be an n× d matrix of indicator variables:

wij =

1 if the ijth entry in Y is observable,

0 if the ijth entry in Y is unobservable.

Let w = vec(W) and let D = diag(w). Then,

yobs = Dy and ymiss = (Ind − D)y.

Note that the nd-vectors yobs and ymiss have zeros replacing the unobservable and observable random variables,respectively. Denote the total number of missing observations by m. That is, tr(D) = nd−m and tr(Ind −D) = m.

The marginal distribution of ymiss is

ymiss ∼ N [(Ind − D)(Id ⊗ X)β, (Ind − D)(Σ ⊗ In)(Ind − D)] .

Let Yobs, Ymiss, Mobs, and Mmiss be the n× d matrices satisfying

vec(Yobs) = yobs, vec(Ymiss) = ymiss, vec(Mobs) = E(yobs) = D(Id ⊗ X)β, and

vec(Mmiss) = E(ymiss) = (Ind − D)(Id ⊗ X)β.

Note thatY = Yobs + Ymiss and M = XB = Mobs + Mmiss.

Denote the parameter estimates after the tth iteration by β(t) and Σ(t), where β(t) = vec(B(t)).

3.12.5.1 Conditional Distribution of Ymiss

The nd× nd matrices D and Ind − D are each symmetric and idempotent. Let U1 be an nd× (nd−m) matrixconsisting of the non-zero columns of D. Similarly, let U2 be an nd×m matrix consisting of the non-zero columnsof Ind − D. Then,

U =(U1 U2

)is an orthogonal matrix,

D = U1U′1, Ind − D = U2U

′2,

U′1U1 = Ind−m, and U′

2U2 = Im.

Note thatz1 = U′

1y

is an (nd−m)-vector of observable random variables and

z2 = U′2y

is an m-vector of unobservable random variables. Also,

U1z1 = yobs, U2z2 = ymiss,

U′1yobs = z1, and U′

2ymiss = z2.

The joint density of z1 and z2 is

(z1

z2

)∼ N

[(U′

1 vec(XB)U′

2 vec(XB)

),

(U′

1(Σ ⊗ In)U1 U′1(Σ ⊗ In)U2

U′2(Σ ⊗ In)U1 U′

2(Σ ⊗ In)U2

)].

It follows that the conditional distribution of z2 given z1 = z1 and θ = θ(t) is normal with

E[z2|z1,θ

(t)]

= U′2 vec(XB(t))

+ U′2(Σ

(t) ⊗ In)U1

[U′

1(Σ(t) ⊗ In)U1

]−1 [z1 − U′

1 vec(XB(t))],


and

Var[z2|z1,θ

(t)]

= U′2(Σ

(t) ⊗ In)U2

− U′2(Σ

(t) ⊗ In)U1

[U′

1(Σ(t) ⊗ In)U1

]−1

U′1(Σ

(t) ⊗ In)U2.

Equivalently, the conditional distribution of ymiss given yobs = yobs and θ = θ(t) is normal with

E[ymiss|yobs,θ

(t)]

= (Ind − D) vec(XB(t))

+ (Ind − D)(Σ(t) ⊗ In)U1

[U′

1(Σ(t) ⊗ In)U1

]−1

U′1 vec(Yobs − XB(t)),

and

Var[ymiss|yobs,θ

(t)]

= (Ind − D)(Σ(t) ⊗ In)(Ind − D)

− (Ind − D)(Σ(t) ⊗ In)U1

[U′

1(Σ(t) ⊗ In)U1

]−1

U′1(Σ

(t) ⊗ In)(Ind − D).

To compute the conditional expectation and variance of ymiss, the preceding expression requires computationof an (nd−m) × (nd−m) inverse. By using the orthogonality of U, an alternative expression can be obtained:

U′(Σ(t) ⊗ In

)U =

[U′([Σ(t)]

−1 ⊗ In

)U]−1

=⇒

U′1

(Σ(t) ⊗ In

)U1 =

[U′

1

([Σ(t)]

−1 ⊗ In

)U1

− U′1

([Σ(t)]

−1 ⊗ In

)U2

[U′

2

([Σ(t)]

−1 ⊗ In

)U2

]−1

U′2

([Σ(t)]

−1 ⊗ In

)U1

]−1

=⇒[U′

1

(Σ(t) ⊗ In

)U1

]−1

= U′1

([Σ(t)]

−1 ⊗ In

)U1

− U′1

([Σ(t)]

−1 ⊗ In

)U2

[U′

2

([Σ(t)]

−1 ⊗ In

)U2

]−1

U′2

([Σ(t)]

−1 ⊗ In

)U1.

The above expressions require inverting Σ(t), a d× d matrix, and U′2

([Σ(t)]

−1 ⊗ In

)U2, an m×m matrix.

Denote the conditional expectation and variance of Ymiss by

E(Ymiss|Yobs,θ(t)) = M

(t)miss, vec(M

(t)miss) = µ

(t)miss, and

Disp(Ymiss|Yobs,θ(t)) = Σ

(t)miss.

Then, the conditional distribution of ymiss given yobs = yobs and θ = θ(t) is normal with

µ(t)miss = (Ind − D) vec(XB(t))

+ (Ind − D)(Σ(t) ⊗ In)D([Σ(t)]

−1 ⊗ In

)(Ind − P2)D vec(Yobs − XB(t)),

and

Σ(t)miss = Disp

[Ymiss|Yobs,θ

(t)]

= (Ind − D)(Σ(t) ⊗ In)(Ind − D)

− (Ind − D)(Σ(t) ⊗ In)D([Σ(t)]

−1 ⊗ In

)(Ind − P2)D(Σ(t) ⊗ In)(Ind − D),

where P2 is the projection operator projecting onto R(U2) along N[U′

2

([Σ(t)]

−1 ⊗ In

)]. That is,

P2 = U2

[U′

2

([Σ(t)]

−1 ⊗ In

)U2

]−1

U′2

([Σ(t)]

−1 ⊗ In

).


3.12.5.2 The Log Likelihood Function

The complete data log likelihood is

`(θ|y) = −(n

2

)ln (|Σ|) −

(nd

2

)ln(2π) −

(1

2

)tr[(Y − XB)

′(Y − XB)Σ−1

].

Writing Y as Y = Yobs + Ymiss, the tr[ ] term in the likelihood function can be written as

tr[(Y − XB)

′(Y − XB)Σ−1

]= tr

[(Yobs − XB)

′(Yobs − XB)Σ−1

]

+ 2 tr[Y′

miss(Yobs − XB)Σ−1]+ y′

miss(Σ−1 ⊗ In)ymiss.

3.12.5.2.1 E Step The E step consists of evaluating the conditional expectation of `(θ|Y) given θ = θ(t) andYobs = Yobs. The result is

Q(θ|θ(t)) = −n2 ln (|Σ|) − nd

2 ln(2π) − 12 tr

[(Yobs − XB)

′(Yobs − XB)Σ−1

]

+2 tr[(Yobs − XB)

′M

(t)missΣ

−1]

+ µ(t)′miss(Σ

−1 ⊗ In)µ(t)miss

+tr[Σ

(t)miss(Σ

−1 ⊗ In)].

Simplification yields

Q(θ|θ(t)) = −(n2

)ln (|Σ|) −

(nd2

)ln(2π)

−(

12

)tr[(Yobs + M

(t)miss − XB)

′(Yobs + M

(t)miss − XB)Σ−1

]

−(

12

)tr[Td(Σ

(t)miss)Σ

−1].

3.12.5.2.2 M Step The M step consists of maximizing the expected log likelihood with respect to β and Σ.Equating the derivative of Q with respect to β to zero yields

∂Q(θ|θ(t))

∂β= 0 =⇒

−2(Σ−1 ⊗ X)(yobs + µ(t)miss) + 2(Σ−1 ⊗ X′X)β = 0 =⇒

B(t+1) = (X′X)−X′(Yobs + M

(t)miss).

Substituting B(t+1) for B into the expected log likelihood yields

Q(θ|Σ(t),B(t+1)) = −(n

2

)ln (|Σ|) −

(nd

2

)ln(2π)

−(

1

2

)tr[(Yobs + M

(t)miss)

′(In − P)(Yobs + M

(t)miss) + A

(t)miss

]Σ−1,

whereA

(t)miss = Td(Σ

(t)miss).

It follows from complete data MLE results that

Σ(t+1) =(Yobs + M

(t)miss)


(t)miss) + A

(t)miss

n.

The E and M steps are repeated until the sequence θ(t),θ(t+1),θ(t+2), . . . converges.


3.12.5.3 Summary

In summary, the EM algorithm for this problem consists of the following steps.

0. Make Initial Guesses. Initial guesses for B and Σ are required; B(0) = 0 and Σ(0) = Id ought to work well.

1. E Step. Compute M(t)miss and Σ

(t)miss:

µ(t)miss = (Ind − D) vec(XB(t))

+ (Ind − D)(Σ(t) ⊗ In)D([Σ(t)]

−1 ⊗ In

)(Ind − P2)D vec(Yobs − XB(t)),

and

Σ(t)miss = (Ind − D)(Σ(t) ⊗ In)(Ind − D)

− (Ind − D)(Σ(t) ⊗ In)D([Σ(t)]

−1 ⊗ In

)(Ind − P2)D(Σ(t) ⊗ In)(Ind − D),

where P2 projects onto R(U2) along N[U′

2([Σ(t)]

−1 ⊗ In)].

2. M Step. Compute B(t+1) and Σ(t+1):

B(t+1) = (X′X)−X′(Yobs + M

(t)miss),

and

Σ(t+1) =(Yobs + M

(t)miss)


(t)miss) + Td(Σ

(t)miss)

n.

Iterate on steps 1 and 2 until convergence. Denote the estimates after the final step by B and Σ.

3.12.6 Example

Problem 3.14 in Morrison gives a multivariate data set having missing observations. The following table containsthe observed data.

Group Subject 15 Min 90 Min1 5 82 −17 −53 −7 −14 −3 ·

Naive 5 −7 −86 −9 −127 −6 −48 1 −39 −3 −101 32 42 36 363 20 124 8 4

Chronic 5 32 126 54 227 24 ·8 60 ·

The corresponding multivariate model is a one-way classification and can be written as

Y = XB + E, where X =[1n (1n1

⊕1n2

)]


and

B =

µ′

θ′1θ′2

=

µ1 µ2

θ11 θ12θ21 θ22

.

The corresponding cell means model is Y = X∗B∗ + E, where

X∗ =(1n1

⊕1n2

)and B∗ =

(µ′

1

µ′2

)=

(µ11 µ12

µ21 µ22

).

Using MATLAB, the maximum likelihood estimates, after 14 iterations, are

B =

9.3796 4.0305−14.4907 −8.283923.8704 12.3144

, Σ =

(139.3170 64.244064.2440 79.5195

),

and

B∗ =

(−5.1111 −4.253333.2500 16.3450

).

Chapter 4

PRINCIPLES OF MATHEMATICAL

STATISTICS

4.1 SUFFICIENCY

Definition: A Statistical Model, F , consists of three components (Y, Pθ,Θ), where Y is the support set for y, Pθ isa probability measure indexed by θ, and Θ is the parameter space. Associated with Y is a σ-algebra (Borel field) ofsubsets, By. Recall that a collection of subsets is a σ-algebra if

1. ∅ ∈ By,

2. By is closed under complementation: A ∈ By =⇒ Ac ∈ By, and

3. By is closed under countable unions: Ai ∈ By for i = 1, 2, . . . =⇒∞⋃

i=1

Ai ∈ By.

Definition: A Statistic is a measurable function s = s(y) with domain Y and counter-domain or support S.Associated with S is a σ-algebra of subsets Bs. Recall, a function s that maps Y to S is measurable ifB ∈ Bs =⇒ s−1(B) ∈ By, where

s−1(B) = y;y ∈ Y, s(y) ∈ B .A statistic induces the model (S, P sθ ,Θ), where

B ∈ Bs =⇒ P sθ (B) = Pθ[s−1(B)

].

A realization of the statistic is denoted by s = s(y), whereas S = S(Y) is a random variable.

4.1.1 Distribution Constant Statistic

If the induced model has only one element, then S is distribution constant with respect to F . That is, S isdistribution constant if P sθ does not depend on θ.

Example: Suppose that Yi, i = 1, . . . , n are iid from a location-scale family of densities. That is,

fY (y;µ, σ) =1

σh

(y − µ

σ

)

for some function h. Then, the n− 2 dimensional statistic

S =

Yi − Y1

Yn − Y1

n−1

i=2

is distribution constant. To verify this result, note that the pdf of Zi = (Yi − µ)/σ is g(z), which does not dependon θ. Now write Yi = µ+ σZi and show that

S =

Zi − Z1

Zn − Z1

n−1

i=2

.

45

46 CHAPTER 4. PRINCIPLES OF MATHEMATICAL STATISTICS

A statistic, C, is a maximal distribution constant statistic if C is distribution constant and is not a function ofany other distribution constant statistic. For example, if Xi, Yi) are iid N(0,Σ), where

Σ =

(1 ρρ 1

),

then C =(Y1 Y2 · · · Yn

)′is maximal distribution constant. So are C∗ =

(X1 X2 · · · Xn

)′and

C∗∗ =(X1 Y2 · · · Yn

)′, so maximal distribution constants need not be unique.

A kth-order distribution constant statistic is a statistic whose first k moments do not depend on θ. Forexample, consider the linear model y = Xβ + ε, where ε ∼ (0, σ2In). Then C = (In − H)Y is first-orderdistribution constant, where H = ppo(X)

One reason to pay attention to distribution constant statistics is that they can be used to reduce the dimensionof the data. Suppose that C is a distribution constant statistic and that (C,S) is a one-to-one function of Y. Then,

fC,S(c, s;θ) = fS|C(s|c,θ) fC(c)

and inference about θ can be made using the conditional distribution fS|C(s|C = c).

4.1.2 Sufficient Statistics

Definition: A statistic S is sufficient for F if the conditional distribution fY |S(y|s) is the same for all distributionsin F .

For example, suppose that F is the family of all continuous distributions. Let Y be a random sample of size nfrom any distribution in F . Then the order statistics are sufficient. To verify this result, note that

fY |S(y|S = s) =fY,S(y, s;θ)

fS(s;θ)=fY (y;θ)Is∗(y)

fS(s;θ)

where s∗ is the set of all n! permutations of s

=

∏ni=1 fY (yi;θ)Is∗(y)

n!∏ni=1 fY (y(i);θ)

=

1n! if y is a permutation of s

0 otherwise.

This result is the basis of the Fisher permutation test. For example, suppose that Yi is a random sample ofsize ni from a continuous distribution with cdf Fi for i = 1, 2. To test H0 : F1 = F2 against Ha : F1(y) = F2(Y + δ),condition on the order statistics from the combined data set of size n1 + n2. Under the null hypothesis, each of the(n1+n2

n1

)distinct permutations of the combined data set is equally likely. The null sampling distribution of a

statistic such as Y 1 − Y 2 can be obtained by computing the statistic under each permutation. The null is rejectedif the observed value of the statistic is unlikely under the null.

As another example, let F be the family of all discrete distributions having common support Y. Then, thehistogram is sufficient. To verify this result, let H be the histogram; i.e., the set of observed frequencies. Now showthat

P (Y = y|H = h) =

1(N

h1,h2,...,h∞

) if the observed frequencies are fi = hi, i = 1, . . . ,∞

0 otherwise.

If F is a parametric family of distributions, (Y, fY (y;θ),Θ), then S(Y) is sufficient for F if the distribution ofY|S = s does not depend on θ.

For b ∈ Bs, s−1(b) is an event in Y. Accordingly, cycling through all b ∈ Bs induces a partition of Y. That is

Y =⋃

b∈Bs

s−1(b) and bi 6= bj =⇒ s−1(bi)⋂s−1(bj) = ∅.

If S is sufficient, then the distribution of Y within any induced partition does not depend on θ. A statistic, S, isminimal sufficient if (a) it is sufficient and (b) if S∗ is any other sufficient statistic, then S is a function of S∗. Aminimal sufficient statistic partitions the sample space such that the sizes of the partitions are as large as possibleand the number of partitions (which may be ∞) is as small as possible.

4.1. SUFFICIENCY 47

Theorem 13 (Neyman Factorization) The statistic S(Y) is sufficient for the parametric family F ifffY (y;θ) = g(s;θ)m(y).

Proof: First suppose that S is sufficient. Then,

fY (y|s,θ) =fY,S(y, s;θ)

fS(s;θ)= fY |S(y|s) =

fY (y;θ)Is[s(y)]

fS(s;θ)

=⇒ fY (y;θ) = g(s;θ)m(y), where g(s;θ) = fS(s;θ) and m(y) =fY |S(y|s)Is[s(y)]

.

Second, suppose that the density factors as fY (y;θ) = g(s;θ)m(y). Perform a one to one transformation from Y

to T = h(Y) =(S Z

)′, where the nature of Z depends on the family. It follows that Y = h−1(T). The Jacobian

of the transformation is

|J(T)| =

∣∣∣∣∂h−1(T)

∂T′

∣∣∣∣ .

Accordingly,

fS,Z(s, z;θ) = fY[h−1(t);θ

]|J(t)| = g(s;θ)m

[h−1(t)

]|J(t)| by assumption

=⇒ fY |S(y|s,θ) =fY (y;θ)Is[s(y)]

fS(s;θ)=g(s;θ)m(y)Is[s(y)]∫

fS,Z(s, z;θ) dz

=g(s;θ)m(y)Is[s(y)]∫g(s;θ)m

[h−1(t)

]|J(t)| dz

=m(y)Is[s(y)]∫m[h−1(t)

]|J(t)|dz

which does not depend on θ.

Theorem 14 (Lehmann-Scheffe-I) For fixed u ∈ Y, define

D(u) =

y;y ∈ Y, fY (y;θ)

fY (u;θ)= h(y,u) ∀θ ∈ Θ

.

Then, the partitioning induced by D is minimal sufficient.

Proof: First show that the partitioning induced by D is sufficient. The conditional distribution of Y given thatY ∈ D(u) is

fY |D[y|y ∈ D(u),θ] =fY (y;θ) ID(u)(y)

P [Y ∈ D(u);θ].

Note that y ∈ D(u) =⇒ fY (y;θ) = fY (u;θ)h(y,u). Accordingly,

fY |D[y|y ∈ D(u),θ] =fY (u;θ)h(y,u) ID(u)(y)∫

y∈D(u)

fY (u;θ)h(y,u) dy

=h(y,u)ID(u)(y)∫

y∈D(u)

h(y,u) dy

,

which does not depend on θ.Second, show that the partitioning is minimal. Suppose that S is any other sufficient statistic. Pick any pair

(u,y) in the same partition induced by S. Then

fY (u;θ)

fY (y;θ)=g(s;θ)m(y)

g(s;θ)m(u)by Neyman factorization and because s(y) = s(u),

=m(y)

m(u)= h(y,u) =⇒ y and u are in the same partition induced by D.


Example Laplace Distribution Consider a random sample of size n from the density

fY (y;µ, σ) =1

2σexp

− 1

σ|y − µ|

,

where σ > 0. Then,

fY (y;θ)

fY (u;θ)= h(y,u) ∀ θ ∈ Θ ⇐⇒

n∑

i=1

|yi − µ| =

n∑

i=1

|ui − µ| ∀µ. (∗)

First set µ < mini[min(yi, ui)] to conclude that∑ni=1 yi =

∑ni=1 ui Equation (∗) is satisfied if y(i) = u(i) for

i = 1, . . . , n. To see that equality of the order statistics is necessary, suppose that y(j) < u(j) for some j. Set µ tobe any value in (y(j), u(j)). If (∗) is satisfied, then

−j∑

i=1

y(i) + jµ+

n∑

i=1

yi −j∑

i=1

y(i) − (n− j)µ = −j−1∑

i=1

u(i) + (j − 1)µ+

n∑

i=1

ui −j−1∑

i=1

u(i) − (n− j + 1)µ

⇐⇒j∑

i=1

y(i) =

j−1∑

i=1

u(i) + µ ∀µ ∈ (y(j), u(j))

which cannot be satisfied unless y(j) = u(j). Accordingly, the order statistics are minimal sufficient because y and uare in the same partition induced by D if and only if their order statistics are identical.

Example Cauchy Distribution Consider a random sample of size n from the density

fY (y;µ, σ) =1

σπ[1 +

(y−µσ

)2] ,

where σ > 0. Then,

fY (y;θ)

fY (u;θ)= h(y,u) ∀ θ ∈ Θ ⇐⇒

n∏

i=1

[σ2 + (ui − µ)2

]

n∏

i=1

[σ2 + (yi − µ)2

] = h(y,u) ∀µ and ∀ σ > 0. (∗∗)

First set µ = 0. Then the numerator and denominator of (∗∗) are nth degree polynomials in σ2 whosecoefficients are symmetric functions of the data. The coefficient on σ2n is 1, so the coefficients for each power of σ2

must match in the numerator and denominator. The resulting n equations are satisfied only if |y|(i) = |u|(i) fori = 1, . . . , n. If µ 6= 0, then (∗∗) is satisfied only if |y− µ|(i) = |u− µ|(i) for all µ and for i = 1, . . . , n. This conditionis satisfied only if y(i) = u(i) for all i. Accordingly, the order statistics are minimal sufficient because y and u are inthe same partition induced by D if and only if their order statistics are identical.

Theorem 15 If T is sufficient and a unique mle exists, then the mle is a function of T. If the mle exists, but isnot unique, then the mle can be chosen to be a function of T.

Proof: Suppose that T is sufficient and that the mle exists. By the Neyman factorization theorem, the probabilityfunction can be written as

fY (y;θ) = g(t;θ)m(y).

If θ maximizes the likelihood function, then it must maximize g(t;θ) and, therefore, can be chosen to be a functionof t. If the mle is unique, then only one maximizer of g(t;θ) exists and it must depend on the data only through t.

4.2 EXPONENTIAL FAMILY

Consider the family F = (Y, fY (y;θ),Θ), where y is a d-dimensional random vector and θ is a k-dimensionalvector of parameters. The family is said to belong to the exponential class or exponential family if

fY (y;θ) = exp a(θ)′b(y) c(θ)d(y),

4.3. INVARIANCE 49

where a(θ) is an m-dimensional function of θ and b(y) is an m-dimensional function of y. Furthermore, the mfunctions in a(θ) must be linearly independent. The functions would be linearly dependent if there exists a vectorof constants c such that a(θ)′c = 0 for all θ ∈ Θ. Otherwise, the functions are linearly independent.

If k = m, then the family is said to be full. If m > k, then the family is said to be curved. An example of acurved exponential family is N(σ, σ2).

Theorem 16 Suppose that F belongs to the exponential class and that Y1,Y2, . . . ,Yn is a random sample fromfY (y;θ). Then,

T =

n∑

i=1

b(yi)

is minimal sufficient.

Proof: By the Lehmann-Scheffe-I theorem, the partitioning induced by

exp

a(θ)′

n∑

i=1

b(yi)

exp

a(θ)′

n∑

i=1

b(ui)

= h(y,u) ∀θ ∈ Θ

Is minimal sufficient. Accordingly, y and u are in the same partition if and only if

a(θ)′[

n∑

i=1

b(yi) −n∑

i=1

b(ui)

]= 0 ∀θ ∈ Θ

and this condition is satisfied if and only ifn∑

i=1

b(yi) =n∑

i=1

b(ui)

because the m functions a(θ) are linearly independent. Therefore,

T =

n∑

i=1

b(yi)

is minimal sufficient.

4.3 INVARIANCE

Definition: A Group of Transformations, G, is a collection of transformations from Y to Y that is

1. closed under inversion, and

2. closed under composition.

A collection of transformations is closed under inversion if

g ∈ G =⇒ ∃ g′ ∈ G 3 g′ [g(y)] = y ∀ y ∈ Y .

A collection of transformations is closed under composition if

g ∈ G, g′ ∈ G =⇒ ∃ g∗ ∈ G 3 g g′(y) = g∗(y) ∀ y ∈ Y .

Example: Suppose that Y = (−∞,∞). Then,

G = g; g(y) = ay + b, a 6= 0, |a| + |b| <∞

is a group of transformations. The collection is closed under inversion because

g(a) = ay + b =⇒ y =1

ag(y) − b

a.


The collection is closed under composition because

g(y) = a1y + b1, g′(y) = a2y + b2 =⇒ g∗(y) = a1a2y + (a1b2 + b1) = g g′(y).

Definition: The model F = [Y, fY (y;θ),Θ] is invariant with respect to G if

Y ∼ fY (y;θ) =⇒ Y∗ def= g(Y) ∼ fY (y∗; g(θ)),

where g ∈ G and G is a group of transformations from Θ to Θ.

Example. Suppose that Y is a scalar random variable having a density in the location-scale family. That is

fY (y;µ, σ) =1

σh

(y − µ

σ

),

for some h, where σ > 0. Let

G = g; g(y) = ay + b, a > 0, |a| + |b| <∞ .

Then F is invariant with respect to G because

Y ∗ = g(Y ) =⇒ fY ∗(y∗;µ, σ) =1

afY

(y∗ − b

a;µ, σ

)= fY (y∗; aµ+ b, aσ).

In this example,

g(θ) =

(aµ+ baσ

).

Definition: A group of transformations from Y to Y is said to be transitive if

y ∈ Y and y′ ∈ Y =⇒ y′ = g(y) for some g ∈ G .

Definition: Suppose that Y is a random variable with support Y and probability function p0(y), where p0 doesnot depend on any unknown parameters. Furthermore, suppose that G is a group of transformations from Y ontoY. Denote the probability function for g(Y ) by gp0. Then, the family of distributions

FG = gp0; g ∈ G

is called a group family.

Example: Suppose that Y ∼ N(0, 1). Let G be the group of linear transformations with elementsg(Y ) = σY + µ, where σ > 0. Then FG is a group family and is the family of normal distributions N(µ, σ2).

Theorem 17 If F is invariant with respect to G and the associated group of transformations G, from Θ to Θ istransitive, then F is a group family.

Proof: To verify that F is a group family, we must verify that

fY (y;θ) = gp0 ⇐⇒ fY (y;θ) ∈ F .

Let p0(y) = fY (y,θ0), where θ0 ∈ Θ is chosen arbitrarily. By hypothesis, F is invariant. Therefore,

fY (y;θ) = gp0 =⇒ fY (y;θ) ∈ F .

Furthermore, transitivity of G reveals that

θ ∈ Θ =⇒ ∃ g ∈ G 3 θ = g(θ0).

Therefore, fY (y;θ) = gp0.

4.3. INVARIANCE 51

4.3.1 Dimension Reduction by Invariance

Suppose that F is invariant with respect to G and that g(θi) = θi for some i. Then inference about θi should notdepend on whether y is observed or g(y) is observed. This principle can sometimes be used to reduce the dimensionof the data.

Example: Suppose that Yiiid∼ N(µ, σ2). Let G = g; g(y) = y + b, |b| <∞. Then F is invariant with respect to

G and g(θ) = (µ+ b, σ2). Note that g(σ2) = σ2. The statistic T = (Y , S2) is minimal sufficient and

T [g(Y] =

(Y + bS2

).

Inference should be the same for any value of b. Choose b = −Y to see that inference about σ2 should depend onS2 alone.

Example 2: Suppose that Yiiid∼ Gamma(α, λ). Let G = g; g(y) = ay, 0 < a <∞. Then F is invariant with

respect to G and g(θ) = (α, λa). Note that g(α) = α. The statistic

T =

(Y

1n

∑ni=1 ln(Yi)

)

is minimal sufficient and

T [g(Y] =

(aY

ln(a) + 1n

∑ni=1 ln(Yi)

).

Inference should be the same for any value of a. Choose a = Y−1

to see that inference about α should depend onthe data solely through the ratio of the geometric mean to the arithmetic mean.

4.3.2 Equivariance

Suppose that F is invariant with respect to G and that T = T(Y) is an estimator of θ. That is, T is a functionthat maps Y to Θ. Then T is equivariant if

T [g(y)] = g [T(y)] ∀g ∈ G and y ∈ Y .

Example: Suppose that Yiiid∼ N(µ, σ2) for i = 1, . . . , n. Consider the group of linear transformations with

typical element g(y) = ay + b, a 6= 0. Then g(µ, σ2) = (aµ+ b, a2σ2). Let T(y) be the mle of θ,

T =

(YV

), where V =

1

n

n∑

i=1

(Yi − Y )2.

Then

T [g(Y)] =

(aY + ba2V

)= g [T(Y)] .

Theorem 18 If F is invariant with respect to G and the mle is unique, then the mle of θ is equivariant. If the mleis not unique, then the mle can be chosen to be equivariant.

Proof: Denote the likelihood function given y by LY (θ;y) and denote the maximizer by θ(y) It follows from theinvariance of F with respect to G that the likelihood function given y∗ = g(y) satisfies

LY ∗(θ;y∗) = LY (g(θ);y∗) ∝ L(θ;y),

where the constant of proportionality is the Jacobian of the transformation. Accordingly, if θ(y) maximizes

LY (θ;y), then it also maximizes LY (g(θ);y∗) and an mle of g(θ) is g[θ(y)

].

4.3.3 Maximal Invariants

Definition: Let G be a group of transformations and let y ∈ Y be a fixed vector. Trace the value of g(y) as g cyclesthrough all transformations in G. The result is called the orbit of y. The orbits in Y form equivalence classes andrepresent another way to partition Y.


The statistic T = T(y) is said to be invariant with respect to G if

T(y) = T [g(y)] ∀ g ∈ G .

For example, the statistic

T(y) =

y(i) − y(1)

y(n) − y(1)

n−1

i=2

is invariant with respect to the group having typical element g(y) = ay + b, a > 0.An invariant statistic provides an index of the orbit of y. All y ∈ Y that are on the same orbit have the same

value of the invariant statistic T. If the statistic also has the same value on different orbits, then the statistic doesnot provide an index of all orbits.

A maximal invariant is

1. an invariant statistic

2. that has a distinct value for each orbit.

Accordingly, if T is a maximal invariant, then T(y) = T [g(y)] ∀ g ∈ G. Also, if y and y∗ are points in Y andT(y) = T[g(y)], then y and y∗ must be on the same orbit. That is, y∗ = g(y) for some g ∈ G.

Theorem 19 Suppose that M is a statistic and that T is a maximal invariant. Then M is invariant if and only ifM is a function of T.

Proof: First, suppose that M is a function of T. Then

M(y) = h [T(y)] =⇒ M [g(y)] = h [T g(y)] = h [T(y)] = M(y).

Second, suppose that M is invariant. Then, M is constant on orbits. Orbits, however, are indexed by T. Therefore,M is a function of T.

4.3.3.1 Examples of Maximal Invariants

1. If Y = IR and G = g; g(y) = ay + b, a > 0, then for a sample of size n, a maximal invariant is

T(y) =

y(i) − y(1)

y(n) − y(1)

n−1

i=2

To verify this claim, first show that T is invariant. Second, show that

T(y) = T[g(y)] =⇒ y∗ = g(y)

for some g ∈ G. The latter is satisfied because

T(y) = T[g(y)] =⇒ y∗i =

[y∗(n) − y∗(1)y(n) − y(1)

]yi + y∗(1) −

[y∗(n) − y∗(1)y(n) − y(1)

]y(1) = ayi + b.

2. Let G be the group of all monotonic transformations. That is, G = g; y > y′ −→ g(y) > g(y′). Then for asample of size n, the vector of rank-orders is a maximal invariant.

3. Let G be the group of all permutations. That is, G = g; g(y) = Py, P is a permutation matrix. Then for asample of size n, the vector of order statistics is a maximal invariant.

4. Let G be the group of all orthogonal transformations. That is, G = g; g(y) = Qy, QQ′ = Q′Q = In. Thenfor a sample of size n, T (y) = y′y is a maximal invariant. Vinograd’s theorem can be used to verify thisresult.

Theorem 20 (Vinograd) Suppose that Y is an n× p matrix and X is an m× p matrix, where n ≥ m. Then

Y′Y = X′X ⇐⇒ ∃ H : n×m 3 Y = HX and H′H = Im.

Proof: First suppose that an n×m matrix H exists such that Y = HX and H′H = Im. ThenY′Y = X′H′HX = X′X. Second, suppose that Y′Y = X′X Then R(Y′Y) = R(X′X) and it follows that

4.3. INVARIANCE 53

R(Y′) = R(X′). Accordingly, there exists a matrix L such that Y = X′L′ or, equivalently, Y = LX. Denoteby r the rank of X and the rank of Y. Write X in terms of its singular values and vectors:

X = UΛV′,

where U : m×m and V : p× p are orthogonal matrices and Λ : m× p is a diagonal matrix appended withzeros. Specifically,

U =(U1 U2

), V =

(V1 V2

), and Λ =

(Λ1 00 0

),

where U1 is m× r, U2 is m× (m− r), V1 is p× r, V2 is p× (p− r), and Λ1 is r × r. One solution to

Y = LX is L = YX+, where X+ = V1Λ−11 U′

1. Another solution is L = YX+ + K, where K is any n×mmatrix that satisfies KX = 0. Factor In − ppo(Y) as In − ppo(Y) = FF′, where F is n× (n− r). It followsthat F satisfies F′Y = 0 and F′F = In−r. Let

K = F

(U′

2

0

),

where the matrix of zeros is (n−m) ×m. Then, Y = LX and

L′L = X+′Y′YX+ +(U2 0

)F′F

(U′

2

0

)

= X+′X′XX+ + U2U′2 because F′Y = 0, F′F = In−r, and Y′Y = X′X;

= U1U′1 + U2U

′2 = UU′ = Im.

4.3.4 Illustration: Uniformly Most Powerful Invariant Test

Let F = (Y, fY (y;θ),Θ) and consider the problem of testing H0 : θ ∈ Θ0 against Ha : θ ∈ Θ\Θ0. The test can bedenoted by ϕ(y), where

ϕ(y) =

1 reject H0,

0 fail to reject H0,

γ reject H0 with probability γ.

The testing problem is said to be invariant with respect to G if

1. F is invariant with respect to G,

2. θ ∈ Θ0 ⇐⇒ g(θ) ∈ Θ0, and

3. θ ∈ Θ\Θ0 ⇐⇒ g(θ) ∈ Θ\Θ0.

If the testing problem is invariant with respect to G, then it is desired that the test be invariant with respect toG. The test is invariant with respect to G only if

ϕ(y) = ϕ [g(y)] ∀ g ∈ G .

It follows from Theorem 19, that ϕ(y) is invariant if and only if ϕ(y) is a function of a maximal invariant.Accordingly, to find a uniformly most powerful invariant test, attention can be restricted to functions of a maximalinvariant.

Suppose that Yiiid∼ N(µ, σ2). Consider the problem of testing H0 : µ = µ0 against Ha : µ 6= µ0. Transform from

Y to Z, where Zi = Yi − µ0. Then, Ziiid∼ N(δ, σ2), where δ = µ− µ0. The problem is to test H0 : δ = 0 against

Ha : δ 6= 0. A uniformly most powerful test does not exist. If attention is restricted to invariant tests, then auniformly most powerful invariant test can be found.

First, use sufficiency to reduce the data to

T =

(ZS2

), where S2 =

1

n− 1

n∑

i=1

(Zi − Z)2.


The two components of T are independently distributed as

Z ∼ N(δ, σ2/n) and(n− 1)S2

σ2∼ χ2

n−1.

Second, consider the group of transformations G1 = g; g(z, s2) = (±z, s2). This is the group of orthogonaltransformations on z. It is readily shown that the testing problem is invariant with respect to G. A maximalinvariant is

K = K(Z) =

(Z

2

S2

).

The two components of K are independently distributed as

nZ2

σ2∼ χ2

1,λ and(n− 1)S2

σ2∼ χ2

n−1, where λ =nδ2

2σ2

The parameter vector corresponding to K is

θK =

(λσ2

).

The hypotheses are now H0 : λ = 0 against Ha : λ > 0. Consider a second group of transformations,G2 = g; g(z2, s2) = (az2, as2), a > 0. The statistic K is transformed to

K[g(Z)] =

(aZ

2

aS2

).

The two components are independently distributed as

naZ2

aσ2∼ χ2

1,λ and(n− 1)aS2

aσ2∼ χ2

n−1.

Accordingly,

g(θK) =

(λaσ2

).

A maximal invariant with respect to G2 is

R =nZ

2

S2which has distribution F1,n−1,λ.

Accordingly, an invariant test must depend on the data only through R.Recall that a family of probability functions, fR(r;λ), λ ≥ 0 has a monotone likelihood ratio if

λ2 > λ1 =⇒ fR(r;λ2)

fR(r;λ1)is nondecreasing in r ∀ r ∈ r; fR(r;λ2) > 0 or fR(r;λ1) > 0.

It can be shown that the distribution of R has a monotone likelihood ratio in λ. Accordingly, from a slightgeneralization of the Karlin-Rubin theorem, the test that rejects H0 for large values of R is uniformly mostpowerful invariant.

4.4 CONDITIONING ON ANCILLARIES

4.4.1 Conditionality Principle

The conditionality principle states that If F = (Y, fY (y;θ),Θ) and C is distribution constant with respect to F ,then inference about θ should be made conditional on C. Suppose that (C,T) is a one-to one function of Y. Theconditionality principle is motivated by the factorization

fT,C(t, c;θ) = fT |C(t|c,θ) fC(c).

All sample information about θ is contained in the probability function for T conditional on C.

4.4. CONDITIONING ON ANCILLARIES 55

Example: Randomly draw a realization n of N , where N has a truncated Poisson distribution with parameterλ. Given N = n, draw a random sample of size n from N(µ, σ2) and make an inference about µ and/or σ2. Therandom variable N is distribution constant with respect to (µ, σ2). Accordingly, inference should be madeconditional on N = n.

Example: Suppose that Yi are iid from Unif(θ − 12 ,θ + 1

2 ). It can be shown that T = (Y(1), Y(n)) is minimalsufficient and that A = Y(n) − Y(1) is distribution constant. By the conditionality principle, inference about θshould be made conditional on A.

Definition: Let F be a parametric family of distributions and denote the MLE of θ by θ. If T is a minimalsufficient statistic and (θ,A) is a one-to-one function of T, then A is called auxiliary (a supplement or aid). If A isdistribution constant, then A is called ancillary.

4.4.2 Distribution of MLEs in Location-Scale Families

Theorem 21 (Burridge, 1981) Suppose that Yi are iid from a location-scale pdf fY (y;µ, σ) = 1σh(y−µσ

). Let

q(z) = − ln [h(z)] and denote the jth derivative of q with respect to z by q(j)(z). If q(j)(z) for j = 1, 2 arecontinuous and q(2)(z) > 0 ∀ z, then the MLE of (µ, σ) is unique if it exists.

Proof: Reparameterize using θ = (θ1, θ2) = (1/σ,−µ/σ). Then the log likelihood function is

`(θ;y) = n ln(θ1) −n∑

i=1

q(zi), where zi =yi − µ

σ= θ1yi + θ2.

The goal is to show that −` is convex. This is a sufficient condition for the MLE to be unique, if it exists. Thesecond derivative of −` is

Iθ(y) − ∂2`(θ;y)

∂ θ ⊗ ∂ θ′=

(nθ21

+∑ni=1 y

2i q

(2)(zi) −∑ni=1 yiq

(2)(zi)

−∑ni=1 yiq

(2)(zi)∑ni=1 q

(2)(zi)

).

Denote the eigenvalues of the observed information matrix by λ1 and λ2. Each of the diagonals of Iθ(y) are positiveand therefore λ1 + λ2 > 0. The determinant is

λ1λ2 =

(n

θ21+

n∑

i=1

y2i q

(2)(zi)

)(n∑

i=1

q(2)(zi)

)−(

n∑

i=1

yiq(2)(zi)

)2

=n

θ21

(n∑

i=1

q(2)(zi)

)+

(n∑

i=1

y2i q

(2)(zi)

)(n∑

i=1

q(2)(zi)

)−(

n∑

i=1

yiq(2)(zi)

)2

.

If q(2)(zi) > 0, then the first term in the above sum is positive by inspection. Let

a =

a1

a2

...an

and b =

b1b2...bn

where ai = yi

√q(2)(zi) and bi =

√q(2)(zi).

By the Cauchy-Schwartz inequality, (a′b)2 ≤ (a′a)(b′b). Accordingly, the remaining term is nonnegative andλ1λ2 > 0. It follows that λ1 > 0, λ2 > 0 and the observed information matrix is positive definite. Therefore, −` isconvex and the MLE is unique if it exists.

Recall that the vector of order statistics based on a random sample always is sufficient. In some location-scalefamilies, the vector of order statistics is minimal sufficient. Also, the statistic

A =

Y(i) − Y(1)

Y(n) − Y(1)

n−1

i=2

is distribution constant and is a maximal invariant with respect to the group of transformationsG = g; g(y) = ay + b, a > 0. The induced group of transformations on θ is G = g; g(θ) = (aµ+ b, aσ), a > 0and from Theorem 18, it is known that the mle of θ is equivariant. That is,

Y∗ = Ya+ 1nb =⇒ θ(y∗) =

(µ(y∗)σ(y∗)

)=

(aµ(y) + baσ(y)

).


This equivariance property will be used below to find the distribution of θ conditional on A.

Theorem 22 Suppose that Yi for i = 1, . . . , n is a random sample from a location-scale family with pdf

fY (y;θ) =1

σh

(y − µ

σ

).

Let

A =

Y(i) − Y(1)

Y(n) − Y(1)

n−1

i=2

and denote the mle of θ by θ =

(µσ

)=

(µ(y)σ(y)

).

Then, the distribution of θ conditional on A is identical to the distribution of θ conditional on Z, wherezi = (yi − µ)σ, and the pdf can be written as

fµ,σ(m, s|A,θ) = fµ,σ(m, s|Z,θ) =

(sn−2

cσn

) n∏

i=1

h

(sziσ

+m− µ

σ

),

where c is a constant. Caution, even though zi is a function of µ and σ, the value of Z is fixed in the conditional pdf.

Proof: First transform from Y to the vector of order statistics. Then, transform from the order statistics to A, T1,and T2, where T1 = Y(1) and T2 = Y(n) − Y(1). The inverse transformations are Y(1) = T1, Y(i) = T2A(i) + T1 for

i = 2, . . . , n− 1, and Y(n) = T2 + T1. Let U =(T1 A1 A2 · · · An−1 T2

)′Then the Jacobian of the

transformation from order statistics to U is

∣∣∣∣∂Y′

∂U

∣∣∣∣ =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 1 1 1 · · · 1 10 T2 0 0 · · · 0 00 0 T2 0 · · · 0 00 0 0 T2 · · · 0 0...

......

.... . .

......

0 0 0 0 · · · T2 00 A2 A3 A4 · · · An−1 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

= Tn−22 .

To verify the evaluation of the determinant above, use

∣∣∣∣W11 W12

W21 W22

∣∣∣∣ = |W22| |W11 − W12W−122 W21|

and equate W11 to scalar with value 1 in the upper left-hand corner of the matrix.The joint density of T1, T2, and A, therefore, is

fT1,T2,A(t1, t2,a;θ) =tn−22 n!

σn

n∏

i=1

h

(t1 + t2ai − µ

σ

), where a1

def= 0 and an

def= 1.

It follows that the joint density of T1 and T2, conditional on A is

fT1,T2|A(t1, t2|a,θ) =fT1,T2,A(t1, t2,a;θ)

fA(a)=tn−22

kσn

n∏

i=1

h

(t1 + t2ai − µ

σ

), where

k =

∫ ∞

0

∫ ∞

∞

tn−22

σn

n∏

i=1

h

(t1 + t2ai − µ

σ

)dt1 dt2.

Note that if A1 and An are defined as A1def= 0 and An

def= 1, then Y(i) = AiT2 + T1 and Ai = (Y(i) − T1)/T2 for

i = 1, . . . , n. It follows from the equivariance of the mle that

µ(a) =1

t2µ(y) − t1

t2, σ(a) =

1

t2σ(y),

µ(y) = t2µ(a) + t1, σ(y) = t2σ(a),


t1 = µ(y) − µ(a)σ(y)

σ(a), and t2 =

σ(y)

σ(a).

The quantities µ(a) and σ(a) depend on the data solely through A and therefore are constants after conditioning onA = a. The Jacobian of the transformation from (T1, T2) to (µ, σ) is

∣∣∣∣∂T′

∂ θ

∣∣∣∣ =∣∣∣∣∣

1 0

− µ(a)σ(a)

1σ(a)

∣∣∣∣∣ =1

σ(a).

Accordingly, the density of θ conditional on A is

fµ,σ|A(m, s|a,θ) =

(sn−2

cσn

) n∏

i=1

h

[s

σ

(ai − µ(a)

σ(a)

)+m− µ

σ

]

=

(sn−2

cσn

) n∏

i=1

h

(s

σzi +

m− µ

σ

)because

ai − µ(a)

σ(a)=y(i) − µ(y)

σ(y), and where

c = kσ(a)n−1 =

∫ ∞

0

∫ ∞

−∞

(sn−2

σn

) n∏

i=1

h

(s

σzi +

m− µ

σ

)dmds.

Laplace’s method can be used to obtain an approximation to the constant c.

Theorem 23 [Mill’s Ratio] Denote the pdf and cdf of the standard normal distribution evaluated at x as ϕ(x) andΦ(x). The quantity [1 − Φ(x)] /ϕ(x) is known as Mill’s ratio. If x > 0, then

1 − Φ(x)

ϕ(x)=

1

x− 1

x3+

3

x5− 15

x7+O

(|x|−9

)and

Φ(x) = 1 +ϕ(x)

x

[1 − 1

x2+

3

x4− 15

x6+O

(|x|−8

)].

Proof: Write x = 1/ε and construct the Taylor series expansion of

q(ε) =1 − Φ(1/ε)

ϕ(1/ε)

around ε = 0. L’Hopital’s method is needed to evaluate the derivatives of q at ε = 0. It can be shown that thesederivatives are

q(2r)(0) = 0 and q(2r+1)(0) = (−1)r(2r + 1)!(1)(3)(5) · · · (2r − 1) for r = 0, 1, . . . ,∞.

The claimed result is obtained by substituting the above derivatives into the expansion and writing ε as 1/x.

Corollary 1: If c is a positive number, then

∫ ∞

c√n

ϕ(z) dz = 1 − Φ(c√n) = −ϕ(c

√n)

c√n

[1 +O

(n−1

)]= o

(n−h

)

for any h > 0, because

limn→∞

nhe−c2n/2 = 0

for any h > 0.Corollary 2: If c is a positive number, then

∫ ∞

c√n

zϕ(z) dz = ϕ(c√n) = o

(n−h

)

for any h > 0.


Corollary 3: If c is a positive number and r ≥ 2 is an integer, then

∫ ∞

c√n

zrϕ(z) dz = o(n−h

)

for any h > 0. To verify the above result, let u = zr−1 and dv = zϕ(z). Then du = (r − 1)zr−2 dz, v = −ϕ(z), and

∫ ∞

c√n

zrϕ(z) dz = −zr−1ϕ(z)

∣∣∣∣∞

c√n

+ (r − 1)

∫ ∞

c√n

zr−2ϕ(z) dz

= cr−1n(r−1)/2φ(c√n) + (r − 1)

∫ ∞

c√n

zr−2ϕ(z) dz

= o(n−h

)+ (r − 1)

∫ ∞

c√n

zr−2ϕ(z) dz

Now apply Corollary 1 or Corollary 2 or use integration by parts repeatedly until you can apply Corollary 1 orCorollary 2.

Theorem 24 [Laplace’s Method] Suppose that u is a p-vector and that f(u) and a(u) are scalar functions thatsatisfy

f(u) = O(1), a(u) = O(1), and a(u) > 0.

Define I(n) as

I(n) =

∫· · ·∫

R

f(u)e−na(u) du

and denote the minimizer of a(u) by u. Denote f and a evaluated at u = u by f , and a. Also, denote the rth

derivatives of f and a with respect to u and evaluated at u = u by f (r) and a(r). Specifically,

f (1) =∂ f(u)

∂ u

∣∣∣∣u=eu

, f (2) =∂2f(u)

∂ u′ ⊗ ∂ u

∣∣∣∣u=eu

,

a(1) =∂ a(u)

∂ u

∣∣∣∣u=eu

, a(2) =∂2f(u)

∂ u′ ⊗ ∂ u

∣∣∣∣u=eu

,

a(3) =∂3a(u)

∂ u′ ⊗ ∂ u′ ⊗ ∂ u

∣∣∣∣u=eu

, and a(4) =∂2f(u)

∂ u′ ⊗ ∂ u ⊗ ∂ u′ ⊗ ∂ u

∣∣∣∣u=eu

.

For notational simplicity and to make the method more transparent, denote the p× p matrix a(2) by Σ−1. If u ∈ R,the region of integration, then

I(n) = n− p2 e−nea(2π)

p2 |Σ| 12

[f +

g

n+O

(n−2

)], where

g =1

2trace

(Σ f (2)

)− 1

6

trace

[2Np (Σ ⊗ Σ)

(a(3) ⊗ f (1)

)]+ σ′(a(3) ⊗ f (1))σ

− f

24

trace

[2Np ((Σ ⊗ Σ) a(4)

]+ σ′a(4)σ

+f

72

[vec(a(3)

)⊗ vec

(a(3)

)]′vec[8 (Np ⊗ Np) (Σ ⊗ σ ⊗ Σ) + 2Np2

(2Np ⊗ Ip2

)(Σ ⊗ Σ ⊗ σ) + (σ ⊗ σσ′)

],

where Nq =1

2

(I(q,q) + Iq2

).


To order O(n−1), the approximation is

I(n) =e−nea(2π)

p2

|na(2)| 12[f +O

(n−1

)].

If p = 1, then the approximations simplify to

I(n) = n− p2 e−nea(2π)

p2 σ[f +

g

n+O

(n−2

)], and

I(n) =e−nea(2π)

p2

√na(2)

[f +O

(n−1

)], where

σ2 =1

a(2)and g =

1

2σ2f (2) − 1

2σ4a(3)f (1) − 1

8σ4a(4)f +

5

24σ6(a(3)

)2

f .

Proof: First expand f(u) and a(u) around u = u:

f(u) = f + f (1)′(u − u) +1

2(u − u)′f (2)(u − u) +O

(|u − u|3

)and

a(u) = a+ a(1)′(u − u) +1

2(u − u)′a(2)(u − u) +

1

6(u − u)′a(3) [(u − u) ⊗ (u − u)]

+1

24[(u − u) ⊗ (u − u)]

′a(4) [(u − u) ⊗ (u − u)] +O

(|u − u|5

).

The p-vector a(1) is zero because u is the global minimizer of a(u) and a(u) is differentiable. Also, the p× p matrixa(2) is positive definite because u is the minimizer. Substitute these expansions into the integral and perform achange of variable from u to z =

√n(u − u). The inverse transformation and Jacobian are

u = u +1√nz and |J | =

∣∣∣∣∂ u′

∂z

∣∣∣∣ =∣∣∣∣Ip

1√n

∣∣∣∣ =1

np2

.

Lastly, evaluate the integral as though R = IRp. Odd moments of a random variable with distribution N(0,Σ) arezero. Second, fourth, and sixth oder moments can be obtained from the moment generating function. See theappendix in Boik (2002, Biometrika, 89, 159–182). The remainder is of order O(n−2) rather than O(n−3/2) becausethe O(n−3/2) term is a function of odd order moments and is zero. Corollaries 1–3 of Theorem 24 can be used toshow that the order of the approximation is the same for a region of integration R as for IRd provided that u ∈ R.

Pace and Salvan (1997, pp. 354–355) give an expression for the Laplace approximation using index notation.Index notation provides a simple representation of the quantities, but the matrix notation is easier to translate intocomputer code.

The Laplace approximation can be used to approximate the value of c in the conditional density of µ and σconditional on A = a (see Theorem 22). Recall that

c =

∫ ∞

0

∫ ∞

−∞

(sn−2

σn

) n∏

i=1

h

(s

σzi +

m− µ

σ

)dmds.

This integral depends on the unknown parameters µ and σ. Transform from µ and σ to

W1 = − (µ− µ)

σand W2 =

σ

σ.

The inverse transformations and Jacobian are

µ = µ− W1σ

W2, σ =

σ

W2, and |J | =

∣∣∣∣∣− σW2

0W1σW 2

2− σW 2

2

∣∣∣∣∣ =σ2

W 32

.

Accordingly, the constant can be obtained as

c =

∫ ∞

0

∫ ∞

−∞

(1

wn+12

) n∏

i=1

h

(zi − w1

w2

)dw1 dw2.


Note that (W1,W2) are pivotal quantities because their distribution does not depend on θ.Let

a(w) =

(1 +

1

n

)ln(w2) −

1

n

n∑

i=1

ln

[h

(zi − w1

w2

)].

Then,

c =

∫ ∞

0

∫ ∞

−∞e−na(w)dw1 dw2.

To apply Laplace’s approximation, the minimizer of a(w) is needed. Note that the conditional pdf of W looks likethe pdf of a random sample Z1, Z2, . . . , Zn from the location-scale family with parameters w1 and w2, except thatthe scale parameter has exponent n+ 1 rather than n. This suggests that a good initial guess for the minimizer ofa(w) is w ≈

(0 1

)′. Numerical methods are needed to find the exact minimizer. After computing w, the

approximate value of c is obtained from

c =e−nea2π

|na(2)| 12[1 +O

(n−1

)].

Numerical integration is usually required to refine the constant. One must be careful, because the constant canbe very large or very small. It usually is best to do most computations on the log scale. for example, denote by c,the Laplace approximation to the constant. That is

ln(c) = −na+ ln(2π) − 1

2ln |na(2)|.

Then, define k as

k =

∫ `4

`3

∫ `2

`1

exp

− ln(c) − (n+ 1) ln(w2) +

n∑

i=1

ln

[h

(zi − w1

w2

)]dw1 dw2,

where `i for i = 1, . . . , 4 are chosen to satisfy

P (`1 ≤W1 ≤ `2) ≈ 1 and P (`3 ≤W2 ≤ `4) ≈ 1.

The large sample distribution of W is

W ∼ N(µw,Σw) , where µw = w and Σw =1

n

[a(2)

]−1

.

Accordingly, limits such as

`1 = w1 − bσw1, `2 = w1 + bσw1

, `3 = w2 − bσw2, and `4 = w2 + bσw2

for b > 6 are reasonable. I suggest using b = 10 or more, but set `3 to a small positive number if it is negative.After integrating, then compute ln c as ln c = ln c+ ln k.

After the exact value of ln c has been obtained, then exact marginal confidence intervals for µ and σ can beobtained by finding the percentiles of the marginal distributions of W1 and W2. Define FW1

(w1|a) and FW2(w2|a) as

FW1(w1|a) =

∫ `4

`3

∫ w1

`1

exp

− ln(c) − (n+ 1) ln(v2) +

n∑

i=1

ln

[h

(zi − v1v2

)]dv1 dv2 and

FW2(w2|a) =

∫ w2

`3

∫ `2

`1

exp

− ln(c) − (n+ 1) ln(v2) +

n∑

i=1

ln

[h

(zi − v1v2

)]dv1 dv2.

SolveFW1

(q1|a) =α

4, FW1

(q2|a) = 1 − α

4, FW2

(q3|a) =α

4, and FW2

(q4|a) = 1 − α

4

for q1, . . . , q4. Then,σq1 + µ ≤ µ ≤ σq2 + µ and σq3 ≤ σ ≤ σq4

are 100(1 − α)% simultaneous confidence intervals for µ and σ based on the Bonferroni inequality.

4.5. LIKELIHOOD PRINCIPLE 61

4.5 LIKELIHOOD PRINCIPLE

The LP states that

. . . the “evidential meaning” of experimental results is characterized fully by the likelihood function,without other reference to the structure of the experiment, . . . (Birnbaum, 1962, pp. 269).

That is, all you need is likelihood. Birnbaum showed that the LP is a consequence of two other principles that aregenerally accepted by the statistical community; namely, sufficiency and conditionality.

Hacking (1965) thought that the LP went too far in some ways and not far enough in others. First, the LP isdeficient because it does not say how to use the likelihood function. Second, the LP is too strong because it usesthe likelihood function without reference to the statistical model. Fraser (1963) and others (see Berger andWolpert, 1984, pp. 47) have criticized the LP on the grounds that the model itself might not be sufficient.Accordingly, the LP should be applied only if the model is believed to capture the entire relationship between thedata and the parameters of interest. Third, Hacking argued that belief in the statistical model itself could beuncertain. The investigator should be allowed to formulate alternative models and to evaluate their relativestrengths; i.e., model checking should be allowed. Hacking removed these deficiencies by formulating a LL. Royall(1997) attributed the LL to Hacking (1965) and stated the law as follows.

If Hypothesis A implies that the probability that a random variable X takes the value x is pA(x), whilehypothesis B implies that the probability is pB(x), then the observation X = x is evidence supporting Aover B if and only if pA(x) > pB(x), and the likelihood ratio, pA(x)/pB(x), measures the strength ofthat evidence (Royall, 1997, pp. 3).

Royall (2000, pp. 760; 2001, 2004) omitted the “only if” condition.

Edwards (1972, reprinted in 1992) argued that it is not necessary to make a distinction between the LL andthe LP. He combined them into a single likelihood axiom stating that

Within the framework of a statistical model, all the information which the data provide concerning therelative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data, andthe likelihood ratio is to be interpreted as the degree to which the data support the one hypothesis againstthe other. (Edwards, 1992, pp. 31).

Joshi (1983) apparently agreed that there is no need to have a LL and a LP. He wrote that the total LP consists ofBirnbaum’s LP together with the LL. The two parts being kept separate only by convention.

References

Berger, J. O., & Wolpert, R. L. (1984). The Likelihood Principle, Haywood CA: Institute of MathematicalStatistics.

Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association,57, 269–326.

Edwards, A. W. F. (1992). Likelihood: Expanded Edition, Baltimore: Johns Hopkins University Press.

Fraser, D. A. S. (1963). On the sufficiency and likelihood principles. Journal of the American StatisticalAssociation, 58, 641–647.

Hacking, I. (1965). Logic of Statistical Inference, London: Cambridge University Press.

Joshi, V. M.(1983) Likelihood principle. In S. Kotz, N. L. Johnson, & C. B. Read (Eds) Encyclopedia of StatisticalSciences , Vol 4, 644–647.

Royall, R. (1997) Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall.

Royall, R. (2000). On the probability of observing misleading statistical evidence, Journal of the AmericanStatistical Association, 95, 760-768.

Royall, R. (2001). The likelihood paradigm for statistical evidence. This volume.


4.6 COMPLETENESS

Consider the family F = fY(y;θ), θ ∈ Θ. Let T(Y) be a statistic and denote the family of distributions of T by

FT = FT(t;θ), θ ∈ Θ. The family of densities is said to be complete if

E [h(T)] = 0 ∀θ ∈ Θ =⇒ P [h(T) = 0] = 1,

where h is a scalar measurable function of T.Example: Suppose that Yi

ind∼ U(θ, 1) for i = 1, . . . , n. It is readily shown that T = Y(1) is minimal sufficientand that the density of T is

fT (t;θ) = n(1 − t)n−1

(1 − θ)n I(θ,1)(t).

Suppose that h(T ) is a differentiable function of T . Then

E [h(T )] = 0 ∀ θ < 1 =⇒∫ 1

θ

h(t)(1 − t)n−1 dt = 0 ∀ θ < 1

=⇒ ∂

∂ θ

∫ 1

θ

h(t)(1 − t)n−1 dt = 0 ∀ θ < 1

=⇒ −h(θ)(1 − θ)n−1 = 0 ∀ θ < 1 =⇒ h(t) = 0 ∀ t < 1.

Accordingly, FT is complete.Technically, we must examine measurable functions, not merely differential functions. Suppose that h(T ) is a

measurable function of T . Then

E [h(T )] = 0 ∀ θ < 1 =⇒∫ 1

θ

h(t)(1 − t)n−1 dt = 0 ∀ θ < 1.

Write h(t) = h+(t) − h−(t), where the positive part, h+(t), and the negative part, h−(t), are defined as

h+(t) = max[0, h(t)] and h−(t) = −min[0, h(t)].

Then,

E [h(T )] = 0 ∀ θ < 1 =⇒∫ 1

θ

h+(t)(1 − t)n−1 dt =

∫ 1

θ

h−(t)(1 − t)n−1 dt ∀θ < 1

=⇒∫ θ2

θ1

h+(t)(1 − t)n−1 dt =

∫ θ2

θ1

h−(t)(1 − t)n−1 dt ∀ θ1 < θ2 < 1.

Define V +(A) and V −(A) by

V +(A) =

∫

A

h+(t)(1 − t)n−1 dt and V −(A) =

∫

A

h−(t)(1 − t)n−1 dt,

where A ∈ B, the Borel field defined on (−∞, 1). Then V +(A) and V −(A) are two measures defined on B, but

V +(A) = V −(A) ∀ A ∈ B =⇒ h+(t)(1 − t)n−1 = h−(t)(1 − t)n−1 except on sets of measure zero

=⇒ h+(t) = h−(t) except on sets of measure zero

=⇒ P[h+(T ) − h−(T ) = 0

]= P [h(T ) = 0] = 1.

Accordingly, FT is complete.Example from Stigler (Am. Stat, 1972, 28–29): Let F = FY (y|N), N ≥ 1, where

FY (y|N) = P (Y = y|N) =

1N y = 1, 2, . . . , N ;

0 otherwise.

4.6. COMPLETENESS 63

For N = 1, E[h(Y )] = h(1) and E[h(Y )] = 0 =⇒ h(1) = 0. For N = 2, E[h(Y )] = 12 [h(1) + h(2)] and

E[h(Y )] = 0 =⇒ 12 [h(1) + h(2)] = 0 =⇒ h(2) = 0 because h(1) = 0. By induction, E[h(Y )] = 0 ∀ n =⇒ h(Y ) = 0

with probability 1. Accordingly, the family is complete. Now examine the family F∗ = F −fY (y|n) for any n ≥ 1The family F∗ is smaller than F and is not complete. To verify that F∗ is not complete, examine

h(Y ) =

0 y = 1, 2, . . . , n− 1, n+ 2, n+ 3, . . .

a y = n,

−a y = n+ 1.

Then E[h(Y )] = 0 for N = 1, 2, . . . , n− 1. For N ≥ n+ 1,

E[h(Y )] =1

N[0 + 0 + · · · + 0 + a− a+ 0 + · · · ] = 0.

Accordingly, the family is not complete.Example: Let Y be a random sample of size one from Unif(θ, θ + 1). Then Y is minimal sufficient. Let h(Y )

be any bounded periodic non-zero function with period one that satisfies

∫ 1

0

f(a+ u) du = 0

for any a. For example, h(y) = sin(2πy) will work. Then,

E[h(Y )] =

∫ θ+1

θ

h(u) du =

∫ 1

0

h(z + θ) dz = 0 where u = z + θ.

Accordingly, the family is not complete.

Theorem 25 (Basu) Let F = fY(y;θ), θ ∈ Θ. Suppose that T is a minimal sufficient complete statistic andthat A is distribution constant. Then T A.

Proof: Denote the support set for T by T . Note that P (A ∈ R) does not depend on θ because A is distributionconstant. Also,

P (A ∈ R) =

∫

TP (A ∈ R|T = t)fT(t;θ) dt =

∫

TP (A ∈ R)fT(t;θ) dt

=⇒∫

T[P (A ∈ R|T = t) − P (A ∈ R)] fT(t;θ) = 0 ∀ R.

Note that P (A ∈ R|T = t) − P (A ∈ R) does not depend on θ and, therefore can be written ash(t) = P (A ∈ R|T = t) − P (A ∈ R). Completeness of T implies that h(T) = 0 with probability 1. Accordingly, thedistribution of A conditional on T is identical to the marginal distribution and this implies that A T.

4.6.1 Completeness in Exponential Families

Theorem 26 establishes that if the distribution of Y belongs to a full-rank exponential family, then the distributionof the minimal sufficient also has a distribution that belongs to a full-rank exponential family. First, however, themodel will be reparameterized using the natural parameterization.

Consider the family F = fY(y;θ), θ ∈ Θ, where

fY(y;θ) = exp a(θ)′b(y) c(θ)d(y),

θ is a k-vector, a(θ) is a k-vector, and the components of a(θ) are linearly independent. The model can be

reparameterized by defining η as ηdef= a(θ) and writing θ as θ = a−1(η). The vector η is called the natural

parameter. The reparameterized probability function is

fY(y;η) = exp η′b(y) c∗(η)d(y),

where c∗(η) = c[a−1(η)

].


Theorem 26 Consider a random sample of size n from a full-rank exponential family distribution. The naturalparameterization of the probability function can be written as

fY(y;θ) = exp η′t [c∗(η)]n

n∏

i=1

d(yi), where t =n∑

i=1

b(yi).

It follows from Theorem 16 that

T =n∑

i=1

b(Yi)

is minimal sufficient. Then, the distribution of T also is a member of the full-rank exponential class.

Proof: Transform from Y to h(Y) =(T′ Z′)′, where Z is chosen to make the transformation one to one. The

inverse transformation isy = h−1(t, z) or yi = h−1

i (t, z).

If the probability function is continuous, then the Jacobian of the transformation is

|J | =∂Y′

∂

(TZ

) .

The marginal probability function of T is

fT(t;θ) =

∫exp η′t [c∗(η)]

nn∏

i=1

d [hi(t, z)] |J | dz,

= exp η′t [c∗(η)]n∫ n∏

i=1

d [hi(t, z)] |J | dz,= exp η′t [c∗(η)]nd∗(t), where

d∗(t) =

∫ n∏

i=1

d [hi(t, z)] |J | dz.

If Y has a discrete distribution, then drop the Jacobian and replace integration by summation. In either case, thedistribution of T belongs to the exponential class.

Theorem 27 Consider the family F = fT(t;θ), θ ∈ Θ, where F is a member of the full-rank exponential classand where Θ contains an open set. Then F is a complete family of distributions.

Proof: Suppose that there exists a measurable function h(t) that satisfies E[h(T)] = 0 ∀ θ ∈ Θ. Write h in terms ofits positive and negative parts so that h(t) = h+(t) − h−(t), where h+ and h− each are nonnegative functions.Assume that fT is parameterized in terms of its natural parameter and write the probability function of T as

fT(t;θ) = expθ′tc(θ)d(t).

Then,

E[h(T)] = 0 ∀ θ ∈ Θ

=⇒∫h+(t) expθ′tc(θ)d(t) dt =

∫h−(t) expθ′tc(θ)d(t) dt ∀ θ ∈ Θ

=⇒∫h+(t) expθ′t dµ(t) =

∫h−(t) expθ′tdµ(t) ∀ θ ∈ Θ, (∗)

where dµ(t) = d(t) dt. Let θ0 be an interior point in Θ and let N(θ0) be a neighborhood of θ0. That is,

N(θ0) = θ;θ ∈ Θ, |θ − θ0| < ε

for some ε > 0. Because θ0 ∈ Θ, it follows that∫h+(t) expθ′0tdµ(t) =

∫h−(t) expθ′0tdµ(t).


Denote the common value of the integral by c. If c = 0, then it follows that h+(t) = h−(t) = 0 almost everywherewith respect to µ(t) and the family of densities is complete. If c > 0, then

g+(t) =1

ch+(t) expθ′0t and g−(t) =

1

ch−(t) expθ′0t

are densities with respect to µ(t). Denote the moment generating functions of g+ and g− by M+(ξ) and M−(ξ).That is,

M+(ξ) = E+

(eξ

′T)

=1

c

∫h+(t) exp(θ0 + ξ)′t dµ(t) and

M−(ξ) = E−(eξ

′T)

=1

c

∫h−(t) exp(θ0 + ξ)′t dµ(t).

It is apparent that the moment generating functions exist for ξ in a neighborhood of zero becauseθ0 + ξ ∈ N(θ0) ∈ Θ. Also, it follows from equation (*) that M+(ξ) = M−(ξ). From the uniqueness of momentgenerating functions, it follows that g+(t) = g−(t) almost everywhere with respect to µ(t). Accordingly,h+(t) = h−(t) almost everywhere with respect to µ(t) and F is complete.

4.6.2 UMVUE

Theorem 28 Consider the family F = Y, fY (y;θ), θ ∈ Θ. Assume that

1. the derivatives of ln [fY(y;θ)] with respect to θ exist to order ≥ 2 exist and are integrable,

2. the order of integration (summation) and differentiation can be interchanged,

3. and that the support of Y does not depend on θ.

The score function is defined as

S(θ)def=

∂ ln [fY(y;θ)]

∂ θ.

Then,

E [S(θ)] = 0 and Var [S(θ)] = Iθ, where

Iθdef= E [S(θ)S(θ)′] .

Furthermore,

Iθ = −E

[∂2 ln [fY(y;θ)]

∂ θ ⊗ ∂ θ′

].

Proof: Assume that the distribution is continuous. If it isn’t, then replace integration by summation. Theexpectation of the score function is

E [S(θ)] =

∫

Y

∂ ln [fY(y;θ)]

∂ θfY(y;θ) dy =

∫

Y

1

fY(y;θ)

∂ fY(y;θ)

∂ θfY(y;θ) dy

=

∫

Y

∂ fY(y;θ)

∂ θdy =

∂

∂ θ

∫

YfY(y;θ)dy =

∂ 1

∂ θ= 0.

Also,

∂ 0

∂ θ= 0 =⇒ ∂

∂ θ

∫

YfY(y;θ)

∂ ln [fY(y;θ)]

∂ θ′dy = 0 ∀ θ ∈ Θ

=⇒∫

Y

(∂ fY(y;θ)

∂ θ

)(∂ ln [fY(y;θ)]

∂ θ′

)dy +

∫

YfY(y;θ)

∂2 ln [fY(y;θ)]

∂ θ ⊗ ∂ θ′dy = 0 ∀ θ ∈ Θ


=⇒∫

YfY(y;θ)

(∂ ln [fY(y;θ)]

∂ θ

)(∂ ln [fY(y;θ)]

∂ θ′

)dy +

∫

YfY(y;θ)

∂2 ln [fY(y;θ)]

∂ θ ⊗ ∂ θ′dy = 0 ∀ θ ∈ Θ

=⇒ E [S(θ)S(θ)′] + E

[∂2 ln [fY(y;θ)]

∂ θ ⊗ ∂ θ′

]= 0 ∀ θ ∈ Θ

=⇒ Iθ = −E

[∂2 ln [fY(y;θ)]

∂ θ ⊗ ∂ θ′

].

Theorem 29 (Cramer-Rao Lower Bound) Suppose that T is an unbiased estimator of τ , where τ is avector-valued differentiable function of θ. If Iθ > 0 and the conditions of Theorem 28 are satisfied, then

Var(T) ≥(∂ τ

∂ θ′

)I−1θ

(∂ τ ′

∂ θ

).

That is,

Var(T) −(∂ τ

∂ θ′

)I−1θ

(∂ τ ′

∂ θ

)

is nonnegative definite.

Proof: First, it will be shown that

Cov(T,S) =∂ τ

∂ θ′,

where S = S(θ) is the score function:

Cov(T,S) = E(TS′) because E(S) = 0

=⇒ Cov(T,S) =

∫

YT(y)

∂ ln [fY(y;θ)]

∂ θ′fY(y;θ) dy =

∫

YT(y)

1

fY(y;θ)

∂ fY(y;θ)

∂ θ′fY(y;θ) dy

=

∫

YT(y)

∂ fY(y;θ)

∂ θ′dy =

∂

∂ θ′

∫

YT(y)fY(y;θ)dy =

∂ τ

∂ θ′,

because T is unbiased for τ . Accordingly,

Corr(`′T,h′S)2 =

[`′(∂ τ

∂ θ′

)h

]2

`′ Var(T)` h′Iθh

and, by the Cauchy-Schwartz inequality, this quantity is ≤ 1. Now maximize the squared correlation with respect toh:

Corr(`′T,h′S)2 ≤ 1 ∀ `,h

=⇒ maxh

Corr(`′T,h′S)2 ≤ 1 ∀ `, and

maxh

Corr(`′T,h′S)2 =

`′(∂ τ

∂ θ′

)I−1θ

(∂ τ ′

∂ θ

)`

`′ Var(T)`.

Accordingly,

`′(∂ τ

∂ θ′

)I−1θ

(∂ τ ′

∂ θ

)` ≤ `′ Var(T)` ∀ `

=⇒ `′

Var(T) −(∂ τ

∂ θ′

)I−1θ

(∂ τ ′

∂ θ

)` ≥ 0 ∀ `


=⇒ Var(T) −(∂ τ

∂ θ′

)I−1θ

(∂ τ ′

∂ θ

)

is nonnegative definite.

Theorem 30 (Rao-Blackwell) Suppose that T(Y) : q × 1 is an unbiased estimator of τ (θ) and that S = S(Y) isa sufficient statistic for the family F = fY(y;θ), Θ. Denote E(T|S) by T∗. Then,

1. T∗ is a statistic,

2. T∗ is unbiased for τ (θ), and

3. Var(T∗) ≤ Var(T).

Proof: It is known that the distribution of T conditional on S does not depend on θ because S is sufficient.Accordingly, E(T|S) does not depend on θ. Unbiasedness of T∗ can be established by employing iterated expectation:

τ (θ) = E(T) = ES [E(T|S)] = ES(T∗).

Furthermore,

Var(T) = E [Var(T|S)] + Var [E(T|S)]

= E [Var(T|S)] + Var(T∗)

=⇒ Var(T∗) = Var(T) − E [Var(T|S)] ≤ Var(T).

because E [Var(T|S)] is nonnegative definite.

Theorem 31 (Lehmann-Scheffe II) Consider the family F = fY(y;θ), Θ. Suppose that S = S(Y) isminimal sufficient and that the family of densities of S is complete. Let g(S) be a vector-valued function of S whose

expectation exists. Then g(S) is the unique UMVUE of γ(θ)def= E [g(S)].

Proof: By construction g is unbiased for γ(θ). Suppose that T(Y) is any other unbiased estimator of γ. It followsfrom Theorem 30 that T∗ = E(T|S) also is unbiased for γ and that Var(T∗) ≤ Var(T). Furthermore, T∗ is afunction of S and

h(S) = g(S) − T∗(S) =⇒ E [h(S)] = 0

=⇒ P [g(S) = T∗(S)] = 1

because the family of densities is complete. Accordingly, g(S) is a UMVUE becauseVar [g(S)] = Var [T∗(S)] ≤ Var [T(Y)], where T is any other unbiased estimator of γ and g(S) is the uniqueUMVUE because it is the only function of S that is unbiased for γ.

Theorem 32 (Rao) Consider the family F = fY(y;θ), Θ. Denote the class of all unbiased estimators of zeroby by U . Suppose that T(Y) : q × 1 is unbiased for τ (θ). Then, T is UMVUE iff Cov(T,U) = 0 ∀ U ∈ U .

Proof: Part 1. Suppose that T is UMVUE for τ . Choose any U ∈ U and consider the statistic T∗`,λ = `′T + λ′U.

Define Σ by

Σ = Var

(TU

)=

(ΣTT ΣTU

ΣUT ΣUU

).

It can be assumed without loss of generality that ΣUU is nonsingular. Otherwise, write the full-rank SVD of ΣUU

byΣUU = FDF′.

Then

Var [(I − FF′)U] = 0 =⇒ (I − FF′)U = k, a constant, with probability 1

=⇒ U = FF′U + k = FF′U because E(U) = 0


=⇒ λ′U = λ∗′U∗,

where λ∗ = F′λ, U∗ = F′U, and Var(U∗) = D which is nonsingular. The variance of T∗`,λ can be written as

Var(T∗`,λ

)=(`′ λ′)Σ

(`

λ

)= `′ΣTT`+ 2`′ΣTUλ+ λ′ΣUUλ.

It follows thatminλ

Var(T∗`,λ

)= `′

(ΣTT − ΣTUΣ−1

UUΣUT

)` ≤ `′ΣTT` ∀ `.

Furthermore, because T is UMVUE, it follows that `′T is UMVUE for `′τ . Therefore,

Var(`′T) ≤ Var(T∗`,λ

)∀ λ, `

=⇒ Var(`′T) ≤ minλ

Var(T∗`,λ

)≤ Var(`′T) ∀ `

=⇒ `′ΣTT` ≤ `′(ΣTT − ΣTUΣ−1

UUΣUT

)` ≤ `′ΣTT` ∀ `

=⇒ `′ΣTUΣ−1UUΣUT` = 0 ∀ `

=⇒ ΣTUΣ−1UUΣUT = 0 =⇒ ΣUT = 0.

Proof: Part 2. Suppose that E(T) = τ (θ) and that Cov(T,U) = 0 ∀ U ∈ U . Let T∗ be any other unbiasedestimator of τ . Then

E [T(T − T∗)] = Cov(T,T − T∗) because E(T − T∗) = 0

= 0 because T − T∗ ∈ U

= E T [(T − τ ) − (T∗ − τ )] = Var(T) − Cov(T,T∗)

=⇒ Cov(T,T∗) = Var(T).

Examine the squared correlation between `′T and `′T∗:

[Corr(`′T, `′T∗)

]2 ≤ 1 ∀ `

=⇒ `′ Cov(T,T∗)` `′ Cov(T,T∗)′`

`′ Var(T)` `′ Var(T∗)`≤ 1 ∀` 3 Var(`′T) 6= 0, Var(`′T∗) 6= 0

=⇒ `′ Var(T)`

`′ Var(T∗)`≤ 1 ∀` 3 Var(`′T∗) 6= 0

=⇒ `′ Var(T)` ≤ `′ Var(T∗)` ∀`

=⇒ Var(T) ≤ Var(T∗)

using Cov(T,T∗) = Var(T).

Chapter 5

LIKELIHOOD BASED INFERENCE

5.1 MULTIVARIATE CENTRAL LIMIT THEOREM

Theorem 33 (Central Limit Theorem) Suppose that yi for i = 1, . . . , n are independent p× 1 random vectors,where E(yi) = µi and Var(yi) = Σi. Define G2,n and G3,n by

G2,ndef=

n∑

i=1

Σi and G3,ndef=

n∑

i=1

E [(yi − µi) ⊗ (yi − µi) ⊗ (yi − µi)] .

Assume that

G2,n is pos. def. for n ≥ n0,

limn→∞

1

nG2,n <∞, and

limn→∞

||(G− 12

2,n ⊗ G− 1

22,n ⊗ G

− 12

2,n )G3,n|| = 0.

Let

y = n−1n∑

j=1

yj .

Then,

√n(y − µn)

dist−→ N(0,Σ) as n→ ∞, where

µn =1

n

n∑

j=1

µj and Σ = limn→∞

1

nG2,n

Outline of Proof: Define G2,n and zn

G2,n =1

nG2,n, and zn =

√nG

− 12

2,n (y − µn).

Denote the characteristic function of yj by φyj (t) = E(eit

′yj

). Recall that the characteristic function of a random

variable with distribution N(0, I) is exp− 12t

′t. We will show that

limn→∞

φzn(t) = e−12 t′t.

Let u be a scalar. Make the following definitions:

φyj (ut)def= E

(eiut

′yj

);

69

70 CHAPTER 5. LIKELIHOOD BASED INFERENCE

φ′yj (ut)def=

∂φyj (ut)

∂u= E

(eiut

′yj it′yj)

; and

φ′′yj (ut)def=

∂2φyj (ut)

(∂u)2= −E

[eiut

′yj (t′yj)2].

Then, the following results are readily established:

limu→0

φyj (ut) = 1;

limu→0

φ′yj (ut) = it′µj ; and

limu→0

φ′′yj (ut) = −[(t′µj)

2 + t′Σjt].

The characteristic function of zn is

φzn(t) = E

[ei

√nt′G

− 12

2,n (y−µn)

]= E

[e

i√nt′G

− 12

2,n

Pyi

]e−i

√nt′G

− 12

2,nµn

=

n∏

j=1

φyj

(1√ntG

− 12

2,n

) e−i

√nt′G

− 12

2,nµn = exp

n∑

j=1

lnφyj

(1√nG

− 12

2,nt

)− i

√nt′G

− 12

2,nµn

= exp

n∑

j=1

[lnφyj

(1√nG

− 12

2,nt

)− i

1√nt′G

− 12

2,nµj

] .

Let u = 1/√n and expand lnφyj (G

− 12

2,nt/√n) − it′G

− 12

2,nµj/√n around u = 0:

lnφyj (uG− 1

22,nt) − iut′G

− 12

2,nµj = ln(1) + it′G− 1

22,nµju− 1

2t′G

− 12

2,nΣjG− 1

22,ntu2 +

i3

6u3K ′′′

j (u∗G− 1

22,nt) − iut′G

− 12

2,nµj ,

where Kj(uG− 1

22,nt) = lnφyj (uG

− 12

2,nt) is the cumulant generating function of yj and u∗ ∈ (0, u). The final step is to

replace u by u = 1/√n, u∗ by 1/

√n∗, where n∗ > n and take the limit as n→ ∞:

φzn(t) = exp

− 1

2n

n∑

j=1

t′G− 1

22,nΣjG

− 12

2,nt +i3

6

n∑

j=1

1

n3/2K ′′′j

(1√n∗ G

− 12

2,nt

)

= exp

−1

2t′t +

i3

6

n∑

j=1

1

n3/2K ′′′j

(1√n∗

G− 1

22,nt

)

=⇒ limn→∞

φzn(t) = exp

−1

2t′t + lim

n→∞i3

6

n∑

j=1

1

n3/2K ′′′j

(1√n∗

G− 1

22,nt

)

= exp

−1

2t′t +

i3

6limn→∞

(t ⊗ t ⊗ t)′(G− 1

22,n ⊗ G

− 12

2,n ⊗ G− 1

22,n )G3,n

= exp

−1

2t′t

.

The CLT can be established under weaker conditions than were assumed in Theorem 33. In particular, if theLindeberg condition is satisfied, then the CLT holds.

5.2. SLUTSKY’S THEOREM AND THE DELTA METHOD 71

Lindeberg Condition. For every ε > 0,

limn→∞

1

n

n∑

j=1

E[y′jyjI(y′

jyj>ε

√n)

]= 0.

This condition implies that

limn→∞

Σk

n∑

j=1

Σj

−1

= 0 for k = 1, . . . , n

That is, no single random vector dominates the others. Note that if yj are iid, then the Lindeberg conditionsimplifies to

limn→∞

E[y′yI(y′y>ε

√n)

]= 0

which is satisfied whenever Var(y) is finite.The Lindeberg condition is a sufficient, but not a necessary condition. Feller showed that if another condition

is satisfied (uniformly asymptotically negligible summands), then the Lindeberg condition is necessary. If p = 1,then a somewhat stronger sufficient condition is that of Liapounov; namely, moments of order 2 + δ that, for someδ > 0, are not too large. See page in Severini (2000).

5.2 SLUTSKY’S THEOREM AND THE DELTA METHOD

The following theorem is very useful when obtaining asymptotic distributions. For a proof, see Theorem 3.4.2 in

Sen, P. K., & Singer, J. M.(1993). Large Sample Methods in Statistics: An Introduction with Applications,London: Chapman & Hall.

Theorem 34 (Slutsky) Let tn be a random p-vector that satisfies tndist−→ t as n→ ∞, where t is a random

p-vector. Suppose that an is a random k vector and that Bn is a random k × p matrix that satisfy anprob−→ a and

Bnprob−→ B, where a is a vector of constants and B is a matrix of constants. Then

1. an + Bntndist−→ a + Bt

2. an + B−1n tn

dist−→ a + B−1t if k = p and B−1 exists.

Theorem 35

1. If g is a continuous function and tndist−→ t, then g(tn)

dist−→ g(t).

2. If g is a continuous function and tnprob−→ t, then g(tn)

prob−→ g(t).

Proof: See Rao (1973, Section 2c.

Theorem 36 (Delta Method) Let tn be a random p-vector with asymptotic distribution√n(tn − θ) dist−→ N(0,Σ). Suppose that g(tn) is a differentiable function.

1. The asymptotic distribution of g(tn) is

√n [g(tn) − g(θ)]

dist−→ N[0,G(θ)ΣG(θ)′], where G(θ) =∂g(tn)

∂t′n

∣∣∣∣∣tn=θ

.

2. If G(θ) is a continuous function of θ, then

√n [G(tn)ΣG(tn)

′]− 1

2 [g(tn) − g(θ)]dist−→ N(0, I).


Proof: Expand g(tn) in a Taylor series around tn = θ:

g(tn) = g(θ) + G(θ)(tn − θ) +Op(n−1

)

=⇒ √n [g(tn) − g(θ)] = G(θ)

√n(tn − θ) +Op

(n−1/2

).

Use Slutsky’s Theorem (Theorem 34) to obtain result (1). Use the law of large numbers to obtain (2). If Σ is afunction of θ, then result (2) requires that Σ(θ) be a continuous function.

5.3 DISTRIBUTION OF MLES

This section is based on material from the following sources.

Lehmann, E. H. (1999). Elements of Large Sample Theory, New York: Springer-Verlag.

Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd ed., New York: Wiley.

5.3.1 Assumptions

Let yi for i = 1, . . . , n be a set of independently distributed random p-vectors with distributionsyi ∼ fi(yi|θ),θ ∈ Θ. Accordingly, the joint density is

f(y|θ) =

n∏

i=1

fi(yi|θ),

and the log likelihood function (log of joint pdf) is

`(θ|y) = `(θ) =n∑

i=1

ln [fi(yi|θ)] =n∑

i=1

ì(θ), where ì = ln [fi(yi|θ)] .

It is assumed that ` satisfies the following:

1. Denote the support of yi by Yi. Assume that Yi does not depend on θ.

2. Assume that θ is identifiable. That is,

fi(y|θ1) = fi(y|θ2)∀y ∈ Yi =⇒ θ1 = θ2.

3. Assume that the parameter space is an open subset in IRν , where ν = dim(θ).

4. Assume that the log likelihood, `(θ), is a third order differentiable function of θ in an open neighborhood ofΘ that contains the true value. These derivatives can be obtained by differentiating under the integral sign.

5. Denote the total information by Iθ. That is,

Iθ = E

[(∂`(θ)

∂θ

)(∂`(θ)

∂θ

)′].

It is assumed that Iθ is finite, positive definite, and that

limn→∞

Iθ−1 = 0.

6. Denote the true value of θ by θ0. Assume that if θ is in a neighborhood of θ0, then∣∣∣∣

∂3ì(θ)

∂θ ⊗ ∂θ′ ⊗ θ′∣∣∣∣ < Mi(yi),

where Mi(yi) satisfies Eθ0[Mi(yi)] <∞.

5.3. DISTRIBUTION OF MLES 73

5.3.2 Consistency of MLEs

Theorem 37 With probability 1 as n→ ∞, the likelihood equations have a root that is consistent for θ.

Proof: Denote the true value of θ by θ0 and consider the likelihood function evaluated at θ0 + δ, where δ is anarbitrary vector in a neighborhood of zero. By Jensen’s inequality,

Eθ0

[ln

(f(y|θ0 + δ)

f(y|θ0)

)]< 0.

Accordingly, by the law of large numbers,

Pr

[`(θ0 + δ)

n− `(θ0)

n< 0

]→ 1 as n→ ∞.

Because δ is arbitrary, it follows that the likelihood function has a local maximum at θ0. Furthermore, because `(θ)is continuous and differentiable, the derivative at this maximum is zero and θ0 is a solution to

∂`(θ)

∂θ= 0.

5.3.3 Distribution of Score Function

The function

S(θ) =∂`(θ)

∂θ=

∂

∂θ

n∑

i=1

ln[fi(yi|θ)] =∂

∂θ

n∑

i=1

`i(θ)

is called the score function. It is a function of y and the unknown parameters; thus it is not a statistic.

Theorem 38 The asymptotic distribution of the score function is the following:

S(θ)√n

dist−→ N(0, Iθ,∞

),

where Iθ,∞ is average Fisher’s Information. Average information can be computed as

Iθ,∞ = limn→∞

Iθ,n, where

Iθ,n = n−1 Eθ

[(∂ ln [f(y|θ)]

∂θ

)(∂ ln [f(y|θ)]

∂θ′

)]or

Iθ,n = −n−1 Eθ

[∂2 ln [f(y|θ)]∂θ ⊗ ∂θ′

].

The notation Eθ(·) means expectation evaluated when the parameter value is θ.

Proof:

S(θ)√n

=√n(S − 0), where S =

1

n

n∑

i=1

Si and Si =∂ ln [fi(yi|θ)]

∂θ.

All that is necessary is to find the expectation and variance of Si and then to apply the multivariate CLT (Theorem33).

E(Si) = Eθ

[∂ ln [fi(y|θ)]

∂θ

]=

∫∂ ln [fi(y|θ)]

∂θfi(y|θ) dy

=

∫1

fi(y|θ)∂fi(y|θ)∂θ

fi(y|θ) dy =

∫∂fi(y|θ)∂θ

dy

=∂

∂θ

∫fi(y|θ) dy =

∂

∂θ1 = 0


=⇒ Var(Si) = Eθ

[(∂ ln [fi(yi|θ)]

∂θ

)(∂ ln [fi(yi|θ)]

∂θ′

)].

Also, Si for i = 1, . . . , n are jointly independent, so

E(S) = 0 and

Var(S) = Var

(n−1

n∑

i=1

Si

)= n−2

n∑

i=1

Eθ(SiS′i)

= n−2n∑

i=1

Eθ

[(∂ ln [fi(yi|θ)]

∂θ

)(∂ ln [fi(yi|θ)]

∂θ′

)]

= n−2 Eθ

[(∂ ln [f(y|θ)]

∂θ

)(∂ ln [f(y|θ)]

∂θ′

)]= n−1 Iθ,n

=⇒ √n(S − 0)

dist−→ N(0, Iθ,∞).

To obtain the alternative expression for Iθ,n, note that

∫∂ ln [f(y|θ)]

∂θf(y|θ) dy = 0 ∀ θ in an open neighborhood of the true value

=⇒ 0 =∂

∂θ′

∫∂ ln [f(y|θ)]

∂θf(y|θ) dy

=

∫∂2 ln [f(y|θ)]∂θ ⊗ ∂θ′

f(y|θ) dy +

∫∂ ln [f(y|θ)]

∂θ

∂f(y|θ)∂θ′

dy

=⇒ Eθ

[(∂ ln [f(y|θ)]

∂θ

)(∂ ln [f(y|θ)]

∂θ′

)]= −Eθ

[∂2 ln [f(y|θ)]∂θ ⊗ ∂θ′

].

Note: Replace integration by summation for discrete random variables.

Example 1: Beta pdf. Suppose yi ∼ iid Beta(α, β), for i = 1, . . . , n. Then, for θ = (α β)′,

S(θ) =

( ∑ni=1 ln(yi) − nψ(α) + nψ(α+ β)∑n

i=1 ln(1 − yi) − nψ(β) + nψ(α+ β)

)and

Iθ,n = Iθ,∞ =

(ψ′(α) − ψ′(α+ β) −ψ′(α+ β)

−ψ′(α+ β) ψ′(β) − ψ′(α+ β)

),

where

ψ(t) =∂ ln Γ(t)

∂tand ψ′(t) =

∂2 ln Γ(t)

(∂t)2.

Example 2: Bernoulli pdf: Suppose that yi are independently distributed as Bernouli(πi), wherelogit(πi) = x′

iθ. Then,

S(θ) =

n∑

i=1

xi (yi − πi) and Iθ,∞ =1

n

n∑

i=1

xiπi(1 − πi)x′i.

5.3.4 Asymptotic Distribution of MLEs

Denote the true value of θ by θ0 and assume that θ is a consistent root of S(θ) = 0. Denote the dimension of θ byν.


Theorem 39 (Distribution of MLE)

S(θ0)√n

a.d.= Iθ0,n

√n(θ − θ0) and

√n(θ − θ0)

dist−→ N(0, I

−1

θ0,∞

), where

Iθ0,∞ = limn→∞

Iθ0,n and Iθ0,n =1

nIθ0,n.

Proof: From Taylor’s theorem with remainder, it follows that

S(θ) = 0 =∂`(θ0)

∂θ0+

∂2`(θ0)

∂θ0 ⊗ ∂θ′0(θ − θ0) +

1

2

∂3`(θ1)

∂θ1 ⊗ ∂θ′1 ⊗ ∂θ′1[(θ − θ0) ⊗ (θ − θ0)],

where θ1 = αθ + (1 − α)θ0 for some α ∈ (0, 1). Accordingly,

S(θ0) = −

∂2`(θ0)

∂θ0 ⊗ ∂θ′0+

1

2

∂3`(θ1)

∂θ1 ⊗ ∂θ′1 ⊗ ∂θ′1[(θ − θ0) ⊗ Iν ]

(θ − θ0);

θ − θ0 = −

∂2`(θ0)

∂θ0 ⊗ ∂θ′0+

1

2

∂3`(θ1)

∂θ1 ⊗ ∂θ′1 ⊗ ∂θ′1[(θ − θ0) ⊗ Iν ]

−1

S(θ0); and

√n(θ − θ0) = −

1

n

∂2`(θ0)

∂θ0 ⊗ ∂θ′0+

1

2n

∂3`(θ1)

∂θ1 ⊗ ∂θ′1 ⊗ ∂θ′1[(θ − θ0) ⊗ Iν ]

−1S(θ0)√

n.

The proof is completed by noting that

1

n

∂2`(θ0)

∂θ0 ⊗ ∂θ′0= − Iθ,n +Op

(n−

12

)prob−→ − Iθ0,∞ by the law of large numbers

∣∣∣∣∂3`(θ1)

∂θ1 ⊗ ∂θ′1 ⊗ ∂θ′1

∣∣∣∣ <n∑

i=1

Mi(yi) and Eθ0[Mi(yi)] <∞ by assumption, and

θ − θ0prob−→ 0 by consistency.

Accordingly,

1

n

∣∣∣∣∂3`(θ1)

∂θ1 ⊗ ∂θ′1 ⊗ ∂θ′1

∣∣∣∣ <1

n

n∑

i=1

Mi(yi),

limn→∞

1

n

n∑

i=1

Mi(yi) = O(1),

and it follows from Slutsky’s Theorem (Theorem 34), that

√n(θ − θ0) − I

−1

θ0,n

S(θ0)√n

= op(1) and that

√n(θ − θ0)

dist−→ N(0, I

−1

θ0,∞

).

Example 1: beta pdf.Example 2: Bernoulli pdf.


5.3.5 Distribution of Maximized Log Likelihood

Denote the dimension of θ by ν and denote the MLE of θ by θ.

Theorem 40 The asymptotic distribution of the maximized log likelihood is

2[`(θ) − `(θ)

]a.d.= n(θ − θ)′ Iθ,n(θ − θ) dist−→ χ2

ν and

2[`(θ) − `(θ)

]a.d.=

1

nS(θ)′Iθ,n

−1S(θ)

dist−→ χ2ν .

Proof:

`(θ) = `(θ) +∂`(θ)

∂θ′(θ − θ) +

1

2(θ − θ)′ ∂2`(θ)

∂θ ⊗ ∂θ′(θ − θ) +Op(n

−1/2)

= `(θ) +S(θ)′√n

√n(θ − θ) +

n

2(θ − θ)′

[− Iθ,n +Op(n

−1/2)](θ − θ) +Op(n

−1/2)

= `(θ) +√n(θ − θ)′ Iθ,n

√n(θ − θ) − n

2(θ − θ)′ Iθ,n(θ − θ) +Op(n

−1/2)

by Theorem 38

=⇒ `(θ) = `(θ) +n

2(θ − θ)′ Iθ,n(θ − θ) +Op(n

−1/2)

=⇒ 2[`(θ) − `(θ)] = n(θ − θ)′ Iθ,n(θ − θ) +Op(n−1/2)

dist−→ χ2ν

by Theorem 39

5.3.6 Distributions Under Reparameterization

The results in this section rely heavily on the chain rule for matrix functions. Two special cases of the chain ruleare as follows.

1. Suppose that z = z(w) is a scalar function of the vector w and w = w(x) is a vector function of the vector x,then

∂z

∂x=

(∂w′

∂x⊗ I1

)(I1 ⊗

∂z

∂w

)=∂w′

∂x

∂z

∂w. (5.1)

2. Suppose that z = z(w) is a vector function of the vector w and w = w(x) is a vector function of the vector x,then

∂z′

∂x=

(∂w′

∂x⊗ I1

)(I1 ⊗

∂z′

∂w

)=∂w′

∂x

∂z′

∂w. (5.2)

It follows from (5.2) that∂z

∂x′ =∂z

∂w′∂w

∂x′ .

Theorem 41 (Derivatives Under Reparameterization) Suppose that θ can be written as g(β) where θ isν × 1; β is ν1 × 1; ν1 ≤ ν, and g(β) is a twice differentiable function of β. Denote the log likelihood functionwritten in terms of β by `[g(β)]. Let S[g(β)] be the score function written in terms of β. That is,

S[g(β)]def=

∂`[g(β)]

∂β.

Define F by

Fdef=

∂θ

∂β′ =∂g(β)

∂β′ .


Assume that F : ν × ν1 has rank ν1. The first two derivatives of `[g(β)] with respect to β are as follows:

∂`[g(β)]

∂β= S[g(β)] = F′S(θ) and

∂2`[g(β)]

∂β ⊗ ∂β′ = F′ ∂2`(θ)

∂θ ⊗ ∂θ′F + [Iν1 ⊗ S(θ)′]

∂2g(β)

∂β ⊗ ∂β′ .

Proof: Use the chain rule in (5.1), where z = `(θ), w = θ = g(β), and x = β to obtain

∂`[g(β)]

∂β=∂θ′

∂β

∂`(θ)

∂θ= F′ ∂`(θ)

∂θ= F′S(θ).

Use the product rule and the chain rule in (5.2), where z = F′S(θ), w = θ = g(β), and x = β to obtain

∂2`[g(β)]

∂β ⊗ ∂β′ =∂S(θ)′F

∂β=∂S(θ)′

∂βF + [Iν1 ⊗ S(θ)′]

∂F

∂β

=∂θ′

∂β

∂S(θ)′

∂θF + [Iν1 ⊗ S(θ)′]

∂F

∂β

= F′ ∂2`(θ)

∂θ ⊗ ∂θ′F + [Iν1 ⊗ S(θ)′]

∂2θ

∂β ⊗ ∂β′ .

Theorem 42 (Score) The asymptotic distribution of the score function is

S[g(β)]√n

a.d.= F′ Iθ,n

√n(θ − θ) dist−→ N

(0, Iβ,∞

)

where Iβ,∞ = F′ Iθ,∞ F.

Proof: Use the expression for S[g(β)] in Theorem 41 and apply Theorems 38 and 39.

Theorem 43 (Distribution of MLE) Denote the maximizer of `[g(β)] by β. Then

S[g(β)]√n

a.d.= Iβ,n

√n(β − β) and

√n(β − β)

a.d.= I

−1

β,n F′ Iθ,n√n(θ − θ) dist−→ N

(0, I

−1

β,∞

).

Proof:

S[g(β)] = 0 = S[g(β)] +∂S[g(β)]

∂β′ (β − β) +Op(1)

=∂`[g(β)]

∂β+∂2`[g(β)]

∂β ⊗ ∂β′ (β − β) +Op(1)

= F′S(θ) +

F′ ∂2`(θ)

∂θ ⊗ ∂θ′F + [Iν1 ⊗ S(θ)′]

∂F

∂β

(β − β) +Op(1)

using Theorem 41

= F′S(θ) + F′ ∂2`(θ)

∂θ ⊗ ∂θ′F(β − β) +Op(1)


because [Iν1 ⊗ S(θ)′]∂F

∂β(β − β) = Op(1)

= F′S(θ) − F′ Iθ,n Fn(β − β) +Op(1)

because∂2`(θ)

n∂θ ⊗ ∂θ′= − Iθ,n +Op

(n−1/2

)

=⇒ F′S(θ)√n

= F′ Iθ,n F√n(β − β) +Op(n

−1/2)

=⇒ F′S(θ)√n

a.d.= F′ Iθ,n F

√n(β − β)

=⇒ √n(β − β)

a.d.=(F′ Iθ,n F

)−1F′S(θ)√

n

=⇒ √n(β − β)

dist−→ N(0, I

−1

β,∞

).

Theorem 44 (Distribution of Maximized Log Likelihood Function) Denote the maximizer of `[g(β)] by β.Then

2[`[g(β)] − `[g(β)]

]dist−→ χ2

ν1 .

Proof: Note that `[g(β)] = `(θ) because θ = g(β). Thus,

`[g(β)] = `(θ) + S[g(β)]′(β − β) +1

2(β − β)′

∂S[g(β)]′

∂β(β − β) +Op(n

−1/2)

= `(θ) + S(θ)′F(β − β) − n

2(β − β)′ Iβ,n(β − β) +Op(n

−1/2)

= `(θ) +n

2(β − β)′ Iβ,n(β − β) +Op(n

−1/2)

Now apply Theorem 43.

Theorem 45 (Distribution of Likelihood Ratio Test Statistic) Consider testing H0 : θ = g(β) againstHa : θ ∈ Θ. The likelihood ratio test statistic is

U =exp`[g(β)]exp`(θ)

.

The asymptotic distribution of U under H0 : is

−2 ln(U) = 2`(θ) − `[g(β)]

dist−→ χ2

ν−ν1 .

Proof: It follows from Theorems 40, 43, and 44 that

2`(θ) − `[g(β)]

= n(θ − θ)′

Iθ,n− Iθ,n FI

−1

β,n F′ Iθ,n

(θ − θ) +Op(n−1/2).

Now apply Theorem 39.


5.3.7 Large Sample Tests

Consider a random vector y having joint density f(y|θ) where θ is a vector of parameters. As before, write the loglikelihood function as `(θ|y) = ln [f(y|θ)].

Suppose that a test of H0 : θ ∈ Θ0 against Ha : θ ∈ Θa is desired. It is assumed that Θ0 ∩ Θa = ∅. Define Θ byΘ = Θ0 ∪ Θa. The likelihood ratio (LR) criterion is defined as

U(y) =exp`0expà

,

where`0 = sup

θ∈Θ0

`(θ|y) and à = supθ∈Θ

`(θ|y).

The null hypothesis is rejected for small values of the LR criterion. Denote the dimension of the restrictedparameter space Θ0 by ν1 and denote the dimension of the full parameter space Θ by ν. The dimension of aparameter space is equal to the number of functionally independent parameters which are free to vary. Supposethat θ ∈ Θ0 ⇐⇒ θ = g(β), where β is a ν1-vector. Then, under the assumptions in Section 5.3.1, Theorem 45reveals that

−2 ln(U) = 2(à − `0)dist−→ χ2

ν−ν1 as n→ ∞,

provided that H0 is true. Accordingly, a size α test is to reject H0 if −2 ln(U) > χ21−α,ν−ν1 .

5.3.7.1 Example

Consider a one-way classification:

yij = µi + εij for i = 1, . . . , k and j = 1, . . . , ni,

where εij are independently distributed as εij ∼ N(0, σ2i ). Consider a test of H0 : σ2

1 = σ22 = · · · = σ2

k againstHa : σ2

i 6= σ2i′ for some i 6= i′. The parameter vector under Ha and H0 is

θ =

σ21

σ22...σ2k

µ1

µ2

...µk

; β =

σ21

µ1

µ2

...µk

; and g(β) =

σ21

σ21...σ2

1

µ1

µ2

...µk

.

It is readily shown that the likelihood ratio test statistic is

U =exp`[g(β)]exp`(θ)

=

∏ki=1

(SSEini

)ni/2

(SSEN

)N/2 ,

where N =∑ki=1 ni; SSEi =

∑nij=1(yij − yi·)2; and SSE =

∑ki=1 SSEi. From Theorem 45, it follows that

−2 ln(U) = N ln

(SSE

N

)−

k∑

i=1

ni ln

(SSEini

)dist−→ χ2

k−1

when H0 is true.A more accurate approximation can be obtained by replacing N by ν = N − k and replacing ni by νi = ni − 1.

Making these substitutions yields

−2 ln(U∗) = ν ln

(SSE

ν

)−

k∑

i=1

νi ln

(SSEiνi

)dist−→ χ2

k−1


when H0 is true. Greater accuracy can be obtained by multiplying −2 ln(U ∗) by a correction factor, B, so thatE[−2B ln(U∗)] = k − 1 under H0. To obtain B (the Bartlett correction factor), the expectation of the log of achi-squared random variable is required. Differentiating both sides of

∫ ∞

0

wα/2 exp−w/2Γ (α/2) 2α/2

dw = 1

with respect to α reveals that if W ∼ χ2α, then E[ln(W )] = ψ (α/2) + ln(2). If the asymptotic expansion

ψ(x) = ln(x) − 1

2x− 1

12x2+O(x−4)

is used, then it can be shown that

B−1 = 1 +1

3(k − 1)

(k∑

i=1

1

νi− 1

ν

).

5.3.8 Asymptotic Distribution Under Violations

In this section it is assumed that y1, . . . ,yn are independently distributed with densities (or pmfs) gi(yi) fori = 1, . . . , n. Denote the joint density as

g(y) =

n∏

i=1

gi(yi).

A parameter θ will be estimated by maximizing

L(θ|y) =

n∏

i=1

f(yi|θ)

even though L(θ|y) is the wrong likelihood function. The true parameter value, θ0 is defined as follows:

θ0 = arg max

∫ln [f(y|θ)] g(y) dy.

Under mild regularity conditions (i.e., differentiability of f , interchange of integration and differentiation), θ0 is thesolution to the equation

∫S(y|θ)g(y) dy = 0, where S(y|θ) =

∂ ln [f(y|θ)]∂θ

is an analogue of the score function for n observations. It follows that

Eg(S(y|θ0)) =

∫S(y|θ0)g(y) dy = 0.

Furthermore, the scaled score-like function can be written as

S(y|θ0)√n

=√nS(y|θ0) − Eg [S(y|θ0)]

, where S(y|θ0) =

1

n

n∑

i=1

S(yi|θ0).

Accordingly, it follows from the central limit theorem that

S(y|θ0)√n

dist−→ N(0, Iθ0,∞), where

Iθ0,n =1

n

[(n∑

i=1

Egi∂ ln(f(yi|θ0)

∂ θ0

)(n∑

i=1

Egi∂ ln(f(yi|θ0)

∂ θ0

)′]and Iθ0,∞ = lim

n→∞Iθ0,n.


Note that if yini=1 are iid, then

Iθ0,n = Iθ0,∞ = Eg

[(∂ ln(f(yi|θ0)

∂ θ0

)(∂ ln(f(yi|θ0)

∂ θ0

)′].

To find the asymptotic distribution of θ, expand

∂ ln[L(θ|Y)]

∂ θ

∣∣∣∣∣θ=θ

= 0

around θ = θ0 to obtain the following:

0 =n∑

i=1

S(yi|θ) =n∑

i=1

S(yi|θ0) +n∑

i=1

∂ S(yi|θ)∂ θ′

∣∣∣θ=θ0

(θ − θ0) +Op (1)

=⇒ √n(θ − θ0) = −

(1

n

n∑

i=1

∂ S(yi|θ)∂ θ′

∣∣∣θ=θ0

)−11√n

n∑

i=1

S(yi|θ0) +Op

(n−

12

)

= J−1

θ0,n

S(Y|θ0)√n

+Op

(n−

12

), where

Jθ0,n =1

nJθ0,n, Jθ0,n = −Eg

(∂ S(y|θ)∂ θ′

∣∣∣θ=θ0

)= −

n∑

i=1

Egi

(∂2 ln [f(yi|θ)]∂ θ ⊗ ∂ θ′

∣∣∣θ=θ0

)and

S(Y|θ0) =

n∑

i=1

S(yi|θ0).

It follows from the CLT and Theorem (35) that

√n(θ − θ0)

dist−→ N(0,Ωθ0), where

Ωθ0= J

−1

θ0,∞Iθ0,∞J−1

θ0,∞, and Jθ0,∞ = limn→∞

Jθ0,n.

Note, in this application, J denotes minus one times the expectation of the second derivative of ln[L(θ)|y], and notthe observed information matrix. The result is summarized in Theorem 46.

Theorem 46 Under suitable regularity conditions,

√n(θ − θ0)

dist−→ N(0,Ωθ0), where

θ0 = arg max

∫ln [f(y|θ)] g(y) dy,

Ωθ0= J

−1

θ0,∞Iθ0,∞J−1

θ0,∞, and Jθ0,∞ = limn→∞

Jθ0,n.

If an additional assumption, namelyE [S(yi|θ0)] = 0 ∀ i

is satisfied, then the covariance matrix can be estimated by

Ωθ =[(Jθ)]−1

Iθ

[(Jθ)]−1

, where

Jθ

=1

n

n∑

i=1

∂2 ln[f(yi|θ)]∂ θ ⊗ ∂ θ′

∣∣∣θ=θ

and Iθ

=1

n

n∑

i=1

(∂ ln [f(yi|θ)]

∂ θ

)(∂ ln [f(yi|θ)]

∂ θ

)′ ∣∣∣θ=θ

.

The above estimator is called a sandwich estimator.


5.3.9 Likelihood Ratio Test Under Local Alternatives

Suppose that yi are independently distributed as fi(yi;θ) for i = 1, . . . , n. Partition θ as

θ =

(ψ

λ

),

where ψ is ν1 × 1 and λ is ν2 × 1. Interest is in testing H0 : ψ = ψ0 against Ha : ψ 6= ψ0. It follows from Theorem45 that

Wdist−→ χ2

ν1 under H0, where W = −2[`(θ0) − `(θ)

].

In this section, the nonnull distribution of W is obtained under the assumption that

ψa = ψ0 +1√nδ,

where δ is a fixed ν1 × 1 vector.Define λψ0

as

λψ0= arg maxL(θ|y,ψ = ψ0).

Thus, λψ0is a solution to

∂ `(θ)

∂ λ

∣∣∣∣∣ψ=ψ0

= 0.


√n(λψ0

− λ) = I−1

λλ′,n

[Iλψ′

a,n

√n(ψ −ψa) + Iλλ′

a,n

√n(λ− λ) − Iλψ′

a,n

√n(ψ0 −ψa)

]+Op

(n−

12

),

where Iθa,n is partitioned as

Iθa,n =

(Iψaψ′

a,nIψaλ′,n

Iλψ′a,n

Iλλ′,n

).

Proof: Write S(θ0) and S(θa) as

S(θ0) =

∂ `(θ)

∂ψ

∣∣∣∣∣ψ=ψ0,λ=λψ0

∂ `(θ)

∂ λ

∣∣∣∣∣ψ=ψ0,λ=λψ0

=

(S(ψ0; λψ0

)0

)and

S(θa) =

(S(ψa;λ)

S(λa;ψa)

),

respectively. Expand

S(θ0)√n

around θ0 = θa =

(ψaλ

).

Accordingly,

S(θ0)√n

=

S(ψ0; λψ0)√

n0

=

S(θa)√n

+∂2`(θ)

n∂ θ ⊗ ∂ θ′

∣∣∣∣∣θ=θa

√n(θ0 − θa) +Op

(n−

12

)

= Iθa√n(θ − θa) − Iθa


(n−

12

)

=⇒ S(ψ0; λψ0)√

n= Iψaψ′

a,n

√n(ψ −ψa) + Iψaλ′,n

√n(λ− λ)


−Iψaψ′a,n

√n(ψ0 −ψa) − Iψaλ′,n

√n(λψ0

− λ) +Op

(n−

12

)and

0 = Iλψ′a,n

√n(ψ −ψa) + Iλλ′,n


a,n

√n(ψ0 −ψa) − Iλλ′,n

√n(λψ0

− λ) +Op

(n−

12

)

=⇒ √n(λψ0

− λ) = I−1

λλ′,n

[Iλψ′

a,n

√n(ψ −ψa) + Iλλ′

a,n


a,n

√n(ψ0 −ψa)

]+Op

(n−

12

).


Wdist−→ χ2

ν1,φ, where φ =δ′Iψaψ′

a,∞δ

2.

Proof: Expand 2[`(θ0) − `(θa)] around θ0 = θa:

2[`(θ0) − `(θa)] = 0 + 2

[S(θa)√

n

√n(θ0 − θa) +

1

2n

√n(θ0 − θa)′

∂2`(θa)

∂ θa ⊗ ∂ θ′a

√n(θ0 − θa)

]+Op

(n−

12

)

= 2S(θa)√

n

√n(θ0 − θa) −

√n(θ0 − θa)′Iθa,n


(n−

12

)

= 2√n(θ − θa)′Iθa,n

√n(θ0 − θa) −



(n−

12

)

becauseS(θa)√

n= Iθa,n

√n(θ − θa).

It follows from Theorem 40 that

2[`(θ) − `(θa)

]= n(θ − θa)′Iθa,n(θ − θa) +Op

(n−

12

).

Accordingly,

W = −2[`(θ0) − `(θ)

]= 2

[`(θ) − `(θ0)

]= 2

[`(θ) − `(θa)

]− 2

[`(θ0) − `(θa)

]

=√n(θ − θ)′Iθa,n

√n(θ − θ) − 2

√n(θ − θa)′Iθa,n

√n(θ0 − θa) +



(n−

12

)

=√n(θ − θ0)

′Iθa,n√n(θ − θ0) +Op

(n−

12

).

Use Theorem 47 to write θ − θ0 = (θ − θa) − (θ0 − θa) as

θ − θ0 =√n

((ψ −ψa) − (ψ0 −ψa)(λ− λ) − (λψ0

−ψa)

)=

( √n(ψ −ψa) + δ

−I−1

λλ′,nIλψ′a,n

√n(ψ −ψa) − I

−1

λλ′,nIλψ′a,nδ

)+Op

(n−

12

)

=

(Iν1

−I−1


)[√n(ψ −ψa) + δ

]+Op

(n−

12

)

Accordingly,

W =[√

n(ψ −ψa) + δ]′(

Iν1−I

−1


)′

Iθa,n

(Iν1

−I−1


)[√n(ψ −ψa) + δ

]+Op

(n−

12

)

=[√

n(ψ −ψa) + δ]′

Iψaψ′a·λ,n

[√n(ψ −ψa) + δ

], where Iψaψ′

a·λ,n = Iψaψ′a,n

− Iψaλ′,nI−1


=⇒Wdist−→ χ2

ν1,φ, because[√

n(ψ −ψa) + δ]

dist−→ N(δ, I

−1

ψaψ′a·λ,∞

).


5.3.10 Wald’s Test


θ =

(ψ

λ

),

where ψ is ν1 × 1 and λ is ν2 × 1. Interest is in testing H0 : ψ = ψ0 against Ha : ψ 6= ψ0. It is known fromTheorem 39 that √

n(θ − θ0)dist−→ N

(0, I

−1

θ0,∞

),

where θ0 is the true value of θ. Accordingly, if H0 is true, then

√n(ψ −ψ0)

dist−→ N(0, I

−1

ψ0ψ′0·λ,∞

), where Iψ0ψ

′0·λ,∞ = Iψ0ψ

′0,∞ − Iψ0λ

′,∞I−1

λλ′,∞Iλψ′0,∞

is the average partial information of ψ controlling for λ. Wald (1943) proposed the following statistic for testing H0:

Ww =√n(ψ −ψ0)

′Iψψ

′·λ,n√n(ψ −ψ0),

where Iψψ

′·λ,n is the average partial information of ψ controlling for λ and evaluated at θ. Consistency of

Iψ0ψ

′0·λ,n

, together with Slutsky’s Theorem implies that Wwdist−→ χ2

ν1 under H0, where ν1 = dim(ψ).

5.3.11 Rao’s Score Test


θ =

(ψ

λ

),

where ψ is ν1 × 1 and λ is ν2 × 1. Interest is in testing H0 : ψ = ψ0 against Ha : ψ 6= ψ0. Rao (1947) proposed thefollowing statistic for testing H0:

Ws =[S(ψ0; λψ0

)]′

I−1

ψ0ψ′0·λψ0

,nS(ψ0; λψ0

), where Iψ0ψ

′0·λψ0

,n = Iψ0ψ′0,n

− Iψ0λ

′ψ0,n

I−1

λψ0λ

′ψ0,n

Iλψ0

ψ′0,n

is the partial information of ψ controlling for λ evaluated at θ0 =(ψ′

0 λψ0

)′.

Theorem 49 (Rao Score Test) If H0 : ψ = ψ0 is true and the usual regularity conditions are satisfied, then

Wsdist−→ χ2

ν1 , where ν1 = dim(ψ).

Proof: Note that

S(θ0) =

∂ `(θ)

∂ψ∂ `(θ)

∂ λ

∣∣∣∣∣∣∣ψ=ψ0,λ=λψ0

=

(S(ψ0; λψ0

)

0

).

Expanding n−1/2S(θ0) around θ0 = θ0 =(ψ′

0 λ′)′ yields

S(ψ0; λψ0)√

n0

=

S(θ0)√n

+∂2`(θ0)

n∂ θ0 ⊗ ∂ θ′0

(√n(ψ0 −ψ0)

√n(λψ0

− λ)

)+O

(n−

12

)

=⇒

S(ψ0; λψ0)√

n0

= Iθ0,n

√n(θ − θ0) − Iθ0,n

(√n(ψ0 −ψ0)

√n(λψ0

− λ)

)+O

(n−

12

)

5.4. AKAIKE’S AND BAYESIAN INFORMATION CRITERIA 85

=⇒

S(ψ0; λψ0)√

n0

=

(Iψ0ψ

′0,n

Iψ0λ′,n

Iλψ′0,n

Iλλ′,n

)√n(ψ −ψ0)

√n(λ− λ)

−

(Iψ0ψ

′0,n

Iψ0λ′,n

Iλψ′0,n

Iλλ′,n

)(0

√n(λψ0

− λ)

)+O

(n−

12

)

=⇒

S(ψ0; λψ0)√

n0

=

Iψ0ψ′0,n

√n(ψ −ψ0) + Iψ0λ

′,n

√n(λ− λ)

Iλψ′0,n

√n(ψ −ψ0) + Iλλ′,n

√n(λ− λ)

−

Iψ0λ′,n

√n(λψ0

− λ)

Iλλ′,n

√n(λψ0

− λ)

+O

(n−

12

)

=⇒ √n(λψ0

− λ) = I−1

λλ′,n

[Iλψ′

0,n

√n(ψ −ψ0) + Iλλ′,n

√n(λ− λ)

]+Op

(n−

12

), and

S(ψ0; λψ0)√

n= Iψ0ψ

′0,n

√n(ψ −ψ0) − Iψ0λ

′,nI−1

λλ′,nIλψ′0,n

√n(ψ −ψ0) +Op

(n−

12

)

= Iψ0ψ′0·λ,n

√n(ψ −ψ0) +Op

(n−

12

).

Note that

Ws =[S(ψ0; λψ0

)]′

I−1

ψ0ψ′0·λψ0

,nS(ψ0; λψ0

) =

[S(ψ0; λψ0

)√n

]′I−1

ψ0ψ′0·λψ0

,n

[S(ψ0; λψ0

)√n

]and that

Iψ0ψ

′0·λψ0

,n = Iψ0ψ′0·λ,n +Op

(n−

12

).

Accordingly,

Ws =

[S(ψ0; λψ0

)√n

]′I−1

ψ0ψ′0·λ,n

[S(ψ0; λψ0

)√n

]+Op

(n−

12

)

=√n(ψ −ψ0)

′Iψ0ψ′0·λ,nI

−1

ψ0ψ′0·λ,nIψ0ψ

′0·λ,n

√n(ψ −ψ0) +Op

(n−

12

)

=√n(ψ −ψ0)

′Iψ0ψ′0·λ,n

√n(ψ −ψ0) +Op

(n−

12

), and

Wsdist−→ χ2

ν1 because√n(ψ −ψ0)

dist−→ N(0, I

−1

ψ0ψ′0·λ,∞

).

5.4 AKAIKE’S AND BAYESIAN INFORMATION CRITERIA

Pronounced ah ka-ee kay.

5.4.1 References

Most of these references were compiled by Ken Burnham.

Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In B.N. Petrov and F.Csaki, (Eds.), Second International Symposium on Information Theory pp. 267–281): Budapest: AkademiaiKiado.

Akaike, H. (1977). On entropy maximization principle. In P.R. Krishnaiah (Ed.), Applications of Statistics(pp. 27–41). Amsterdam: North-Holland

Akaike, H. (1981). Likelihood of a model and information criteria, Journal of Econometrics, 16, 3–14.

Akaike, H. (1983). Statistical inference and measurement of entropy. In G.E.P. Box, T. Leonard, and C-F. Wu(Eds.), Scientific Inference, Data Analysis, and Robustness (pp. 165–189). London: Academic Press.


Akaike, H. (1985). Prediction and entropy. In A.C. Atkinson and S.E. Fienberg (Eds.), A celebration of statistics(pp. 1–24). New York: Springer.

Akaike, H. (1987), Factor analysis and AIC. 52, 317–332.

Akaike, H. (1992). Information theory and an extension of the maximum likelihood principle. In S. Kotz and N.L.Johnson (Eds.), Breakthroughs in Statistics, Vol. 1 (pp. 610–624). London: Springer-Verlag.

Akaike, H. (1994). Implications of the informational point of view on the development of statistical science. In H.Bozdogan, (Ed.), Engineering and Scientific Applications, Vol. 3, Proceedings of the First US/JapanConference on the Frontiers of Statistical Modeling: An Informational Approach (pp. 27–38). Dordrecht,Netherlands: Kluwer Academic Publishers.

Anderson, D.R., Burnham, K.P., and White, G.C. (1994). AIC model selection in overdispersed capture-recapturedata. Ecology, 75, 1780–1793.

Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and itsanalytical extensions. Psychometrika, 52, 345–370.

Burnham, K. P., & Anderson, D. R. (1998). Model Selection and Inference: A Practical Information-TheoreticApproach, New York: Springer-Verlag.

Burnham, K.P., Anderson, D.R., and White, G.C. (1994). Evaluation of the Kullback-Leibler discrepancy formodel selection in open population capture-recapture models. Biometrical Journal, 36, 299–315.

Cheeseman and R.W. Oldford (Eds.) Selecting Models from Data. New York: Springer-Verlag.

deLeeuw, J. (1992). Introduction to Akaike (1973) information theory and an extension of the maximumlikelihood principle. In S. Kotz and N.L. Johnson (Eds.), Breakthroughs in Statistics, Vol. 1 (pp. 599–609).London: Springer-Verlag.

Findley, D. F. (1985). On the unbiasedness property of AIC for exact or approximating linear stochastic timeseries models. Journal of Time Series Analysis, 6, 229–52.

Hurvich, C.M., and Tsai, C-L. (1989). Regression and time series model selection in small samples. Biometrika,76, 297–307.

Hurvich, C.M., and Tsai, C-L. (1991). Bias of the corrected AIC criterion for underfitted regression and timeseries models. Biometrika, 78, 499–509.

Hurvich, C.M., and Tsai, C-L. (1995). Relative rates of convergence for efficient model selection criteria in linearregression. Biometrika, 82, 418–425.

Hurvich, C.M., Shumway, R., and Tsai, C-L. (1990). Improved estimators of Kullback-Leibler information forautoregressive model selection in small samples. Biometrika, 77, 709–719.

Kapur, J.N., and Kesavan, H.K. (1992). Entropy Optimization Principles with Applications. Academic Press,London.

Kullback, S., and Leibler, R.A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22,79–86.

Linhart, H., and Zucchini, W. (1986). Model Selection. New York: John Wiley and Sons.

Reschenhofer, E. (1996). Prediction with vague prior knowledge. Communications in Statistics — Theory andMethods, 25, 601–608.

Reschenhofer, E. (1996). Improved estimation of the expected Kullback-Leibler discrepancy in case ofmisspecification. Technical report.

Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific.

Sakamoto, Y. (1982). Efficient use of Akaike’s information criterion for model selection in a contingency tableanalysis. Metron, 40, 257–275.


Sakamoto, Y. (1991). Categorical Data Analysis by AIC. Tokyo: KTK Scientific Publishers.

Sakamoto, Y., Ishiguro, M., and Kitagawa, G. (1986). Akaike Information Criterion Statistics. Tokyo: KTKScientific Publishers.

Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike’s information criterion.Biometrika, 63, 117–26.

Shibata, R. (1981). An optimal selection of regression variables. Biometrika, 68, 45–54. Correction (1982). 69,492.

Shibata, R. (1983). A theoretical view of the use of AIC. In O.D. Anderson (Ed.), Time Series Analysis: Theoryand Practice (pp. 237–244). Amsterdam: North-Holland.

Shibata, R. (1989). Statistical aspects of model selection. In J.C. Willems (Ed.), From Data to Model(pp. 215–240). London: Springer-Verlag.

Speed, T.P., and Yu, B. (1993). Model selection and prediction: Normal regression. Annals of the Institute ofStatistical Mathematics, 1, 35–54.

Stone, M. (1977). An asymptotic equivalence of choice of model by cross validation and Akaike’s criterion.Journal of the Royal Statistical Society, Series B, 39, 44–47.

Stone, M. (1982). Local asymptotic admissibility of a generalization of Akaike’s model selection rule. Annals ofthe Institute of Statistical Mathematics (Part a), 34, 123–133.

Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the finite corrections.Communications in Statistics — Theory and Methods, 7, 13–26.

Takeuchi, K. (1976). Distribution of information statistics and criteria for adequacy of models. MathematicalSciences, 153, 12–18 (in Japanese).

Zhang, P. (1993). On the convergence rate of model selection criteria. Communications in Statistics — Theoryand Methods, 22, 2765–2775.

5.4.2 Kullback-Leibler Information

Let y be a p-vector with distribution y ∼ g(y). Suppose that the pdf of y is to be approximated by g(y). In manycases g(y) is a member of a parametric family, F = f(y|θ), θ ∈ Θ. Two questions arise. First, how can thegoodness of fit be measured? Second, how can θ be chosen to obtain the best fit? One approach to these questionsis to use the Kullback-Leibler (K-L) information measure. The K-L information number is a measure of thedistance between two distributions. It is defined by

KL(g, g) = Eg

[ln

(g

g

)]=

∫ln

[g(y)

g(y)

]g(y) dy.

Properties of K-L information:

1. KL(g, g) ≥ 0.

2. KL(g, g) always exists, but it need not be finite.

3. KL(g, g) = 0 if and only if g = g except, possibly, on sets of measure zero.

4. If y1, . . . ,yn are independent, then the K-L information measure for the entire sample is

KL[gi(y), gi(y)] =

n∑

i=1

KL[gi(yi), g(yi)], where

g(y) =n∏

i=1

g(yi) and g(y) =n∏

i=1

g(yi).

That is, the K-L information measure is additive for independent random variables.


Property (1) can be proven as follows:

KL(g, g) =

∫z ln(z)g(y) dy = Eg [Z ln(Z)] , where Z =

g(y)

g(y).

Note that z ln(z) is a convex function. Now apply Jensen’s inequality.

5.4.3 Asymptotic Distribution Results

Let yi for i = 1, . . . , n be independently distributed random vectors (or scalars). Assume that the density of yiexists and denote the distribution and density functions by Gi(yi) and gi(yi). Similarly, denote the distributionand density functions for the entire sample by G(y) and g(y) =

∏ni=1 gi(yi). In practice, parametric forms of G

and g may not be known. Nonetheless, the investigator might attempt to approximate g by a member of the familyF = f(y|θ),θ ∈ Θ.

The difference between f(y|θ) and g(y) can be indexed by the Kullback-Leibler information

KL(g, f) =

∫ln

[g(y)

f(y|θ)

]g(y) dy.

The optimum member of F is the one that minimizes KL(g, f) with respect to θ. That is, f(y|θg) is the optimummember of the family F , where θg is the solution to

∫∂ ln[f(y|θ)]

∂θg(y) dy = 0.

For convenience, denote the log likelihood function by `(θ|y) and denote the first two derivatives by

`(1)(θ|y) =∂`(θ|y)

∂θand

`(2)(θ|y) =∂2`(θ|y)

∂θ ⊗ ∂θ′.

Denote the maximizer of `(θ|y) with respect to θ by θ = θy. It is assumed that the maximizer is obtained as the

solution to `(1)(θ|y) = 0. The subscript y serves as a reminder that θy is a function of y.

Theorem 50 The score function, `(1)(θg|y), and the MLE, θy, have the following distributions:

`(1)(θg|y)√n

dist−→ N(0, Iθg ),

√n(θy − θg) a.d.

= J−1

θg

`(1)(θg|y)√n

, and

√n(θy − θg) dist−→ N

(0,J

−1

θgIθg J

−1

θg

),

where

Iθg =1

nEg

[`(1)(θg|y) `(1)(θg|y)′

]=

1

n

∫`(1)(θg|y) `(1)(θg|y)′g(y) dy and

Jθg = − 1

nEg

[`(2)(θg|y)

]= − 1

n

∫`(2)(θg|y)g(y) dy.

Heuristic proof: From the definition of θg, it follows that Eg

[`(1)(θg|y)

]= 0. The asymptotic distribution of

`(1)(θg|y)/√n follows from the central limit theorem. To obtain the relationship between θy and `(1)(θg|y), expand

`(1)(θy|y) around θy = θg:

`(1)(θy|y) = 0 = `(1)(θg|y) + `(2)(θg|y)(θy − θg) +Op(1)


=⇒ `(1)(θg|y)√n

= −`(2)(θg|y)

n

√n(θy − θg) +Op

(n−1/2

)

=⇒ `(1)(θg|y)√n

=[Jθg +Op

(n−1/2

)]√n(θy − θg) +Op

(n−1/2

)

=⇒ `(1)(θg|y)√n

= Jθg√n(θy − θg) +Op

(n−1/2

)

=⇒ J−1

θg

`(1)(θg|y)√n

=√n(θy − θg) +Op

(n−1/2

)

.

Note that if g(y) = f(y|θg), then Jθg = Iθg and the usual results are obtained.

5.4.4 AIC and TIC

Let z be an n-vector of future random variables having distribution g(z). To make predictions about z, the density

f(z|θy) often is used, where θy is the MLE of θ based on y. Akaike (1973) was interested in judging the accuracy

of f(z|θy). He suggested that the goodness of fit of the hypothesized model be measured by the expected K-Ldistance from the true model:

γ(g, f) = EKL

[g, f(z|θy)

]=

∫ ∫ln

[g(z)

f(z|θy)

]g(z) dz

g(y) dy.

Using properties of logs, the above measure can be written as the difference between two quantities. The firstquantity does not depend on f and the second quantity is called predictive accuracy:

PA(g, f) =

∫ ∫ln[f(z|θy)

]g(z) dz g(y) dy. (5.3)

If a member of the family F is close to the true distribution, g, then expected K-L information, γ(g, f), will besmall. Equivalently, if F contains a member that is close to g, then predictive accuracy will be large. In a modelselection problem, several families of distributions are under consideration. Model selection in the Akaikeframework proceeds by choosing the family having largest predictive accuracy. The remaining problem is to find aconsistent estimator of predictive accuracy.

Theorem 51 (Predictive Accuracy) The predictive accuracy of family F can be written as

PA = EY [`(θg|y)] − 1

2tr(Iθg J

−1

θg

)+O

(n−1

).

Proof: Expand ln[f(z|θy)] = `(θy|z) around θy = θg:

`(θy|z) = `(θg|z) + `(1)(θg|z)(θy − θg) +1

2(θy − θg)′`(2)(θg|z)(θy − θg) +Op

(n−1/2

).

Now take expectations with respect to z and y, keeping in mind that z and y are independent and identicallydistributed:

PA = EY EZ

[`(θy|z)

]= EZ [`(θg|z)] −

n

2EY

[(θy − θg)′ Jθg (θy − θg)

]+O

(n−1

)

because EZ

[`(1)(θg|z)

]= 0 and EZ

[`(2)(θg|z)

n

]= −Jθg


= EY [`(θg|y)] − 1

2tr[J−1

θgIθg J

−1

θgJθg

]+O

(n−1

)

because Var[√

(n)(θy − θg)]

= J−1

θgIθg J

−1

θg+O(n−1) (see Theorem 50).

A natural estimator of predictive accuracy is the observed log likelihood function evaluated at the MLE. Asthe next theorem shows, this estimator is biased.

Theorem 52 The expectation of `(θy|y) is

EY

[`(θy|y)

]= EY [`(θg|y)] +

1

2tr(Iθg J

−1

θg

)+O

(n−1

)

= PA+tr(Iθg J

−1

θg

)+O

(n−1

).

Proof: Expand ln[f(y|θy)] = `(θy|y) around θy = θg:

`(θy|y) = `(θg|y) + `(1)(θg|y)(θy − θg)

+1

2(θy − θg)′`(2)(θg|y)(θy − θg) +Op

(n−1/2

)

= `(θg|y) + n(θy − θg)′ Jθg (θy − θg)

+n

2(θy − θg)′

[−Jθg +Op

(n−1/2

)](θy − θg) +Op

(n−1/2

)

using Theorem 50

= `(θg|y) +n

2(θy − θg)′ Jθg (θy − θg) +Op

(n−1/2

).

Now take expectations with respect y:

EY

[`(θy|y)

]= EY [`(θg|y)] +

n

2EY

[(θy − θg)′ Jθg (θy − θg)

]+O

(n−1

)

= EY [`(θg|y)] +1

2tr[J−1

θgIθg J

−1

θgJθg

]+O

(n−1

)

= PA+tr(Iθg J

−1

θg

)+O

(n−1

).

There are several estimators of predictive accuracy. The most widely used is AIC. If the family F contains thetrue density, then AIC is asymptotically unbiased for predictive accuracy. If the family F does not contain the truedensity, then AIC can be asymptotically biased. In this case, an alternative estimator called TIC can be used.Details about TIC can be found in Takeuchi (1976) and Stone (1977).

Theorem 53 (AIC & TIC) If g(y) = f(y|θ) for some θ, then

AIC = `(θy|y) − ν

is asymptotically unbiased for predictive accuracy. In general, an asymptotically unbiased estimator of predictiveaccuracy is

TIC = `(θy|y) − tr(J−1

θyIθy

),

where

Iθy

=1

n

n∑

i=1

(∂ ln [f(yi|θ)]

∂θ

)(∂ ln [f(yi|θ)]

∂θ

)′∣∣∣∣∣θ=θy

and


Jθy

=1

n

n∑

i=1

∂2 ln [f(yi|θ)]∂θ ⊗ ∂θ′

∣∣∣∣∣θ=θy

.

Proof: If g(y) = f(y|θ) for some θ, then Iθg = Jθg and it follows that AIC is asymptotically unbiased. From the

law of large numbers it follows that Iθy

is consistent for Iθg and Jθy

is consistent for Jθg . Accordingly, TIC is

asymptotically unbiased for predictive accuracy.

Model selection proceeds by choosing the family that maximizes AIC or TIC. Caution, some software packagesmultiply AIC by 2 or by −2. In the latter case, the family that minimizes −2AIC is selected.

5.4.5 BIC

Schwartz (1978) constructed a Bayesian Information Criterion (BIC) for model selection. A heuristic derivation ofBIC follows.

Let Fi = fi(y|θi) be the ith family of densities under consideration for i = 1, . . . , k. Prior to observing data,the investigator’s belief in the ith family is indexed by τi. That is, Pr(Fi) = τi. The investigator’s prior beliefsabout the parameter vector θi are summarized by the prior density mi(θi). The goal of the Bayesian analysis is toobserve the data, compute the posterior probability of the ith family, and choose the family that has the highestposterior probability.

The posterior probability of the ith family is

Pr(Fi|y) =Pr(Fi,y)

Pr(y)=

Pr(y|Fi) Pr(Fi)∑kj=1 Pr(y|Fj) Pr(Fj)

=fi(y)τi∑kj=1 fj(y)τj

where

fi(y) =

∫fi(y|θi)mi(θi) dθi.

Bayesian model selection proceeds by selecting family i if τifi(y) is a maximum.

Theorem 54 (BIC) Assume the usual regularity conditions and denote the MLE of θi by θi. Then

ln [τifi(y)] = BICi +Op(1), where

BICi = ln[fi(y|θi)

]− νi

2ln(n),

and νi = dim(θi).

Heuristic Proof: Expand ln [fi(y|θi)] in a Taylor series around θi = θi:

ln [fi(y|θi)] = ln[fi(y|θi)

]+∂ ln [fi(y|θi)]

∂θ

∣∣∣∣∣θi=θi

(θi − θi)

+1

2(θi − θi)′

∂2 ln [fi(y|θi)]∂θ ⊗ ∂θ′


(θi − θi) +Op

(n−1/2

)

= ln[fi(y|θi)

]− n

2(θi − θi)′ Jθ(θi − θi) +Op

(n−1/2

)

because∂ ln [fi(y|θi)]

∂θ


= 0


and where Jθ

= − 1

n

∂2 ln [fi(y|θi)]∂θ ⊗ ∂θ′


.

Substitute the above expansion into the equation for fi(y) to obtain

fi(y) = fi(y|θi)∫

exp−n

2(θi − θi)′ Jθ(θi − θi)

mi(θi)

[1 +Op

(n−1/2

)]dθi.

Transform from θi to z = J1/2

θ

√n(θi − θi). The inverse transformation is θi = J

−1/2

θz/

√n+ θi and the Jacobian

is n−νi/2|Jθ|−1/2. Therefore,

fi(y) =fi(y|θi)

nνi/2|Jθ|1/2

∫exp

−1

2z′z

mi

(J−1/2

θz/

√n+ θi

) [1 +Op

(n−1/2

)]dz.

Expand mi

(J−1/2

θz/

√n+ θi

)around z = 0 to obtain

mi

(J−1/2

θz/

√n+ θi

)= mi(θi) +Op

(n−1/2

).

Putting the pieces together yields

fi(y) =fi(y|θi)mi(θi)

nνi/2|Jθ|1/2

∫exp

−1

2z′z

[1 +Op

(n−1/2

)]dz

=fi(y|θi)mi(θi)(2π)νi/2

nνi/2|Jθ|1/2

[1 +Op

(n−1/2

)]

=⇒ ln [τifi(y)] = ln[fi(y|θi

]− νi

2ln(n) +Op(1)

because ln(τi) +νi2

ln(2π) + ln[mi(θi)] −1

2ln |J

θ| +Op

(n−1/2

)= Op(1).

Model selection proceeds by selecting the family with the largest BIC. Note that the forms of AIC and BIC arevery similar. The motivations underlying the procedures, however, are quite different.

Chapter 6

EDGEWORTH EXPANSIONS

6.1 CHARACTERISTIC FUNCTIONS

Let Y be a random p-vector with cdf FY (y). The characteristic function of Y is

φY (t) = E(eit

′Y)

=

∫eit

′y dFY (y),

where i2 = −1 and t is a fixed p-vector.

Theorem 55 The characteristic function exists for all t.

Proof: The Euler form of eit′Y is obtained by expanding eit

′Y around it′Y = 0:

eit′Y =

∞∑

j=0

(it′Y)j

j!

=

[1 − (t′Y)2

2!+

(t′Y)4

4!− (t′Y)6

6!+ · · ·

]+ i

[(t′Y)1

1!− (t′Y)3

3!+

(t′Y)5

5!− · · ·

]= cos(t′Y) + i sin(t′Y).

Recall that the norm of a complex number is

|a+ bi| =√

(a+ bi)(a− bi) =√a2 + b2.

Accordingly

φY (t) =

∫eit

′y dFY (y) =

∫[cos(t′y) + i sin(t′y)] dFY (y) and

|φY (t)| =

∣∣∣∣∫eit

′y dFY (y)

∣∣∣∣ ≤∫

|cos(t′y) + i sin(t′y)| dFY (y) =

∫dFY (y) = 1.

It can be shown that φY (t) is uniformly continuous in t. Furthermore, if the MGF of Y exists, then thecharacteristic function of Y is φY (t) = MGFY (it). For example, if Y ∼ N(µ,Σ), then φY (t) = exp

it′µ− 1

2t′Σt.

The Si function is defined as

Si(y) =1

π

∫ ∞

−∞

eity

itdt.

Writing eity in Euler form reveals that

Si(y) =1

π

∫ ∞

−∞

[icos(ty)

t+

sin(ty)

t

]dt =

1

π

∫ ∞

−∞

sin(ty)

tdt

because cos(ty)/t is an odd function. Billingsly (Probability and Measure, 1986, pp 239) showed that

1

π

∫ ∞

−∞

sin(ty)

tdt = sgn(y) =

1 y > 0,

0 y = 0, and

−1 y < 0.

93

94 CHAPTER 6. EDGEWORTH EXPANSIONS

Theorem 56 (Inversion Theorem) Suppose that Y is a scalar random variable with cdf FY (y) andcharacteristic function φY (t). Then

1

2π

∫ ∞

−∞

1 − e−ity

itφY (t) dt = Fy(y) − FY (0) +

1

2[P (Y = 0) − P (Y = y)] .

Proof: Define Ic as

Icdef=

1

2π

∫ c

−c

1 − e−ity

itφY (t) dt.

Then,

Ic =1

2π

∫ c

−c

1 − e−ity

it

∫ ∞

−∞eitu dFY (u) dt =

1

2π

∫ c

−c

∫ ∞

−∞

eitu − eit(u−y)

itd FY (u) dt

=1

2π

∫ c

−c

∫ ∞

−∞

cos(tu) + i sin(tu) − cos[t(u− y)] − i sin[t(u− y)]

itd FY (u) dt

=1

2π

∫ ∞

−∞

∫ c

−cicos(tu) − cos[t(u− y)]

t+

sin(tu) − sin[t(u− y)]

td FY (u) dt

where the interchange is justified because the interval is uniformly convergent

=1

2π

∫ ∞

−∞

∫ c

−c

sin(tu) − sin[t(u− y)]

td FY (u) dt

because cos(tu)/t is an odd function. Now take the limit as c→ ∞:

limc→∞

Ic =1

2

∫ ∞

−∞[sgn(u) − sgn(u− y)] dFY (u)

=1

2[−P (Y < 0) + 0P (Y = 0) + P (Y > 0)] − 1

2[−P (Y < y) + 0P (Y = y) + P (Y > y)]

=1

2[−FY (0) + P (Y = 0) + 1 − FY (0) + FY (y) − P (Y = y) − 1 + FY (y)] .

Note that if FY is continuous and differentiable, then

1

2π

∫ ∞

−∞

1 − e−ity

itφY (t) dt = Fy(y) − FY (0) and

fY (y) =d

dy[Fy(y) − FY (0)] =

1

2π

∫ ∞

−∞e−ityφY (t) dt.

6.2 HERMITE POLYNOMIALS

Let z be a scalar and denote the standard normal pdf by ϕ(z). That is, ϕ(z) = (2π)−1/2 exp−z2/2. The rth

Hermite polynomial, Hr(z), is defined as

Hr(z)def=

(−1)r

ϕ(z)

dr φ(z)

(d z)r, for r = 1, 2, . . . .

By convention, H0(z) = 1. To generate these polynomials efficiently, note that

ϕ(z − t) =1√2πe−(z−t)2 = φ(z)etz−

12 t

2

.

6.2. HERMITE POLYNOMIALS 95

Expand φ(z − t) around t = 0:

dϕ(z − t)

dt

∣∣∣∣∣t=0

= (−1)dϕ(z)

dzand

dj ϕ(z − t)

(dt)j

∣∣∣∣∣t=0

= (−1)jdj ϕ(z)

(dz)jby the chain rule

=⇒ ϕ(z − t) =

∞∑

j=0

(−1)j

j!

djϕ(z)

(dz)jtj =

∞∑

j=0

tj

j!Hj(z)ϕ(z).

Also, note that

etz−12 t

2

=

∞∑

j=1

(tz − 12 t

2)j

j!.

Accordingly∞∑

j=1

(tz − 12 t

2)j

j!=

∞∑

j=0

tj

j!Hj(z)ϕ(z).

Now match coefficients and solve for Hr(z). The result is

Hr(z) = zr − Pr,2zr−2

21 1!+Pr,4z

r−4

22 2!− Pr,6z

r−6

23 3!+ · · · , where Pr,m =

r!

(r−m)! r ≥ m

0 r < m..

The first six Hermite polynomials are the following:

H1(z) = z, H2(z) = z2 − 1, H3(z) = z3 − 3z, H4(z) = z4 − 6z + 3,

H5(z) = z5 − 10z3 + 15z, and H6(z) = z6 − 15z4 + 45z2 − 15.

Theorem 57 (Orthogonal Properties of Hr) If Hr(z) is the rth Hermite polynomial, then

∫ ∞

−∞Hm(z)Hn(z)ϕ(z) dz =

0 m 6= n,

m! m = n.

Proof: Later.

Set m = 1 to reveal that ∫ ∞

−∞Hn(z)ϕ(z) dz = 0 for n = 1, 2, . . . .

Theorem 58 Denote the rth Hermite polynomial by Hr(z) and denote the standard normal pdf by ϕ(z). Then,

1

2π

∫ ∞

∞e−itz(−it)re− 1

2 t2

dt = (−1)rHr(z)ϕ(z), for r = 1, 2, . . . and

∫ z

−∞Hr(u)ϕ(u) du =

Φ(z) r = 0,

−Hr−1(z)ϕ(z) r ≥ 1..

Proof: It follows from the inversion theorem (Theorem 56), that

ϕ(z) =1

2π

∫ ∞

−∞e−itze−

12 t

2

dt.

Take the rth derivative with respect to z of each side to obtain

dr ϕ(z)

(dz)r= (−1)rϕ(z)Hr(z) by the definition of Hr(z)

=1

2π

∫ ∞

−∞e−itz(−it)re− 1

2 t2

dt.


The second result can be established by using the definition of Hr(z). That is, if r ≥ 1, then

∫ z

−∞Hr(u)ϕ(u) du =

∫ z

−∞

(−1)r

ϕ(u)

dr ϕ(u)

(du)rϕ(u) du

= (−1)r∫ z

−∞

dr ϕ(u)

(du)rdu = (−1)r

dr−1 ϕ(u)

(du)r−1

∣∣∣∣∣

z

−∞

= (−1)r(−1)r−1Hr−1(u)ϕ(u)

∣∣∣∣∣

z

−∞= −Hr−1(z)ϕ(z).

6.3 EDGEWORTH EXPANSION FOR Y

Suppose that Yi are iid for i = 1, . . . , N with mean µ and variance σ2. Define Z as Z =√N(Y − µ)/σ. It is readily

shown that the characteristic function of Z is

φZ(t) =

[φY

(t√Nσ

)]Nexp

−√Nitµ

σ

.

Accordingly, the cumulant generating function is

CGFZ(t) = N ln

[φY

(t√Nσ

)]−

√Nitµ

σ= N

∞∑

j=2

(it

σ√N

)jκj(Y )

j!

= − t2

2+

(it)3κ3(Y )

6√Nσ3

+(it)4κ4(Y )

24Nσ4+O

(N− 3

2

).

An Edgeworth expansion for the distribution of Z is obtained by employing the inversion theorem andintegrating the above expansion term by term.

Theorem 59 [Edgeworth] If the cdf of Y is continuous and differentiable, then the pdf and cdf of Y are

fY (y) =

√N

σϕ(z)

[1 +

ρ3

6√NH3(z) +

ρ4

24NH4(z) +

ρ23

72NH6(z) +O

(N− 3

2

)], and

FY (y) = Φ(z) − ϕ(z)

[ρ3

6√NH2(z) +

ρ4

24NH3(z) +

ρ23

72NH5(z) +O

(N− 3

2

)], where

z =

√N(y − µ)

σ, ρj =

κj(Y )

σj, κj(Y ) is the jth cumulant of Y,

and Φ is the standard normal cdf.

Proof: It follows from Theorem 56 that

fZ(z) =1

2π

∫ ∞

−∞e−itzφZ(t) dt =

1

2π

∫ ∞

−∞e−itz exp CGFZ(t) dt.

Replacing CGFZ by its expansion yields

fZ(z) =1

2π

∫ ∞

−∞e−itz exp

− t

2

2+

(it)3κ3(Y )

6√Nσ3

+(it)4κ4(Y )

24Nσ4+O

(N− 3

2

)dt

6.3. EDGEWORTH EXPANSION FOR Y 97

=1

2π

∫ ∞

−∞e−itze−

12 t

2

[1 +

(it)3ρ3

6√N

+(it)4ρ4

24N+ +

(it)6ρ23

72NO(N− 3

2

)]dt

= ϕ(z)

[1 +

ρ3

6√NH3(z) +

ρ4

24NH4(z) +

ρ23

72NH6(z) +O

(N− 3

2

)]using Theorem 58.

The pdf of Y is obtained by transforming from Z to Y . The cdf of Z is obtained by integrating the pdf:

FZ(z) =

∫ z

−∞fZ(z) dz =

∫ z

−∞ϕ(u)

[1 +

ρ3

6√NH3(u) +

ρ4

24NH4(u) +

ρ23

72NH6(u) +O

(N− 3

2

)]du

= Φ(z) − ϕ(z)

[ρ3

6√NH2(z) +

ρ4

24NH3(z) +

ρ23

72NH5(z) +O

(N− 3

2

)]

by Theorem 58. To obtain the cdf of Y , use

FY (y) = P (Y ≤ y) = P (Z ≤ z), where z =

√N(y − µ)

σ.

6.3.1 Cornish-Fisher Expansions

Cornish and Fisher (1937) constructed an expansion so that the percentiles of the distribution of Z (or Y )) can beexpressed in terms of the percentiles of the N(0, 1) distribution and vice-versa. The results are summarized inTheorem 60.

Theorem 60 (Cornish-Fisher) Denote the 100α percentile of the distribution of Z =√N(Y − µ)/σ by yα and

denote the 100α percentile of the N(0, 1) distribution by zα. Then,

yα = zα +ρ3

6√N

(z2α − 1) +

ρ4

24N(z3α − 3zα) − ρ2

3

36N(2z3

α − 5zα) +O(N− 3

2

), and

zα = yα − ρ3

6√N

(y2α − 1) − ρ4

24N(y3α − 3zα) +

ρ23

36N(4y3

α − 7yα) +O(N− 3

2

).

Furthermore, the above expansions hold for all α ∈ (0, 1). If α is a random variable with a Unif(0, 1) distribution,then zα is a realization of a N(0, 1) random variable and yα is a realization of a random variable having the samedistribution as Z =

√N(Y − µ)/σ. Accordingly,

P (Z∗ ≤ z) = Φ(z) +O(N− 3

2

), where

Z∗ = Z − ρ3

6√N

(Z2 − 1) − ρ4

24N(Z3 − 3Z) +

ρ23

36N(4Z3 − 7Z), and Z =

√N(Y − µ)

σ.

Proof: It follows from the definitions of zα and yα that α = Φ(zα) = FZ(yα). Accordingly, by Theorem 59,

α = Φ(zα) = Φ(yα) − ϕ(yα)

[ρ3

6√NH2(yα) +

ρ4

24NH3(yα) +

ρ23

72NH5(yα) +O

(N− 3

2

)]

=⇒ Φ(zα) − Φ(yα) = −ϕ(yα)

[ρ3

6√NH2(yα) +

ρ4

24NH3(yα) +

ρ23

72NH5(yα) +O

(N− 3

2

)].

Note that Φ(zα) = FZ(zα) +O(N−1/2). It follows that zα − yα = O(N−1/2). Expand Φ(zα) in a Taylor seriesaround zα = yα:

Φ(zα) = Φ(yα) + ϕ(yα)(zα − yα) − 1

2H1(yα)(zα − yα)2 +O

(N− 3

2

).


Substituting this expansion into the expression for Φ(zα) − Φ(yα) yields

Φ(zα) − Φ(yα) = ϕ(yα)(zα − yα) − 1

2H1ϕ(yα)(yα)(zα − yα)2 +O

(N− 3

2

)

= −ϕ(yα)

[ρ3

6√NH2(yα) +

ρ4

24NH3(yα) +

ρ23

72NH5(yα) +O

(N− 3

2

)]

=⇒ (zα − yα) − 1

2H1(yα)(zα − yα)2 = − ρ3

6√NH2(yα) − ρ4

24NH3(yα) − ρ2

3

72NH5(yα) +O

(N− 3

2

)

=⇒ (zα − yα) =1

2H1(yα)(zα − yα)2 − ρ3

6√NH2(yα) − ρ4

24NH3(yα) − ρ2

3

72NH5(yα) +O

(N− 3

2

)

=⇒ (zα − yα)2 =

[1

2H1(yα)(zα − yα)2 − ρ3

6√NH2(yα) − ρ4

24NH3(yα) − ρ2

3

72NH5(yα) +O

(N− 3

2

)]2

=ρ23

36NH2(yα)2 +O

(N− 3

2

)

=⇒ zα = yα +H1(yα)ρ23

72NH2(yα)2 − ρ3

6√NH2(yα) − ρ4

24NH3(yα) − ρ2

3

72NH5(yα) +O

(N− 3

2

).

The expansions for zα and for Z∗ that were claimed in the statement of the theorem are obtained by explicitlywriting out the Hermite polynomials. To obtain the expansion for yα, write yα as

yα = δ0 +1√Nδ1 +

1

Nδ2 +O(N−3/2).

Substitute this expansion into the expansion for zα and iteratively solve for the δ terms. That is, first make thesubstitution keeping only O(1) terms and then solve for δ0. Second, make the substitution keeping all terms ofmagnitude O(N−1/2) or larger and then solve for δ1, etc. Specifically, write yα as δ0, substitute into the expansionfor zα and then drop terms smaller than O(1) to obtain zα = δ0. Then, write yα as yα = zα + (1/

√N)δ1, substitute

into the expansion for zα and then drop terms smaller than O(N−1/2) to obtain

zα = zα +1√Nδ1 −

ρ3

6√N

(z2α − 1)

=⇒ δ1 =ρ3

6(z2α − 1) =⇒ yα = zα +

ρ3

6√N

(z2α − 1) +

1

Nδ2 +O

(N− 3

2

).

Substituting this expression into the expansion for zα and solving for δ2 yields

δ2 =ρ4

24(z3α − 3zα) − ρ2

3

36(2z3

α − 5zα) and

yα = zα +ρ3

6√N

(z2α − 1) +

ρ4

24N(z3α − 3zα) − ρ2

3

36N(2z3

α − 5zα) +O(N− 3

2

).

6.4 GENERAL EDGEWORTH EXPANSIONS

Suppose that W is a random quantity having a continuous distribution that depends on n, where n typically is an

index of sample size. If Wdist−→ N(0, 1) as n→ ∞, then, under mild regularity conditions, an Edgeworth Edgeworth

expansions can be constructed for the distribution of W . The result is summarized in Theorem 61.

6.4. GENERAL EDGEWORTH EXPANSIONS 99

Theorem 61 Suppose that the distribution of a continuous random variable W satisfies

Wdist−→ N(0, 1) as n→ ∞,

E(W ) =ω1√n

+O(n−

32

), Var(W ) = 1 +

ω2

n+O

(n−2

),

ρ3(W ) =ω3√n

+O(n−

32

), ρ4(W ) =

ω4

n+O

(n−2

),

where ωj = O(1) for j = 1, . . . , 4. Then, the pdf and cdf of W can be expanded as follows:

fW (w) = ϕ(w)

[1 +

ω1H1(w)√n

+g2H2(w)

2n+ω3H3(w)

6√n

+g4H4(w)

24n+ω2

3H6(w)

72n

]+O

(n−

32

), and

FW (w) = Φ(w) − ϕ(w)

[ω1√n

+g2H1(w)

2n+ω3H2(w)

6√n

+g4H3(w)

24n+ω2

3H5(w)

72n

]+O

(n−

32

), where

g2 = ω2 + ω21 , and g4 = ω4 + 4ω1ω3.

Proof: The claimed results can be verified by the approach used to verify Theorem 59. Also, see Theorem 2 in Boik(2005) for an extension to the t distribution.

As an example, suppose that U ∼ χ2ν,λ. The cumulant generating function of U is

CGFU (t) =

∞∑

j=1

(it)j

j!

[2j−1(j − 1)!ν + 2jj!λ

].

It follows that E(U) = ν + 2λ and Var(U) = 2ν + 8λ. Define W as

Wdef=

U − (ν + 2λ)√2ν + 8λ

.

It is readily shown that the cumulant generating function of W is

CGFW (t) =∞∑

j=2

(it)j

j!

[2j−1(j − 1)!ν + 2jj!λ

]

(2ν + 8λ)j2

, and that

ρ3(W ) =8ν + 48λ

(2ν + 8λ)32

and ρ4(W ) =48ν + 384λ

(2ν + 8λ)2 .

Accordingly, an Edgeworth expansion for the distribution of W can be obtained from Theorem 61, where

ω1 = ω2 = 0, ω3 =4ν + 24λ

ν + 4λ, ω4 =

24ν + 192λ

ν + 4λ, and n = 2ν + 8λ.

Note that n→ ∞ as ν → ∞ and/or as λ→ ∞. Results for (ν, λ) = (8, 0), (8, 2), and (8, 10) are displayed below.


Edgeworth Expansions of χ2ν,λ PDF

−5 0 5 10 15 20 25 30 35−0.02

0

0.02

0.04

0.06

0.08

0.1

0.12Exact and Edgeworth Approx to Density of χ2(8,0)

y

fY(y)

ExactNormalEdgeworth

−10 0 10 20 30 40 50−0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07


y

fY(y)


−20 0 20 40 60 80 100−0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04


y

fY(y)


6.4.1 Expansion of Approximate Pivotal Quantities in Regression Analysis

The following material is taken from Boik (2005). Consider the linear model

Y = Xβ + ε, (6.1)

where Y is an N × 1 random vector of responses, X is an N × p matrix of explanatory variables withrank(X) = r ≤ p, β is a p× 1 vector of unknown regression coefficients, and ε is an N × 1 vector of randomdeviations. It is assumed that the column space generated by X contains a column of ones. That is, the modelcontains an intercept term. Without loss of generality, it is assumed that the first column of X is a column of onesand, therefore, the first entry in β, say β0, is an intercept term. If columns 2, 3, . . . , p of X consist of quantitativeexplanatory variables, as they might in a regression model, then X typically has rank p. If X contains columns thatcode for treatments and interactions, however, then the rank of X may be less than p. In any case, it is assumedthat the rank of X does not change as N increases and that each nonzero eigenvalue of X′X diverges to +∞ asN → ∞. This condition ensures that information about any estimable function of β increases as N increases. It isassumed that the N components ε1, . . . εN are independently and identically distributed with an arbitrary


continuous distribution having mean zero, variance σ2, and finite higher-order moments at least to order six. Inparticular, the distribution of εi may exhibit strong skewness.

Suppose that the investigator has interest in making an inference about an estimable linear function of β, sayψ = c′β, where c is a p× 1 vector of known constants. Under the stated assumptions on ε, the best linear unbiasedestimator of ψ is

ψ = c′β, where β = (X′X)−X′Y.

The exact distribution of ψ depends on the distribution of ε. The first two moments, however, depend only on themean and variance of ε. Specifically,

E(ψ) = ψ and Var(ψ) =σ2ψ

n, where σ2

ψ = σ2q0, q0 = nc′(X′X)−c,

and n = N − r. The conventional unbiased estimator of σ2ψ is

σ2ψ = S2q0, where S2 =

1

nY′AY, A = IN − X(X′X)−X′.

Note that IN − A is the perpendicular projection operator onto the column space generated by X.The starting point for expanding the distribution of V is to obtain the finite-sample moments of the

distribution. Exact moments of V cannot be obtained without additional knowledge about the distribution of thevector of random errors, ε. Nonetheless, approximations to the moments can be obtained by using a stochasticTaylor series expansion of V j around S2 = σ2, where j = 0, 1, 2, . . .. The general form of this expansion is as follows:

V j =nj2 (ψ − ψ)j

σjψ=

nj2 (ψ − ψ)j

σjψ

(1 +

S2 − σ2

σ2

) j2

=nj2 (ψ − ψ)j

σjψ

[1 − j

2

(S2 − σ2

σ2

)+j(j + 2)

8

(S2 − σ2

σ2

)2

+Op

(n−

32

)]

= Zj1

[1 − j

2√nZ2 +

j(j + 2)

8nZ2

2 +Op

(n−

32

)], where

Z1 =

√n(ψ − ψ)

σψand Z2 =

√n(S2 − σ2)

σ2 .

This expansion enables one to express the moments of V j as functions of the joint moments of Z1 and Z2. See Boik(2005) for additional details.

6.4.2 Maximum Likelihood Estimators

Suppose that yi for i = 1, . . . , N are independently distributed random vectors and that the pdf for yi is fi(yi;θ).

Denote the mle of θ by θ. Furthermore, suppose that an expansion of the distribution of the scalar valued quantity√Ng(θ;θ) is desired, where

√Ng(θ;θ) = Op(1), g(θ;θ) = 0, and g(r)(θ) = O(1) for r = 1, 2, . . . ,

where g(r)(θ) is the rth derivative of g(θ;θ) with respect to θ evaluated at θ = θ. Specifically,

g(1)(θ) =∂ g(θ;θ)

∂ θ

∣∣∣∣∣θ=θ

, g(2)(θ) =∂2 g(θ;θ)

∂ θ′ ⊗ ∂ θ

∣∣∣∣∣θ=θ

,

and g(3)(θ) =∂3 g(θ;θ)

∂ θ′ ⊗ ∂ θ

′ ⊗ ∂ θ

∣∣∣∣∣θ=θ

.


Then,

√Ng(θ − θ) =

[g(1)(θ)

]′ √N(θ − θ) +

1

2√N

√N(θ − θ)′g(2)(θ)

√N(θ − θ)

+1

6N

√N(θ − θ)′g(2)(θ)

[√N(θ − θ) ⊗

√N(θ − θ)

]+Op

(n−

32

).

In principle, cumulants of√Ng(θ;θ) can be computed to order O(N−3/2) by evaluating the expectations

E[√

Ng(θ;θ)r]

= E

[[g(1)(θ)

]′ √N(θ − θ) +

1

2√N

√N(θ − θ)′g(2)(θ)

√N(θ − θ)

+1

6N

√N(θ − θ)′g(2)(θ)

[√N(θ − θ) ⊗

√N(θ − θ)

]r]

for r = 1, 2, . . ..If the moments of

√N(θ − θ) can be computed, then an Edgeworth expansion for

√Ng(θ − θ) can be

constructed by employing Theorem 61. Theorem 62 describes an expansion that can be used to obtain expressionsfor the moments of the mles.

Theorem 62 Suppose that yi for i = 1, . . . , N are independently distributed random vectors and that the pdf for yiis fi(yi;θ). Assume that the usual regularity conditions are satisfied. Denote the likelihood function for θ given the

entire sample by `(θ) and denote the rth derivative of the likelihood function with respect to θ by `(r). Specifically,

`(1) =∂ `(θ)

∂ θ, `(2) =

∂2 `(θ)

∂ θ ⊗ ∂ θ′,

`(3) =∂2 `(θ)

∂ θ ⊗ ∂ θ′ ⊗ ∂ θ′, and `(4) =

∂2 `(θ)

∂ θ ⊗ ∂ θ′ ⊗ ∂ θ′ ⊗ ∂ θ′.

Define Zr as

Zrdef=

√N(`(r)

(θ) − Kr

)=`(r)(θ)√

N−√NKr, where

`(r)

(θ) =1

N`(r)(θ) and Kr =

1

NE(`(r)).

Note that K1 = 0, Kr = O(1) for r ≥ 2, K2 = −Iθ (minus Fisher’s Information matrix), Zr = Op(1), and

Z1 = S(θ)/√N (the scaled score function). Denote the mle of θ by θ. Then,

√N(θ − θ) = δ0 +

1√Nδ1 +

1

Nδ2 +O

(N− 3

2

), where

δ0 = I−1

θ

`(1)(θ)√N

, δ1 = I−1

θ

[Z2δ0 +

1

2K3(δ0 ⊗ δ0)

], and

δ2 = I−1

θ

[Z2δ1 + K3(δ0 ⊗ δ1) +

1

2Z3(δ0 ⊗ δ0) +

1

6K4(δ0 ⊗ δ0 ⊗ δ0)

].

Proof: Expand `(1)(θ)/√N = 0 in a Taylor series around θ = θ:

0 =`(1)(θ)√

N=`(1)(θ)√

N+ `

(2)(θ)

√N(θ − θ) +

1

2√N`(3)

(θ)[√

N(θ − θ) ⊗√N(θ − θ)

]

+1

6N`(4)

(θ)[√

N(θ − θ) ⊗√N(θ − θ) ⊗

√N(θ − θ)

]+O

(N− 3

2

)


= Z1 +

(K2 +

Z2√N

)√N(θ − θ) +

1

2√N

(K3 +

Z3√N

)[√N(θ − θ) ⊗

√N(θ − θ)

]

+1

6N

(K4 +

Z4√N

)[√N(θ − θ) ⊗

√N(θ − θ) ⊗

√N(θ − θ)

]+O

(N− 3

2

).

Write√N(θ − θ) as

√N(θ − θ) = δ0 +

1√Nδ1 +

1

Nδ2 +O

(N− 3

2

),

substitute this expression into the expansion of `(1)(θ)/√N and iteratively solve for δj. Specifically,

0 = Z1 +

(K2 +

Z2√N

)(δ0 +

1√Nδ1 +

1

Nδ2

)+

1

2√N

(K3 +

Z3√N

)[(δ0 +

1√Nδ1

)⊗(δ0 +

1√Nδ1

)]

+1

6N

(K4 +

Z4√N

)(δ0 ⊗ δ0 ⊗ δ0

)+O

(N− 3

2

)

=⇒ 0 =1

N0

[Z1 + K2δ0

]+

1

N12

[Z2δ0 +

1

2K3

(δ0 ⊗ δ0

)]

+1

N

[K2δ2 + Z2δ1 +

1

2K3(δ0 ⊗ δ1) +

1

2K3(δ1 ⊗ δ0) +

1

2Z3(δ0 ⊗ δ0) +

1

6K4(δ0 ⊗ δ0 ⊗ δ0)

]+O

(N− 3

2

).

Note that the right-hand side of the above relation is zero only if the term of magnitude O(N 0) is zero, the term ofmagnitude O(N−1/2) is zero, and the term of magnitude O(N−1) is zero. The claim of the theorem is obtained by

solving first for δ0, second for δ1, and third for δ2. The solution can be simplified by using K3(a⊗ b) = K3(b⊗ a),where a and b are any vectors that have the same dimension as θ.

6.4.3 Example MLE of Gamma Shape Parameter

As an example, consider a random sample of size N from the gamma distribution with density

fY (y,θ) =yα−1e−y/β

βαΓ(α)I(0,∞)(y),

and where β is known. Before differentiating the likelihood function, an expression for the cumulants of

V =√Nln(G) − E[ln(G)]

will be obtained, where

G =

N∏

j=1

Yi

1N

and ln(G) =1

N

N∑

j=1

ln(Yj).

Transform from Y to U = ln(Y ):

fU (u) =eαu exp eu/β

βαΓ(α)I(−∞,∞)(u).

It readily follows that the moment generating and cumulant generating functions of U are

MU (t) =Γ(α+ t)βt

Γ(α)and CGFU (t) = t [ln(β) + ψ(α)] +

∞∑

j=2

tj

j!ψ(j)(α),

where ψ(j)(α) is the jth derivative of ln Γ(α) with respect to α. That is, ψ(α) = ψ(1)(α), ψ′(α) = ψ(2)(α),ψ′′(α) = ψ(3)(α), etc. It follows that E(lnG) = ln(β) + ψ(α), the cumulant generating function of V is

CGFV (t) =∞∑

j=2

tj

j!

ψ(j)(α)

Nj2−1


and that the first six moments of V are

E(V ) = 0, E(V 2) = ψ(1)(α), E(V 3) =ψ(2)

√N, E(V 4) =

ψ(3)(α)

N+ 3

[ψ(1)(α)

]2,

E(V 5) =ψ(4)(α)

N32

+ 10ψ(2)(α)ψ(1)(α)√

N= 10

ψ(2)(α)ψ(1)(α)√N

+O(N− 3

2

),

E(V 6) =ψ(5)(α)

N2+ 15

(ψ(3)(α)ψ(1)(α)

N+[ψ(1)(α)

]3)+ 10

[ψ(2)(α)

]2

N

= 15

(ψ(3)(α)ψ(1)(α)

N+[ψ(1)(α)

]3)+ 10

[ψ(2)(α)

]2

N+O

(N−2

).

The likelihood function is

`(α) = (α− 1)

N∑

i=1

ln(Yi) −NY

β−Nα ln(β) −NΓ(α).

The required values for `(r)(α), Zi, and Ki are as follows:

`(1)(α) =

N∑

i=1

ln(Yi) −N ln(β) −Nψ(α), `(2)(α) = −Nψ′(α), `(3)(α) = −Nψ′′(α), `(4)(α) = −Nψ′′′(α),

K2 = −ψ′(α), K3 = −ψ′′(α), K4 = −ψ′′′(α),

Z1 =√N[ln(G) − ln(β) − ψ(α)

], Z2 = 0, and Z3 = 0,

Accordingly,

Iα = ψ′(α), I−1

α =1

ψ′(α), and δ0 =

√N[lnG− lnβ − ψ(α)

]

ψ′(α).

It is known from first-order theory that

√N(α− α)

dist−→ N

(0,

1

ψ′(α)

).

It follows thatW

dist−→ N(0, 1), where W =√Nψ′(α)(α− α).

From Theorem 62, the expansion for W is

√Nψ′(α)(α− α) =

√ψ′(α)

[δ0 +

1√Nδ1 +

1

Nδ2 +Op

(N− 3

2

)]where

δ0 =

√N[lnG− lnβ − ψ(α)

]

ψ′(α), δ1 = −

ψ′′(α)√

N[lnG− lnβ − ψ(α)

]2

2 [ψ′(α)]3 and

δ2 =

3 [ψ′′(α)]

2 − ψ′(α)ψ′′′√

N[lnG− lnβ − ψ(α)

]3

6 [ψ′(α)]5 .

With the aid of a Maple program, the first four cumulants of W are as follows:

κ1(W ) = − ψ(2)(α)

2√N[ψ(1)(α)

] 32

+O(N− 3

2

),


κ2(W ) = 1 +5[ψ(2)(α)

]2 − 2ψ(1)(α)ψ(3)(α)

2N[ψ(1)(α)

]3 +O(N−2

),

κ3(W ) = −2ψ(2)(α)

√N[ψ(1)(α)

] 32

+

[ψ(1)(α)

] 32

√N

+O(N− 3

2

), and

κ4(W ) =105

[ψ(2)(α)

]2 − 81[ψ(2)(α)

]2 [ψ(1)(α)

]3 − 20[ψ(2)(α)

]3ψ(1)(α)ψ(3)(α) + 14ψ(3)(α)

[ψ(1)(α)

]4

2N[ψ(1)(α)

]6 +O(N−2

).

The Maple program is listed below:

m1:=0;

m2:=p1;

m3:=p2/sqrt(N);

m4:=p3/N+3*p1^2;

m5:=10*p2*p1/sqrt(N);

m6:=15*p3*p1/N+15*p2^3+10*p2^2/N;

c0:=1/p1;

c1:=-p2/(sqrt(N)*2*p1^3);

c2:=(3*p2^2-p1*p3)/(N*6*p1^5);

d0:=0;

d1:=c1*m2;

d2:=c2*m3;

d01:=c0*c1*m3;

d02:=c0*c2*m4;

d00:=c0^2*m2;

d001:=c0^2*c1*m4;

d002:=c0^2*c2*m5;

d11:=c1^2*m4;

d0011:=c0^2*c1^2*m6;

d000:=c0^3*m3;

d0000:=c0^4*m4;

d0001:=c0^3*c1*m5;

d0002:=c0^3*c2*m6;

a1:=d0+d1+d2;

a2:=d00+d11+2*d01+2*d02;

a3:=d000+3*d001;

a4:=d0000+4*d0001+4*d0002+6*d0011;

k1:=asympt(sqrt(p1)*a1,N,4);

k2:=asympt(p1*(a2-a1^2),N,4);

k22:=factor(coeff(k2,N^(-1)));

k3:=asympt(p1^(3/2)*(a3-3*a2*a1+2*a1^3),N,4);

k4:=asympt(p1^2*(a4-4*a3*a1-3*a2^2+12*a2*a1^2-6*a1^4),N,2);

Chapter 7

SADDLEPOINT EXPANSIONS

Suppose that W is a scalar continuous random variable with (real) cumulant generating function

KW (t) =t2

2

(1 +

ω2

n

)+

t3

6√nω3 +

t4

24nω4 +O

(n−

32

).

In this Section, a saddlepoint expansion for the distribution of W is constructed.

7.1 TILTED EXPONENTIAL PDF

If W has pdf fW (w), moment generating function MW (φ), and (real) cumulant generating function KW (φ), then

fW (w;φ) = exp φw −KW (φ) fW (w)

is a pdf for all φ for which KW (φ) exists. This density is called an exponential tilted pdf. To verify that fW (w;φ)is a pdf, first note that fW (w;φ) is nonnegative. Also,

∫ ∞

−∞fW (w;φ) dw = MW (φ) exp −KW (φ) = exp KW (φ) −KW (φ) = 1.

The motivation for performing the exponential tilting is to embed fW (w) within the exponential class.Define Z as

Zdef=

W − E(W |φ)√Var(W |φ)

.

The moments of W given φ can be obtained from the moment generating function:

MW |φ(t) = EW |φ(etW

)=

∫ ∞

−∞exp tw + φw −KW (φ) fW (w)

=

∫ ∞

−∞exp (t+ φ)w −KW (t+ φ) fW (w) exp KW (t+ φ) −KW (φ) = exp KW (t+ φ) −KW (φ)

provided that KW (t+ φ) exists. Accordingly, the rth cumulant of W given φ is

κr(W |φ) =∂r [KW (t+ φ) −KW (φ)]

(∂ t)r

∣∣∣∣∣t=0

= K(r)W (φ), where K

(r)W (φ) =

∂rKW (φ)

(∂ φ)r.

Accordingly,

Z =W −K

(1)W (φ)√

K(2)W (φ)

.

Note that KW (φ) has the expansion

KW (φ) =φ2

2

(1 +

ω2

n

)+

φ3

6√nω3 +

φ4

24nω4 +O

(n−

32

).

107

108 CHAPTER 7. SADDLEPOINT EXPANSIONS

The magnitude of K(r)W (φ) depends on the value of r and is given by

K(r)W (φ) =

O(1) r ≤ 2

O(n−

r2+1)

r ≥ 2.

As a preliminary step in obtaining an expansion for the distribution of W , an expansion for the distribution ofZ is obtained. The moment generating function of Z is

MZ|φ(t) = E(etZ)

= MW |φ

t√

K(2)W (φ)

exp

− tK

(1)W (φ)√K

(2)W (φ)

.

It follows that the cumulant generating function of Z given φ is

KZ|φ(t) = KW |φ

t√

K(2)W (φ)

− tK

(1)W (φ)√K

(2)W (φ)

= KW

φ+

t√K

(2)W (φ)

−KW (φ) − tK

(1)W (φ)√K

(2)W (φ)

Define ρr(Z|φ) as

ρr(Z|φ)def= n

r2−1 ∂

rKZ|φ(t)

(∂ t)r

∣∣∣∣∣t=0

def= n

r2−1 K

(r)W (φ)

[K

(r)W (φ)

] r2.

Then the rth cumulant of Z given φ is

κr(Z|φ) =ρr(Z|φ)

nr2−1

.

Specifically

κ1(Z|φ) =√nρ1(Z|φ) = 0, κ2(Z|φ) = ρ2(Z|φ) = 1, and κr(Z|φ) =

ρr(Z|φ)

nr2−1

= O(n−

r2 +1)

for r = 2, 3, . . . .

It follows from Theorem 61 that the density of Z given φ can be expanded as

fZ(z;φ) = ϕ(z)

[1 +

ρ3(Z|φ)

6√n

H3(z) +ρ4(Z|φ)

24nH4(z) +

ρ23(Z|φ)

72nH6(z) +O

(n−

32

)].

Recall that primary interest is in the distribution of W rather than the distribution of Z. The distribution ofW given φ can be obtained by transforming from Z to W :

|J | =

∣∣∣∣dZ

dW

∣∣∣∣ =1√

K(2)W (φ)

=⇒ fW (w;φ) =ϕ(z)√K

(2)W (φ)

[1 +

ρ3(Z|φ)

6√n

H3(z) +ρ4(Z|φ)

24nH4(z) +

ρ23(Z|φ)

72nH6(z) +O

(n−

32

)],

where z =[w −K

(1)W (φ)

]/

√K

(2)W (φ).

7.2 THE SADDLEPOINT SOLUTION

The remaining issue is to decide what value of φ to employ. The problem is simplified if the exponential tilt isinverted. Recall that

fW (w;φ) = exp φw −KW (φ) fW (w).

7.2. THE SADDLEPOINT SOLUTION 109

It follows thatfW (w) = exp −φw +KW (φ) fW (w;φ).

Accordingly,

fW (w) = exp KW (φ) − φw ϕ(z)√K

(2)W (φ)

[1 +

ρ3(Z|φ)

6√n

H3(z) +ρ4(Z|φ)

24nH4(z) +

ρ23(Z|φ)

72nH6(z) +O

(n−

32

)].

The expansion of the density of W given φ is valid for any φ for which K(r)W (φ) exists. The saddlepoint

solution is to evaluate φ at the value that satisfies z = 0. That is, the chosen φ is the maximizer of fW (w;φ):

∂

∂ φln [fW (w;φ)] = 0 =⇒W = K

(1)W (φ).

The maximizer, φ, is the MLE of φ in the tilted exponential distribution. The density of W evaluated at φsimplifies to

fW (w) =exp

KW (φ) − φw

√2π

√K

(2)W (φ)

[1 +

3ρ4(Z|φ) − 5ρ23(Z|φ)

24n+O

(n−

32

)]

=exp

KW (φ) − φw

√2π

√K

(2)W (φ)

[1 +O

(n−1

)].

The latter expansion is called the saddlepoint expansion. Note that, unlike the Edgeworth expansion, thesaddlepoint expansion actually is a density.

The saddlepoint expansion can be improved by renormalization. The idea is to approximate the density of Wby

fW (w) = cexp

KW (φ) − φw

√2π

√K

(2)W (φ)

[1 +O

(n−1

)],

where the value of c is that which caused the function to integrate to one.

7.2.1 Example: Noncentral χ2

Suppose that Q ∼ χ2ν,λ. It is readily shown that the moment and cumulant generating functions are

MQ(t) = (1 − 2t)−ν2 exp

λ

(1

1 − 2t− 1

)and

KQ(t) = −ν2

ln(1 − 2t) + λ

(1

1 − 2t− 1

)=

∞∑

j=1

tj

j!

[2j−1(j − 1)!ν + 2jj!λ

].

In particular,E(Q) = ν + 2λ and Var(Q) = 2ν + 8λ.

Define W as

W =Q− µQσQ

, where µQ = ν + 2λ and σQ =√

2ν + 8λ.

The moment and cumulant generating functions of W are

MW (φ) = E(eWφ

)= E

(exp

φQ

σQ− φµQ

σQ

)

= MQ

(φ

σQ

)exp

−φµQσQ

=

(1 − 2

φ

σQ

)− ν2

exp

λ

(1

1 − 2 φσQ

− 1

)exp

−φµQσQ

and


KW (φ) = −ν2

ln

(1 − 2

φ

σQ

)+ λ

(1

1 − 2 φσQ

− 1

)− φµQ

σQ.

The parameter space for φ is (−∞, σQ/2).

The first two moments of W given φ are

E(W |φ) = K(1)W (φ) =

ν

σQ − 2φ+

2λσQ(σQ − 2φ)3

− µQσQ

and

Var(W |φ) =2σQ(ν + 4λ) − 4νφ

(σQ − 2φ)3.

Equating W to rmE(W |φ) and solving for φ yields

φ =σQ2

[1 − ν

2Q± 1

2Q

√ν2 + 8λQ

].

To satisfy the restriction φ < σQ/2, it is necessary to choose the solution

φ =σQ2

[1 − ν

2Q− 1

2Q

√ν2 + 8λQ

].

The saddlepoint approximation to the density of W is

fW (w) =1√

2πK(2)W (φ)

expK

(1)W (φ) − wφ

[1 +O(n−1)

],

where n = 2ν + 8λ. Transforming from W to Q yields

FQ(q) =1√

2πσ2QK

(2)W (φ)

exp

K

(1)W (φ) −

(q − µQσQ

)φ

[1 +O(n−1)

].

The accuracy of the saddlepoint approximation is illustrated below.

Edgeworth and Saddlepoint Expansions of χ2ν,λ PDF

−10 0 10 20 30 40 50−0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Exact and Approx to Density of χ2(8,2)

y

fY(y)

ExactNormalEdgeworthSaddlepoint

−10 0 10 20 30 40 500

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Exact and Approx to Density of χ2(8,2)

y

fY(y)

ExactSaddlepointRenormalized Saddlepoint

7.3. SADDLEPOINT APPROXIMATIONS TO MLES IN EXPONENTIAL FAMILIES 111

7.3 SADDLEPOINT APPROXIMATIONS TO MLES IN

EXPONENTIAL FAMILIES

Suppose that Yi for i = 1, . . . , n are iid random d-vectors, each with continuous pdf

fY (y; θ) = expθ′t(y) − ω(θ)

d(y),

where θ is an ν dimensional parameter, ω(θ) is a scalar function of θ, and t(y) is an ν-dimensional function of y.That is, the pdf is a member of the full-rank exponential family with natural parameter θ. The joint pdf of Yi fori = 1, . . . , n is

fY(y; θ) = exp

θ′

n∑

i=1

t(yi) − nω(θ)

n∏

i=1

d(yi).

Define T as

T =

n∑

i=1

t(Yi).

It follows from Theorem 26 that T is minimal sufficient; the family of densities of T is complete; and the pdf of Tis a member of the exponential class:

fT (t;θ) = expθ′t − nω(θ)

h(t),

where h(t) is a nonnegative function of t.

The goal in this section is to construct a saddlepoint approximation to the density of the mle of θ, namely θ.The strategy is to first construct a saddlepoint approximation to the density or T and then to transform from T toθ. If the function h(t) is known, then approximation methods are not needed. In this case, all that is required is to

transform from T to θ. The mle satisfies

nΩ(1)(θ) − T = 0, where Ω(1)(θ) =∂ ω(θ)

∂ θ

∣∣∣∣θ=θ

.

Accordingly,

∂

∂θ′

[nΩ(1)(θ) − T

]= 0

=⇒ nΩ(2)(θ) − ∂T

∂ θ′ = 0 where Ω(2)(θ) =

∂2ω(θ)

∂ θ ⊗ ∂ θ′

=⇒ The Jacobian is

∣∣∣∣∂T

∂ θ′

∣∣∣∣ = nν∣∣∣Ω(2)(θ)

∣∣∣ .

The exact pdf of θ is

fθ(θ;θ) = nν

∣∣∣Ω(2)(θ)∣∣∣ exp

θ′Ω(1)(θ) − nω(θ)

h[nΩ(1)(θ)

].

Generally, the function h(t) is not known. Nonetheless, it is known that

h(t) = fT (t;θ) exp−θ′t + nω(θ)

.

Consider replacing fT (t;θ) by its Edgeworth expansion. It is readily shown that the cumulant generating functionof T is

KT(ξ) = n [ω(θ + ξ) − ω(θ)] .

Accordingly,

E(T) = nΩ(1)(θ), Var(T) = nΩ(2)(θ), and Zdist−→ N(0, Iν) as n→ ∞, where

Z = n− 12

[Ω(2)(θ)

]− 12[T − nΩ(1)(θ)

].


The Edgeworth expansion for the density of Z is

fZ(z;θ) =exp

− 1

2z′z

(2π)ν2

[1 +

[ρ3(Z)]′H3(z)

6√n

+O(n−1

)],

where ρ3(Z)/√n is the third-order multivariate cumulant of Z, and H3(z) is the third-order multivariate Hermite

polynomial evaluated at z Transforming from Z to T yields∣∣∣∣∂ Z

∂T′

∣∣∣∣ = n− ν2

∣∣∣Ω(2)(θ)∣∣∣− 1

2

=⇒ fT (t;θ) =exp

− 1

2n

[T − nΩ(1)(θ)

]′ [Ω(2)(θ)

]−1 [T − nΩ(1)(θ)

]

nν2

∣∣Ω(2)(θ)∣∣− 1

2

[1 +

[ρ3(Z)]′H3(z)

6√n

+O(n−1

)].

It follows that

h(t) =exp

− 1

2n

[T − nΩ(1)(θ)

]′ [Ω(2)(θ)

]−1 [T − nΩ(1)(θ)

]

nν2

∣∣Ω(2)(θ)∣∣− 1

2

exp−θ′t + nω(θ)

×[1 +

[ρ3(Z)]′H3(z)

6√n

+O(n−1

)].

Note that the value of the parameter θ in the approximation to h(t) is arbitrary. To ensure greatest accuracy,the parameter value can be chosen so that the O(n−1/2) term goes to zero. This term is a cubic function of z and iszero iff z = 0. Therefore, the chosen value of θ satisfies t − nΩ(1)(θ) = 0. This value is the mle of θ given T = t.

Setting θ = θ, the approximation to f(t) is

h(t) =1

nν2

∣∣∣Ω(2)(θ)∣∣∣− 1

2

exp−θ′t + nω(θ)

[1 +O

(n−1

)].

and the saddlepoint approximation to the density of T is

fT (t;θ) =exp

(θ − θ)′t − n

[ω(θ) − ω(θ)

]

nν2

∣∣∣Ω(2)(θ)∣∣∣12

(2π)ν2

[1 +O

(n−1

)]

=exp

`(θ; t) − `(θ; t)

nν2

∣∣∣Ω(2)(θ)∣∣∣12

(2π)ν2

[1 +O

(n−1

)],

where `(θ; t) is the loglikelihood function given t.

Lastly, to obtain the saddlepoint approximation to the density of θ, transform from T to θ. The result is

fθ(θ;θ) = n

ν2

∣∣∣Ω(2)(θ)∣∣∣12

exp`(θ; t) − `(θ; t)

(2π)ν2

[1 +O

(n−1

)],

where t = nΩ(1)(θ).

Accuracy can be improved by renormalization.

7.3.1 Example MLE of Gamma Shape Parameter

As an example, consider a random sample of size N from the gamma distribution with density

fY (y, α) =yα−1e−y/β

βαΓ(α)I(0,∞)(y),

7.3. SADDLEPOINT APPROXIMATIONS TO MLES IN EXPONENTIAL FAMILIES 113

and where β is known. The density can be written as

fY (y;α) = eα ln(y)−α ln(β)−ln Γ(α) 1

ye−y/βI(0,∞)(y).

Accordingly,

T =

n∑

i=1

ln(Yi), ω(α) = α ln(β) + ln Γ(α),

Ω(1)(α) = ln(β) + ψ(α), and Ω(2)(α) = ψ′(α).

The mle, α satisfiesT − n ln(β) − nψ(α) = 0.

The saddlepoint approximation to the density of α is

fα(α;α) =√n|ψ′(α)| 12 e

`(α;t)−`(α;t)

√2π

[1 +O

(n−1

)], where

t = n ln(β) + nψ(α), and `(α; t) − `(α; t) = n [(α− α)ψ(α) − ln Γ(α) + ln Γ(α)] .

The accuracy of the approximation is illustrated below for n = 4, α = 12 , and β = 2.

Saddlepoint Expansion of Density of MLE of α

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2Sampling From χ2(1): α = 0.5, n = 4

MLE of α

PD

F o

f M

LE

ExactSaddlepointRenormalized Saddlepoint

Chapter 8

GENERALIZED ESTIMATING

FUNCTIONS

8.1 LEAST SQUARES

Let yi be a random p-vector with distribution

yiind∼ (B′xi,Σ),

for i = 1, . . . , n, where B is a q × p matrix of unknown parameters and xi is a q-vector of known covariates. Notethat the model can be written as

Y = XB + U,

where

Y =

y′1

y′2...

y′n

; X =

x′1

x′2...

x′n

;

and Disp(U) = Σ ⊗ In. The usual least squares estimator chooses B as the minimizer of

SSE = [vec(Y − XB)]′[vec(Y − XB)] =

n∑

i=1

[yi − (Ip ⊗ x′i)β]′[yi − (Ip ⊗ x′

i)β],

where β = vec(B). The usual normal equations are obtained by differentiation:

∂SSE

∂β= 0 =⇒ g(β) = 0,

where

g(β) =

n∑

i=1

(Ip ⊗ xi)(yi − B′xi).

The equation g(β) = 0 is a special case of an estimating equation. Note that g satisfies E(g) = 0 From theGauss-Markov Theorem, we know that the solution to g(β) is BLUE.

8.2 GENERALIZED ESTIMATING FUNCTIONS IN CLASS G2

The following reference was used to prepare these notes:

Godambe, V. P. (1991). Estimating functions: An overview. In V.P. Godambe (Ed.), Estimating Functions,Oxford: Clarendon Press.

115

116 CHAPTER 8. GENERALIZED ESTIMATING FUNCTIONS

To generalize the estimating equation, assume that

yiind∼ [h(θ,xi),Σi];

where yi is a random p-vector; θ is a q-vector of unknown parameters; and Σi could be a function of θ and/or xi.The class G2 of generalized estimating equations consists of all estimating equations that can be written as

g(θ) =

n∑

i=1

Fi[yi − h(θ,xi)],

where Fi is a q × p matrix and could be a function of θ. Note that E(g) = 0. An estimator of θ is obtained as thesolution to g(θ) = 0. Note that the normal equations for the least squares criterion constitute a generalizedestimating equation in class G2.

8.3 OPTIMAL CHOICE OF Fi

The choice of Fi determines the properties of the estimator. One sensible criterion is to choose Fi so that thevariance of θ is minimized. To get an expression for Var(θ), expand g(θ) = 0 around θ = θ:

g(θ) = 0 = g(θ) +Op(n1/2)

= g(θ) +∂g(θ)

∂θ′(θ − θ) +Op(1).

The required matrix of derivatives is

∂g(θ)

∂θ′=

n∑

i=1

∂Fi

∂θ′[Iq ⊗ yi − h(θ,xi)] −

n∑

i=1

Fi∂h(θ,xi)

∂θ′

= Op(n1/2) −

n∑

i=1

FiHi; where Hi =∂h(θ,xi)

∂θ′.

Using θ − θ = Op(n−1/2), the expansion simplifies to

0 = g(θ) −n∑

i=1

FiHi(θ − θ) +Op(1).

Solving for (θ − θ) yields

(θ − θ) =

(n∑

i=1

FiHi

)−1

g(θ) +

(n∑

i=1

FiHi

)−1

Op(1).

Using (∑ni=1 FiHi)

−1= O(n−1) and O(n−1)Op(1) = Op(n

−1) yields

(θ − θ) =

(n∑

i=1

FiHi

)−1

g(θ) +Op(n−1).

Note that E(θ) = θ +O(n−1). Therefore, θ is asymptotically unbiased. The above expansion also implies that

Var[√n(θ − θ)] = n

(n∑

i=1

FiHi

)−1 n∑

i=1

FiΣiF′i

(n∑

i=1

H′iF

′i

)−1

+O(n−1/2).

Also, note that Var[(θ − θ)] = O(n−1), so θ is consistent.

8.4. EXAMPLES 117

Theorem 63 Var[√n(θ − θ)] is approximately minimized by defining Fi as

Fi = H′iΣ

−1i .

Proof: If Fi = H′iΣ

−1i , then

Var[√n(θ − θ)] = n

(n∑

i=1

H′iΣ

−1i Hi

)−1

+O(n−1/2).

Examine

∆ =

(n∑

i=1

FiHi

)−1 n∑

i=1

FiΣiF′i

(n∑

i=1

H′iF

′i

)−1

−(

n∑

i=1

H′iΣ

−1i Hi

)−1

.

The proof is completed by showing that ∆ is psd. Note that

∆ =n∑

i=1

LiΣiL′i,

where

Li =

(n∑

i=1

FiHi

)−1

Fi −(

n∑

i=1

H′iΣ

−1i Hi

)−1

H′iΣ

−1.

Therefore, ∆ ≥ 0.

Note that the least squares normal equations constitute an optimal estimating equation.

8.4 EXAMPLES

Zeger, S.L., Liang, K., & Albert, P.S. (1988). Models for longitudinal data: A generalized estimating equationapproach. Biometrics, 44, 1049–1060.

8.5 OPTIMAL ESTIMATING FUNCTIONS

Using a result by Kale (1962, An extension of Cramer-Rao inequality for statistical estimation, ScandinavianActuarial Journal, 45, 60-89), it can be shown that in the class of all estimating functions that satisfy E(g) = 0,the optimal estimating function is the score function.

8.6 ASYMPTOTIC DISTRIBUTIONS OF ESTIMATORS

8.7 REFERENCES

Rao, C. R. (1947). Large sample tests of statistical hypotheses concerning several parameters with applications toproblems of estimation. Proceedings of the Cambridge Philosophical Society, 44, 40–57.

Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations islarge. Transactions of the American Mathematics Society, 54, 426–482.

lecture notes: statistics 550 spring 2004 - … · lecture notes: statistics 550 spring 2004 ... l....

Documents