pukelsheim optimal doe

Optimal Designof Experiments

SIAM's Classics in Applied Mathematics series consists of books that were previously allowedto go out of print. These books are republished by SIAM as a professionalservice because they continue to be important resources for mathematical scientists.

Editor-in-Chief

Robert E. O'Malley, jr., University of Washington

Editorial Board

Richard A. Brualdi, University of Wisconsin-MadisonNicholas J. Higham, University of ManchesterLeah Edelstein-Keshet, University of British ColumbiaHerbert B. Keller, California Institute of TechnologyAndrzej Z. Manitius, George Mason UniversityHilary Ockendon, University of Oxford

Ingram Olkin, Stanford UniversityPeter Olver, University of MinnesotaFerdinand Verhulst, Mathematisch Instituut, University of Utrecht

Classics in Applied Mathematics

C. C. Lin and L. A. Segel, Mathematics Applied to Deterministic Problems in the NaturalSciencesJohan G. F. Belinfante and Bernard Kolman, A Survey of Lie Groups and Lie Algebras withApplications and Computational MethodsJames M. Ortega, Numerical Analysis: A Second CourseAnthony V. Fiacco and Garth P. McCormick, Nonlinear Programming: Sequential Uncon-strained Minimization TechniquesF. H. Clarke, Optimization and Nonsmooth AnalysisGeorge F. Carrier and Carl E. Pearson, Ordinary Differential EquationsLeo Breiman, ProbabilityR. Bellman and G. M. Wing, An Introduction to Invariant ImbeddingAbraham Berman and Robert J. Plemmons, Nonnegative Matrices in the MathematicalSciencesOlvi L. Mangasarian, Nonlinear Programming*Carl Friedrich Gauss, Theory of the. Combination of Observations Least Subjectto Errors: Part One, Part Two, Supplement. Translated by G. W. StewartRichard Bellman, Introduction to Matrix AnalysisU. M. Ascher, R. M. M. Mattheij, and R. D. Russell, Numerical Solution of BoundaryValue Problems for Ordinary Differential EquationsK. E. Brenan, S. L. Campbell, and L. R. Pel2old, Numerical Solution of Initial-ValueProblems in Differential-Algebraic EquationsCharles L. Lawson and Richard J. Hanson, Solving Least Squares ProblemsJ. E. Dennis, Jr. and Robert B. Schnabel, Numerical Methods for UnconstrainedOptimization and Nonlinear EquationsRichard E. Barlow and Frank Proschan, Mathematical Theory of ReliabilityCornelius Lanczos, Linear Differential OperatorsRichard Bellman, Introduction to Matrix Analysis, Second EditionBeresford N. Parlett, The Symmetric Eigenvalue ProblemRichard Haberman, Mathematical Models: Mechanical Vibrations, Population Dynamics,and Traffic Flow

*First time in print.

ii

Classics in Applied Mathematics (continued)

Peter W. M. John, Statistical Design and Analysis of ExperimentsTamer Basar and Geert Jan Olsder, Dynamic Noncooperative Game Theory, Second EditionEmanuel Parzen, Stochastic ProcessesPetar Kokotovic, Hassan K. Khalil, and John O'Reilly, Singular Perturbation Methods inControl: Analysis and DesignJean Dickinson Gibbons, Ingram Olkin, and Milton Sobel, Selecting and Ordering Popula-tions: A New Statistical MethodologyJames A. Murdock, Perturbations: Theory and MethodsIvar Ekeland and Roger Temam, Convex Analysis and Variational ProblemsIvar Stakgold, Boundary Value Problems of Mathematical Physics, Volumes I and IIJ. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in SeveralVariablesDavid Kinderlehrer and Guido Stampacchia, An Introduction to VariationalInequalities and Their ApplicationsF Natterer, The Mathematics of Computerized TomographyAvinash C. Kak and Malcolm Slaney, Principles of Computerized Tomographic ImagingR. Wong, Asymptotic Approximations of IntegralsO. Axelsson and V. A. Barker, Finite Element Solution of Boundary Value Problems: Theoryand ComputationDavid R. Brillinger, Time Series: Data Analysis and TheoryJoel N. Franklin, Methods of Mathematical Economics: Linear and NonlinearProgramming, Fixed-Point TheoremsPhilip Hartman, Ordinary Differential Equations, Second EditionMichael D. Intriligator, Mathematical Optimization and Economic TheoryPhilippe G. Ciarlet, The Finite Element Method for Elliptic ProblemsJane K. Cullum and Ralph A. Willoughby, Lanczos Algorithms for Large SymmetricEigenvalue Computations, Vol. I: TheoryM. Vidyasagar, Nonlinear Systems Analysis, Second EditionRobert Mattheij and Jaap Molenaar, Ordinary Differential Equations in Theory andPracticeShanti S. Gupta and S. Panchapakesan, Multiple Decision Procedures: Theory and Method-ology of Selecting and Ranking PopulationsEugene L. Allgower and Kurt Georg, Introduction to Numerical Continuation MethodsLeah Edelstein-Keshet, Mathematical Models in BiologyHeinz-Otto Kreiss and Jens Lorenz, Initial-Boundary Value Problems and the Navier-StokesEquationsJ. L. Hodges, Jr. and E. L. Lehmann, Basic Concepts of Probability and Statistics, SecondEditionGeorge F Carrier, Max Krook, and Carl E. Pearson, Functions of a Complex Variable:Theory and TechniqueFriedrich Pukelsheim, Optimal Design of Experiments

iii

This page intentionally left blank

Optimal Designof Experiments

Friedrich PukelsheimUniversity of Augsburg

Augsburg, Germany

Society for Industrial and Applied MathematicsPhiladelphia

Copyright © 2006 by the Society for Industrial and Applied Mathematics

This SIAM edition is an unabridged republication of the work first published by JohnWiley & Sons, Inc., New York, 1993.

1 0 9 8 7 6 5 4 3 2 1

All rights reserved. Printed in the United States of America. No part of this bookmay be reproduced, stored, or transmitted in any manner without the written permis-sion of the publisher. For information, write to the Society for Industrial and AppliedMathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging-in-Publication Data:

Pukelsheim, Friedrich, 1948-Optimal design of experiments / Friedrich Pukelsheim.— Classic ed.

p. cm. — (Classics in applied mathematics ; 50)Originally published: New York : J. Wiley, 1993.Includes bibliographical references and index.ISBN 0-89871-604-7 (pbk.)1. Experimental design. 1. Title. II. Series.

QA279.P85 2006519.5'7--dc22

Partial royalties from the sale of this book are placed in a fund to helpstudents attend SIAM meetings and other SIAM-related activities. Thisfund is administered by SIAM, and qualified individuals are encouragedto write directly to SIAM for guidelines.

is a registered trademark.

2005056407

ContentsPreface to the Classics Edition, xviiPreface, xixList of Exhibits, xxiInterdependence of Chapters, xxivOutline of the Book, xxvErrata, xxix

1. Experimental Designs in Linear Models

1.1. Deterministic Linear Models, 11.2. Statistical Linear Models, 21.3. Classical Linear Models with Moment Assumptions, 31.4. Classical Linear Models with Normality Assumption, 41.5. Two-Way Classification Models, 41.6. Polynomial Fit Models, 61.7. Euclidean Matrix Space, 71.8. Nonnegative Definite Matrices, 91.9. Geometry of the Cone of Nonnegative Definite Matrices, 10

1.10. The Loewner Ordering of Symmetric Matrices, 111.11. Monotonic Matrix Functions, 121.12. Range and Nullspace of a Matrix, 131.13. Transposition and Orthogonality, 141.14. Square Root Decompositions of a Nonnegative Definite

Matrix, 151.15. Distributional Support of Linear Models, 151.16. Generalized Matrix Inversion and Projections, 161.17. Range Inclusion Lemma, 171.18. General Linear Models, 18

1.19. The Gauss-Markov Theorem, 201.20. The Gauss-Markov Theorem under a Range Inclusion

Condition, 211.21. The Gauss-Markov Theorem for the Full Mean Parameter

System, 221.22. Projectors, Residual Projectors, and Direct Sum

Decomposition, 23

Vll

1

Vlll CONTENTS

1.23. Optimal Estimators in Classical Linear Models, 241.24. Experimental Designs and Moment Matrices, 251.25. Model Matrix versus Design Matrix, 271.26. Geometry of the Set of All Moment Matrices, 291.27. Designs for Two-Way Classification Models, 301.28. Designs for Polynomial Fit Models, 32

Exercises, 33

2. Optimal Designs for Scalar Parameter Systems 35

2.1. Parameter Systems of Interest and Nuisance Parameters, 352.2. Estimability of a One-Dimensional Subsystem, 362.3. Range Summation Lemma, 372.4. Feasibility Cones, 372.5. The Ice-Cream Cone, 382.6. Optimal Estimators under a Given Design, 412.7. The Design Problem for Scalar Parameter Subsystems, 412.8. Dimensionality of the Regression Range, 422.9. Elfving Sets, 432.10. Cylinders that Include the Elfving Set, 442.11. Mutual Boundedness Theorem for Scalar Optimality, 452.12. The Elfving Norm, 472.13. Supporting Hyperplanes to the Elfving Set, 492.14. The Elfving Theorem, 502.15. Projectors for Given Subspaces, 522.16. Equivalence Theorem for Scalar Optimality, 522.17. Bounds for the Optimal Variance, 542.18. Eigenvectors of Optimal Moment Matrices, 562.19. Optimal Coefficient Vectors for Given Moment Matrices, 562.20. Line Fit Model, 572.21. Parabola Fit Model, 582.22. Trigonometric Fit Models, 582.23. Convexity of the Optimality Criterion, 59

Exercises, 59

3. Information Matrices 61

3.1. Subsystems of Interest of the Mean Parameters, 613.2. Information Matrices for Full Rank Subsystems, 623.3. Feasibility Cones, 63

CONTENTS IX

3.4. Estimability, 643.5. Gauss-Markov Estimators and Predictors, 653.6. Testability, 673.7. F-Test of a Linear Hypothesis, 673.8. ANOVA, 713.9. Identifiability, 723.10. Fisher Information, 723.11. Component Subsets, 733.12. Schur Complements, 753.13. Basic Properties of the Information Matrix Mapping, 763.14. Range Disjointness Lemma, 793.15. Rank of Information Matrices, 813.16. Discontinuity of the Information Matrix Mapping, 823.17. Joint Solvability of Two Matrix Equations, 853.18. Iterated Parameter Subsystems, 853.19. Iterated Information Matrices, 863.20. Rank Deficient Subsystems, 873.21. Generalized Information Matrices for Rank Deficient

Subsystems, 883.22. Generalized Inverses of Generalized Information Matrices, 903.23. Equivalence of Information Ordering and Dispersion

Ordering, 913.24. Properties of Generalized Information Matrices, 923.25. Contrast Information Matrices in Two-Way Classification

Models, 93Exercises, 96

4. Loewner Optimality 98

4.1. Sets of Competing Moment Matrices, 984.2. Moment Matrices with Maximum Range and Rank, 994.3. Maximum Range in Two-Way Classification Models, 994.4. Loewner Optimality, 1014.5. Dispersion Optimality and Simultaneous Scalar

Optimality, 1024.6. General Equivalence Theorem for Loewner Optimality, 1034.7. Nonexistence of Loewner Optimal Designs, 1044.8. Loewner Optimality in Two-Way Classification Models, 1054.9. The Penumbra of the Set of Competing Moment

Matrices, 107

x CONTENTS

4.10. Geometry of the Penumbra, 1084.11. Existence Theorem for Scalar Optimality, 1094.12. Supporting Hyperplanes to the Penumbra, 1104.13. General Equivalence Theorem for Scalar Optimality, 111

Exercises, 113

5. Real Optimality Criteria 114

5.1. Positive Homogeneity, 1145.2. Superadditivity and Concavity, 1155.3. Strict Superadditivity and Strict Concavity, 1165.4. Nonnegativity and Monotonicity, 1175.5. Positivity and Strict Monotonicity, 1185.6. Real Upper Semicontinuity, 1185.7. Semicontinuity and Regularization, 1195.8. Information Functions, 1195.9. Unit Level Sets, 1205.10. Function-Set Correspondence, 1225.11. Functional Operations, 1245.12. Polar Information Functions and Polar Norms, 1255.13. Polarity Theorem, 1275.14. Compositions with the Information Matrix Mapping, 1295.15. The General Design Problem, 1315.16. Feasibility of Formally Optimal Moment Matrices, 1325.17. Scalar Optimality, Revisited, 133

Exercises, 134

6. Matrix Means 135

6.1. Classical Optimality Criteria, 1356.2. D-Criterion, 1366.3. A-Criterion, 1376.4. E-Criterion, 1376.5. T-Criterion, 1386.6. Vector Means, 1396.7. Matrix Means, 1406.8. Diagonality of Symmetric Matrices, 1426.9. Vector Majorization, 1446.10. Inequalities for Vector Majorization, 1466.11. The Holder Inequality, 147

CONTENTS XI

6.12. Polar Matrix Means, 1496.13. Matrix Means as Information Functions and Norms, 1516.14. The General Design Problem with Matrix Means, 1526.15. Orthogonality of Two Nonnegative Definite Matrices, 1536.16. Polarity Equation, 1546.17. Maximization of Information versus Minimization of

Variance, 155Exercises, 156

7. The General Equivalence Theorem 158

7.1. Subgradients and Subdifferentials, 1587.2. Normal Vectors to a Convex Set, 1597.3. Full Rank Reduction, 1607.4. Subgradient Theorem, 1627.5. Subgradients of Isotonic Functions, 1637.6. A Chain Rule Motivation, 1647.7. Decomposition of Subgradients, 1657.8. Decomposition of Subdifferentials, 1677.9. Subgradients of Information Functions, 1687.10. Review of the General Design Problem, 1707.11. Mutual Boundedness Theorem for Information Functions, 1717.12. Duality Theorem, 1727.13. Existence Theorem for Optimal Moment Matrices, 1747.14. The General Equivalence Theorem, 1757.15. General Equivalence Theorem for the Full Parameter

Vector, 1767.16. Equivalence Theorem, 1767.17. Equivalence Theorem for the Full Parameter Vector, 1777.18. Merits and Demerits of Equivalence Theorems, 1777.19. General Equivalence Theorem for Matrix Means, 1787.20. Equivalence Theorem for Matrix Means, 1807.21. General Equivalence Theorem for E-Optimality, 1807.22. Equivalence Theorem for E-Optimality, 1817.23. E-Optimality, Scalar Optimality, and Eigenvalue

Simplicity, 1837.24. E-Optimality, Scalar Optimality, and Elfving Norm, 183

Exercises, 185

xii CONTENTS

8. Optimal Moment Matrices and Optimal Designs 187

8.1. From Moment Matrices to Designs, 1878.2. Bound for the Support Size of Feasible Designs, 1888.3. Bound for the Support Size of Optimal Designs, 1908.4. Matrix Convexity of Outer Products, 1908.5. Location of the Support Points of Arbitrary Designs, 1918.6. Optimal Designs for a Linear Fit over the Unit Square, 1928.7. Optimal Weights on Linearly Independent Regression

Vectors, 1958.8. A-Optimal Weights on Linearly Independent Regression

Vectors, 1978.9. C-Optimal Weights on Linearly Independent Regression

Vectors, 1978.10. Nonnegative Definiteness of Hadamard Products, 1998.11. Optimal Weights on Given Support Points, 1998.12. Bound for Determinant Optimal Weights, 2018.13. Multiplicity of Optimal Moment Matrices, 2018.14. Multiplicity of Optimal Moment Matrices under Matrix

Means, 2028.15. Simultaneous Optimality under Matrix Means, 2038.16. Matrix Mean Optimality for Component Subsets, 2038.17. Moore-Penrose Matrix Inversion, 2048.18. Matrix Mean Optimality for Rank Deficient Subsystems, 2058.19. Matrix Mean Optimality in Two-Way Classification


9. D-, A-, E-, T-Optimality 210

9.1. D-, A-, E-, T-Optimality, 2109.2. G-Criterion, 2109.3. Bound for Global Optimality, 2119.4. The Kiefer-Wolfowitz Theorem, 2129.5. D-Optimal Designs for Polynomial Fit Models, 2139.6. Arcsin Support Designs, 2179.7. Equivalence Theorem for A-Optimality, 2219.8. L-Criterion, 2229.9. A-Optimal Designs for Polynomial Fit Models, 2239.10. Chebyshev Polynomials, 2269.11. Lagrange Polynomials with Arcsin Support Nodes, 227

xi

CONTENTS Xlll

9.12. Scalar Optimality in Polynomial Fit Models, I, 2299.13. E-Optimal Designs for Polynomial Fit Models, 2329.14. Scalar Optimality in Polynomial Fit Models, II, 2379.15. Equivalence Theorem for T-Optimality, 2409.16. Optimal Designs for Trigonometric Fit Models, 2419.17. Optimal Designs under Variation of the Model, 243

Exercises, 245

10. Admissibility of Moment and Information Matrices 247

10.1. Admissible Moment Matrices, 24710.2. Support Based Admissibility, 24810.3. Admissibility and Completeness, 24810.4. Positive Polynomials as Quadratic Forms, 24910.5. Loewner Comparison in Polynomial Fit Models, 25110.6. Geometry of the Moment Set, 25210.7. Admissible Designs in Polynomial Fit Models, 25310.8. Strict Monotonicity, Unique Optimality, and Admissibility, 25610.9. E-Optimality and Admissibility, 25710.10. T-Optimality and Admissibility, 25810.11. Matrix Mean Optimality and Admissibility, 26010.12. Admissible Information Matrices, 26210.13. Loewner Comparison of Special C-Matrices, 26210.14. Admissibility of Special C-Matrices, 26410.15. Admissibility, Minimaxity, and Bayes Designs, 265

Exercises, 266

11. Bayes Designs and Discrimination Designs 268

11.1. Bayes Linear Models with Moment Assumptions, 26811.2. Bayes Estimators, 27011.3. Bayes Linear Models with Normal-Gamma Prior

Distributions, 27211.4. Normal-Gamma Posterior Distributions, 27311.5. The Bayes Design Problem, 27511.6. General Equivalence Theorem for Bayes Designs, 27611.7. Designs with Protected Runs, 27711.8. General Equivalence Theorem for Designs with Bounded

Weights, 27811.9. Second-Degree versus Third-Degree Polynomial Fit Models,

I, 280

XIV CONTENTS

11.10. Mixtures of Models, 28311.11. Mixtures of Information Functions, 28511.12. General Equivalence Theorem for Mixtures of Models, 28611.13. Mixtures of Models Based on Vector Means, 28811.14. Mixtures of Criteria, 28911.15. General Equivalence Theorem for Mixtures of Criteria, 29011.16. Mixtures of Criteria Based on Vector Means, 29011.17. Weightings and Scalings, 29211.18. Second-Degree versus Third-Degree Polynomial Fit Models,

II, 29311.19. Designs with Guaranteed Efficiencies, 29611.20. General Equivalence Theorem for Guaranteed Efficiency

Designs, 29711.21. Model Discrimination, 29811.22. Second-Degree versus Third-Degree Polynomial Fit Models,

III, 299Exercises, 302

12. Efficient Designs for Finite Sample Sizes 304

12.1. Designs for Finite Sample Sizes, 30412.2. Sample Size Monotonicity, 30512.3. Multiplier Methods of Apportionment, 30712.4. Efficient Rounding Procedure, 30712.5. Efficient Design Apportionment, 30812.6. Pairwise Efficiency Bound, 31012.7. Optimal Efficiency Bound, 31112.8. Uniform Efficiency Bounds, 31212.9. Asymptotic Order O(n- l ) , 31412.10. Asymptotic Order O(n-2), 31512.11. Subgradient Efficiency Bounds, 31712.12. Apportionment of D-Optimal Designs in Polynomial Fit

Models, 32012.13. Minimal Support and Finite Sample Size Optimality, 32212.14. A Sufficient Condition for Completeness, 32412.15. A Sufficient Condition for Finite Sample Size

D-Optimality, 32512.16. Finite Sample Size D-Optimal Designs in Polynomial Fit


CONTENTS XV

13. Invariant Design Problems 331

13.1. Design Problems with Symmetry, 33113.2. Invariance of the Experimental Domain, 33513.3. Induced Matrix Group on the Regression Range, 33613.4. Congruence Transformations of Moment Matrices, 33713.5. Congruence Transformations of Information Matrices, 33813.6. Invariant Design Problems, 34213.7. Invariance of Matrix Means, 34313.8. Invariance of the D-Criterion, 34413.9. Invariant Symmetric Matrices, 34513.10. Subspaces of Invariant Symmetric Matrices, 34613.11. The Balancing Operator, 34813.12. Simultaneous Matrix Improvement, 349

Exercises, 350

14. Kiefer Optimality 352

14.1. Matrix Majorization, 35214.2. The Kiefer Ordering of Symmetric Matrices, 35414.3. Monotonic Matrix Functions, 35714.4. Kiefer Optimality, 35714.5. Heritability of Invariance, 35814.6. Kiefer Optimality and Invariant Loewner Optimality, 36014.7. Optimality under Invariant Information Functions, 36114.8. Kiefer Optimality in Two-Way Classification Models, 36214.9. Balanced Incomplete Block Designs, 36614.10. Optimal Designs for a Linear Fit over the Unit Cube, 372

Exercises, 379

15. Rotatability and Response Surface Designs 381

15.1. Response Surface Methodology, 38115.2. Response Surfaces, 38215.3. Information Surfaces and Moment Matrices, 38315.4. Rotatable Information Surfaces and Invariant Moment

Matrices, 38415.5. Rotatability in Multiway Polynomial Fit Models, 38415.6. Rotatability Determining Classes of Transformations, 38515.7. First-Degree Rotatability, 38615.8. Rotatable First-Degree Symmetric Matrices, 387

XVI CONTENTS

15.9. Rotatable First-Degree Moment Matrices, 38815.10. Kiefer Optimal First-Degree Moment Matrices, 38915.11. Two-Level Factorial Designs, 39015.12. Regular Simplex Designs, 39115.13. Kronecker Products and Vectorization Operator, 39215.14. Second-Degree Rotatability, 39415.15. Rotatable Second-Degree Symmetric Matrices, 39615.16. Rotatable Second-Degree Moment Matrices, 39815.17. Rotatable Second-Degree Information Surfaces, 40015.18. Central Composite Designs, 40215.19. Second-Degree Complete Classes of Designs, 40315.20. Measures of Rotatability, 40515.21. Empirical Model-Building, 406

Exercises, 406

Comments and References 408

1. Experimental Designs in Linear Models, 4082. Optimal Designs for Scalar Parameter Systems, 4103. Information Matrices, 4104. Loewner Optimality, 4125. Real Optimality Criteria, 4126. Matrix Means, 4137. The General Equivalence Theorem, 4148. Optimal Moment Matrices and Optimal Designs, 4179. D-, A-, E-, T-Optimality, 418

10. Admissibility of Moment and Information Matrices, 42111. Bayes Designs and Discrimination Designs, 42212. Efficient Designs for Finite Sample Sizes, 42413. Invariant Design Problems, 42514. Kiefer Optimality, 42615. Rotatability and Response Surface Designs, 426

Biographies 428

1. Charles Loewner 1893-1968, 4282. Gustav Elfving 1908-1984, 4303. Jack Kiefer 1924-1981, 430

Bibliography 432

Index 448

Preface to the Classics Edition

Research into the optimality theory of the design of statistical experiments originatedaround 1960. The first papers concentrated on one specific optimality criterion oranother. Before long, when interrelations between these criteria were observed,the need for a unified approach emerged. Invoking tools from convex optimizationtheory, the optimal design problem is indeed amenable to a fairly complete solution.This is the topic of Optimal Design of Experiments, and over the years the materialdeveloped here has proved comprehensive, useful, and stable. It is a pleasure to seethe book reprinted in the SIAM Classics in Applied Mathematics series.

Ever since the inception of optimal design theory, the determinant of the mo-ment matrix of a design was recognized as a very specific criterion function. Infact, determinant optimality in polynomial fit models permits an analysis other thanthe one presented here, based on canonical moments and classical polynomials.This alternate part of the theory is developed by H. DETTE and W.J. STUDDEN intheir monograph The Theory of Canonical Moments with Applications in Statistics,Probability, and Analysis, and the references listed there complement and update thebibliography given here.

Since the book's initial publication in 1993, its results have been put to gooduse in deriving optimal designs on the circle, optimal mixture designs, or optimaldesigns in other linear statistical models. However, many practical design problemsof applied statistics are inherently nonlinear. Even then, local linearization mayopen the way to apply the present results, thus aiding in identifying good, practicaldesigns.

FRIEDRICH PUKELSHEIM

Augsburg, Germany

October 2005

xvii

Preface

... dans ce meilleur des [modeles] possibles ... tout est au mieux.Candide (1759), Chapitre I, VOLTAIRE

The working title of the book was a bit long, Optimality Theory of Experi-mental Designs in Linear Models, but focused on two pertinent points. Thesetting is the linear model, the simplest statistical model, where the results arestrongest. The topic is design optimality, de-emphasizing the issue of designconstruction. A more detailed Outline of the Book follows the Contents.

The design literature is full of fancy nomenclature. In order to circumventexpert jargon I mainly speak of a design being -optimal for K 'Q in H, thatis, being optimal under an information function , for a parameter systemof interest K'6, in a class of competing designs. The only genuinely newnotions that I introduce are Loewner optimality (because it refers to theLoewner matrix ordering) and Kiefer optimality (because it pays due homageto the man who was a prime contributor to the topic).

The design problems originate from statistics, but are solved using specialtools from linear algebra and convex analysis, such as the information matrixmapping of Chapter 3 and the information functions of Chapter 5. I haverefrained from relegating these tools into a set of appendices, at the expenseof some slowing of the development in the first half of the book. Instead, theauxiliary material is developed as needed, and it is hoped that the expositionconveys some of the fascination that grows out of merging three otherwisedistinct mathematical disciplines.

The result is a unified optimality theory that embraces an amazingly widevariety of design problems. My aim is not encyclopedic coverage, but rather tooutline typical settings such as D-, A-, and E-optimal polynomial regressiondesigns, Bayes designs, designs for model discrimination, balanced incompleteblock designs, or rotatable response surface designs. Pulling together formerlyseparate entities to build a greater community will always face opponentswho fear an assault on their way of thinking. On the contrary, my intentionis constructive, to generate a frame for those design problems that share

xix

XX PREFACE

a common goal. The goal of investigating optimal, theoretical designs is toprovide a gauge for identifying efficient, practical designs.

Il meglio e l'inimico del bene.Dictionnaire Philosophique (1770), Art Dramatique, VOLTAIRE

ACKNOWLEDGMENTS

The writing of this book became a pleasure when I began experiencing en-couragement from so many friends and colleagues, ranging from good adviceof how to survive a book project, to the tedious work of weeding out wrongtheorems. Above all I would like to thank my Augsburg colleague NorbertGaffke who, with his vast knowledge of the subject, helped me several timesto overcome paralyzing deadlocks. The material of the book called for anumber of research projects which I could only resolve by relying on thecompetence and energy of my co-authors. It is a privilege to have cooper-ated with Norman Draper, Sabine Rieder, Jim Rosenberger, Bill Studden,and Ben Torsney, whose joint efforts helped shape Chapters 15, 12, 11, 9, 8,respectively.

Over the years, the manuscript has undergone continuous mutations, asa reaction to the suggestions of those who endured the reading of the earlydrafts. For their constructive criticism I am grateful to Ching-Shui Cheng,Holger Dette, Berthold Heiligers, Harold Henderson, Olaf Krafft, RudolfMathar, Wolfgang Nather, Ingram Olkin, Andrej Pazman, Norbert Schmitz,Shayle Searle, and George Styan. The additional chores of locating typos,detecting doubly used notation, and searching for missing definitions wasundertaken by Markus Abt, Wolfgang Bischoff, Kenneth Nordstrom, IngolfTerveer, and the students of various classes I taught from the manuscript.Their labor turned a manuscript that initially was everywhere dense in errorinto one which I hope is finally everywhere dense in content.

Adalbert Wilhelm carried out most of the calculations for the numeri-cal examples; Inge Dotsch so cheerfully kept retyping what seemed in finalform. Ingo Eichenseher and Gerhard Wilhelms contributed the public do-main postscript driver dvilw to produce the exhibits. Sol Feferman, TimoMakelainen, and Dooley Kiefer kindly provided the photographs of Loewner,Elfving, and Kiefer in the Biographies. To each I owe a debt of gratitude.

Finally I wish to acknowledge the support of the Volkswagen-Stiftung,Hannover, for supporting sabbatical leaves with the Departments of Statis-tics at Stanford University (1987) and at Penn State University (1990), andgranting an Akademie-Stipendium to help finish the project.

FRIEDRICH PUKELSHEIM

Augsburg, GermanyDecember 1992

List of Exhibits

1.1 The statistical linear model, 31.2 Convex cones in the plane R2, 111.3 Orthogonal decompositions induced by a linear mapping, 141.4 Orthogonal and oblique projections, 241.5 An experimental design worksheet, 281.6 A worksheet with run order randomized, 281.7 Experimental domain designs, and regression range designs, 32

2.1 The ice-cream cone, 382.2 Two Elfving sets, 432.3 Cylinders, 452.4 Supporting hyperplanes to the Elfving set, 502.5 Euclidean balls inscribed in and circumscribing the Elfving

set, 55

3.1 ANOVA decomposition, 713.2 Regularization of the information matrix mapping, 813.3 Discontinuity of the information matrix mapping, 84

4.1 Penumbra, 108

5.1 Unit level sets, 121

6.1 Conjugate numbers, p + q = pq, 148

7.1 Subgradients, 1597.2 Normal vectors to a convex set, 1607.3 A hierarchy of equivalence theorems, 178

xxi

LIST OF EXHIBITS

8.1 Support points for a linear fit over the unit square, 194

9.1 The Legendre polynomials up to degree 10, 2149.2 Polynomial fits over [-1; 1]: -optimal designs for 0 in

T, 2189.3 Polynomial fits over [—1;!]: -optimal designs for 6 in

2199.4 Histogram representation of the design , 2209.5 Fifth-degree arcsin support, 2209.6 Polynomial fits over [-1;1]: -optimal designs for 6 in

T, 2249.7 Polynomial fits over [-1;1]: -optimal designs for 6 in

2259.8 The Chebyshev polynomials up to degree 10, 2269.9 Lagrange polynomials up to degree 4, 2289.10 E-optimal moment matrices, 2339.11 Polynomial fits over [-1;1]: -optimal designs for 8 in

T, 2369.12 Arcsin support efficiencies for individual parameters 240

10.1 Cuts of a convex set, 25410.2 Line projections and admissibility, 25910.3 Cylinders and admissibility, 261

11.1 Discrimination between a second- and a third-degreemodel, 301

12.1 Quota method under growing sample size, 30612.2 Efficient design apportionment, 31012.3 Asymptotic order of the E-efficiency loss, 31712.4 Asymptotic order of the D-efficiency loss, 32212.5 Nonoptimality of the efficient design apportionment, 32312.6 Optimality of the efficient design apportionment, 329

13.1 Eigenvalues of moment matrices of symmetric three-pointdesigns, 334

14.1 The Kiefer ordering, 35514.2 Some 3x6 block designs for 12 observations, 370

xxii

LIST OF EXHIBITS XX111

14.3 Uniform vertex designs, 37314.4 Admissible eigenvalues, 375

15.1 Eigenvalues of moment matrices of central compositedesigns, 405

Interdependence of Chapters

1 Experimental Designsin Linear Models

2 Optimal Designs forScalar Parameter Systems

3 Information Matrices 4 Loewner Optimality

5 Real Optimality Criteria 6 Matrix Means

7 The General Equivalence Theorem

8 Optimal Moment Matricesand Optimal Designs

9 D-, A-, E-, T-Optimality

10 Admissibility of Momentand Information Matrices

11 Bayes Designs andDiscrimination Designs

12 Efficient Designs forFinite Sample Sizes

13 Invariant Design Problems

14 Kiefer Optimality

15 Rotatability andResponse Surface Designs

XXIV

Outline of the Book

CHAPTERS 1, 2, 3, 4: LINEAR MODELS AND INFORMATIONMATRICES

Chapters 1 and 3 are basic. Chapter 1 centers around the Gauss-MarkovTheorem, not only because it justifies the introduction of designs and theirmoment matrices in Section 1.24. Equally important, it permits us to definein Section 3.2 the information matrix for a parameter system of interest K'0in a way that best supports the general theory. The definition is extendedto rank deficient coefficient matrices K in Section 3.21. Because of the dualpurpose the Gauss-Markov Theorem is formulated as a general result ofmatrix algebra. First results on optimal designs are presented in Chapter 2,for parameter subsystems that are one-dimensional, and in Chapter 4, inthe case where optimality can be achieved relative to the Loewner orderingamong information matrices. (This is rare, see Section 4.7.) These resultsalso follow from the General Equivalence Theorem in Chapter 7, whenceChapters 2 and 4 are not needed for their technical details.

CHAPTERS 5,6: INFORMATION FUNCTIONS

Chapters 5 and 6 are reference chapters, developing the concavity propertiesof prospective optimality criteria. In Section 5.8, we introduce informationfunctions which by definition are required to be positively homogeneous,superadditive, nonnegative, nonconstant, and upper semicontinuous. Infor-mation functions submit themselves to pleasing functional operations (Sec-tion 5.11), of which polarity (Section 5.12) is crucial for the sequel. The mostimportant class of information functions are the matrix means with pa-rameter They are the topic of Chapter 6, starting from theclassical D-, A-, E-criterion as the special cases respectively.

XXV

XXVI OUTLINE OF THE BOOK

CHAPTERS 7, 8,12: OPTIMAL APPROXIMATE DESIGNS ANDEFFICIENT DISCRETE DESIGNS

The General Equivalence Theorem 7.14 is the key result of optimal designtheory, offering necessary and sufficient conditions for a design's momentmatrix M to be -optimal for K' in M. The generic result of this type isdue to Kiefer and Wolfowitz (1960), concerning D-optimality for 6 in M .The present theorem is more general in three respects, in allowing for thecompeting moment matrices to form a set M which is compact and con-vex, rather than restricting attention to the largest possible set M of allmoment matrices, in admitting parameter subsystems K' rather than con-centrating on the full parameter vector 6, and in permitting as optimalitycriterion any information function , rather than restricting attention to theclassical D-criterion. Specifying these quantitites gives rise to a number ofcorollaries which are discussed in the second half of Chapter 7. The first halfis a self-contained exposition of arguments which lead to a proof of the Gen-eral Equivalence Theorem, based on subgradients and normal vectors to aconvex set. Duality theory of convex analysis might be another starting point;here we obtain a duality theorem as an intermediate step, as Theorem 7.12.Yet another approach would be based on directional derivatives; however,their calculus is quite involved when it comes to handling a composition

C like the one underlying the optimal design problem.Chapter 8 deals with the practical consequences which the General Equiv-

alence Theorem implies about the support points xi, and the weights w, ofan optimal design The theory permits a weight w, to be any real numberbetween 0 and 1, prescribing the proportion of observations to be drawn un-der xi. In contrast, a design for sample size n replaces wi by an integer n,-, asthe replication number for xi. In Chapter 12 we propose the efficient designapportionment as a systematic and easy way to pass from wi, to ni. This dis-cretization procedure is the most efficient one, in the sense of Theorem 12.7.For growing sample size AX, the efficiency loss relative to the optimal designstays bounded of asymptotic order n - 1 ; in the case of differentiability, theorder improves to n-2.

CHAPTERS 9,10,11: INSTANCES OF DESIGN OPTIMALITY

D-, A-, and E-optimal polynomial regression designs over the interval [—1; 1]are characterized and exhibited in Chapter 9. Chapter 10 discusses admis-sibility of the moment matrix of a polynomial regression design, and ofthe contrast information matrix of a block design in a two-way classifica-tion model. Prominent as these examples may be, it is up to Chapter 11 toexploit the power of the General Equivalence Theorem to its fullest. Var-ious sets of competing moment matrices are considered, such as Ma forBayes designs, M(a[a;b]) for designs with bounded weights, M(m) for mix-

OUTLINE OF THE BOOK XXVll

ture model designs, {(M,... ,M): M M} for mixture criteria designs, andfor designs with guaranteed efficiencies. And they are eval-

uated using an information function that is a composition of a setof m information functions, together with an informationfunction on the nonnegative orthant Rm.

CHAPTERS 13,14,15: OPTIMAL INVARIANT DESIGNS

As with other statistical problems, invariance considerations can be of greathelp in reducing the dimensionality and complexity of the general designproblem, at the expense of handling some additional theoretical concepts.The foundations are laid in Chapter 13, investigating various groups and theiractions as they pertain to an experimental domain design r, a regression rangedesign a moment matrix M(£), an information matrix C/c(M), oran information function (C). The idea of "increased symmetry" or "greaterbalancedness" is captured by the matrix majorization ordering of Section 14.1.This concept is brought together with the Loewner matrix ordering to createthe Kiefer ordering of Section 14.2: An information matrix C is at least asgood as another matrix D, C > D, when relative to the Loewner ordering,C is above some intermediate matrix which is majorized by D, The concept isdue to Kiefer (1975) who introduced it in a block design setting and called ituniversal optimality. We demonstrate its usefulness with balanced incompleteblock designs (Section 14.9), optimal designs for a linear fit over the unitcube (Section 14.10), and rotatable designs for response surface methodology(Chapter 15).

The final Comments and References include historical remarks and men-tion the relevant literature. I do not claim to have traced every detail to itsfirst contributor and I must admit that the book makes no mention of manyother important design topics, such as numerical algorithms, orthogonal ar-rays, mixture designs, polynomial regression designs on the cube, sequen-tial and adaptive designs, designs for nonlinear models, robust designs, etc.

Errata

Page ±Line Text Correction

313291156157169169203217222

241

270330347357361378390

xxix

+ 12Exh. 1.7-11-2+11+13-7-12_7

-8

_2

+4+3-7+ 15+ 11+9+13,-3

Section 1.26lower right: interchangeB~B, BB~X = \C\iE >0,GKCDCK'G' :s x k

Section 1.251/2 and 1/6B~ BK, B k B ~\X\ = CjE >0,GKCDCK'G' + F :s x (s — k)ds

i [in denominator]r [in numerator]

Exhibit 9.4KNND(s)

Exhibit 9.2Ks

NND(k)

a(jk)Il

m

b(jk)lI1+m

OPTIMAL DESIGN OF EXPERIMENTS

C H A P T E R 1

Experimental Designs inLinear Models

This chapter provides an introduction to experimental designs for linear mod-els. Two linear models are presented. The first is classical, having a dispersionstructure in which the dispersion matrix is proportional to the identity matrix.The second model is more general, with a dispersion structure that does notimpose any rank or range assumptions. The Gauss-Markov Theorem is for-mulated to cover the general model. The classical model provides the settingto introduce experimental designs and their moment matrices. Matrix algebrais reviewed as needed, with particular emphasis on nonnegative definite matri-ces, projectors, and generalized inverses. The theory is illustrated with two-wayclassification models, and models for a line fit, parabola fit, and polynomialfit.

1.1. DETERMINISTIC LINEAR MODELS

Many practical and theoretical problems in science treat relationships of thetype

where the observed response or yield, y, is thought of as a particular valueof a real-valued model function or response function, g, evaluated at the pairof arguments ( t , 0). This decomposition reflects the distinctive role of thetwo arguments: The experimental conditions t can be freely chosen by theexperimenter from a given experimental domain T, prior to running the ex-periment. The parameter system 6 is assumed to lie in a parameter domain ®,and is not known to the experimenter. This is paraphrased by saying that theexperimenter controls t, whereas "nature" determines 6.

The choice of the function g is central to the model-building process. One

1

2 CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS

of the simplest relationships is the deterministic linear model

where f(t) = (f\(t), . . . ,/*(0) ' an^ 0 = (#i> • • • i #*) ' are vectors in ^-dimen-sional Euclidean space Rk. All vectors are taken to be column vectors, a primeindicates transposition. Hence f(t)'B is the usual Euclidean scalar product,/(0'0 — £;<*/} (Ofy Linearity pertains to the parameter system 0, not tothe experimental conditions t.

Linearity shifts the emphasis from the model function g to the regres-sion function f. Assuming that the experimenter knows both the regressionfunction / and the experimental conditions t, a compact notation resultsupon introducing the k x 1 regression vector x = /(?), and the regressionrange X = [f(t) : t £ T} C R*. From an applied point of view the exper-imental domain T plays a more primary role than the regression range X,but the latter is expedient for a consistent development. The linear model, inits deterministic form discussed so far, thus takes the simple form y = x'B.

1.2. STATISTICAL LINEAR MODELS

In many experiments the response can be observed only up to an additiverandom error e, distorting the model to become

Because of random error, repeated experimental runs typically lead to dif-ferent observed responses, even if the regression vector x and the parametersystem 8 remain identical. Therefore any evaluation of the experiment caninvolve a statement on the distribution of the response only, rather than onany one of its specific values. A (statistical) linear model thus treats responseand error as real-valued random variables Y and £, governed by a probabilitydistribution P and satisfying the relationship

In this model, the term e may subsume quite diverse sources of error, rangingfrom random errors resulting from inaccuracies in the measuring devices,to systematic errors that are due to inappropriateness of a model function

A schematic arrangement of these quantities is presented in Exhibit 1.1.

1.3. CLASSICAL LINEAR MODELS WITH MOMENT ASSUMPTIONS 3

EXHIBIT 1.1 The statistical linear model. The response Y decomposes into the deterministicmean effect x'Q plus the random error E.

1.3. CLASSICAL LINEAR MODELS WITH MOMENTASSUMPTIONS

To proceed, we need to be more specific about the underlying distributionalassumptions. For point estimation, the distributional assumptions solely per-tain to expectation and variance relative to the underlying distribution P,

For this reason 0 is called the mean parameter vector, while the model vari-ance a2 > 0 provides an indication of the variability inherent in the observa-tion Y. Another way of expressing this is to say that the random error E hasmean value zero and variance a2, neither of which depends on the regressionvector x nor on the parameter vector 0 of the mean response.

The k x 1 parameter vector 0 and the scalar parameter a2 comprise atotal of k +1 unknown parameter components. Clearly, for any reasonableinference, the number n of observations must be at least equal to k + 1. Weconsider a set of n observations,

with possibly different regression vectors jc, in experimental run /. The jointdistribution of the n responses Yt is specified by assuming that they areuncorrelated.

Considerable simplicity is gained by using vector notation. Let

denote the n x l response vector Y, the n x k model matrix X, and the n x lerror vector £, respectively. (Henceforth the random quantities Y and E aren x l vectors rather than scalars!) The (i,y)th entry *|; of the matrix X is the


same as the ; th component of the regression vector jc,, that is, the regressionvector jCj appears as the / th row of the model matrix X. The model equationthus becomes

With /„ as the n x n identity matrix, the model is succinctly represented bythe expectation vector and dispersion matrix of y,

and is termed the classical linear model with moment assumptions.In other words, the mean vector Ep[Y] is given by the linear relation-

ship X6 between the regression vectors *!,...,*„ and the parameter vec-tor 0, while the dispersion matrix D/>[F] is in its classical, that is, simplest,form of being proportional to the identity matrix.

1.4. CLASSICAL LINEAR MODELS WITH NORMALITYASSUMPTION

For purposes of hypothesis testing and interval estimation, assumptions onthe first two moments do not suffice and the entire distribution of Y is re-quired. Hence in these cases there is a need for a classical linear model withnormality assumption,

in which Y is assumed to be normally distributed with mean vector XB anddispersion matrix a2In. If the model matrix X is known then the normaldistribution P = N^.^ is determined by 8 and a2. We display these pa-rameters by writing Ee.a2[- • •] in place of E/>[- • •], etc. Moreover, the letter Psoon signifies a projection matrix.

1.5. TWO-WAY CLASSIFICATION MODELS

The two-sample problem provides a simple introductory example. Considertwo populations with mean responses a\ and a2. The observed responsesfrom the two populations are taken to have a common variance a2 and tobe uncorrelated. With replications y = !,...,«/ for populations / = 1,2 thisyields a linear model

1.5. TWO-WAY CLASSIFICATION MODELS 5

Assembling the components into n x 1 vectors, with n = n\ + n^, we get

Here the n x 2 model matrix X and the parameter vector 6 are given by

with regression vectors x\ = Q and *2 = (i) repeated n\ and «2 times. Theexperimental design consists of the replication numbers n\ and n-i, telling theexperimenter how many responses are to be observed from which population.

It is instructive to identify the quantities of this example with those of thegeneral theory. The experimental domain T is simply the two-element set{1,2} of population labels. The regression function takes values /(I) = Qand /(2) = (J) in R2, inducing the set X = {(J), (J)} as the regression range.

The generalization from two to a populations leads to the one-way clas-sification model. The model is still Y,; = a, + £t;, but the subscript rangesturn into i = l , . . . , a and ; = !,...,«,. The mean parameter vector be-comes 0 = («!,..., «„)', and the experimental domain is T = {1,...,0}.The regression function / maps i into the / th Euclidean unit vector ei ofRa, with /th entry one and zeros elsewhere. Hence the regression range isX = {el5... ,ea}. Further generalization is aided by a suitable terminology.Rather than speaking of different populations, / = 1,..., a, we say that the"factor" population takes "levels" / = 1..., a. More factors than one occurin multiway classification models.

The two-way classification model with no interaction may serve as a pro-totype. Suppose level / of a first factor "A" has mean effect a/, while level jof a second factor "B" has mean effect )8y. Assuming that the two effects areadditive, the model reads

with replications i — 1,... ,n/;, for levels i = 1,... ,a of factor A and levelsj = 1,... ,b of factor B. The design problem now consists of choosing thereplication numbers n,;. An extreme, but feasible, choice is n,; = 0, that is,no observation is made with factor A on level / and factor B on level /. The


parameter vector 0 is the k x 1 vector (ai,..., aa, p\,..., ftb)', with k = a+b.The experimental domain is the discrete rectangle T = (1,..., a} x {1,..., b}.The regression function / maps («',/) into (J), where e{ is the ith Euclidean

unit vector of Ra and d, is the ; th Euclidean unit vector of R*. We return tothis model in Section 1.27.

So far, the experimental domain has been a finite set; next it is going tobe an interval of the real line R.

1.6. POLYNOMIAL FIT MODELS

Let us first look at a line fit model,

Intercept a and slope )8 form the parameter vector 6 of interest, whereasthe experimental conditions f, come from an interval T C R. For the sakeof concreteness, we think of t e T as a "dose level". The design problemthen consists of determining how many and which dose levels f i , . . . , r/ areto be observed, and how often. If the experiment calls for nt replicationsof dose level f,-, the subscript ranges in the model are / = 1,... ,n, for i =1,... ,^. Here the regression function has values f ( i ) = (1,0'. generating aline segment embedded in the plane R2 as regression range X.

The parabola fit model has mean response depending on the dose levelquadratically,

This changes the regression function to f(t) = (l ,r ,f2) ' , and the regressionrange X turns into the segment of a parabola embedded in the space R3.

These are special instances of polynomial fit models of degree d > 1, themodel equation becoming

The regression range X is a one-dimensional curve embedded in R*, with k =d +1. Often the description of the experiment makes it clear that the exper-imental condition is a single real variable /; a linear model for a line fit(parabola fit, polynomial fit of degree d) is then referred to as a first-degreemodel (second-degree model, d th-degree model).

This generalizes to the powerful class of m-way d th-degree polynomial fitmodels. In these models the experimental condition / = (fi,• • • , fm) ' has mcomponents, that is, the experimental domain T is a subset of Rm, and themodel function f(t) '8 is a polynomial of degree d in the m variables f j , . . . , tm.

1.7. EUCLIDEAN MATRIX SPACE 7

For instance, a two-way third-degree model is given by

with i experimental conditions f, = (to,to)' € T C R2, and with subscriptranges / = 1,..., nt; for / = 1,..., L As a second example consider the three-way second-degree model

with i experimental conditions f, = (to, to, to)' e T C R3, and with subscriptranges / = 1,..., n,- for i = 1,..., £. Both models have ten mean parameters.

The two examples illustrate saturated models because they feature everypossible dth-degree power or cross product of the variables 11,...,tm. Ingeneral, a saturated m-way d th-degree model has

mean parameters. An instance of a nonsaturated two-way second-degreemodel is

with i experimental conditions tt = (to,to)' e T C R2, and with subscriptranges / = 1,..., n, for / = 1,..., £.

The discussion of these examples is resumed in Section 1.27, after a properdefinition of an experimental design.

1.7. EUCLIDEAN MATRIX SPACE

In a classical linear model, interest concentrates on inference for the meanparameter vector 6. The performance of appropriate statistical procedurestends to be measured by dispersion matrices, moment matrices, or informa-tion matrices. This calls for a review of matrix algebra. All matrices usedhere are real.

First let us recall that the trace of a square matrix is the sum of its diagonalentries. Hence a square matrix and its transpose have identical traces. An-other important property is that, under the trace operator, matrices commute


provided they are conformable,

We often apply this rule to quadratic forms given by a symmetric matrix A, inusing x'Ax = trace Axx' = trace xx'A, as is convenient in a specific context.

Let R"** denote the linear space of real matrices with n rows and kcolumns. The Euclidean matrix scalar product

turns Rn*k into a Euclidean space of dimension nk. For k = 1, we recover theEuclidean scalar product for vectors in W. The symmetry of scalar products,trace A 'B = (A,B) = (B,A) = trace B'A, reproduces the property that asquare matrix and its transpose have identical traces. Commutativity underthe trace operator yields (A,B) = trace A 'B = trace BA1 — (B',A') =(A',B'), that is, transposition preserves the scalar products between thematrix spaces of reversed numbers of rows and columns, Rnxk and R*xw.

In general, although not always, our matrices have at least as many rows ascolumns. Since we have to deal with extensive matrix products, this facilitatesa quick check that factors properly conform. It is also in accordance withwriting vectors of Euclidean space as columns. Notational conventions thatare similarly helpful are to choose Greek letters for unknown parameters ina statistical model, and to use uppercase and lowercase letters to discriminatebetween a random variable and any one of its values, and between a matrixand any one of its entries.

Because of their role as dispersion operators, our matrices often are sym-metric. We denote by Sym(A;) the subspace of symmetric matrices, in thespace Ukxk of all square, that is, not necessarily symmetric, matrices. Recallfrom matrix algebra that a symmetric k x k matrix A permits an eigenvaluedecomposition

The real numbers A i , . . . , A j t are the eigenvalues of A counted with theirrespective multiplicities, and the vectors z\, • . . , z* € ^fc f°rm an orthonormalsystem of eigenvectors. In general, such a decomposition fails to be unique,since if the eigenvalue A; has multiplicity greater than one then many choicesfor the eigenvectors Zj become feasible.

The second representation of an eigenvalue decomposition, A = Z'&^Z,assembles the pertinent quantities in a slightly different way. We define theoperator AA by requiring that it creates a diagonal matrix with the argumentvector A = (A 1 ? . . . , \k)' on the diagonal. The orthonormal vectors z\, • - - , Zk

1.8. NONNEGATIVE DEFINITE MATRICES 9

form the k x k matrix Z' = (z\,..., Zk}-, whence Z' is an orthogonal matrix.The equality with the first representation now follows from

Matrices that have matrices or vectors as entries, such as Z', are termedblock matrices. They provide a convenient technical tool in many areas. Thealgebra of block matrices parallels the familiar algebra of matrices, and maybe verified as needed.

In the space Sym(/c), the subsets of nonnegative definite matrices, NND(fc),and of positive definite matrices, PD(/c), are central to the sequel. They aredefined through

Of the many ways of characterizing nonnegative definiteness or positive def-initeness, frequent use is made of the following.

1.8. NONNEGATIVE DEFINITE MATRICES

Lemma. Let A be a symmetric k x k matrix with smallest eigenvalueThen we have

Proof. Assume A e NND(£), and choose an eigenvector z e R* ofnorm 1 corresponding to AminCA). Then we obtain 0 < z 'Az = Amjn(^)z 'z =AmjnC<4)- Now assume Amin(/l) > 0, and choose an eigenvalue decomposition

This yields traceTo complete the circle, we verify for all bychoosing

For positive defimteness the arguments follow the same lines upon ob-serving that we have provided

The set of all nonnegative definite matrices NND(fc) has a beautiful geo-metrical shape, as follows.

1.9. GEOMETRY OF THE CONE OF NONNEGATIVE DEFINITEMATRICES

Lemma. The set NND(A:) is a cone which is convex, pointed, and closed,and has interior PD(£) relative to the space Sym(A:).

Proof. The proof is by first principles, recalling the definition of the prop-erties involved. For A e NND(A;) and 5 > 0 evidently 8A € NND(fc), thusNND(fc) is a cone. Next for A, B e NND(fc) we clearly have A+B e NND(fc)since

Because NND(fc) is a cone, we may replace A by (1 - a)A and B by aB,where a lies in the open interval (0;1). Hence given any two matrices Aand B the set NND(&) also includes the straight line (1 - a)A + aB from Ato B, and this establishes convexity. If A e NND(fc) and also —A £ NND(fc),then A = 0, whence the cone NND(&) is pointed.

The remaining two properties, that NND(fc) is closed and has PD(fc) forits interior, are topological in nature. Let

be the closed unit ball in Sym(k) under the Euclidean matrix scalar product.Replacing B e Sym(fc) by an eigenvalue decomposition £;Ayv;j>y' yieldstrace B2 — ]T; A?; thus B e B has eigenvalues Ay satisfying |A,| < 1. It followsthat B e B fulfills x'Bx < \x'Bx\ < £; jAyKjc'^)2 < x'^y^^x = x'x forall x e IR*.

A set is closed when its complement is open. Therefore we pick an arbi-trary k x k matrix A which is symmetric but fails to be nonnegative definite. Bydefinition, x 'Ax < 0 for some vector x e R*. Define 5 = -x'Ax/(2x'x] > 0.For every matrix B 6 B, we then have

trace

1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES 11

EXHIBIT 1.2 Convex cones in the plane U2. Left: the linear subspace generated by x e R2 isa closed convex cone that is not pointed. Right: the open convex cone generated by x,y 6 R2,together with the null vector, forms a pointed cone that is neither open nor closed.

Thus the set A + SB is included in the complement of NND(Jt), and it followsthat the cone NND(fc) is closed.

Interior points are identified similarly. Let A e intNND(fc), that is, A +SB C NND(fc) for some 8 > 0. If x ± 0 then the choice B = -xx'/x'x e Bleads to

Hence every matrix A interior to NND(/c) is positive definite, intNND(fc) CPD(fc). It remains to establish the converse inclusion. Every matrix A €PD(fc) has 0 < Amin(A) = 6, say. For B e B and x e R*, we obtain x'Bx >-jc'jc, and

Thus A + 3BC NND(fc) shows that A is interior to NND(fc).

There are, of course, convex cones that are not pointed but closed, orpointed but not closed, or neither pointed nor closed. Exhibit 1.2 illustratestwo such instances in the plane R2.

1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES

True beauty shines in many ways, and order is one of them. We prefer toview the closed cone NND(fc) of nonnegative definite matrices through the

partial ordering >, defined on Sym(k) by

which has come to be known as the Loewner ordering of symmetric matrices.The notation B < A in place of A > B is self-explanatory. We also definethe closely related variant > by

which is based on the open cone of positive definite matrices.The geometric properties of the set NND(fc) of being conic, convex, point-

ed, and closed, translate into related properties for the Loewner ordering:

The third property in this list says that the Lowener ordering is antisymmetric.In addition, it is reflexive and transitive,

Hence the Loewner ordering enjoys the three properties that constitute apartial ordering.

For scalars, that is, A: = 1, the Loewner ordering reduces to the fa-miliar total ordering of the real line. Or the other way around, the totalordering > of the real line U is extended to the partial ordering > of thematrix spaces Sym(/c), with k > 1. The crucial distinction is that, in general,two matrices may not be comparable. An example is furnished by

for which neither A > B nor B > A holds true.Order relations always call for a study of monotonic functions.

1.11. MONOTONIC MATRIX FUNCTIONS

We consider functions that have a domain of definition and a range that areequipped with partial orderings. Such functions are called isotonic when they

1.12. RANGE AND NULLSPACE OF A MATRIX 13

are order preserving, and antitonic when they are order reversing. A functionis called monotonic when it is isotonic or antitonic. Two examples may serveto illustrate these concepts.

A first example is supplied by a linear form A *-+ trace AB on Sym(A:),determined by a matrix B e Sym(fc). If this linear form is isotonic relativeto the Loewner ordering, then A > 0 implies trace A B > 0, and Lemma 1.8proves that the matrix B is nonnegative definite. Conversely, if B is nonneg-ative definite and A > C, then again Lemma 1.8 yields trace(^4 - C)B > 0,that is, trace AB > trace CB. Thus a linear form A i-> trace AB is isotonicrelative to the Loewner ordering if and only if B is nonnegative definite. Inparticular the trace itself is isotonic, A •-> trace A, as follows with B = Ik.

It is an immediate consequence that the Euclidean matrix norm \\A\\ —(trace A2)1/2 is an isotonic function from the closed cone NND(fc) into thereal line. For if A > B > 0, then we have

As a second example, matrix inversion A l is claimed to be an antitonicmapping from the open cone PD(fc) into itself. For if A > B > 0 then we get

Pre- and postmultiplication by A ] gives A~l < B l, as claimed.A minimization problem relative to the Loewner ordering is taken up in

the Gauss-Markov Theorem 1.19. Before turning to this topic, we review therole of matrices when they are interpreted as linear mappings.

1.12. RANGE AND NULLSPACE OF A MATRIX

A rectangular matrix A e Rnxk may be identified with a linear mappingcarrying x e Rk into Ax e IR". Its range or column space, and its nullspaceor kernel are

The range is a subspace of the image space Rn. The nullspace is a subspaceof the domain of definition Rk. The rank and nullity of A are the dimensionsof the range of A and of the nullspace of A, respectively.

If the matrix A is symmetric, then its rank coincides with the number ofnonvanishing eigenvalues, and its nullity is the number of vanishing eigenval-ues. Symmetry involves transposition, and transposition indicates the pres-ence of a scalar product (because A' is the unique matrix B that satisfies


EXHIBIT 13 Orthogonal decompositions induced by a linear mapping. Range and nullspaceof a matrix A € R"** and of its transpose A' orthogonally decompose the domain of defini-tion Kk and the image space R".

(Ax,y) = (x,By) for all x,y). In fact, Euclidean geometry provides the fol-lowing vital connection that the nullspace of the transpose of a matrix is theorthogonal complement of its range. Let

denote the orthogonal complement of a subspace L of the linear space R".

1.13. TRANSPOSITION AND ORTHOGONALITY

Lemma. Let A be an n x k matrix. Then we have

Proof. A few transcriptions establish the result:

Replacing A' by A yields nullspace A = (range A')^. Thus any n x kmatrix A comes with two orthogonal decompositions, of the domain of defi-nition R*, and of the image space R". See Exhibit 1.3.

1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS 15

1.14. SQUARE ROOT DECOMPOSITIONS OF A NONNEGATIVEDEFINITE MATRIX

As a first application of Lemma 1.13 we investigate square root decomposi-tions of nonnegative definite matrices. If V is a nonnegative definite n x nmatrix, a representation of the form

is called a square root decomposition of V, and U is called a square rootof V. Various such decompositions are easily obtained from an eigenvaluedecomposition

For instance, a feasible choice is U = (±v/Alzi,...,±v/&«) e Rnx". If Vhas nonvanishing eigenvalues A I , . . . , A*, other choices are U = (±\f\iz\,...,±v/AJtZjt) e R"x/c; here V = UU' is called a full rank decomposition for thereason that the square root U has full column rank.

Every square root U of V has the same range as V, that is,

To prove this, we use Lemma 1.13, in that the ranges of V and U coin-cide if and only if the nullspaces of V and U' are the same. But U 'z = 0clearly implies Vz = 0. Conversely, Vz = 0 entails 0 = z'Vz = z'UU'z =(U'z)'(U'z), and thus forces U'z = 0.

The range formula, for every n x k matrix X,

is a direct consequence of a square root decomposition V = UU', sincerange V = range U implies range X'V — range X'U = range X'VX Crange X'V.

Another application of Lemma 1.13 is to clarify the role of mean vectorsand dispersion matrices in linear models.

1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS

Lemma. Let Y be an n x 1 random vector with mean vector //, anddispersion matrix V. Then we have

with probability 1,


that is, the distribution of Y is concentrated on the affine subspace that resultsif the linear subspace range V is shifted by the vector /A.

Proof. The assertion is true if V is positive definite. Otherwise we mustshow that Y — p, lies in the proper subspace range V with probability 1. Inview of Lemma 1.13, this is the same as Y-JJL _L nullspace V with probability1. Here nullspace V may be replaced by any finite set {z\,. • . , Zk} of vectorsspanning it. For each j = 1,..., k we obtain

thus Y — n _L Zj with probability 1. The exceptional nullsets may dependon the subscript;', but their union produces a global nullset outside of which

is orthogonal to as claimed.

In most applications the mean vector /A is a member of the range of V.Then the affine subspace p + range V equals range V and is actually a lin-ear subspace, so that Y falls into the range of V with probability 1. In aclassical linear model as expounded in Section 1.3, the mean vector fj, is ofthe form X6 with unknown parameter system 6. Hence the containmentIJi — Xde range V holds true for all vectors 8 provided

Such range inclusion conditions deserve careful study as they arise in manyplaces. They are best dealt with using projectors, and projectors are naturalcompanions of generalized inverse matrices.

1.16. GENERALIZED MATRIX INVERSION AND PROJECTIONS

For a rectangular matrix A € Rnxk, any matrix G e Rkxn fulfilling AGA = Ais called a generalized inverse of A. The set of all generalized inverses of A,

is an affine subspace of the matrix space R*XAI, being the solution set of aninhomogeneous linear matrix equation. If a relation is invariant to the choiceof members in A~, then we often replace the matrix G by the set A~, Forinstance, the defining property may be written as A A'A = A.

A square and nonsingular matrix A has its usual inverse A~l for its uniquegeneralized inverse, A' = {A~1}. In this sense generalized matrix inversionis a generalization of regular matrix inversion.

Our explicit convention of treating A~ as a set of matrices is a bit un-usual, even though it is implicit in all of the work on generalized matrix

{z1,...,zk},

1.17. RANGE INCLUSION LEMMA 17

inverses. Namely, often only results that are invariant to the specific choiceof a generalized inverse are of interest. For example, in the following lemma,the product X'GX is the same for every generalized inverse G of V. Weindicate this by inserting the set V~ in place of the matrix G.

However, the central optimality result for experimental designs is of op-posite type. The General Equivalence Theorem 7.14 states that a certainproperty holds true, not for every, but for some generalized inverse. In fact,the theorem becomes false if this point is missed. Our notation helps to alertus to this pitfall.

A matrix P e R"x" is called a projector onto a subspace /C C U" when Pis idempotent, that is, P2 = P, and has /C for its range.

Let us verify that the following characterizing interrelation between gen-eralized inverses and projectors holds true:

For the direct part, note first that AG is idempotent. Moreover the inclusions

show that the range of AG and the range of A coincide. For the conversepart, we use that the projector AG has the same range as the matrix A. Thusevery vector Ax with x G Rk has a representation AGy with y € IR", whenceAGAx = AGAGy = AGy — Ax. Since x can be chosen arbitrarily, thisestablishes AGA = A.

The intimate relation between range inclusions and projectors, alluded toin Section 1.15, can now be made more explicit.

1.17. RANGE INCLUSION LEMMA

Lemma. Let X be an n x k matrix and V be an n x s matrix. Then wehave

If range X C range V and V is a nonnegative definite n x n matrix, then theproduct

does not depend on the choice of generalized inverse for V, is nonnegativedefinite, and has the same range as X' and the same rank as X.

Proof. The range of X is included in the range of V if and only if A' —VW for some conformable matrix W. But then VGX = VGVW = VW = X

for all G e V . Conversely we may assume the slightly weaker property thatVGX = X for at least one generalized inverse G of V. Clearly this is enoughto make sure that the range of X is included in the range of V.

Now let V be nonnegative definite, in addition to X = VW. Then thematrix X'GX = W'VGVW = W'VW is the same for all choices G e V~,and is nonnegative definite. Furthermore the ranks of X'V'X = W'VWand VW = X are equal. In particular, the ranges of X'V~X and X' havethe same dimension. Since the first is included in the second, they must thencoincide.

We illustrate by example what can go wrong if the range inclusion condi-tion is violated. The set of generalized inverses of

is

This is also the set of possible products X'GX with G € V if for X wechoose the 2 x 2 identity matrix. Hence the product X'V'X is truely a setand not a singleton. Among the members

some are not symmetric (/3 ^ -y), some are not nonnegative definite (a <0,/3 = y), and some do not have the same range as X' and the same rankas X (a = (3 = y = 1).

Frequent use of the lemma is made with other matrices in place of X andV. The above presentation is tailored to the linear model context, which wenow resume.

1.18. GENERAL LINEAR MODELS

A central result in linear model theory is the Gauss-Markov Theorem 1.19.The version below is stated purely in terms of matrices, as a minimizationproblem relative to the Loewner ordering. However, it is best understood inthe setting of a general linear model in which, by definition, the n x 1 responsevector Y is assumed to have mean vector and dispersion matrix given by

1.18. GENERAL LINEAR MODELS 19

Here the n x k model matrix X and the nonnegative definite n x n matrix Vare assumed known, while the mean parameter vector 9 e ® and the modelvariance a2 > 0 are taken to be unknown. The dispersion matrix need nolonger be proportional to the identity matrix as in the classical linear modeldiscussed in Section 1.3. Indeed, the matrix V may be rank deficient, evenadmitting the deterministic extreme V = 0.

The theorem considers unbiased linear estimators LY for XO, that is, n x nmatrices L satisfying the unbiasedness requirement

In a general linear model, it is implicitly assumed that the parameter do-main © is the full space, 6 = R*. Under this assumption, LY is unbiased forX6 if and only if LX — X, that is, L is a left identity of X. There alwaysexists a left identity, for instance, L = /„. Hence the mean vector XO alwaysadmits an unbiased linear estimator.

More generally, we may wish to estimate s linear forms Cj'0,... ,cs'0 of 6,with coefficient vectors c, e R*. For a concise vector notation, we form thek x s coefficient matrix K = (ci,... ,c5). Thus interest is in the parametersubsystem K'O. A linear estimator LY for K'O is determined by an s x nmatrix L. Unbiasedness holds if and only if

There are two important implications.First, K'O is called estimable when there exists an unbiased linear estima-

tor for K'O. This happens if and only if there is some matrix L that satis-fies (1). In the sequel such a specific solution is represented as L = U', withan n x s matrix U. Therefore estimability means that AT' is of the form U'X.

Second, if K'O is estimable, then the set of all matrices L that satisfy (1)determines the set of all unbiased linear estimators LY for K'O. In otherwords, in order to study the unbiased linear estimators for K'O = U'XO,wehave to run through the solutions L of the matrix equation

It is this equation (2) to which the Gauss-Markov Theorem 1.19 refers.The theorem identifies unbiased linear estimators LY for the mean vector

XO which among all unbiased linear estimators LY have a smallest dispersionmatrix. Thus the quantity to be minimized is a2LVL', relative to the Loewnerordering. The crucial step in the proof is the computation of the covariancematrix between the optimality candidate LY and a competitor LY,


1.19. THE GAUSS-MARKOV THEOREM

Theorem. Let X be an n x k matrix and V be a nonnegative definiten x n matrix. Suppose U is an n x s matrix.

A solution L of the equation LX = U'X attains the minimum of LVL',relative to the Loewner ordering and over all solutions L of the equationLX = U'X,

if and only if

where R is a projector given by R = In- XG for some generalized inverse GoiX.

A minimizing solution L exists; a particular choice is U '(ln-VR'HK), withany generalized inverse H otRVR'. The minimum admits the representation

and does not depend on the choice of the generalized inverses involved.

Proof. For a fixed generalized inverse G of X we introduce the projectorsP = XG and R = In - P. Every solution L satisfies L - L(P + R) =LXG + LR= U'P + LR.

I. First the converse part is proved. Assume the matrix L solves LX =U'X and fulfills LVR' = 0, and let L be any other solution. We get

and symmetry yields (L - L)VL' = 0. Multiplying out (L - L + L)V(L -L+L)' = (L-L)V(L-L)'+Q+0+LVLr, we obtain the minimizing propertyof L,

II. Next we tackle existence. Because of RX = 0 the matrix L = U'(In —VR'HR} solves LX = U'X - U'VR'HRX = U'X. It remains to showthat LVR' = 0. To this end, we note that range RV = range RVR', by thesquare root discussion in Section 1.14. Lemma 1.17 says that VR'HRV =VR'(RVR'yRV, as well as RVR'(RVR')-RV = RV. This gives

1.20. THE GAUSS-MARKOV THEOREM UNDER A RANGE INCLUSION CONDITION 21

Hence L fulfills the necessary conditions from the converse part, and thusattains the minimum. Furthermore the minimum permits the representation

III. Finally the existence proof and the converse part are jointly put touse in the direct part. Since the matrix L is minimizing, any other minimizingsolution L satisfies (1) with equality. This forces (L - L)V(L - L}' — 0,and further entails (L - L)V = 0. Postmultiplication by R' yields LVR' =LVR' = Q.

The statistic RY is an unbiased estimator for the null vector, Ee.a2[RY] =RX6 = 0. In the context of a general linear model, the theorem thereforesays that an unbiased estimator LY for U'Xd has a minimum dispersionmatrix if and only if LY is uncorrelated with the unbiased null estimatorRY, that is, C^LY,RY] = a-2LVR' = 0.

Our original problem of estimating X6 emerges with (/ = /„. The so-lution matrices L of the equation LX — X are the left identities of X.A minimum variance unbiased linear estimator for X6 is given by LY,with L = In — VR'HR. The minimum dispersion matrix takes the forma2(V ~- VR'(RVRTRV).

A simpler formula for the minimum dispersion matrix becomes availableunder a range inclusion condition as in Section 1.17.

1.20. THE GAUSS-MARKOV THEOREM UNDER A RANGEINCLUSION CONDITION

Theorem. Let X be an n x k matrix and V be a nonnegative definiten x n matrix such that the range of V includes the range of X. Suppose Uis an n x s matrix. Then the minimum of LVL' over all solutions L of theequation LX — U'X admits the representation

and is attained by any L = U'X(X'V~X)~X'H where H is any generalizedinverse of V.


Proof. The matrix X'V'X ~ W does not depend on the choice of thegeneralized inverse for V and has the same range as X', by Lemma 1.17. Asecond application of Lemma 1.17 shows that a similar statement holds forXW~X'. Hence the optimality candidate L is well defined. It also satisfiesLX = U'X since

Furthermore, it fulfills LVR' = 0, because of VH'X = VV~X = X and

By Theorem 1.19, the matrix Lisa minimizing solution. From X'V~VV~X—X'V~X,v/e now obtain the representation

The preceding two theorems investigate linear estimates LY that are un-biased for a parameter system U'XO. The third, and last version concentrateson estimating the parameter vector 0 itself. A linear estimator LY, withL € IR*X", is unbiased for 9 if and only if

This reduces to LX — Ik, that is, L is a left inverse of X. For a left inverse Lof X to exist it is necessary and sufficient that X has full column rank k.

1.21. THE GAUSS-MARKOV THEOREM FOR THE FULL MEANPARAMETER SYSTEM

Theorem. Let X be an n x k matrix with full column rank k and V be anonnegative definite n x n matrix.

A left inverse L of X attains the minimum of LVL' over all left inverses Lof*,

if and only if

1.22. PROJECTORS, RESIDUAL PROJECTORS, AND DIRECT SUM DECOMPOSITION 23

where R is a projector given by R = In- XG for some generalized inverse GotX.

A minimizing left inverse L exists; a particular choice is G-GVR 'HR, withany generalized inverse H of RVR'. The minimum admits the representation

and does not depend on the choice of the generalized inverses involved.Moreover, if the range of V includes the range of X then the minimum is

and is attained by any matrix L — (X'V~X)~1X'H where H is any general-ized inverse of V'.

Proof. Notice that every generalized inverse G of X is a left inverse ofX, since premultiplication of XGX = X by (X'XY1X' gives GX = Ik. WithU' = G, Theorem 1.19 and Theorem 1.20 establish the assertions.

The estimators LY that result from the various versions of the Gauss-Markov Theorem are closely related to projecting the response vector Y ontoan appropriate subspace of R". Therefore we briefly digress and comment onprojectors.

1.22. PROJECTORS, RESIDUAL PROJECTORS, AND DIRECTSUM DECOMPOSITION

Projectors were introduced in Section 1.16. If the matrix F e R"x" is a pro-jector, P — P2, then it decomposes the space Rn into a direct sum consistingof the subspaces /C = range P, and £ = nullspace P. To see this, observethat the nullspace of P coincides with the range of the residual projectorR = In - P. Therefore every vector x e Rn satisfies

But then the vector x lies in the intersection JC n C if and only if x ~ Px andx = Rx, or equivalently, x = Rx = RPx = 0. Hence the spaces 1C and C aredisjoint except for the null vector.

Symmetry of P adds orthogonality to the picture. Namely, we then haveC = nullspace P = nullspace P' = (range P)1 = /C\ by Lemma 1.13.

EXHIBIT 1.4 Orthogonal and oblique projections. The projection onto the first componentin IR2 along the direction (^) is orthogonal. The (dashed) projection along the direction (}) isnonorthogonal relative to the Euclidean scalar product.

Thus projectors that are symmetric correspond to orthogonal sum decom-positions of the space Un, and are called orthogonal projectors. This transla-tion into geometry often provides a helpful view. In Exhibit 1.4, we sketch asimple illustration.

In telling the full story we should speak of P as being the projector"onto 1C along £", that is, onto its range along its nullspace. But brevitywins over exactness. It remains the fact that projectors in IR" correspondto direct sum decompositions IR" = /C ® £, without reference to any scalarproduct. In Lemma 2.15, we present a method for computing projectors ifthe subspaces /C and £ arise as ranges of nonnegative definite matrices Aand£.

We mention in passing that a symmetric n x n matrix V permits yet anothereigenvalue decomposition of the form V — Y^k<e ^A» where AI, . . . , A/ arethe distinct eigenvalues of V and PI , . . . , Pf are the orthogonal projectorsonto the corresponding eigenspaces. In this form, the eigenvalue decompo-sition is unique up to enumeration, in contrast to the representations givenin Section 1.7.

1.23. OPTIMAL ESTIMATORS IN CLASSICAL LINEAR MODELS

From now on, a minimum variance unbiased linear estimator is called anoptimal estimator, for short. Returning to the classical linear model,

1.24. EXPERIMENTAL DESIGNS AND MOMENT MATRICES 25

Theorem 1.20 shows that the optimal estimator for the mean vector X6 is PYand that it has dispersion matrix a2P, where P = X(X'X)~X' is the orthog-onal projector onto the range of X. Therefore the estimator P Y evidentlydepends on the model matrix X through its range, only. Representing thissubspace as the range of another matrix X still leads to the same estimatorPY and the same dispersion matrix a2P.

This outlook changes dramatically as soon as the parameter vector 6 itselfhas a definite physical meaning and is to be investigated with its given com-ponents, as is the case in most applications. From Theorem 1.21, the optimalestimator for 6 then is (X'X)~1X'Y, and has dispersion matrix o-2(X'X)~l.Hence changing the model matrix X in general affects both the optimal es-timator and its dispersion matrix.

1.24. EXPERIMENTAL DESIGNS AND MOMENT MATRICES

The stage is set now to introduce the notion of experimental designs. InSection 1.3, the model matrix was built up according to X = (*i,.. .,*„)',starting from the regression vectors *,-. These are at the discretion of theexperimenter who can choose them so that in a classical linear model, theoptimal estimator (X'XylX'Y for the mean parameter vector 6 attains adispersion matrix cr2(X'X)~l as small as possible, relative to the Loewnerordering. Since matrix inversion is antitonic, as seen in Section 1.11, theexperimenter may just as well aim to maximize the precision matrix, that is,the inverse dispersion matrix,

The sum repeats the regression vector Xj e X according to how often itoccurs in *!,...,*„. Since the order of summation does not matter, we mayassume that the distinct regression vectors x\,... ,JQ, say, are enumerated inthe initial section, while replications are accounted for in the final sectionxt+\i • • • ixn-

We introduce for / < i the frequency counts «, as the number of timesthe particular regression vector A:, occurs among the full list jti, ...,*„. Thismotivates the definitions of experimental designs and their moment matriceswhich are fundamental to the sequel.

DEFINITION. An experimental design for sample size n is given by a finitenumber of regression vectors x\,..., xg in X, and nonzero integers ni,...,nf

such that

In other words an experimental design for sample size n, denoted by £„,specifies i < n distinct regression vectors *,, and assigns to them frequen-cies «, that sum to n. It tells the experimenter to observe n, responses underthe experimental conditions that determine the regression vector *,.

The vectors that appear in the design £, are called the support of £„,supp£, = {*i,... ,JQ}. The matrix Y^i<i nixixl — %'% is called the momentmatrix of £„ and is denoted by A/(£,). Then the precision matrix of theoptimal estimator for 0 may be written as

The set of all designs for sample size n is denoted by Ew.Experimental designs for finite sample size lead to, often untractable, in-

teger optimization problems. Much more smoothness evolves if we take aslightly different point of view. Indeed, the last sum in (1) is an average overthe regression range X, placing rational weight «,/« on the regression vec-tor */. The clue is to allow the weights to vary continuously in the closed unitinterval [0;lj. This emerges as the limiting case for sample size n tending toinfinity.

DEFINITION. An experimental design for infinite sample size (or a design,for short) is a distribution on the regression range X which assigns all itsmass to a finite number of points.

A general design, denoted by £, specifies i > 1 regression vectors *, andweights w/, where W I , . . . , H ^ are positive numbers summing to 1. It tellsthe experimenter to observe a proportion w, out of all responses under theexperimental conditions that come with the regression vector */. A vector inthe regression range carrying positive weight under the design £ is called asupport point of £. The set of all support points is called the support of £and is denoted by supp £. We use the notation E for the set of all designs.

The discussion of precision matrices suggests that any performance mea-sure of a design ought to be based on its moment matrix.

DEFINITION. The moment matrix of a design £ 6 E is the k x k matrixdefined by

The representation as an integral is useful since in general the support ofa design is not known. It nicely exhibits that the moment matrices depend

1.25. MODEL MATRIX VERSUS DESIGN MATRIX 27

linearly on the designs, in the sense that

However, the integral notation hides the fact that a moment matrix alwaysreduces to a finite sum.

A design £„ for sample size n must be standardized according to gn/n tobecome a member of the set H. In (1) the precision matrix of the optimalestimator for 0 then takes the form

Thus standardized precision grows directly proportional to the sample sizen, and decreases inversely proportional to the model variance a2. For thisreason the moment matrices M(£) of designs £ e H are often identified withprecision matrices for 6 standardized with respect to sample size n and modelvariance a2.

Of course, in order to become realizable, a design for infinite sample sizemust in general be approximated by a design for sample size n. An efficientapportionment method is proposed in Chapter 12.

1.25. MODEL MATRIX VERSUS DESIGN MATRIX

Once a finite sample size design is selected, a worksheet can be set up intowhich the experimenter enters the observations that are obtained under theappropriate experimental conditions. In practical applications, designs £ onthe regression range X uniquely correspond with designs r on the experi-mental domain T. The set of all designs T on T is denoted by T. (The Greekletters T, T are in line with r, T, as are £, H with jc, X.} The correspondencebetween £ on X and T on T is always plainly visible. The formal relationis that £ on X is the distribution of the vector / relative to the underlyingprobability measure T on T, that is, £ = r o f~l.

From an applied point of view, the designs r on the experimental domain Tplay a more primary role than the designs £ on the regression range X.However, for the mathematical development, we concentrate on £. Becauseof the obvious correspondence between £ and r, no ambiguity will arise.

Exhibit 1.5 displays a worksheet for a design of sample size n — 12, consist-ing of i — 6 experimental conditions, each with n, = 2 replications. When theexperiment is carried out, the 12 observations ought to be made in randomorder, as a safeguard against a systematic bias that might be induced by any"standard order". Hence while such a worksheet is useful for the statistician,the experimenter may be better off with a version such as in Exhibit 1.6.There, the experimental runs are randomized and presentation of the model

28 CHAPTER 1. EXPERIMENTAL DESIGNS IN LINEAR MODELS

ExperimentalRun

/

123456

conditions Regression vectortil tn

-1 -1

0 -11 -1

-1 10 11 1

Xil

111111

*i2

-1

0

1

-1

01

*il

-1-1-1111

XM

1

01101

Replicatedobservations

y/i Vi2

EXHIBIT 1.5 An experimental design worksheet. The design is for sample size 12, fora nonsaturated two-way second-degree model. Each run / = 1,...,6 has two replications,j = 1,2. Randomization of the run order must be implemented by the experimenter.

Run Experimental conditionsi

123456789

101112

til ti2

1 -10 10 -11 11 11 -1

-1 -1-1 -10 1

-1 1-1 10 -1

Observationsyt

EXHIBIT 1.6 A worksheet with run order randomized. A worksheet for the same experi-ment as in Exhibit 1.5, except that now run order is randomized.

matrix X is suppressed. The matrix that appears in Exhibit 1.6 still tells theexperimenter which experimental conditions are to be realized in which run,but it does not reflect the underlying statistical model.

For this reason it is instructive to differentiate between a design matrix

1.26. GEOMETRY OF THE SET OF ALL MOMENT MATRICES 29

and a model matrix. A design matrix is any matrix that determines a design rn

for sample size n on the experimental domain T, while the model matrix Xof Section 1.3 also reflects the modeling assumptions.

The transition from designs for finite sample size to designs for infinitesample size originates from the usual step towards asymptotic statistics, ofletting sample size n tend to infinity in order to obtain procedures thatdo not depend on n. It also pleases the mathematical mind to formulatea tractable, smooth problem. Lemma 1.26 provides a first indication of thetype of smoothness involved.

The convex hull, conv 5, of a subset S of some linear space consists of allconvex combinations Y^i<t ai^«» w*tn a finite number t of points 5, from Sand with arbitrary positive weights a, summing to 1. In the space Sym(A:)of symmetric matrices, the set of all moment matrices, M(H), is the convexhull of the set S — {xx1 : x £ X] of the rank one matrices formed fromregression vectors x. We now show that the set M(H) is also compact.

1.26. GEOMETRY OF THE SET OF ALL MOMENT MATRICES

Lemma. If the regression range X is a compact set in Uk, then the setM (H) of moment matrices of all designs on X is a compact and convex subsetof the cone NND(fc).

Proof. Being a convex hull, the set M (E) is convex. Since the generatingset S — {xx' : x € X} is included in NND(/c) so is Af(H). Moreover,5 is compact, being the image of the compact set X under the continuousmapping x \-> xx' from X to Sym(/c). In Euclidean space, the convex hull ofa compact set is compact, thus completing the proof.

In general, the set M(H«) of moment matrices obtained from all exper-imental designs for sample size n fails to be convex, but it is still compact.To see compactness, note that such a design £„ is characterized by n regres-sion vectors Jtj, ...,*„, counting multiplicities. The set of moment matricesM(En) thus consists of the continuous images M(^,) = Y^>j<nxjxj °f tne n~fold Cartesian product Xn of the regression range with itself. If X. is compact,so are Xn and the continuous image M(En).

The regression range X is compact, of course, if the experimental do-main T is compact and the regression function / is continuous.

The closedness property that is implied by compactness is of a more tech-nical nature. However, boundedness is persuasive also on practical grounds:The fitting of a linear model would appear to adequately approximate thetrue expected response over a bounded region only. With an unboundedregression range, we would be comparing experiments in rather differentenvironments.

1.27. DESIGNS FOR TWO-WAY CLASSIFICATION MODELS

We illustrate the basic concepts with the examples of Section 1.5. In the two-sample problem, an experimental design for finite sample size n is of theform £,(,!,) = n\ and &,(j) = n2, with n\ + «2 = n. It directs the experimenterto observe n, responses from population i. Its moment matrix is

The optimal estimator for 0 = (ot\,a2)' becomes

using the common abbreviations y,-. = $3/<n, ^0 anc^ ^« = (Vn/) ̂ f°r ' —1,2. This is the familiar result that the optimal estimators for the populationmeans are the sample averages within each population.

Generalization to the one-way classification model is straightforward. Afinite sample size design is given by £„(£/) = n,-, calling for n/ observationsto be made at level /. Its moment matrix is diagonal with entries «,. Theoptimal estimator for the mean effect of level / is the within-group sampleaverage y,-..

Two-way classification models lead to something more interesting. Herean experimental design £„ for sample size n is of the form feiQ) = «/,, with]Ci<a Ylj,; are nonnegative and sum to 1, the rectangularmatrix W is a probability distribution on the rectangular domain T. (It is thesame as the measure T from Section 1.25, but is now preferably thought of asa matrix. The design £ e 5 differs from its weight matrix W e T only in that£ lives on the regression range X, while W lives on the experimental domainT.) The weight w,y determines the fraction of responses to be observed withfactor A on level / and factor B on level ;'. We call W an a x b block design.

Let la denote the unity vector, that is, the a x 1 vector with all componentsequal to 1. Given a weight matrix W, its row sum vector r, its column sumvector s, and their entries are given by

The components r, and s, give the proportions of observations that use level/ of factor A and level ;' of factor B, respectively.

1.27. DESIGNS FOR TWO-WAY CLASSIFICATION MODELS 31

In many applications, factor A is some sort of "treatment" on which inter-est concentrates, and r represents the treatment replication vector. Factor B isa "blocking factor" of only secondary interest, but is necessary to representthe experiment by a linear model, and s is then called the blocksize vector.

We emphasize that, in our development, the replication number r, of treat-ment / and the size Sj of block / are relative frequencies rather than absolutefrequencies.

If £„ € En is a design for sample size n and the standardized design £n/n e3 has weight matrix W e T, then the a x b matrix

has integer entries n^ and is called the incidence matrix of the design £„.Either W or N may serve as design matrix, in the sense of Section 1.26.

In order to find the moment matrix, we must specify the model. As inSection 1.5, we consider the two-way classification model with no interaction,

We again take Ar to be the a x a diagonal matrix with row sum vector r onthe diagonal, while As is the b x b diagonal matrix formed from the columnsum vector s. With this notation, the moment matrix M of a design £ has anappealing representation in terms of its associated weight matrix W,

However, the optimal estimator (nM)~lX'Y fails to exist since M is singu-lar! Indeed, postmultiplying M by (7a', -1£)' gives the null vector. This meansthat only certain subsystems of the mean parameter vector B are identifiable,as we shall see in Section 3.20.


EXHIBIT 1.7 Experimental domain designs, and regression range designs. Top: a designr on the experimental domain T = [—!;!]. Left: the induced design £ on the regressionrange X C R2 for the line fit model. Right: the induced design £ on the regression range X C R3

for the parabola fit model.

1.28. DESIGNS FOR POLYNOMIAL FIT MODELS

In the polynomial fit model of Section 1.6, the compactness assumption ofLemma 1.26 is satisfied if the interval figuring as experimental domain T iscompact. Any design T e T, on the experimental domain T, induces a design£ € H, on the regression range X, as discussed in Section 1.25. Exhibit 1.7sketches this relation, for polynomial fit models of degree one and two overthe symmetric unit interval T = [-!;!].

While the design £ on X incorporates the model assumptions through thesupport points x e X, this is not so for a design T on T. For this reason,we denote the moment matrix for r by Md(r) = JTf(t)f(t)'dT, indicatingthe degree of the model through the subscript d. The resulting matrices are

EXERCISES 33

classical moment matrices in that they involve the moments /iy of T, for; = 0,...,2d,

In general, closed form inversion of Md(r) is no longer possible, and theoptimal estimator must be found numerically.

It may happen that interest concentrates not on the full mean parametervector, but on a single component of the mean parameters. For instance, theexperimenter may wish to learn more about the coefficient 6d of the highestterm td of a d th-degree polynomial regression function. The surprising findingis that for one-dimensional parameter systems an explicit construction ofoptimal designs becomes feasible, with the help of some convex geometry.

EXERCISES

1.1 Show that a saturated m-way d th-degree model has (d^m) mean param-eters. This is the number of ways d "exponent units" can be put intom + 1 compartments labeled 1, t\,..., tm.

1.2 Verify that a one-way d th-degree model matrix X = (*i,..., xd+\)', withrows jc, = ( l , f , - , . . . , f f ) ' , is a Vandermonde matrix [Horn and Johnson(1985), p. 29]. Discuss its rank.

1.3 Show that the eigenvalues of P are 1 or 0, and rank P = trace P, forevery projector P.

1.4 Show that P > Q if and only if range P D range Q, for all orthogonalprojectors P, Q.

1.5 Show that pa = ]CyP?/ and \Pij\ < max-{pa,Pjj} < 1, for every orthogo-nal projector P.

1.6 Let P = X(X'X)~1X' be the orthogonal projector originating froman n x k model matrix X of rank k, with rows */. Show that (i) /?,, =x!(X'X)-lxit (ii) EiPn/» = */», W P < XX1/\min(X'X) andmax, pa < R2/'\min(X'X) = c/n, with c = R2/\min(M) > 1, whereR = max, ||jc,-|| and M = (l/n)X'X, (iv) if ln e range* then P >lnlnln and mm/ Pa ^ Vn-


1.7 Discuss the following three equivalent versions of the Gauss-MarkovTheorem:

i. LVL' > X(X'V-X)-X' for all L £ Rnxn with LX = X.ii. V >X(X'V-X}-X'.

iii. trace VW > traceX(X'V-XYX'W for all W € NND(n).

1.8 Let the responses Y\,..., Yn have an exchangeable (or completely sym-metric) dispersion structure V = a2In+o2p(lnlt[-In), with variance a2 >0 and correlation p. Show that V is positive definite if and only ifpe(-l/(n-l);l).

1.9 (continued) Let £, be a design for sample size n, with model matrix X,standardized moment matrix M = X'X/n = f xx'dgn/n, and standard-ized mean vector m — X'ln/n — /xdgn/n. Show that Iimw_00(cr2/n) xJT't/-1* = (M - mm') /(I - p), provided p e (0; 1). Discuss M -mm',as a function of £ = &/«.

C H A P T E R 2

Optimal Designs forScalar Parameter Systems

In this chapter optimal experimental designs for one-dimensional parametersystems are derived. The optimally criterion is the standardized variance ofthe minimum variance unbiased linear estimator. A discussion of estimabilityleads to the introduction of a certain cone of matrices, called the feasibilitycone. The design problem is then one of minimizing the standardized varianceover all moment matrices that lie in the feasibility cone. The optimal designsare characterized in a geometric way. The approach is based on the set ofcylinders that include the regression range, and on the interplay of the designproblem and a dual problem. The construction is illustrated with models thathave two or three parameters.

2.1. PARAMETER SYSTEMS OF INTEREST AND NUISANCEPARAMETERS

Our aim is to characterize and compute optimal experimental designs. Anyconcept of optimality calls upon the experimenter to specify the goals of theexperiment; it is only relative to such goals that optimality properties of adesign would make any sense.

In the present chapter we presume that the experimenter's goal is pointestimation in a classical linear model, as set forth in Section 1.23. There weconcentrated on the full mean parameter vector.

However, the full parameter system often splits into a subsystem of interestand a complementary subsystem of nuisance parameters. Nuisance parame-ters assist in formulating a statistical model that adequately describes theexperimental reality, but the primary concern of the experiment is to learnmore about the subsystem of interest. Therefore the performance of a designis evaluated relative to the subsystem of interest, only.

One-dimensional subsystems of interest are treated first, in the present

35

36 CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS

chapter. They are much simpler and help motivate the general results formultidimensional subsystems. The general discussion is taken up in Chapter 3.

2.2. ESTEVfABILFTY OF A ONE-DIMENSIONAL SUBSYSTEM

As before, let 8 be the full k x 1 vector of mean parameters in a classicallinear model,

Suppose the system of interest is given by c'O, where the coefficient vectorc e Rk is prescribed prior to experimentation. To avoid trivialities, we assumec^O.

The most important special case is an individual parameter component 0y,which is obtained from c'O with c the Euclidean unit vector ej in the spaceIR*. Or interest may be in the grand mean 0. = Y,j<k ®j/k, which is of theform c'O if c is chosen to be l^/k, with the k x 1 unity vector lk.

An estimator for c'6 is said to be optimal when it minimizes the varianceamong all unbiased linear estimators for c'O, as discussed in Section 1.23.Let us first look at the unbiasedness requirement. An estimator u'Y, withu e Rn, is unbiased for c'O if and only if

This reduces to the relation c = X'u between the vectors c and u whichdetermine the subsystem of interest and the linear estimator.

The parameter system c'O is estimable if and only if there exists at leastone vector u e IR" such that the linear estimator u'Y is unbiased for c'O (seeSection 1.18). This entails u'X = c'. Hence c'O is estimable if and only if thevector c lies in the range of the matrix X'. But this matrix has the same rangeas X'X, and its associated moment matrix M = (\/ri)X'X, by Section 1.14.Therefore a design £, as introduced in Section 1.24, renders a subsystem c'Oestimable if and only if the estimability condition c e range M (£) is satisfied.

It is conceptually helpful to separate the estimability condition from anyreference to designs £. Upon defining the subset A(c] of nonnegative definitematrices that have a range containing the vector c,

the estimability condition is cast into the form A/(£) e A(c). This form moreclearly displays the present state of affairs. The key variables are the momentmatrices M (£), whereas the coefficient vector c is fixed.

The matrix subset A(c) is called the feasibility cone for c'O. In order to seewhat it looks like, we need an auxiliary result that, for nonnegative definite

2.4. FEASIBILITY CONES 37

matrices, the range of a sum is the sum of the ranges. The sum /C + £ of twosubspaces K. and C comprises all vectors x + y with x e 1C and y e C.

2.3. RANGE SUMMATION LEMMA

Lemma. Let A and B be nonnegative definite k x k matrices. Then wehave

range(y4 + B) = (range A) + (range B).

In particular, if A > B > 0 then the range of A includes the range of B.

Proof. A passage to the orthogonal complement based on Lemma 1.13transforms the range formula into a nullspace formula,

The converse inclusion is obvious. It remains to consider the direct inclusion.If (A + B)z — 0, then also z 'Az + z 'Bz = 0. Because of nonnegative defi-niteness, both terms vanish individually. However, for a nonnegative definitematrix A, the scalar z 'Az vanishes if and only if the vector Az is null, asutilized in Section 1.14.

In particular, if A > B > 0, then A = B + C with C = A -B > 0. Thereforerange A = (range B) + (range C) D range B.

The following description shows that every feasibility cone lies somewherebetween the open cone PD(fc) and the closed cone NND(fc). The descriptionlacks an explicit characterization of which parts of the boundary NND(&) \PD(fc) belong to the feasibility cone, and which parts do not.

2.4. FEASIBILITY CONES

Theorem. The feasibility cone A(c) for c'9 is a convex subcone ofNND(fc) which includes PD(fc).

Proof. By definition, A(c) is a subset of NND(&). Every positive definitek x k matrix lies in A(c). The cone property and the convexity propertyjointly are the same as

The first property is evident, since multiplication by a nonvanishing scalardoes not change the range of a matrix. That A(c) contains the sum of anytwo of its members follows from Lemma 2.3.

nullspace(A+B) = (nullspace A) (nullspace B).


EXHIBIT 2.1 The ice-cream cone. The cone NND(2) in Sym(2) is isomorphic to the ice-cream cone in R3.

2.5. THE ICE-CREAM CONE

Feasibility cones reappear in greater generality in Section 3.3. At this point,we visualize the situation for the smallest nontrivial order, k = 2. The spaceSym(2) has typical members

and hence is of dimension 3. We claim that in Sym(2), the cone NND(2)looks like the ice-cream cone of Exhibit 2.1 which is given by

This claim is substantiated as follows. The mapping from Sym(2) into R3

2.5. THE ICE-CREAM CONE 39

which takes

into (a, \/2/3, y)' is linear and preserves scalar products. Linearity is evident.The scalar products coincide since for

we have

Thus the matrix

and the vector (a, \/2/3, y)' enjoy identical geometrical properties.For a more transparent coordinate representation, we apply a further or-

thogonal transformation into a new coordinate system,

Hence the matrix

is mapped into the vector

while the vector (jc,_y,z) ' is mapped into the matrix

For instance, the identity matrix /2 corresponds to the vector

The matrix

is nonnegative definite if and only if ay — (32 > 0 as well as a > 0 and y > 0.In the new coordinate system, this translates into z2 > x2 + y2 as well asz > y and z > —y. These three properties are equivalent to z > (x2 + y2)1/2.Therefore the cone NND(2) is isometrically isomorphic to the ice-cream cone,and our claim is proved.

The interior of the ice-cream cone is characterized by strict inequality,(x2 +_y2)!/2 < z. This corresponds to the fact that the closed cone NND(2)has the open cone PD(2) for its interior, by Lemma 1.9. This correspondenceis a consequence of the isomorphism just established, but is also easily seendirectly as follows. A singular 2 x 2 matrix has rank equal to 0 or 1. The nullmatrix is the tip of the ice-cream cone. Otherwise we have A — dd1 for somenonvanishing vector

Such a matrix

is mapped into the vector

which satisfies x2 + y2 = z2 and hence lies on the boundary of the ice-creamcone.

What does this geometry mean for the feasibility cone A(c}l In the firstplace, feasibility cones contain all positive definite matrices, as stated in Theo-rem 2.4. A nonvanishing singular matrix A = dd' fulfills the defining propertyof the feasibility cone, c € range dd', if and only if the vectors c and d areproportional. Thus, for dimension k = 2, we have

In geometric terms, the cone A(c) consists of the interior of the ice-creamcone and the ray emanating in the direction (lab, b2 -a2, b2+a2)' for c = (£).

2.7. THE DESIGN PROBLEM FOR SCALAR PARAMETER SUBSYSTEMS 41

2.6. OPTIMAL ESTIMATORS UNDER A GIVEN DESIGN

We now return to the task set in Section 2.2 of determining the optimalestimator for c'0. If estimability holds under the model matrix X, then theparameter system satisfies c'6 = u'X6 for some vector u e R". The Gauss-Markov Theorem 1.20 determines the optimal estimator for c'0 to be

Whereas the estimator involves the model matrix X, its variance dependsmerely on the associated moment matrix M — (l/n)X'X,

Up to the common factor tr2/n, the optimal estimator has variance c'M (£) c.It depends on the design £ only through the moment matrix M(£).

2.7. THE DESIGN PROBLEM FOR SCALAR PARAMETERSUBSYSTEMS

We recall from Section 1.24 that H is the set of all designs, and that M(H) isthe set of all moment matrices. Let c ̂ 0 be a given coefficient vector in Uk.The design problem for a scalar parameter system c'6 can now be stated asfollows:

In short, this means minimizing the variance subject to estimability. Noticethat the primary variables are matrices rather than designs! The optimalvariance of this problem is, by definition,

A moment matrix M is called optimal for c'6 in M(H) when M lies in thefeasibility cone A(c) and c'M'c attains the optimal variance. A design £ iscalled optimal for c'6 in H when its moment matrix M(£) is optimal for c'0in M(H).

In the design problem, the quantity c'M~c does not depend on the choiceof generalized inverse for M, and is positive. This has nothing to do withM being a moment matrix but hinges entirely on the feasibility cone A(c).

Namely, if A lies in A(c), then Lemma 1.17 entails that c'A~c is well definedand positive.

Another way of verifying these properties is as follows. Every matrix A eA(c) admits a vector h eRk solving c = Ah. Therefore we obtain

saying that c 'A c is well defined. Furthermore h 'Ah = 0 if and only if Ah = 0,by our considerations on matrix square roots in Section 1.14. The assumptionc ̂ 0 then forces c'A'c = h'Ah > 0.

2.8. DIMENSIONALITY OF THE REGRESSION RANGE

Every optimization problem poses the immediate question whether it is non-vacuous. For the design problem, this means whether the set of momentmatrices Af(E) and the feasibility cone A(c) intersect.

The range of matrix M(£) is the same as the subspace £(supp£)C Rk that is spanned by the support points of the design £. If £ is supportedby the regression vectors x\,...,xe, say, then Lemma 2.3 and Section 1.14entail

Therefore the intersection M(H) n A(c) is nonempty if and only if the co-efficient vector c lies in the regression space C(X) C R* spanned by theregression range X C Uk.

A sufficient condition is that X contains k linearly independent vectors*!,...,*£. Then the moment matrix (!/&)£,<**/*/ is positive definite,whence Af (E) meets A(c) for all c ̂ 0.

2.9. ELFVING SETS 43

EXHIBIT 2.2 Two Elfving sets. Left: the Elfving set for the line fit model over [-.1;1] isa square. Right: the Elfving set for the parabola fit model has no familiar shape. The reardashed arc standing up on the (x,y)-plane is X, the front solid arc that is hanging down is-X.

2.9. ELFVING SETS

The central tools to solve the optimal design problem are the Elfving set 7?.,and the set M of cylinders including it. First we define the Elfving set H tobe the convex hull of the regression range X and its negative image —X —{-jc : jc e X } ,

Two instances of an Elfving set are shown in Exhibit 2.2.In order to develop a better feeling for Elfving sets, we recall that a mo-

ment matrix M(£) = $xxx' d£ consists of the second order uncenteredmoments of the design £. Hence it is invariant under a change of sign ofx. This may serve as a justification to symmetrize the regression range byadjoining to X its negative image —X.

Now consider a design 17 on the symmetrized regression range X u (—X).Suppose 17 has I support points. These are of the form «,•*,-, with jc, e X and

Hence

is the mean vector of 77. This expression also appears as the generic termin the formation of the convex hull in the definition of ft. From this pointof view, the Elfving set ft consists of the mean vectors of all designs on thesymmetrized regression range X U {—X}.

However, the geometric shape of ft turns out to be more important. TheElfving set ft is a symmetric compact convex subset of Rk that contains theorigin in its relative interior. Indeed, symmetry and convexity are built intothe definition. Compactness holds true if the regression range X is compact,with the same argument as in Lemma 1.26. The relative interior of ft is theinterior relative to the subspace C(X) that is generated by x e X. Specif-ically, with regression vectors Xi,...,xi € X that span C(X), the Elfvingset ft includes the polytope conv{±*i,. ..,±xe} which, in turn, contains aball open relative to £(X) around the origin. Hence the origin lies in therelative interior of ft.

2.10. CYLINDERS THAT INCLUDE THE ELFVING SET

If the regression range X consists of a finite number of points, then theElfving set ft is a polytope, that is, the convex hull of a finite set. It thenhas vertices and faces. This is in contrast to the smooth boundary of the ball{z e R* : z 'Nz < 1} that is obtained from the scalar product (x,y)N =x'Ny,as given by a positive definite matrix N. Such a scalar product ball has novertices or faces. Nevertheless Elfving sets and scalar product balls are linkedto each other in an intrinsic manner.

A scalar product ball given by a positive definite matrix N is an ellipsoid,because of the full rank of N. If we drop the full rank assumption, theellipsoid may degenerate to a cylinder. For a nonnegative definite matrixTV € NND(fc), we call the set of vectors

the cylinder induced by N. It includes the nullspace of N, or in geomet-ric terminology, it recedes to infinity in all directions of this nullspace (seeExhibit 2.3).

Elfving sets allow many shapes other than cylinders. However, we mayapproximate a given Elfving set ft from the outside, by considering all cylin-ders that include ft. Since cylinders are symmetric and convex, inclusion offt is equivalent to inclusion of the regression range X. Identifying a cylinderwith the matrix inducing it, we define the set J\f of cylinders that include ft,

2.11. MUTUAL BOUNDEDNESS THEOREM FOR SCALAR OPTIMALITY 45

EXHIBIT 23 Cylinders. Left: the cylinder is induced by N = (\ ]), and recedes to infinity

in the directions (Jj) and ("j1). Right: the cylinder induced by N = ( 'Q ®,2) is a compactellipsoid, and has no direction of recession.

or X, by

These geometric considerations may sound appealing or not, we have yetto convince ourselves that they can be put to good use. The key result is thatthey provide bounds for the design problem, as follows.

2.11. MUTUAL BOUNDEDNESS THEOREM FOR SCALAROPTIMALITY

Theorem. Let M be a moment matrix that is feasible for c'6, M eM(H) n A(c), and let TV be a cylinder that includes the regression range X,

Then we have c'M c > c'Nc, with equality if and only if M and N fulfillconditions (1) and (2) given below. More precisely, we have

with respective equality if and only if, for every design £ e H which hasmoment matrix M,


Proof. Inequality (i) follows from the assumption that M is the momentmatrix of a design £ e H, say. By definition of A/*, we have 1 > x 'Nx for allx e X, and integration with respect to £ leads to

Moreover the upper bound is attained if and only if x'Nx = 1, for all x inthe support of £, thereby establishing condition (1).

Inequality (ii) relates to the assumption that M lies in the feasibility coneA(c). The fact that c e range M opens the way to applying the Gauss-MarkovTheorem 1.20 to obtain M > minL€R*x*.Lc=cLA/L' = c(c'M~c)~lc'. SinceTV is nonnegative definite, the linear form A t-> trace AN is isotonic, bySection 1.11. This yields

It remains to show that in this display, equality forces condition (2), theconverse being obvious. We start from two square root decompositions M —KK' and N = HH', and introduce the matrix

The definition of A does not depend on the choice of generalized inversesfor M. We know this already from the expression c'M'c. Because of M eA(c), we have MM~c = c. For K'M~c, in variance then follows from K =MM~K and K'Gc = K'M'MGMM-c = K'M~c, for G e M~. Next wecompute the squared norm of A:

Thus equality in (ii) implies that A has norm 0 and hence vanishes. Pre- andpostmultiplication of A by K and H' give 0 = MN - c(c'M~c)~lc'N whichis condition (2).

Essentially the same result will be met again in Theorem 7.11 where we dis-cuss a multidimensional parameter system. The somewhat strange sequencingof vectors, scalars, and matrices in condition (2) is such as to readily carryover to the more general case.

2.12. THE ELFVING NORM 47

The theorem suggests that the design problem is accompanied by the dualproblem:

The two problems bound each other in the sense that every feasible valuefor one problem provides a bound for the other:

Equality holds as soon as we find matrices M and N such that c'M c = c'Nc.Then M is an optimal solution of the design problem, and N is an optimalsolution of the dual problem, and M and N jointly satisfy conditions (1)and (2) of the theorem.

But so far nothing has been said about whether such a pair of optimalmatrices actually exists. If the infimum or the supremum were not to beattained, or if they were to be separated by strict inequality (a duality gap),then the theorem would be of limited usefulness. It is at this point that scalarparameter systems permit an argument much briefer than in the general case,in that the optimal matrices submit themselves to an explicit construction.

2.12. THE ELFVING NORM

The design problem for a scalar subsystem c'O is completely resolved by theElfving Theorem 2.14. As a preparation, we take another look at the Elfvingset K in the regression space C(X} C Uk that is spanned by the regressionrange X. For a vector z £ £>(X} Q R*, the number

is the scale factor needed to blow up or shrink the set f l so that z comesto lie on its boundary. It is a standard fact from convex analysis that, on thespace C(X), the function p is a norm. In our setting, we call p the Elfvingnorm. Moreover, the Elfving set 71 figures as its unit ball,

Boundary points of 7£ are characterized through p(z) ~ 1, and this propertyof p is essentially all we need.

Scalar parameter systems c'6 are peculiar in that their coefficient vector ccan be embedded in the regression space. This is in contrast to multidimen-sional parameter systems K'Q with coefficient matrices K e IRfcxs of rank

Maximize

subject to

s > 1. The relation between the coefficient vector c and the Elfving set 7?. isthe key to the solution of the problem.

Rescaling 0 ̂ c € C(X] by p(c) places c/p(c) on the boundary of 71. As amember of the convex set 72., the vector c/p(c) admits a representation

say, with e, e {±1} and jc/ € X for / = 1,... ,^, and 17 a design on thesymmetrized regression range X u (-X}.

The fact that c/p(c) lies on the boundary of K prevents rj from puttingmass at opposite points, that is, at points Xi e X and x^ e —X with jc2 =—x\. Suppose this happens and without loss of generality, assume TI(X\) >17 (— x\) > 0. Then the vector z from (1) has norm smaller than 1:

In the inequalities, we have first employed the triangle inequality on p, thenused p(e,*,) < 1 for e/x, € H, and finally added 2r$—x$ > 0. Hence we get1 = p(c/p(c)) = p(z) < 1, a contradiction.

The bottom line of the discussion is the following. Given a design 17 on thesymmetrized regression range X U ( — X } such that 17 satisfies (1), we definea design £ on the regression range X through

We have just shown that the two terms 17(x) and rj(-x) cannot be positive

2.13. SUPPORTING HYPERPLANES TO THE ELFVING SET 49

at the same time. In other words the design £ satisfies

Thus representation (1) takes the form

with s a function on X which on the support of £ takes values ±1. Thesedesigns £ and their moment matrices M (£) will be shown to be optimal inthe Elfving Theorem 2.14.

2.13. SUPPORTING HYPERPLANES TO THE ELFVING SET

Convex combinations are "interior representations". They find their counter-part in "exterior representations" based on supporting hyperplanes. Namely,since c/p(c) is a boundary point of the Elfving set 7£, there exists a supportinghyperplane to the set 71 at the point c/p(c), that is, there exist a nonvanishingvector h € £>(X) C R* and a real number y such that

Examples are shown in Exhibit 2.4.In view of the geometric shape of the Elfving set 7?., we can simplify this

condition as follows. Since 7£ is symmetric we may insert z as well as — z,turning the left inequality into \z'h\ < y. As a consequence, y is nonnegative.But y cannot vanish, otherwise 72. lies in a hyperplane of the regression spaceC(X} and has empty relative interior. This is contrary to the fact, mentionedin Section 2.9, that the interior of 72. relative to C(X} is nonempty. Hence yis positive. Subdividing by y > 0 and setting h — h/y / 0, the supportinghyperplane to 72. at c/p(c) is given by

with some vector h e £>(X], h^Q.The square of inequality (1) proves that the matrix defined by N = hh'

satisfies


EXHIBIT 2.4 Supporting hyperplanes to the Elfving set. The diagram applies to the linefit model. Bottom: at c = (~^ ), the unique supporting hyperplane is the one orthogonal to

h = (_Pj). Top: at d = ({), the supporting hyperplanes are those orthogonal to h = ( J^A) forsome A 6 [0; 1].

Hence N lies in the set Af of cylinders that include 7£, introduced in Sec-tion 2.10. The equality in (1) determines the particular value

Therefore Theorem 2.11 tells us that (p(c))2 is a lower bound for the optimalvariance of the design problem. In fact, this is the optimal variance.

2.14. THE ELFVING THEOREM

Theorem. Assume that the regression range X C R* is compact, andthat the coefficient vector c e R* lies in the regression space C(X) and hasElfving norm p(c) > 0. Then a design £ € H is optimal for c'd in H if andonly if there exists a function e on X which on the support of £ takes values±1 such that

There exists an optimal design for c'& in H, and the optimal variance is(p(c))2.

Proof. As a preamble, let us review what we have already established.The Elfving set and the ElfVing norm, as introduced in the preceding sections,

2.14. THE ELFVING THEOREM 51

are

There exists a vector h e £>(X] C Rk that determines a supporting hyperplaneto 71 at c/p(c). The matrix N = hh' induces a cylinder that includes 71 or X,as discussed in Section 2.13, with c'Nc = (p(c))2.

Now the proof is arranged like that of the Gauss-Markov Theorem 1.19by verifying, in turn, sufficiency, existence, and necessity. First the converse isproved. Assume there is a representation of the form c/p(c) = Xlxesupp £ £(*)e(x)x. Let M be the moment matrix of £. We have e(x)x 'h < 1 for all x e X.In view of

every support point x of £ satisfies e(x)x'h — 1. We get x'h = \je(x) = s(x),and

This shows that M lies in the feasibility cone A(c). Moreover, this yieldsh'Mh = c'h/p(c) = 1 and c'M'c = (p(c))2h'MM~Mh = (p(c))2. Thus thelower bound (p(c))2 is attained. Hence the bound is optimal, as are M, £,and N.

Next we tackle existence. Indeed, we have argued in Section 2.12 thatc/p(c) does permit a representation of the form Jx e(x)x dg. The designs £leading to such a representation are therefore optimal.

Finally this knowledge is put to use in the direct part of the proof. Ifa design £ is optimal, then M(£) e A(c) and c'M(g)~c — (p(c))2. Thusconditions (1) and (2) of Theorem 2.11 hold with M = M(£) and N = hh1.Condition (1) yields

Condition (2) is postmultiplied by h/h'h. Insertion of (c'M(£) c) =(p(c)) and c'h — p(c) produces

Upon defining e(x) = x'h, the quantities e(x) have values ±1, by condition(i), while condition (ii) takes the desired form c/p(c)


The theorem gives the solution in terms of designs, even though the for-mulation of the design problem apparently favors moment matrices. This,too, is peculiar to the case of scalar parameter systems.

Next we cast the result into a form that reveals its place in the theory tobe developed. To this end we need to take up once more the discussion ofprojections and direct sum decompositions from Section 1.22.

2.15. PROJECTORS FOR GIVEN SUBSPACES

Lemma. Let A and B be nonnegative definite k x k matrices such thatthe ranges provide a direct sum decomposition of Rk:

Then (A + B) l is a generalized inverse of A and A(A + B)~} is the projectoronto the range of A along the range of B.

Proof, The matrix A + B is nonsingular, by Lemma 2.3. Upon settingG = (A + B)'\ we have Ik = AG + BG. We claim AGB = 0, which evidentlyfollows from

nullspace AG = range B.

The direct inclusion holds since, if x is a vector such that AGx = 0, thenx = AGx + BGx = BGx e range B. Equality of the two subspaces is aconsequence of the fact that they have the same dimension, k — rank A.Namely, because of the nonsingularity of G, A:-rank A is the nullity commonto AG and A. It is also the rank of B, in view of the direct sum assumption.

With AGB = 0, postmultiplication of Ik = AG + BG by A producesA = AGA. This verifies G to be a generalized inverse of A for which theprojector AG has nullspace equal to the range of B.

The Elfving Theorem 2.14 now permits an alternative version that puts allthe emphasis on moment matrices, as does the General Equivalence Theo-rem 7.14.

2.16. EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY

Theorem. A moment matrix M e M(H) is optimal for c'Q in M(H) ifand only if M lies in the feasibility cone A(c] and there exists a generalizedinverse G of M such that

2.16. EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY 53

Proof. First the converse is proved, providing a sufficient condition foroptimality. We use the inequality

for matrices A e M (H) n A(c), obtained from the Gauss-Markov Theo-rem 1.20 with U = Ik, X = c, V = A. This yields the second inequality inthe chain

But we have c'G'c = c'Gc — c'M c, since M lies in the feasibility coneA(c). Thus we obtain c'A~c>c'M~c and M is optimal.

The direct part, the necessity of the condition, needs more attention. Fromthe proof of the Elfving Theorem 2.14, we know that there exists a vectorh e Rk such that this vector and an optimal moment matrix M jointly satisfythe three conditions

This means that the equality c'M c — c'Nc holds true with N = hh'\ com-pare the Mutual Boundedness Theorem 2.11.

The remainder of the proof, the construction of a suitable generalizedinverse G of M, is entirely based on conditions (0), (1), and (2). Given anyother moment matrix A e A/(H) belonging to a competing design 17 € H wehave, from condition (0),

Conditions (1) and (2) combine into

Next we construct a symmetric generalized inverse G of M with the prop-erty that

From (4) we have c'h ^ 0. This means that c'(ah) = 0 forces a — 0, orformally,


With Lemma 1.13, a passage to the orthogonal complement gives

But the vector c is a member of the range of M, and the left hand sum canonly grow bigger if we replace range c by range M. Hence, we get

This puts us in a position where we can find a nonnegative definite k x kmatrix H with a range that is included in the nullspace of h' and that iscomplementary to the range of M:

The first inclusion is equivalent to range h C nullspace H, that is, Hh = 0.Now we choose for M, the generalized inverse G = (M+H)~l of Lemma 2.15.Postmultiplication of Ik = GM + GH by h yields h = GMh, whence condition(5) is verified.

Putting together steps (3), (5), (2), and (4) we finally obtain, for all A eAf(B),

2.17. BOUNDS FOR THE OPTIMAL VARIANCE

Bounds for the optimal variance (p(c))2 with varying coefficient vector c canbe obtained in terms of the Euclidean norm \\c\\ = \fcrc of the coefficientvector c. Let r and R be the radii of the Euclidean balls inscribed in andcircumscribing the Elfving set ft,

The norm ||z||, as a convex function, attains its maximum over all vectors zwith p(z) = 1 at a particular vector z in the generating set X ( J ( - X ) . Also thenorm is invariant under sign changes, whence maximization can be restricted

2.17. BOUNDS FOR THE OPTIMAL VARIANCE 55

EXHIBIT 2.5 Euclidean balls inscribed in and circumscribing the Elfving set. For the three-point regression range X = {(*), (*), (*)}.

to the regression range X. Therefore, we obtain the alternative representation

Exhibit 2.5 illustrates these concepts.By definition we have, for all vectors c ^ 0 in the space C(X) spanned

by*,

If the upper bound is attained, ||c||/p(c) = R, then c/p(c) or -c/p(c) lie inthe regression range X. In this case, the one-point design assigning mass 1to the single point c/p(c) and having moment matrix M = cc'/(p(c))2 isoptimal for c'6 in M(H). Clearly, c/p(c) is an eigenvector of the optimalmoment matrix M, corresponding to the eigenvalue c'c/(p(c))2 = R2. Thefollowing corollary shows that the eigenvector property pertains to everyoptimal moment matrix, not just to those stemming from one-point designs.

Attainment of the lower bound, ||c||/p(c) = r, does not generally lead tooptimal one-point designs. Yet it still embraces a very similar result on theeigenvalue properties of optimal moment matrices.

2.18. EIGENVECTORS OF OPTIMAL MOMENT MATRICES

Corollary. Let the moment matrix M e A/(H) be optimal for c'0 inM(H). If 0 ^ c € C(X) and ||c||/p(c) = r, that is, c/p(c) determines theEuclidean ball inscribed in ft, then c is an eigenvector of M correspondingto the eigenvalue r2. If 0 ^ c e £(,*) and ||c||/p(c) = /?, that is, c/p(c)determines the Euclidean ball circumscribing 71, then c is an eigenvector ofM corresponding to the eigenvalue R2.

Proof. The hyperplane given by h = c/(p(c)r2) or h = c/(p(c)/?2)supports ft at c/p(c), since ||c||/p(c) is the radius of the ball inscribed inor circumscribing ft. The proof of the Elfving Theorem 2.14 shows thatc/p(c) = A//z, that is, Me = r2c or Me = R2c.

The eigenvalues of any moment matrix M are bounded from above by R2.This is an immediate consequence of the Cauchy inequality,

On the other hand, the in-ball radius r2 does not need to bound the eigen-values from below. Theorem 7.24 embraces situations in which the smallesteigenvalue is r2, with rather powerful consequences. In general, however,nothing prevents M from becoming singular.

For instance, suppose the regression range is the Euclidean unit ball ofthe plane, X = {* e R2 : ||*|| < 1} = ft. The ball circumscribing ft coincideswith the ball inscribed in ft. The only optimal moment matrix for c'0 isthe rank one matrix cc'/lkl)2. Here c is an eigenvector corresponding to theeigenvalue r2 — 1, but the smallest eigenvalue is zero.

The next corollary answers the question, for a given moment matrix A/,which coefficient vectors c are such that M is optimal for c'0 in A/(E).

2.19. OPTIMAL COEFFICIENT VECTORS FOR GIVEN MOMENTMATRICES

Corollary. Let M be a moment matrix in A/(E). The set of all nonvan-ishing coefficient vectors c e Uk such that M is optimal for c'0 in Af (E) isgiven by the set of vectors c = SMh where d > 0 and the vector h e Uk issuch that it satisfies (x'h)2 < 1 = h'Mh for all x e X.

Proof. For the direct inclusion, the formula c = p(c)Mh from the proofof the Elfving Theorem 2.14 represents c in the desired form. For the converseinclusion, let £ be a design with moment matrix M and let * be a supportpoint. From h'Mh - 1, we find that e(x) = x'h equals ±1. With 8 > 0, weget c = SMh = S ]Cxesupp ( £(x)£(x)x' whence optimality follows.

2.21. PARABOLA FIT MODEL 57

Thus, in terms of cylinders that include the Elfving set, a moment matrixM is optimal in A/(H) for at least one scalar parameter system if and only ifthere exists a cylinder N e A/" of rank one such that trace M N — 1. Closelyrelated conditions appear in the admissibility discussion of Corollary 10.10.

2.20. LINE FIT MODEL

We illustrate these results with the line fit model of Section 1.6,

with a compact interval [a; b\ as experimental domain T. The vectors (j) witht € [a; b] constitute the regression range X, so that the Elfving set Tl becomesthe parallelogram with vertices ±Q and ±Q. Given a coefficient vector c,it is easy to compute its Elfving norm p(c) and to depict c/p(c) as a convexcombination of the four vertices of 72..

As an example, we consider the symmetric and normalized experimentaldomain T = [—!;!]. The intercept a has coefficient vector c — (J). A design Ton T is optimal for the intercept if and only if the first design momentvanishes, /AI = JTf dr = 0. This follows from the Elfving Theorem 2.14,since

forces e(t) — 1 if r(t) > 0 and ̂ = 0. The optimal variance for the interceptisl.

The slope /3 has coefficient vector c = (^). The design T which assignsmass 1/2 to the points ±1 is optimal for the slope, because with e(l) = 1and e(—1) = —1, we get

The optimal design is unique, since there are no alternative convex repre-sentations of c. The optimal variance for the slope is 1.

2.21. PARABOLA FIT MODEL

In the model for a parabola fit, the regression space is three-dimensional andan illustration becomes slightly more tedious. The model equation is

with regression function /(/) = (1,f, t2)'. The Elfving set 7£ is the convex hullof the two parabolic arcs ±(l,r ,f2) with t e [a; b], as shown in Exhibit 2.2.

If the experimental domain is [a;b] = [-1;1], then the radius of the ballcircumscribing U is R = \/3 = |jc||, with c = (1, ±1,1)'. The radius of the ballinscribed in U is r ̂ l/y/5 = |jc||/5, with c = (-1,0,2)'. The vector c/5 hasElfving norm 1 since it lies in the face with vertices /(-I), —/(O), and /(I),

Thus an optimal design for c'O in M(H) is obtained by allocating mass 1/5 atthe two endpoints ±1 and putting the remaining mass 3/5 into the midpoint0. Its moment matrix

has eigenvector c corresponding to the eigenvalue r2 = 1/5, as stated byCorollary 2.18. The other two eigenvalues are larger, 2/5 and 6/5. We returnto this topic in Section 9.12.

2.22. TRIGONOMETRIC FIT MODELS

An example with much smoothness is the model for a trigonometric fit ofdegree d,

with k = 2d + 1 mean parameters a, A, ri, • • • , A/, r</. The experimental do-main is the "unit circle" T = [0;27r). The endpoint 2ir may be omitted inview of the periodicity of the regression function

Because of this periodicity, we may as well start from the compact intervalT = [0;27r], in order to satisfy the compactness assumption of the ElfvingTheorem 2.14.

For degree one, d = 1, the Elfving set 7£ is the convex hull of thetwo Euclidean discs {(±l,cos(f),sin(f)) : t G [0;27r)|. The Elfving set is a

EXERCISES 59

truncated tube,

The radii from Section 2.17 are r = 1 and R — \/2. Depending on wherethe ray generated by the coefficient vector c penetrates the boundary of 7£,an optimal design for c'O in M(H) can be found with one or two points ofsupport. Other optimal designs for this model are computed in Section 9.16.

2.23. CONVEXITY OF THE OPTIMALFTY CRITERION

The elegant result of the Elfving Theorem 2.14 is born out by the abundanceof convexity that is present in the design problem of Section 2.7. For instance,the set of matrices M(H) n A(c) over which minimization takes place is theintersection of two convex sets and hence is itself convex, as follows fromLemma 1.26 and Lemma 2.4. (However, it is neither open nor closed.)

Moreover, the optimality criterion c'M~c that is to be minimized is aconvex function of M. We did not inquire into the convexity and continuityproperties of the criterion function since we could well do without theseproperties. For multidimensional parameter systems, solutions are in no wayas explicit as in the Elfving Theorem 2.14. Therefore we have to follow upa series of intermediate steps before reaching results of comparable strengthin Chapter 7. The matrix analogue of the scalar term c'M'c is studied first.

EXERCISES

2.1 Show that sup^6R* 2c'h -h'Mh equals c'M~c or oo according as M eA(c) or not [Whittle (1973), p. 129]

2.2 Show that sup£eRA^,c=1 \ji'Mi - suph€KkMh^Q(c'h)2/h'Mh equalsc'M~c or oo according as M € A(c) or not [Studden (1968), p. 1435].

2.3 Show that

[Studden (1968), p. 1437].

2.4 Let the set K C Rk be symmetric, compact, convex, with zero in itsinterior. Show that p(z) = inf {8 > 0 : z e SK} defines a norm on Rk,and that the unit ball {p < 1} of p is 71.


2.5 For the dual problem of Section 2.11, show that if N e M is an optimalsolution, then so is Nc(c'Nc)~c'N.

2.6 On the regression range X = show that the unique

optimal design for Q\ in H is554].

= 1/3 [Silvey (1978), p.

2.7 In the model with regression function f(t) = overT= [0;l],showthat the unique optimal design for B\ in H isp. 1581].

- 1 [Atwood (1969),

2.8 In the line fit model over T — [-!;!], show that the unique optimaldesign for Q\ + fy in H is = 1 [Bandemer (1977), p. 217].

2.9 In the line fit model over T = [—1;0], show that the unique optimaldesign for B\ + 82 m H is

2.10 In the parabola fit model over T = [-1; 1], determine the designs thatare optimal for c'B in H, with (i) c = (1,1,1)', (ii) c = (1,0, -1)', (iii)c = (1,0, -2)' [Lauter (1976), p. 63].

2.11 In the preceding problems, find the optimal solutions N e Af of thedual problem of Section 2.11.

C H A P T E R 3

Information Matrices

Information matrices for subsystems of mean parameters in a classical lin-ear model are studied. They are motivated through dispersion matrices ofGauss-Markov estimators, by the power of the F-test, and as Fisher infor-mation matrices. Their functional properties are derived from a representationas the minimum over a set of linear functions. The information matrix map-ping then is upper semicontinuous on the closed cone of nonnegative definitematrices, but it fails to be continuous. The rank of information matrices isshown to reflect identifiability of the parameter system of interest. Most of thedevelopment is carried out for parameter systems of full rank, but is seen togeneralize to the rank deficient case.

3.1. SUBSYSTEMS OF INTEREST OF THE MEAN PARAMETERS

The full mean parameter vector 8 and a scalar subsystem c'9 represent justtwo distinct and extreme cases of a more general situation. The experimentermay wish to study s out of the total of k components, s < k, rather than beinginterested in all of them or a single one. This possibility is allowed for bystudying linear parameter subsystems that have the form

K'B for some k x 5 matrix K.

We call K the coefficient matrix of the parameter subsystem K'B. One-dimensional subsystems are covered as special cases through 5 — 1 and K = c.The full parameter system is included through K = Ik.

The most restrictive aspect about parameter subsystems so defined is thatthey are linear functions of the full parameter vector 6. Nonlinear functions,such as 01/02, or cos(0! + \/^)> say, are outside the scope of the theorythat we develop. If a genuinely nonlinear problem has to be investigated,a linearization using the Taylor theorem may permit a valid analysis of theproblem.

61

62 CHAPTER 3: INFORMATION MATRICES

3.2. INFORMATION MATRICES FOR FULL RANK SUBSYSTEMS

What is an appropriate way of evaluating the performance of a design if theparameter system of interest is K'Ql In the first place, this depends on theunderlying model which again we take to be the classical linear model,

Secondly it might conceivably be influenced by the type of inference theexperimenter has in mind.

However, it is our contention that point estimation, hypothesis testing, andgeneral parametric model-building all guide us to the same central notionof an information matrix. Its definition assumes that the k x s coefficientmatrix K has full column rank s.

DEFINITION. For a design £ with moment matrix M the information matrixfor K'B, with k x s coefficient matrix K of full column rank s, is defined tobe CK(M] where the mapping CK from the cone NND(fc) into the spaceSym(.s) is given by

Here the minimum is taken relative to the Loewner ordering, over allleft inverses L of K. The notation L is mnemonic for a left inverse; thesematrices are of order s x k and hence may have more columns than rows,deviating from the conventions set out in Section 1.7. We generally call CK(A)an information matrix for K '6, without regard to whether A is the momentmatrix of a design or not.

The matrix CK(A) exists as a minimum because it matches the Gauss-Markov Theorem in the form of Theorem 1.21, with X replaced by K and Vby A. Moreover, with residual projector R = 4 — KL for some left inverse Lof K, Theorem 1.21 offers the representations

whenever the left inverse L of K satisfies LAR' = 0. Such left inverses Lof K are said to be minimizing for A. They are obtained as solutions of thelinear matrix equation

Occasionally, we make use of the existence (rather than of the form) of leftinverses L of K that are minimizing for A.

3.3. FEASIBILITY CONES 63

We provide a detailed study of information matrices, their statistical mean-ing (Section 3.3 to Section 3.10) and their functional properties (Section 3.11to Section 3.19). Finally (Section 3.20 to Section 3.25), we introduce gener-alized information matrices for those cases where the coefficient matrix K isrank deficient.

3.3. FEASIBILITY CONES

Two instances of information matrices have already been dealt with, eventhough the emphasis at the time was somewhat different. The most importantcase occurs if the full parameter vector 9 is of interest, that is, if K — 7*.Since the unique (left) inverse L of K is then the identity matrix Ik, theinformation matrix for 6 reproduces the moment matrix M,

In other words, for a design £ the matrix M (£) has two meanings. It is themoment matrix of £ and it is the information matrix for 0. The distinctionbetween these two views is better understood if moment matrices and infor-mation matrices differ rather than coincide.

This happens with scalar subsystems c'0, the second special case. If thematrix M lies in the feasibility cone A(c), Theorem 1.21 provides the repre-sentation

Here the information for c'd is the scalar (c'M~c)~l, in contrast to themoment matrix M.

The design problem of Section 2.7 calls for the minimization of the vari-ance c'M~c over all feasible moment matrices M. Clearly this is the same asmaximizing (c'M~c)~l. The task of maximizing information sounds reason-able. It is this view that carries over to greater generality.

The notion of feasibility cones generalizes from the one-dimensional dis-cussion of Section 2.2 to cover an arbitrary subsystem K'6, by forming thesubset of nonnegative definite matrices A such that the range of K is includedin the range of A. The definition does not involve the rank of the coefficientmatrix K.

DEFINITION. For a parameter subsystem K'O, the feasibility cone A(K) isdefined by

A matrix A e Sym(A:) is called feasible for K'd when A G A(K), a design £is called feasible for K'O when M(£) e A(K).


Feasibility cones will also be used with other matrices in place of K. Theirgeometric properties are the same as in the one-dimensional case studied inTheorem 2.4. They are convex subcones of the closed cone NND(fc), andalways include the open cone PD(fc). If the rank of K is as large as can be,k, then its feasibility cone A(K) coincides with PD(k) and is open. But if therank of K is smaller than A:, then singular matrices A may, or may not, lie inA(K), depending on whether their range includes the range of K. Here thefeasibility cone is neither open nor closed.

In the scalar case, we assumed that the coefficient vector c does not vanish.More generally, suppose that the coefficient matrix K has full column rank s.Then the Gauss-Markov Theorem 1.21 provides the representation

It is in this form that information matrices appear in statistical inference. Theabstract definition chosen in Section 3.2 exhibits its merits not until we turnto their functional properties.

Feasibility cones combine various inferential aspects, namely those of es-timability, testability, and identifiability. The following sections elaborate onthese interrelations.

3.4. ESTIMABILFTY

First we address the problem of estimating the parameters of interest, K'O,for a model with given model matrix X. The subsystem K'B is estimable ifand only if there exists at least one n x s matrix U such that

as pointed out in Section 1.18. This entails K — X'U, or equivalently,range/C C range Jf' = range A^Jf. With moment matrix M = (l/n)X'X,we obtain the estimability condition

that is, the moment matrix M must lie in the feasibility cone for K'B.This visibly generalizes the result of Section 2.2 for scalar parameter sys-

tems. It also embraces the case of the full parameter system 0 discussed inSection 1.23. There estimability forces the model matrix X and its associatedmoment matrix M to have full column rank k. In the present terms, thismeans M e A(Ik) = PD(k).

3.5. GAUSS-MARKOV ESTIMATORS AND PREDICTORS 65

3.5. GAUSS-MARKOV ESTIMATORS AND PREDICTORS

Assuming estimability, the optimal estimator y for the parameter subsys-tem y = K 'B is given by the Gauss-Markov Theorem 1.20. Using U 'X — K',the estimator y for y is

Upon setting M — (\/ri)X'X, the dispersion matrix of the estimator y is

The dispersion matrix becomes invertible provided the coefficient ma-trix K has full column rank s, leading to the precision matrix

The common factor n/a2 does not affect the comparison of these matricesin the Loewner ordering. In view of the order reversing property of matrixinversion from Section 1.11, minimization of dispersion matrices is the sameas maximization of information matrices. In this sense, the problem of de-signing the experiment in order to obtain precise estimators for K'Q calls fora maximization of the information matrices CK(M).

A closely related task is that of prediction. Suppose we wish to predictadditional reponses Yn+j — *' 0 + £«+y, for j — 1,... ,5. We assemble theprediction sites xn+i in a k xs matrix K = (JCM+I , . . . , xn+s), and take the randomvector y from (1) as a predictor for the random vector Z = (Yn+i,..., Yn+s)'.Clearly, y is linear in the available response vector Y = (Yi,..., Yn)', andunbiased for the future response vector Z,

Any other unbiased linear predictor LY for Z has mean squared-error matrix

The first term appears identically in the estimation problem, the second termis constant. Hence y is an unbiased linear predictor with minimum meansquared-error matrix,

66 CHAPTERS: INFORMATION MATRICES

For the purpose of prediction, the design problem thus calls for minimizationof K'M~K, as it does from the estimation viewpoint.

The importance of the information matrix €%(M) is also justified on com-putational grounds. Numerically, the optimal estimate 6 for 6 is obtained asthe solution of the normal equations

Multiplication by 1/n turns the left hand side into MO. On the right handside, the k x 1 vector T = (\fn)X'Y is called the vector of totals. If theregression vectors jt1 } . . . ,*/ are distinct and if Y/7 for j = 1,... ,/i,- are thereplications under jc/, then we get

where Y,. = (1/n,-) £)/<„,. ^«7 *s tne average over all observations under re-gression vector *,-. With this notation, the normal equations are MO = T.

Similarly, if the subsystem y = K '0 is of interest then the optimal estimatey = K'6 is the unique solution of the reduced normal equations

where L is a left inverse of K that is minimizing for M. The 5 x 1 vector LTis called the adjusted vector of totals.

In order to verify that y solves the reduced normal equations, we usethe representation y = (l/n)K'M~X'Y. Every left inverse L of K thatis minimizing for M satisfies 0 = LMR1 = LM — LML'K'. Because ofMM~X' — X', we get what we want,

Our goal of maximizing information matrices typically forces the eigenval-ues to become as equal as possible. For instance, if we maximize the smallesteigenvalue of CK(M) then, since M varies over the compact set M, thelargest eigenvalue of CK(M) will be close to a minimum. Thus an optimalinformation matrix Cx(M) usually gives rise to reduced normal equations inwhich the condition number of the coefficient matrix CK(M) is close to theoptimum.

It is worthwhile to contemplate the role of the Gauss-Markov Theorem.We are using it twice. The first application is to a single model and a familyof estimators, and conveys the result that the optimal estimator for K'6 has

3.7. F-TEST OF A LINEAR HYPOTHESIS 67

a precision matrix proportional to CK(M) if the given model has momentmatrix M. Here the role of the Gauss-Markov Theorem is the standard one,to determine the optimal member in the family of unbiased linear estimators.

The second use of the Gauss-Markov Theorem occurs in the context of afamily of models and a single estimation procedure, when maximizing CK(M)over a set M of competing moment matrices. Here the role of the Gauss-Markov Theorem is a nonstandard one, to represent the matrix CK(M) in away that aids the analysis.

3.6. TESTABILITY

To test the linear hypothesis K'6 — 0, we adopt the classical linear modelwith normality assumption of Section 1.4. Let the response vector y follow ann-variate normal distribution N^.^, with mean vector of the form /A = X6.Speaking of the "hypothesis K'6 = 0" is a somewhat rudimentary way todescribe the hypothesis HQ that the mean vector /u, is of the form X6 forsome 6 £ Rk with K'd = 0. This is to be tested against the alternative H\that the mean vector /A is of the form X9 for some 6 € Rfc with K'6 ^ 0.

If the testing problem is to make any sense, then the hypothesis HQ andthe alternative H\ must be disjoint. They intersect if and only if there existsome vectors Q\ and 62 such that

Hence disjointness holds if and only if Xd\ — Xfy implies K'Q\ — K'&I.Because of linearity we may further switch to differences, 00 = #1 - #2- ThusX&Q = 0 must entail K '00 = 0. This is the same as requiring that the nullspaceof X is included in the nullspace of K1. Using Lemma 1.13, a passage toorthogonal complements yields

With moment matrix M = (\/ri)X'X, we finally obtain the testability condi-tion

pointing yet again to the importance of the feasibility cone A(K).

3.7. F-TEST OF A LINEAR HYPOTHESIS

Assuming testability, the F-test for the hypothesis K'6 = 0 has various de-sirable optimality properties. Rather than dwelling on those, we outline the


underlying rationale that ties the test statistic F to the optimal estimators forthe mean vector under the full model, and under the hypothesis. We proceedin five steps.

I. First we find appropriate matrix representations for the full model, andfor the hypothetical model. The full model permits any mean vector of theform JJL — XO for some vector 0, that is, /i can vary over the range ofthe model matrix X. With projector P = X(X'X)~X', we therefore haveJJL = PIJL, In the full model. A similar representation holds in the submodelspecified by the hypothesis HQ. To this end, we introduce the matrices

From Lemma 1.17, the definition of P\ does not depend on the particularversion of the generalized inverses involved. The matrices PQ and P\ areidempotent, and symmetric; hence they are orthogonal projectors. They sat-isfy PI - PPi and PQ = P(In- P^. We claim that under the hypothesis H0,the mean vector //, varies over the range of the projector PQ. Indeed, formean vectors of the form /t = XO with K'6 = 0 we have P\X6 = 0 and

Hence ft lies in the range of PQ. Conversely, if IJL is in the range of PQ thenwe obtain, upon choosing any vector OQ = GX'PQJJL with G e (X'X)~ andusing K'(X'XYX'Pi = K'(X'X)~X' from Lemma 1.17,

In terms of projectors we are therefore testing the hypothesis HQ: p. — PQIJL,within the full model fi = Pp. The alternative HI thus is /A ^ PQIJL, within/A = PM.

n. Secondly, we briefly digress and solve the associated estimation prob-lems. The optimal estimator for /*, in the grand model is PY. The optimalestimator for IJL under the hypothesis H0 is P0Y, the orthogonal projectionof the response vector Y onto the range of the matrix P0. The differencebetween the two optimal estimators,

should be small if /u, belongs to the hypothesis. A plausible measure of sizeof this n x 1 vector is its squared Euclidean length,

3.7. F-TEST OF A LINEAR HYPOTHESIS 69

(The reduction from the vector P\Y to the scalar Y'PiY can be rigorouslyjustified by invariance arguments.) A large model variance a2 entails a largevariability of the response vector Y, and hence of the quadratic form Y'P\Y.Therefore its size is evaluated, not on an absolute scale, but relative to anestimate of the model variance. An unbiased, in fact, optimal, estimator fora2 is

invoking the residual projector R = In- P = In - X(X'X] X'. Indeed, thisis the usual estimator for a2, in that Y'RY is the residual sum of squaresand n - rank X is the number of associated degrees of freedom.

HI. In the third step, we return to the testing problem. Now the normalityassumption comes to bear. The random vector P\Y is normally distributedwith mean vector P\IL and dispersion matrix o-2P\, while the random vec-tor RY follows a centered normal distribution with dispersion matrix a2R.Moreover, the covariance matrix of the two vectors vanishes,

Hence the two statistics P\Y and RY are independent, and so are the qua-dratic forms Y'PiY and Y'RY. Under the hypothesis, the mean vectorPIIJL — }i - PQIJL vanishes. Therefore the statistic

has a central F-distribution with numerator degrees of freedom rank PI =rank K, and with denominator degrees of freedom rank/? = n — rank X.Large values of F go together with the squared norm of the difference vectorP\Y outgrowing the estimate of the model variance a2. In summary, then,the F-test uses F as test statistic. Large values of F indicate a significantdeviation from the hypothesis K'O = 0. The critical value for significancelevel a is F~*_k(l - a), that is, the 1 - a quantile of the central F-distributionwith numerator degrees of freedom s = rank K and denominator degrees offreedom n - k, where k = rank X. Tables of the F-distributions are widelyavailable. For moderate degrees of freedom and a — .05 the critical valuelies in the neighborhood of 4.

IV. Fourthly, we discuss the global behavior of this test. For a full appre-ciation of the F-test we need to know how it performs, not just under thehypothesis, but in the entire model /a = XB. Indeed, the statistic F quitegenerally follows an F-distribution, except that the distribution may become


noncentral, with noncentrality parameter

Here we have used the formula X'X(X'X) K = K, and the abbreviation

The noncentrality parameter vanishes, \L 'P\\L = 0, if and only if the hypothe-sis HQ : K'B = 0 is satisfied. The expected value of the test statistic F reflectsthe noncentrality effect. Namely, it is well known that, for n > 2 + rank X,the statistic F has expectation

The larger the noncentrality parameter, the larger values for F we expect,and the clearer the test detects a significant deviation from the hypothesis HQ.

V. In the fifth and final step, we study how the F-test responds to a changein the model matrix X. That is, we ask how the test for K'B = 0 comparesin a grand model with a moment matrix M as before, relative to one withan alternate moment matrix A, say. We assume that the testability conditionA 6 A(K) is satisfied. The way the noncentrality parameter determines theexpectation of the statistic F involves the matrices

It indicates that the F-test is uniformly better under moment matrix M thanunder moment matrix A provided

If K is of full column rank, this is the same as requiring CK(M) > CK(A)in the Loewner ordering. Again we end up with the task of maximizing theinformation matrix CK(M).

Some details have been skipped over. It is legitimate to compare twoF-tests by their noncentrality parameters only with "everything else beingequal". More precisely, model variance a2 and sample size n ought to be the

3.8. ANOVA 71

EXHIBIT 3.1 ANOVA decomposition. An observed yield y decomposes into PQy+P\y+Ry.The term P$y is fully explained by the hypothesis. The sum of squares ||.Pi y||2 is indicative fora deviation from the hypothesis, while \\Ry\\2 measures the model variance.

same. That the variance per observation, cr2/n, is constant is also called forin the estimation problem. Furthermore, the testing problem even requiresequality of the ranks of two competing moment matrices M and A becausethey affect the denominator degrees of freedom of the F-distribution.

Nevertheless the more important aspects seem to be captured by the matri-ces CK(M) and CK(A). Therefore we keep concentrating on maximizing in-formation matrices.

3.8. ANOVA

The rationale underlying the F-test is often subsumed under the headinganalysis of variance. The key connection is the decomposition of the identitymatrix into orthogonal projectors,

These projectors correspond to three nested subspaces:

i. U" = range(Po + PI + R), the sample space;ii. £ = range(Po + PI), the mean space under the full model;

Hi. H = range PQ, the mean space under the hypothesis HQ.

Accordingly the response vector Y decomposes into the three terms P0Y,PiY, and RY, as indicated in Exhibit 3.1.


The term P0Y is entirely explained as a possible mean vector, both underthe full model and under the hypothesis. The quadratic forms

are mean sums of squares estimating the deviation of the observed meanvector from the hypothetical model, and estimating the model variance.

The sample space IR" is orthogonally decomposed with respect to the Eu-clidean scalar product because in the classical linear model, the dispersionmatrix is proportional to the identity matrix. For a general linear model withdispersion matrix V > 0, the decomposition ought to be carried out relativeto the pseudo scalar product (x,y)v = x'V~ly.

3.9. IDENTIFIABILITY

Estimability and testability share a common background, identifiability. Weassume that the response vector Y follows a normal distribution N^e;o.2/n. Bydefinition, the subsystem K'B is called identifiable (by distribution) when allparameter vectors B\, #2 £ R* satisfy the implication

The premise means that the parameter vectors d\ and fy specify one and thesame underlying distribution. The conclusion demands that the parametersof interest then coincide as well.

We have seen in Section 3.6 that, in terms of the moment matrix M =(l/ri)X'X, we can transcribe this definition into the identifiability condition

Identifiability of the subsystem of interest is a necessary requirement forparametric model-building, with no regard to the intended statistical infer-ence. Estimability and testability are but two particular instances of intendedinference.

3.10. FISHER INFORMATION

A central notion in parametric modeling, not confined to the normal distribu-tion, is the Fisher information matrix for the model parameter 0. There aretwo alternative definitions. It is the dispersion matrix of the first logarithmicderivative with respect to 6 of the model density. Or, up to a change of sign,it is the expected matrix of the second logarithmic derivative. In a classical

3.11. COMPONENT SUBSETS 73

linear model with normality assumption, the Fisher information matrix forthe full mean parameter system turns out to be (n/<r2)M if the underlyingdesign has moment matrix M.

For the first s out of k parameters, 6\,..., 05, the Fisher information matrixis taken to be (n/a2)C, where the matrices M and C are related through theirinverses,

We refer to this relation as the dispersion formula. The subscripting in(M"1)!! indicates that C"1 is the upper left 5 x s-block in the inverse of M.The formula creates the impression (wrongly, as we see in the next section)that two inversions are necessary, of M and of C, to obtain a reasonablysimple relationship for C in terms of M.

In our notation, we can write (01? . . . , 0S)' = K'O by choosing for K theblock matrix (/s,0)'. With this notation we find

Thus calling C/^(M) the information matrix for K'O also ties in with thedispersion formula.

We do not embark on the general theory of Fisher information for verify-ing the dispersion formula, but illustrate it by example. Strictly speaking, theclassical linear model with normality assumption has k +1 parameters, the kcomponents of 0 plus the scalar model variance or2. Therefore the Fisherinformation matrix for this model is of order (fc + 1) x (/c + 1),

In the present development, the mean parameter system 8 is of interest, thatis, the first k out of all k + 1 parameters. For its inverse information matrix,the dispersion formula yields (cr2/ri)M~l, whence its information matrix be-comes (n/cr2)M. Disregarding the common factor n/cr2, we are thus right intreating M to be the information matrix for 0.

This concludes our overview of the statistical background why CK(M) isrightly called the information matrix for K'O. Whatever statistical proce-dure we have in mind, a reasonable design £ must be such that its momentmatrix M lies in the feasibility cone A(K), in which case CK(M) takes theclosed form expression (K'M~K)~l.

3.11. COMPONENT SUBSETS

The complexity of computing CK(A) is best appreciated if the subsystem ofinterest consists of 5 components of the full vector 0, rather than of an arbi-


trary linear transformation of rank s. Without loss of generality we considerthe first s out of the k mean parameters 6\,..., 0*, that is,

The corresponding block partitioning of a k x k matrix A is indicated through

with s x s block A\\, (k — s) x (k - s) block A22, and s x. (k — s) blockAl2=A2l.

The feasibility cone for the first s out of all k mean parameters comprisesall nonnegative definite matrices such that the range includes the leading 5coordinate subspace. Its members A obey the relation

calling for a generalized inversion of the k x k matrix A, followed by a regularinversion of the upper left s x s subblock. This emphasis on inversions ismisleading, the dependence of CK(A) on A is actually less complex.

A much simpler formula is provided by the Gauss-Markov Theorem 1.21,and even applies to all nonnegative definite matrices rather than being re-stricted to those in the feasibility cone. With any left inverse L of K andresidual projector R = Ik - KL, this formula reads

The liberty of choosing the left inverse L can be put to good use in makingthis expression as simple as possible. For K = (/5,0)', we take

With this choice, the formula for Cx(A) turns into

The matrix A\\ — A\iA22Ai\ is called the Schur complement of A22 in A.Our derivation implies that it is well defined and nonnegative definite, but adirect proof of these properties is instructive.

3.12. SCHUR COMPLEMENTS 75

3.12. SCHUR COMPLEMENTS

Lemma. A symmetric block matrix

is nonnegative definite if and only if A22 is nonnegative definite, the rangeof A22 includes the range of A2\, and the Schur complement AH — A\2A22A2\is nonnegative definite.

Proof. The direct part is proved first. With A being nonnegative definite,the block A22 is also nonnegative definite. We claim that the range of A22

includes the range of A2\. By Lemma 1.13, we need to show that the nullspaceof A22 is included in the nullspace of A2l = A\2. Hence we take any vector ywith A22y = 0. For every d > 0 and for every vector jc, we then get

Letting 8 tend to zero, we obtain x'Auy > 0 for all x, hence also for — x.Thus we get x'A\2y — 0 for all jc, and therefore Auy = 0.

It follows from Lemma 1.17 that A22A22A2\ = A2\ and that A\2A22A2\does not depend on the choice of the generalized inverse for A22. Finallynonnegative definiteness of AH -A\2A22A2i is a consequence of the identity

For the converse part of the proof, we need only invert the last identity:

There are several lessons to be learnt from the Schur complement rep-resentation of information matrices. Firstly, we can interpret it as consistingof the term AH, the information matrix for (0i,. . . , 0$)' in the absence ofnuisance parameters, adjusted for the loss of information Ai2A22A2i, causedby entertaining the nuisance parameters (05+i, . . . , 9k)'.


Secondly, Lemma 1.17 says that the penalty term AÂÂ-^ has the samerange as A\2- Hence it may vanish, making an inversion of Aî obsolete.It vanishes if and only if A\2 = 0, which may be referred to as parame-ter orthogonality of the parameters of interest and the nuisance parameters.For instance, in a linear model with normality assumption, the mean pa-rameter vector 6 and the model variance a2 are parameter orthogonal (seeSection 3.10).

Thirdly, the complexity of computing CK(A) is determined by the inversionof the (k — s) x (k - s) matrix A22- In general, we must add the complexityof computing a left inverse L of the k x s coefficient matrix K.

For a general subsystem of interest K'O with coefficient matrix K of fullcolumn rank s, four formulas for the information matrix CK(A) are nowavailable:

Formula (1) clarifies the role of Cx(A) in statistical inference. It is restrictedto the feasibility cone A(K) but can be extended to the closed cone NND(fc)by semicontinuity (see Section 3.13). Formula (2) utilizes an arbitrary leftinverse L of K and the residual projector R = Ik - KL. It is of Schur com-plement type, focusing on the loss of information due to the presence ofnuisance parameters. The left inverse L can be chosen in order to relieve thecomputational burden as much as possible. Formula (3) is based on a left in-verse L of K that is minimizing for A, and shifts the computational task overto determining a solution of the linear matrix equation L(K,AR'} = (/s,0).

Formula (4) has been adopted as a definition and is instrumental whenwe next establish the functional properties of the mapping CK. The formulaprovides a quasi-linear representation of CK, in the sense that the mappingsA H-+ LAV are linear in A, and that CK is the minimum of a collection ofsuch linear mappings.

3.13. BASIC PROPERTIES OF THE INFORMATION MATRIXMAPPING

Theorem. Let K be a k x s coefficient matrix of full column rank s, withassociated information matrix mapping

3.13. BASIC PROPERTIES OF THE INFORMATION MATRIX MAPPING 77

Then CK is positively homogeneous, matrix superadditive, and nonnegative,as well as matrix concave and matrix isotonic:

Moreover CK enjoys any one of the following three equivalent properties:

a. (Upper semicontinuity) The level sets

are closed, for all C e Sym(s).b. (Sequential semicontinuity criterion) For all sequences (Am}m>\ in

NND(fc) that converge to a limit A, we have

c. (Regularizatiori) For all A,B e NND(fc) we have

Proof. The first three properties are immediate from the definitionof CK(A) as the minimum over the matrices LAL'. Superadditivity and ho-mogeneity imply matrix concavity:

Superadditivity and nonnegativity imply matrix monotonicity,

Moreover CK enjoys property b. Suppose the matrices Am > 0 converge

to A such that CK(Am) > CK(A). With a left inverse LofK that is minimizingfor A we obtain

Hence CK(Am) converges to the limit CK(A).It remains to establish the equivalence of (a), (b), and (c). For this, we

need no more than that the mapping CK • NND(A:) —> Sym(s) is matrixisotonic. Let B = (B e Sym(/c) : trace B2 < 1} be the closed unit ball inSym(/c), as encountered in Section 1.9. There we proved that every matrixB e B fulfills \x'Bx\ < x'x for all x € R*. In terms of the Loewner orderingthis means -Ik i be a sequence in NND(fc)converging to A, and satisfying CK(Am) > CK(A) for all m > 1. For every e >0, the sequence (Am)m>\ eventually stays in the ball A+eB. This entails Am <A+eIk and, by monotonicity of CK, also CK(Am) < CK(A+eIk) for eventuallyall m. Hence the sequence (CK(Am))m>l is bounded in Sym(.y), and possessescluster points C. From CK(Am) > CK(A), we obtain C > CK(A).

Let (CK(Amt))(>l be a subsequence converging to C. For all 8 > 0, thesequence (CK(Am())e>l eventually stays in C + dC, where C is the closed unitball in Sym(.s). This implies CK(Amt) >C~8Ise Sym(s), for eventually all LIn other words, Am( lies in the level set {CK > C - dls}. This set is closed,by assumption (a), and hence also contains A. Thus we get CK(A) > C — dls

and, with d tending to zero, CK(A) > C. This entails C = CK(A). Since thecluster point C was arbitrary, (b) follows.

Next we show that (b) implies (c). Indeed, the sequence Am = A + (l/m)Bconverges to A and, by monotonicity of CK, fulfills CK(Am) > CK(A). Now(b) secures convergence, thus establishing part (c).

Finally we demonstrate that (c) implies (a). For a fixed matrix C e Sym(5)let (At)t>\ be a sequence in the level set {CK > C} converging to a limit A.For every m > 1, the sequence (Af)e>i eventually stays in the ball A +(l/m)B. This implies At < A + (l/m)Ik and, by monotonicity of CK, alsoCK(Ae) <CK(A + (l/m)4). The left hand side is bounded from below by Csince Ae € {CK > C}. The right hand side converges to CK(A) because ofregularity. But C < CK(A) means A e {CK > C}, establishing closedness ofthe level set [CK > C}.

The main advantage of defining the mapping CK as the minimum of thelinear functions LAV is the smoothness that it acquires on the boundaryof its domain of definition, expressed through upper semicontinuity (a). Ourdefinition requires closedness of the upper level sets {CK > C} of the matrix-valued function CK, and conforms as it should with the usual concept of uppersemicontinuity for real-valued functions, compare Lemma 5.7. Property (b)

3.14. RANGE DISJOINTNESS LEMMA 79

provides a handy sequential criterion for upper semicontinuity. Part (c) tiesin with regular, that is, nonsingular, matrices since A + (l/m)B is positivedefinite whenever B is positive definite.

A prime application of regularization consists in extending the formulaCK(A) — (K'A~K)~l from the feasibility cone A(K) to all nonnegative def-inite matrices,

By homogeneity this is the same as

where the point is that the positive definite matrices

converge to A along the straight line from 1% to A. The formula remains cor-rect with the identity matrix Ik replaced by any positive definite matrix B. Inother words, the representation (K'A~lK)~l permits a continuous extensionfrom the interior cone PD(fc) to the closure NND(fc), as long as convergencetakes place "along a straight line" (see Exhibit 3.2).

Section 3.16 illustrates by example that the information matrix mapping CK

may well fail to be continuous if the convergence is not along straight lines.Next we show that the rank behavior of CK(A) completely specifies the

feasibility cone A(K). The following lemma singles out an intermediate stepconcerning ranges.

3.14. RANGE DISJOINTNESS LEMMA

Lemma. Let the k x s coefficient matrix K have full column rank 5, andlet A be a nonnegative definite k x k matrix. Then the matrix

is the unique matrix with the three properties

Proof. First we show that AK enjoys each of the three properties. Non-negative definiteness of AK is evident. Let L be a left inverse of K that isminimizing for A, and set R = Ik — KL. Then we have Cx(A) — LAL', and


EXHIBIT 3.2 Regularization of the information matrix mapping. On the boundary of thecone NND(fc), convergence of the information matrix mapping CK holds true provided the se-quence Ai,A2,-.. tends along a straight line from inside PD(fc) towards the singular matrix A,but may fail otherwise.

the Gauss-Markov Theorem 1.19 yields

This establishes the first property. The second property, the range inclusioncondition, is obvious from the definition of AK.

Now we turn to the range disjointness property of A - AK and K. Forvectors u e IR* and v € IRS with (A - AK]U = Kv, we have

3.15. RANK OF INFORMATION MATRICES 81

In the penultimate line, the product LAR' vanishes since the left inverse Lof K is minimizing for A, by the Gauss-Markov Theorem 1.19. Hence theranges of A - AK and K are disjoint except for the null vector.

Second consider any other matrix B with the three properties. Becauseof B > 0 we can choose a square root decomposition B = UU'. From thesecond property, we get range U = range B C range K. Therefore we canwrite U = KW. Now A > B entails AK > B, by way of

Thus A - B — (A - AK] + (AK - B] is the sum of two nonnegative definitematrices. Invoking the third property and the range summation Lemma 2.3,we obtain

A matrix with range null must be the null matrix, forcing B = AK.

The rank behavior of information matrices measures the extent of identifi-ability, indicating how much of the range of the coefficient matrix is capturedby the range of the moment matrix.

3.15. RANK OF INFORMATION MATRICES

Theorem. Let the k x s coefficient matrix K have full column rank s,and let A be a nonnegative definite k x k matrix. Then

In particular CK(A) is positive definite if and only if A lies in the feasibilitycone A(K).

Proof. Define AK = KCK(A)K'. We know that A = (A - AK) + AK isthe sum of two nonnegative definite matrices, by the preceding lemma. FromLemma 2.3 we obtain

range A n range K = (range(A - AK)r\ range K) + (range AK n range K)

= range AK.


Since K has full column rank this yields

In particular, CK(A) has rank s if and only if range A n range K = range AT,that is, the matrix A lies in the feasibility cone A(K).

The somewhat surprising conclusion is that the notion of feasibility conesis formally redundant, even though it is essential to statistical inference. Thetheorem actually suggests that the natural order of first checking feasibilityand then calculating CK(A) be reversed. First calculate CK(A) from the Schurcomplement type formula

with any left inverse L of K and residual projector R = Ik - KL. Then checknonsingularity of CK(A) to see whether A is feasible. Numerically it is muchsimpler to find out whether the nonnegative definite matrix CK(A) is singularor not, rather than to verify a range inclusion condition.

The case of the first s out of all k mean parameters is particularly trans-parent. As elaborated in Section 3.11, its information matrix is the ordinarySchur complement of A22 in A. The present theorem states that the rank ofthe Schur complement determines feasibility of A, in that the Schur comple-ment has rank 5 if and only if the range of A includes the leading s coordinatesubspace.

For scalar subsystems c'9, the design problem of Section 2.7 only involvesthose moment matrices that lie in the feasibility cone A(c). It becomes clearnow why this restriction does not affect the optimization problem. If thematrix A fails to lie in the cone A(c), then the information for c'B vanishes,having rank equal to

The formulation of the design problem in Section 2.7 omits only such momentmatrices that provide zero information for c'B. Clearly this is legitimate.

3.16. DISCONTINUITY OF THE INFORMATION MATRIXMAPPING

We now convince ourselves that matrix upper semicontinuity as in Theo-rem 3.13 is all we can generally hope for. If the coefficient matrix K is squareand nonsingular, then we have CK(A) = K~lAK~l' for A e NND(fc). Herethe mapping CK is actually continuous on the entire closed cone NND(fc).But continuity breaks down as soon as the rank of K falls below k.

3.16. DISCONTINUITY OF THE INFORMATION MATRIX MAPPING 83

This discontinuity is easiest recognized for scalar parameter systems c'6.We simply choose vectors dm ^ 0 orthogonal to c and converging to zero. Thematrices Am — (c + dm)(c + dm)' then do not lie in the feasibility cone A(c),whence Cc(Am) — 0. On the other hand, the limit cc' has information onefor c'6,

The same reasoning applies as long as the range of the coefficient matrix Kis not the full space IR*. To see this, let KK' = £];<$ ^j zizj ^e an eigenvaluedecomposition. We can choose vectors dm ^ 0 orthogonal to the range of Kand converging to zero. Then the matrices Am = ^j<s A/(z, + dm)(zj + dm)'satisfy

whence C/((/4m) is singular. On the other hand, the limit matrix KK' comeswith the information matrix

The matrices Am do converge to KK', but convergence is not along a straightline.

In this somewhat abstract reasoning, the approximating matrices Am failto lie in the feasibility cone, contrary to the limit matrices cc' and KK'. Thefollowing example is less artificial and does not suffer from this overemphasisof the feasibility cone.

Consider a line fit model with experimental domain T = [—!;!]. A de-sign T on T is optimal for the intercept a if and only if /i^ = 0, as seen inSection 2.20. In particular, the symmetric two-point designs ri>m that assignequal mass 1/2 to the points ±l/m are optimal.

For parameter r e R, we now introduce the two-point designs 7y)OT thatassign mass 1/2 to the support points 1/ra and —r/m. If m > |r|, then thesupport points lie in the experimental domain [-1; 1]. As m tends to infinity,the corresponding moment matrices Ar,m converge to a limit that does notdepend on r,

The limit matrix (QQ) is an optimal moment matrix for the intercept a be-longing, for example, to the design T which assigns all its mass to the singlepoint zero.


EXHIBIT 3.3 Discontinuity of the information matrix mapping. For the intercept in a line fitmodel, the information of the two-point designs rr,m exhausts all values between the minimumzero (r = -1) and the maximum one (r = 1). Yet the moment matrices Ar,m converge to alimit not depending on r.

For the matrices Ar^m, the Schur complement formula of Lemma 3.12yields the information value for the intercept a,

constant in m. With varying parameter r, the designs 7>jm exhaust all possibleinformation values, between the minimum zero and the maximum one, asshown in Exhibit 3.3.

For r ̂ I , the unsymmetric two-point designs rr,m provide an instance ofdiscontinuity,

The matrices Arfn are nonsingular. Also the limiting matrix (QQ) is feasible.Hence in the present example convergence takes place entirely within thefeasibility cone A(c), but not along straight lines. This illustrates that thelack of continuity is peculiar to not just the complement of the feasibilitycone, but the boundary at large.

Next we show that the notion of information matrices behaves consistentlyin the case of iterated parameter systems. As a preparation we prove a resultfrom matrix algebra.

3.18. ITERATED PARAMETER SUBSYSTEMS 85

3.17. JOINT SOLVABILITY OF TWO MATRIX EQUATIONS

Lemma. Let the matrices A e Rtxs, B e Rtxk, C e Ukxn, D e Usxn

be given. Then the two linear matrix equations AX = B and XC = D arejointly solvable if and only if they are individually solvable and AD = BC.

Proof. The direct part is trivial, AD - AXC = BC. For the conversepart, we first note that the single equation AX = B is solvable if and onlyrange B C range A. By Lemma 1.17, this is the same as AA~B — B. SimilarlyC 'X' = D' is solvable if and only if C 'C '~D' - D'. Altogether we now havethe three conditions

With generalized inverses G of A and H of C, the matrix X = DH + GB -GADH solves both equations: AX = ADH + AGB - AGADH = B resultsfrom (1), and XC = DHC + GBC - GADHC = D + G(BC - AD) = Dresults from (2) and (3).

3.18. ITERATED PARAMETER SUBSYSTEMS

Information matrices for K'9 depend on moment matrices. But momentmatrices themselves are information matrices for 6. The idea of "iteratingon information" holds true more generally as soon as the subsystems underinvestigation are appropriately nested.

To be precise, let us consider a first subsystem K'O with k x s coefficientmatrix K of rank s. The information matrix mapping for K'O is given by

Left inverses of K that are minimizing for A are denoted by LK>A.Now we adjoin a sub-subsystem H'K'6, given by an s x r coefficient

matrix H of rank r. The information matrix mapping associated with H is

Left inverses of H that are minimizing for B are denoted by

However, the subsystem H'K'B can also be read as (KH)'d, with k x rcoefficient matrix KH and information matrix mapping

Left inverses of KH that are minimizing for A are denoted by LKHyA.The interrelation between these three information matrix mappings makes

the notion of iterated information matrices more precise.

3.19. ITERATED INFORMATION MATRICES

Theorem. With the preceding notation we have

Moreover, a left inverse LKH>A of KH is minimizing for A if and only if itfactorizes into some left inverse LK^ of K that is minimizing for A and someleft inverse LHtClc^A) of H that is minimizing for CK(A),

Proof. There exist left inverses LK<A of K minimizing for A and LH£K(A)of H minimizing for CK(A). With this, we define the matrix L = LH<CK(A)^K^'Obviously LKH = LH£K(A)H — Ir, that is, L is a left inverse of KH. Thethree associated residual projectors are denoted by

For L to be minimizing for A, it remains to show that = 0. But

we have RK^KH = RK anc*Thus we get

3.20. RANK DEFICIENT SUBSYSTEMS 87

This establishes the converse part of the factorization formula, thatLH,CK(A)LK^ is a left inverse of KH that is minimizing for A and it provesthe iteration formula,

For the direct part of the factorization formula, let LKH^ be a left inverseof KH that is minimizing for A, with residual projector RKn — Ik -KHLKH>A.Setting RK = Ik — KL, with an arbitrary left inverse L of K, we have RK =RKRKH, and LKHtAAR'K = LKHjAAR^HR'K = 0.

We consider the two matrix equations

The first equation has solution X = HLKHyA, for instance. The solutions ofthe second equation are the left inverses of K that are minimizing for A. Thefour coefficient matrices satisfy LKH>AK(IS^0} — LKHjA(K,AR'K). In view ofLemma 3.17, the two equations admit a joint solution which we also denoteby*.

The first equation provides the factorization LKH^ = (L,Kn^K)X. Thefactor X is a left inverse of K that is minimizing for A, by the second equation.The factor LKHfAK is a left inverse of H. Setting RH — ls — HLKH^K, thefirst equation and XK = Is yield

Hence LKH^K is a left inverse of H that is minimizing for CK(A), and theproof is complete.

This concludes the study of information matrices for subsystems K'6 withcoefficient matrix K of full column rank s.

3.20. RANK DEFICIENT SUBSYSTEMS

Not all parameter systems K'6 that are of statistical interest have a coefficientmatrix K that is of full column rank.

For instance, in the two-way classification model of Section 1.27, the meanparameters comprise a subvector a = («i,..., aa)', where or, is the mean ef-fect of level i of factor A. Often these effects are of interest (and identifiable)only after being corrected by the mean effect a. — (I/a) ]£l<a «;. This givesrise to the special coefficient matrices defined by

The subsystem Kaa consists of the parameter functions («i - a.,..., aa - a.)',and is called the system of centered contrasts of factor A.

The coefficient matrix Ka fails to have full column rank a. In fact, thematrix Ja is the orthogonal projector onto the one-dimensional subspace of Ra

spanned by the unity vector la = (1,..., 1)'. Its residual projector Ka projectsonto the orthogonal complement, and therefore has rank a - 1. Togetherthey provide an orthogonal decomposition of the parameter subspace Ua,corresponding to

The matrix Ja is called the averaging matrix or the diagonal projector and Ka

is called the centering matrix or the orthodiagonal projector. We return tothis model in Section 3.25.

3.21. GENERALIZED INFORMATION MATRICES FOR RANKDEFICIENT SUBSYSTEMS

It is neither helpful nor wise to remedy rank deficiency through a full rankreparametrization. In most applications, the parameters have a definite mean-ing and this meaning is destroyed or at least distorted by reparametrization.Instead the notion of information matrices is generalized in order to covercoefficient matrices that do not have full column rank.

Which difficulties do we encounter if the k x s coefficient matrix K hasrank r < s? In the case of a full column rank, r = s, a feasible momentmatrix A e A(K) has information matrix CK(A) given by the formula

If K is rank deficient, r < s, so is the product K'A K and the matrix K'A Kfails to be invertible.

The lack of invertibility cannot, at this point, be overcome by generalizedinversion. A singular matrix K'A~K has many generalized inverses, and wehave no clue which one deserves to play the distinctive role of the informationmatrix for K '0.

3.21. GENERALIZED INFORMATION MATRICES FOR RANK DEFICIENT SUBSYSTEMS 89

If invertibility breaks down, which properties should carry over to the rankdeficient case? Here is a list of desiderata.

i. The extension must not be marred by too much algebra of specificversions of generalized inverse matrices.

ii. A passage to the case of full column rank coefficient matrices must bevisibly simple.

Hi. The power of the F-test depends on the matrix AK = K(K'A~K)~K',as outlined in Section 3.7, and any generalization must be in line withthis.

iv. The estimation problem outlined in Section 3.5 leads to dispersionmatrices of the form K'A~K even if the coefficient matrix K failsto have full column rank. The transition to generalized informationmatrices ought to be antitonic relative to the Loewner ordering, so thatminimization of dispersion matrices still corresponds to maximizationof generalized information matrices.

v. The functional properties from the full rank case should continue tohold true.

We contend that these points are met best by the following definition.

DEFINITION. For a design £ with moment matrix M, the generalized in-formation matrix for K'B, with a k x s coefficient matrix K that may be rankdeficient, is defined to be the k x k matrix MK where the mapping A H-» AKfrom the cone NND(fc) to Sym(fc) is given by

The major disadvantage of the definition is that the number of rows andcolumns of generalized information matrices AK no longer exhibits the re-duced dimensionality of the parameter subsystem of interest. Both AK and Aare k x k matrices, and the notation is meant to indicate this. Rank, notnumber of rows and columns, measures the extent of identifiability of thesubsystem K'd. Otherwise the definition of a generalized information matrixstands up against the desiderata listed above.

i. The Gauss-Markov Theorem 1.19 provides the representation

where the residual projector R = Ik - KG is formed with an arbitrary gener-alized inverse G of K. This preserves the full liberty of choosing generalizedinverses of K, and of RAR'.


ii. Let us assume that the coefficient matrix K has full column rank s.A passage from AK to CK(A) is easily achieved by premultiplication of AK

by L and postmultiplication by L', where L is any left inverse of K,

Conversely we obtain AK from CK(A) by premultiplication by K and post-multiplication by K'. Indeed, inserting KL = lk - R, we find

using RAR'(RAR'} RA = RA as in the existence part of the proof of theGauss-Markov Theorem 1.19.

Hi. The testing problem is covered since we have AK = K(K'A~K)~K'if A lies in the feasibility cone A(K), by the Gauss-Markov Theorem 1.20.

iv. The estimation problem calls for a comparison of the Loewner or-derings among generalized information matrices AK, and among dispersionmatrices K'A~K. We establish the desired result in two steps. The first com-pares the generalized inverses of an arbitrary moment matrix A and its gen-eralized information matrix AK.

3.22. GENERALIZED INVERSES OF GENERALIZEDINFORMATION MATRICES

Lemma. Let A be a nonnegative definite k x k matrix and let AK be itsgeneralized information matrix for K'O. Then every generalized inverse of Aalso is a generalized inverse of AK.

Proof. Let Q be a left identity of K that is minimizing for A, AK =QAQ'. We know from the Gauss-Markov Theorem 1.19 that this equalityholds if and only if QK = K and QAR' = 0, with R = lk - KL for somegeneralized inverse L of K. The latter property means that QA = QAL'K'.Postmultiplication by Q' yields QAQ' = QAL'K'Q' = QAL'K' = QA. Weget

3.23. EQUIVALENCE OF INFORMATION ORDERING AND DISPERSION ORDERING 91

the last equation following from symmetry. Now we take a generalized in-verse G of A. Then we obtain

and this exhibits G as a generalized inverse of AK.

In the second step, we verify that the passage between dispersion matricesand generalized information matrices is order reversing, closely following thederivation for positive definite matrices from Section 1.11. There we startedfrom

for positive definite matrices A > B > 0. For nonnegative definite matricesC > D > 0, the basic relation is

which holds if D~ is a generalized inverse of D that is nonnegative definite.For instance we may choose D~ = GDG' with G e D~.

3.23. EQUIVALENCE OF INFORMATION ORDERING ANDDISPERSION ORDERING

Lemma. Let A and B be two matrices in the feasibility cone A(K). Thenwe have

Proof. For the direct part, we apply relation (1) of the preceding sectionto C = AK and D — BK, and insert for D~ a nonnegative definite generalizedinverse fi~ of B,

Upon premultiplication by K 'A and postmultiplication by its transpose, allterms simplify using AKA~K — K, because of Lemma 3.22, and since theproof of Theorem 3.15 shows that AK and K have the same range and thusLemma 1.17 applies. This leads to 0 < K'B~K - K'A~K.

The converse part of the proof is similar, in that relation (1) of the pre-ceding section is used with C = K'B~K and D — K'A~K. Again invokingLemma 1.17 to obtain DD~K' = K', we get 0 < CD~C - C. Now us-ing CC~K' = K', premultiplication by KC~ and postmultiplication by itstranspose yield 0 < KD~K' - KC~K'. But we have KD~K' = AK andKC~K' = BK, by Theorem 1.20.


The lemma verifies desideratum (iv) of Section 3.21 that, on the feasibilitycone A(K), the transition between dispersion matrices K'A'K and general-ized information matrices A% is antitonic, relative to the Loewner orderingson the spaces Sym(s) and Sym(Ar).

It remains to resolve desideratum (v) of Section 3.21, that the functionalproperties of Theorem 3.13 continue to hold for generalized informationmatrices. Before we state these properties more formally, we summarize thefour formulas for the generalized information matrix AK that parallel thoseof Section 3.12 for CK(A):

Formula (1) holds on the feasibility cone A(K), by the Gauss-Markov The-orem 1.20. Formula (2) is of Schur complement type, utilizing the residualprojector R = Ik- KG, G e K~. Formula (3) involves a left identity Q of Kthat is minimizing for A, as provided by the Gauss-Markov Theorem 1.19.Formula (4) has been adopted as the definition, and leads to functional prop-erties analogous to those for the mapping CK, as follows.

3.24. PROPERTIES OF GENERALIZED INFORMATION MATRICES

Theorem. Let K be a nonzero k x s coefficient matrix that may be rankdeficient, with associated generalized information matrix mapping

Then the mapping A *-> AK is positively homogeneous, matrix super-additive, and nonnegative, as well as matrix concave and matrix isotonic.Moreover it enjoys any one of the three equivalent upper semicontinuityproperties (a), (b), (c) of Theorem 3.13.

Given a matrix A € NND(&), its generalized information matrix AK isuniquely characterized by the three properties

It satisfies the range formula range AK = (range A) n (range K), as wellas the iteration formula AKH = (AK)KH, with any other 5 x r coefficientmatrix H.

3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS

Proof. The functional properties follow from the definition as a minimumof the linear functions A H-» QAQ', as in the full rank case of Theorem 3.13.The three characterizing properties are from Lemma 3.14, and the proof re-mains unaltered except for choosing for K generalized inverses G rather thanleft inverses L. The range formula is established in the proof of Theorem 3.15.

It remains to verify the iteration formula. Two of the defining relationsare

There exist matrices Q and R attaining the respective minima. Obviously RQsatisfies RQKH = RKH = KH, and therefore the definition of AKH yieldsthe first inequality in the chain

The second inequality follows from monotonicity of the generalized informa-tion matrix mapping for (KH)'d, applied to AK < A.

This concludes the discussion of the desiderata (i)-(v), as stipulated inSection 3.21. We will return to the topic of rank deficient parameter systemsin Section 8.18.

3.25. CONTRAST INFORMATION MATRICES IN TWO-WAYCLASSIFICATION MODELS

A rank deficient parameter subsystem occurs with the centered contrasts Kaaof factor A in a two-way classification model, as mentioned in Section 3.20.By Section 1.5 the full parameter system is 6 = ("), and includes the effects /3of factor B. Hence the present subsystem of interest can be written as

93


In Section 1.27 we have seen that an a x b block design W has moment matrix

where Ar and Af are the diagonal matrices formed from the treatment repli-cation vector r — Wlb and blocksize vector 5 = Wla. We claim that thegeneralized information matrix for the centered treatment contrasts Kaa is

Except for vanishing subblocks the matrix MK coincides with the Schur com-plement of A5 in A/,

The matrix (2) is called the contrast information matrix, it is often referred toas the C-matrix of the simple block design W. Therefore the definition of gen-eralized information matrices for rank deficient subsystems is in accordancewith the classical notion of C-matrices.

Contrast information matrices enjoy the important properties of havingrow sums and column sums equal to zero. This is easy to verify directly,since postmultiplication of Ar — W&~Wf by the unity vector la producesthe null vector. We can deduce this result also from Theorem 3.24. Namely,the matrix MK has its range included in that of K, whence the range ofAr — WA^ W is included in the range of Ka. Thus the nullspace of the matrixAr - W&-W includes the nullspace of Ka, forcing (Ar - Wb;W')la to bethe null vector.

To prove the claim (1), we choose for K the generalized inverse K' =(#a,0), with residual projector

Let la denote the a x 1 vector with all components equal to I/a. In otherwords, this is the row sum vector that corresponds to the uniform distributionon the a levels of factor A. From Ja = lal a and Ar7a = r, as well as Wla = 5,we obtain

3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS 95

Using and we further get

Hence for RMR we can choose the generalized inverse

Multiplication now yields

In the last line we have used A5AS W — W. This follows from Lemma 1.17since the submatrix As lies in the feasibility cone A(W), by Lemma 3.12.(Alternatively we may look at the meaning of the terms involved. If theblocksize Sj is zero, then SjS~j~ = 0 and the ;' th column of W vanishes. If Sj ispositive, then SjSf = s/s."1 = 1.) In summary, we obtain

and formula (1) is verified. We return to this model in Section 4.3.

The Loewner ordering fails to be a total ordering. Hence a maximizationor a minimization problem relative to it may, or may not, have a solution. TheGauss-Markov Theorem 1.19 is a remarkable instance where minimizationrelative to the Loewner ordering can be carried out. The design problemembraces too general a situation; maximization of information matrices inthe Loewner ordering is generally infeasible. However, on occasion Loewneroptimality can be achieved and this is explored next.


EXERCISES

3.1 For the compact, convex set

and the matrix

verify

Is CK(M] compact, convex?

3.2 Let the matrix X = (Xi, X2) and the vector 6 = (0/, 02')' be pardonedso that X9 = X\ Q\ + X2&2- Show that the information matrix for 62 is^2'1*2.1, where X2.i = (/ - Xl(X{X^~X{)X2.

3.3 In the general linear model E[Y] = KB and D[Y] = A, find the jointmoments of Y and the zero mean statistic Z = RY, where R — I —KL for some generalized inverse L of K. What are the properties ofthe covariance adjusted estimator Y -AR'(RAR')~Z of E[Y]1 [Rao(1967), p. 359].

3.4 In the linear model Y ~ N^.^, show that a 1 — a confidence ellipsoidfor y = K'B is given by

where ^^(l - a) is the 1 — a quantile of the F-distribution withnumerator degrees of freedom s and denominator degrees of free-dom n — k.

3.5 Let K G M*X5 be of full column rank s. For all A e NND(fc) andz 6 RJ, show that (i) K'A'K C (CK(A))~, (ii) z € range CK(A) if andonly if Kz e range A [Heiligers (1991), p. 26].

EXERCISES 97

3.6 Show that K(K'K)-2K' e (KK'Y and K'(KK'YK = 75, for all A: eIR*XS with rank /C = 5.

3.7 Show that (K'A'KY1 < (K'K}-lK'AK(K'KYl for all A <E .4(#),with equality if and only if range AK C range K.

3.8 In Section 3.16, show that the path

is tangent to NND(2) at 8 = 0.

3.9 Let K € Rkxs be of arbitrary rank. For all A, B e NND(fc), show thatAK > BK if and only if range AK D range BK and c'A~c < c'B~c forall c e range fi^.

3.10 For all A E NND(fc), show that (i) (AK}K = A*, (ii) if ranged =range H then A# = AH, (iii) if range ^4 C range /C then /l/^ = >4.

3.11 Show that (jcjc ')K equals jcjc' or 0 according as x € range K or not.

3.12 Show that Ac equals c(c'A~c)~lc' or 0 according as c 6 range A ornot.

3.13 In the two-way classification model of Section 3.25, show that LM =(7fl, -WAj) is a left identity of K = (Ka,Q)' that is minimizing for

3.14 (continued) Show that the Moore-Penrose inverse of K'M K isA r - W&-W.

3.15 (continued} Show that Ar~ + A,:W£rWA7 C (A, - WA;W)", whereD = As - WA~W is the information matrix for the centered blockcontrasts [John (1987), p. 39].

C H A P T E R 4

Loewner Optimality

Optimality of information matrices relative to the Loewner ordering is dis-cussed. This concept coincides with dispersion optimally, and simultaneousoptimality for all scalar subsystems. An equivalence theorem is presented. Anonexistence theorem shows that the set of competing moment matrices mustnot be too large for Loewner optimal designs to exist. As an illustration, prod-uct designs for a two-way classification model are shown to be Loewner op-timal if the treatment rectfication vector is fixed. The proof of the equivalencetheorem for Loewner optimality requires a refinement of the equivalence the-orem for scalar optimality. This refinement is presented in the final sections ofthe chapter.

4.1. SETS OF COMPETING MOMENT MATRICES

The general problem we consider is to find an optimal design £ in a specifiedsubclass H of the set of all designs, H C E. The optimality criteria to bediscussed in this book depend on the design £ only through the momentmatrix M (£). In terms of moment matrices, the subset H of H leads to asubset M (H) of the set of all moment matrices, A/(E) C M(H). Thereforewe study a subset of moment matrices,

which we call the set of competing moment matrices. Throughout this book wemake the grand assumption that there exists at least one competing momentmatrix that is feasible for the parameter subsystem K'd of interest,

Otherwise there is no design under which K'B is identifiable, and our opti-mization problem would be statistically meaningless.

98

4.3. MAXIMUM RANGE IN TWO-WAY CLASSIFICATION MODELS 99

The subset M of competing moment matrices is often convex. Simpleconsequences of convexity are the following.

4.2. MOMENT MATRICES WITH MAXIMUM RANGE AND RANK

Lemma. Let the set M of competing moment matrices be convex. Thenthere exists a matrix M e M which has a maximum range and rank, that is,for all A e M we have

Moreover the information matrix mapping CK permits regularization alongstraight lines in M,

Proof. We can choose a matrix M e M with rank as large as possible,rankM = max{rank A : A e M}. This matrix also has a range as large aspossible. To show this, let A e M be any competing moment matrix. Thenthe set M contains B = \A + \M, by convexity. From Lemma 2.3, we obtain

range A C range Z? and range M C range/?. The latter inclusion holds withequality, since the rank of M is assumed maximal. Thus range A C range B —range M, and the first assertion is proved.

Moreover the set M contains the path (1 — a)A + aB running within Mfrom A to B. Positive homogeneity of the information matrix mapping CK,established in Theorem 3.13, permits us to extract the factor 1 - a. This gives

This expression has limit CK(A) as a tends to zero, by matrix upper semi-continuity,

Often there exist positive definite competing moment matrices, and themaximum rank in M then simply equals k.

4.3. MAXIMUM RANGE IN TWO-WAY CLASSIFICATION MODELS

The maximum rank of moment matrices in a two-way classification model isnow shown to be a + b -1. As in Section 1.27, we identify designs £ e H with

100 CHAPTER 4: LOEWNER OPTIMALITY

their a x b weight matrices W e T. An a x b block design W has a degeneratemoment matrix,

Hence the rank is at most equal to a + b - 1. This value is actually attainedfor the uniform design, assigning uniform mass l/(ab) to every point (/,/)in the experimental domain T = {l,...,a} x {!,...,6}. The uniform designhas weight matrix 1 al b. We show that

occurs only if x = Sla and y — -&//,, for some S 6 R. But

entails x = —y.la, while

gives y = -x.lb. Premultiplication of x — -y.la by 2 a yields x. = —y. = S,say. Therefore the uniform design has a moment matrix of maximum ranka + b-l.

An important class of designs for the two-way classification model are theproduct designs. By definition, their weights are the products of row sums r,and column sums s,, w,; = r/sy. In matrix terms, this means that the weightmatrix W is of the form

with row sum vector r and column sum vector s. In measure theory terms,the joint distribution W is the product of the marginal distributions r and s.

An exact design for sample size n determines frequencies n/y that sumto n. With marginal frequencies «,-. = ]C/<6 nn anc* ni — £]i<fl

nih tne prod-uct property turns into n/y = n/.n.y/n. For this reason a product design issometimes called a proportional frequency design.

We call a vector r positive and write r > 0 when all its entries are posi-tive. A product design with positive row and column sum vectors again has

4.4. LOEWNER OPTIMALITY 101

maximal rank,

The proof of rank maximality for the uniform design carries over, since

implies x = ~(s'y)la and y = -(r'x)lb with r'x = -s'y.The present model also serves as an example that subsets M. of the full

set M (E) of all moment matrices are indeed of interest. For instance, fixingthe row sum vector r means fixing the replication numbers r, for levels i =l , . . . ,a of factor A. The weight matrices that achieve the given row sumvector r lead to the subsets T(r) = {W <E T: Wlb = r} and M(r) = M (T(r)).The subset T(r) of block designs with given treatment replication vector ris convex, as is the subset M(r) of corresponding moment matrices. Theset M(r} is also compact, being a closed subset of the compact set M(E),see Lemma 1.26. If the row sum vector r is positive, then the maximum rankin M(r} is a + b - 1, and is attained by product designs rs' with s > 0. Wereturn to this model in Section 4.8.

4.4. LOEWNER OPTIMALITY

It is time now to be more specific about the meanings of optimality. The mostgratifying criterion refers to the Loewner comparison of information matri-ces. It goes hand in hand with estimation problems, testing hypotheses, andgeneral parametric model-building, as outlined in Section 3.4 to Section 3.10.

DEFINITION. A moment matrix M 6 M. is called Loewner optimal for K'Bin M when the information matrix for K'd satisfies

provided the k x s coefficient matrix K has full column rank 5. If K is rankdeficient, we take recourse to the generalized information matrices for K'6and demand

The optimality properties of designs £ are determined by their momentmatrices A/(£). Given a subclass of designs, H C H, a design £ e E is calledLoewner optimal for K'0 in H when its moment matrix M(£) is Loewneroptimal tor K'd in M(H).

The following theorem summarizes the various aspects of Loewner opti-mality, emphasizing in turn (a) information matrices, (b) dispersion matrices,and (c) simultaneous scalar optimality. We do not require the coefficientmatrix K to be of full column rank s. From (b) every Loewner optimal mo-ment matrix M is feasible, that is, M e A(K). Part (c) comes closest to theidea of uniform optimality, a notion that some authors prefer.

4.5. DISPERSION OPTIMALITY AND SIMULTANEOUS SCALAROPTIMALITY

Theorem. Let the set M of competing moment matrices be convex. Thenfor every moment matrix M e M, the following statements are equivalent:

a. (Information optimality) M is Loewner optimal for K'0 in M.b. (Dispersion optimality) M e A(K) and K'M~K < K'A'K for all A e

MnA(K).c. (Simultaneous scalar optimality) M is optimal for c'Q in M, for all

vectors c in the range of K.

Proof. The proof that (a) implies (c) is a case of iterated informationsince for some vector z € Ks, the coefficient vector c satisfies c = Kz. Mono-tonicity yields Mc = (MK)C > (AK)C = Ac, for all A e M, by Theorem 3.24.Hence M is optimal for c'Q in M..

Next we establish that (c) implies (b). Firstly, being optimal for c'0, thematrix M must lie in the feasibility cone A(c). Since this is true for all vec-tors c in the range of K, the matrix M actually lies in the cone A(K). Sec-ondly the scalar inequalities z'K'M'Kz <z'K'A~Kz, with arbitrary vectorz € IRS, evidently prove (b).

Finally we show that (b) implies (a). Let B € M n A(K) be a feasiblecompeting moment matrix that satisfies K'M~K < K'B~K. Lemma 3.23gives MK > BK. Hence M is Loewner optimal in M n A(K). This extendsto all matrices A e M, feasible or not, by regularization. For a € (0; 1),feasibility of B entails feasibility of (1 - a)A + aB, whence MK > ((I - a)A +aB)K. A passage to the limit as a tends to zero yields MK >AK. Thus M isLoewner optimal in M.

Part (c) enables us to deduce an equivalence theorem for Loewner opti-mality that is similar in nature to the Equivalence Theorem 2.16 for scalaroptimality. There we concentrated on the set M (E) of all moment matrices.We now use (and prove from Section 4.9 on) that Theorem 2.16 remainstrue with the set A/(H) replaced by a set M of competing matrices that iscompact and convex.

4.6. GENERAL EQUIVALENCE THEOREM FOR LOEWNER OPTIMALITY 103

4.6. GENERAL EQUIVALENCE THEOREM FOR LOEWNEROPTIMALITY

Theorem. Assume the set M of competing moment matrices is compactand convex. Let the competing moment matrix M e M have maximum rank.Then M is Loewner optimal for K'O in M if and only if

Proof. As a preamble, we verify that the product AGK is invariant tothe choice of generalized inverse G e M~. In view of the maximum rangeassumption, the range of M includes the range of K, as well as the rangeof every other competing moment matrix A. Thus from K = MM~K andA = MM~A, we get AGK = AM'MGMM'K = AM~K for all G e M~.It follows that K'M-AM-K is well denned, as is K'M'K.

The converse provides a sufficient condition for optimality. We presentan argument that does not include the rank of K, utilizing the generalizedinformation matrices AK from Section 3.21. Let G be a generalized inverseof M and introduce Q = MKG' = K(K'M-KYK'G'. From Lemma 1.17,we obtain QK — K. This yields

The second inequality uses the inequality from the theorem.The direct part, necessity of the condition, invokes Theorem 2.16 with the

subset M in place of the full set M (E), postponing a rigorous proof of thisresult to Theorem 4.13. Given any vector c = Kz in the range of K, thematrix M is optimal for c'B in M, by part (c) of Theorem 4.5. Because ofTheorem 2.16, there exists a generalized inverse G 6 A/~, possibly dependenton the vector c and hence on z, such that

As verified in the preamble, the product AGK is invariant to the choice of thegeneralized inverse G. Hence the left hand side becomes z'K'M~AM~Kz.Since z is arbitrary the desired matrix inequality follows.

The theorem allows for arbitrary subsets of all moment matrices, M. CM(H), as long as they are compact and convex. We repeat that the necessity


part of the proof invokes Theorem 2.16 which covers the case M == A/(E),only. The present theorem, with M. = A/(H), is vacuous.

4.7. NONEXISTENCE OF LOEWNER OPTIMAL DESIGNS

Corollary. No moment matrix is Loewner optimal for K' 6 in M(H) un-less the coefficient matrix K has rank 1.

Proof. Assume that £ is a design that is Loewner optimal for K'6 in H,with moment matrix M and with support points *i,... ,Xf. If i — 1, then £ isa one-point design on x\ and range K C range *!*/ forces /C to have rank 1.

Otherwise t > 2. Applying Theorem 4.6 to A = *,-*/, we find

Here equality must hold since a single strict inequality leads to the contra-diction

because of i > 2. Now Lemma 1.17 yields the assertion by way of comparingranks, rank K = rank K'M'K = rank K'M'XixlM-K = 1.

The destructive nature of this corollary is deceiving. Firstly, and aboveall, an equivalence theorem gives necessary and sufficient conditions for adesign or a moment matrix to be optimal, and this is genuinely distinct froman existence statement. Indeed, the statements"If a design is optimal then it must look like this""If it looks like this then it must be optimal"in no way assert that an optimal design exists. If existence fails to hold, thenthe statements are vacuous, but logically true.

Secondly, the nonexistence result is based on the Equivalence Theorem 4.6.Equivalence theorems provide an indispensable tool to study the existenceproblem. The present corollary ought to be taken as a manifestation of theconstructive contribution that an equivalence theorem adds to the theory. InChapter 8, we deduce from the General Equivalence Theorem insights aboutthe number of support points of an optimal design, their location, and theirweights.

Thirdly, the corollary stresses the role of the subset M of competing mo-ment matrices, pointing out that the full set M(H) is too large to permitLoewner optimal moment matrices. The opposite extreme occurs if the sub-set consists of a single moment matrix, MQ = {A/0}. Of course, MQ is Loewner

4.8. LOEWNER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS 105

optimal for K '0 in MQ. The relevance of Loewner optimality lies somewherebetween these two extremes.

Subsets M of competing moment matrices tend to be of interest for thereason that they show more structure than the full set A/(E). The specialstructure of M often permits a direct derivation of Loewner optimality, cir-cumventing the Equivalence Theorem 4.6.

4.8. LOEWNER OPTIMALITY IN TWO-WAY CLASSIFICATIONMODELS

This section continues the discussion of Section 4.3 for the two-way classifi-cation model. We are interested in the centered contrasts of factor A,

By Section 3.25, the contrast information matrix of a block design W isAr — W&j W. If a moment matrix is feasible for the centered contrasts Kaa,then W must have positive row sum vectors. For if r, vanishes, then the / throw and column of the contrast information matrix Ar - WAjW are zero.Then its nullity is larger than 1, and its range cannot include the range of Ka.

Any product design has weight matrix W = rs', and fulfills WkjW =rs'hjsr' = rr'. This is so since &~s is a vector with the / th entry equal to 1or 0 according as s, is positive or vanishes, entailing s'k~s = ]Crjj>oJ« = 1-Therefore all product designs with row sum vector r share the same contrastinformation matrix Ar - rr'. They are feasible for the treatment contrasts ifand only if r is positive.

Let r be a fixed row sum vector that is positive. We claim the following.

Claim. The product designs rs' with arbitrary column sum vector 5 arethe only Loewner optimal designs for the centered contrasts of factor A inthe set T(r) of block designs with row sum vector equal to r, with contrastinformation matrix Ar - rr'.

Proof. There is no loss in generality assuming the column sum vector sto be positive. This secures a maximum rank for the moment matrix M ofthe product design rs'. A generalized inverse G for M and the matrix GKare given by


Now we take a competing moment matrix A, with top left block A\\. InTheorem 4.6, the left hand side of the inequality turns into

The right hand side equals K'M K = Ka&r lKa. Hence the two sides coincide

if An = Ar, that is, if A lies in the subset M (T(r)). Thus Theorem 4.6 provesthat the product design rs1 is Loewner optimal, Ar - rr' > Ar - W&~Wfor all W e T(r). But then every weight matrix W e T(r) that is optimalmust satisfy Wk~ W = rr', forcing W to have rank 1 and hence to be of theform W = rs'. Thus our claim is verified.

Brief contemplation opens a more direct route to this result. For an arbi-trary block design W e T(r) with column sum vector s, we have Wk~s = ras well as s'&js = 1. Therefore, we obtain the inequality

Equality holds if and only if W = rs'. This verifies the assertion without anyreference to Theorem 4.6.

We emphasize that, in contrast to the majority of the block design litera-ture, we here study designs for infinite sample size. For a given sample size n,designs such as W = rs' need not be realizable, in that the numbers w,-s/may well fail to be integers. Yet, as pointed out in Section 1.24, the generaltheory leads to general principles which are instructive. For example, if wecan choose a design for sample size n that is a proportional frequency design,then this product structure guarantees a desirable optimality property.

It is noteworthy that arbitrary column sum vectors s are admitted forthe product designs rs'. Specifically all observations can be taken in the firstblock, s = (1,0,..., 0)', whence the corresponding moment matrix has rank arather than maximum rank a+b-1. Therefore Loewner optimality may holdtrue even if the maximum rank hypothesis of Theorem 4.6 fails to hold.

The class of designs with moment matrices that are feasible for the cen-tered contrasts decomposes into the cross sections T(r) of designs with pos-itive row sum vector r. Within one cross section, the information matrixAr — rr' is Loewner optimal. Between all cross sections, Loewner optimalityis ruled out by Corollary 4.7. We return to this model in Section 8.19.

4.9. THE PENUMBRA OF THE SET OF COMPETING MOMENT MATRICES 107

4.9. THE PENUMBRA OF THE SET OF COMPETING MOMENTMATRICES

Before pushing on into the more general theory, we pause and generalize theEquivalence Theorem 2.16 for scalar optimality, as is needed by the GeneralEquivalence Theorem 4.6 for Loewner optimality. This makes for a self-contained exposition of the present chapter. But our main motivation is toacquire a feeling of where the general development is headed. Indeed, thepresent Theorem 4.13 is but a special case of the General Equivalence The-orem 7.14.

As in Section 2.13, we base our argument on the supporting hyperplanetheorem. However, we leave the vector space Rk which includes the regres-sion range X, and move to the matrix space Sym(/:) which includes the set ofcompeting moment matrices M. The reason is that, unlike the full set M (H),a general convex subset M cannot be generated as the convex hull of a setof rank 1 matrices.

Even though our reference space changes from Rk to Sym(/c), our argu-ment is geometric. Whereas before, points in the reference space were columnvectors, they now turn into matrices. And whereas before, we referred to theEuclidean scalar product of vectors, we now utilize the Euclidean matrixscalar product (A,B] = trace AB on Sym(A:). A novel feature which we ex-ploit instantly is the Loewner ordering A < B that is available in the matrixspace Sym(fc).

In attacking the present problem along the same lines as the Elfving The-orem 2.14, we need a convex set in the space Sym(fc) which takes the placeof the Elfving set 72. of Section 2.9. This set of matrices is given by

We call P the penumbra of M since it may be interpreted as a union ofshadow lines for a light source located in M + fi, where B e NND(fc). Thepoint M G M then casts the shadow half line {M - SB : 8 > 0}, generatingthe shadow cone M -NND(A:) as B varies over NND(fc). This is a translationof the cone -NND(&) so that its tip comes to lie in M. The union overM e M of these shadow cones is the penumbra P (see Exhibit 4.1).

There is an important alternative way of expressing that a matrix A eSym(fc) belongs to the penumbra P,

This is to say that P collects all matrices A that in the Loewner orderinglie below some moment matrix M e M. An immediate consequence is thefollowing.

EXHIBIT 4.1 Penumbra. The penumbra P originates from the set M C NND(Jt) andrecedes in all directions of -NND(/c). By definition of p2(c), the rank 1 matrix cc'/p2(c) isa boundary point of P, with supporting hyperplane determined by N > 0. The picture showsthe equivalent geometry in the plane R2.

4.10. GEOMETRY OF THE PENUMBRA

Lemma. Let the set M of competing moment matrices be compact andconvex. Then the penumbra P is a closed convex set in the space Sym(fc).

Proof. Given two matrices A — M - B and A = M - B with M, M e Mand B,B > 0, convexity follows with a G (0; 1) from

In order to show closedness, suppose (Am)m>i is a sequence of matricesin P converging to A in Sym(A:). For appropriate matrices Mm £ M we haveAm < Mm. Because of compactness of the set M, the sequence (Mm)m>\ hasa cluster point M e M, say. Thus A < M and A lies in P. The proof iscomplete.

Let c e Rk be a nonvanishing coefficient vector of a scalar subsystemc'0. Recall the grand assumption of Section 4.1 that there exists at least onecompeting moment matrix that is feasible, M n A(c) ^ 0. The generalizationof the design problem of Section 2.7 is

4.11. EXISTENCE THEOREM FOR SCALAR OPTIMALITY 109

In other words, we wish to determine a competing moment matrix M € Mthat is feasible for c'6, and that leads to a variance c'M'c of the estimatefor c'6 which is an optimum compared to the variance under all other feasiblecompeting moment matrices.

In the Elfving Theorem 2.14, we identified the optimal variance as thesquare of the Elfving norm, (p(c))2. Now the corresponding quantity turnsout to be the nonnegative number

For 5 > 0, we have (8M) - NND(fc) = 8P. Hence a positive value p2(c) > 0provides the scale factor needed to blow up or shrink the penumbra P so thatthe rank one matrix cc' comes to lie on its boundary. The scale factor p2(c)has the following properties.

4.11. EXISTENCE THEOREM FOR SCALAR OPTIMALITY

Theorem. Let the set M of competing moment matrices be compact andconvex. There exists a competing moment matrix that is feasible for c'6,M e M n A(c), such that

Every such matrix is optimal for c'6 in M, and the optimal variance is p2(c).

Proof. For 5 > 0, we have cc' 6 (SM) -NND(fc) if and only if cc' < SMfor some matrix M € M.. Hence there exists a sequence of scalars dm > 0and a sequence of moment matrices Mm e M such that cc' < 8mMm andp2(c) = limw_oo 8m. Because of compactness, the sequence (Mm)m>i has acluster point M G Ai, say. Hence we have cc' < p2(c)M. This forces p2(c) tobe positive. Otherwise cc' < 0 implies c = 0, contrary to the assumption c / 0from Section 2.7. With p2(c) > 0, Lemma 2.3 yields c E range cc1 C range M.Hence M is feasible, M € A(c).

The inequality c'M~c < p2(c) follows from c'M~(cc')M~c <p2(c)c'M~MM~c = p2(c)c'M~c. The converse inequality, p2(c) = inf{6 >0 : cc' e 8P} < c'M'c, is a consequence of Theorem 1.20, since c(c'M~c}~1

c' <M means cc' < (c'M'c)M e (c'M'c)P.Finally we take any other competing moment matrix that is feasible, A e

M n A(c). It satisfies cc' < (c'A~c}A e (c'A~c)P, and therefore p2(c) <c'A~c. Hence p2(c) = c'M~c is the optimal variance, and M is optimalf o r c ' f l i n M

Geometrically, the theorem tells us that the penumbra P includes thesegment {ace1 : a < l/p2(c)} of the line {ace' : a e IR}. Moreover

the matrix cc'/p2(c) lies on the boundary of the penumbra P. Whereas inSection 2.13 we used a hyperplane in Rk supporting the Elfving set 7£ at itsboundary point c/p(c), we now take a hyperplane in Sym(k) supporting thepenumbra P at cc'/p2(c). Since the set P has been defined so as to recedein all directions of - NND(&),

the supporting hyperplane permits a matrix N normal to it that is nonnegativedefinite.

4.12. SUPPORTING HYPERPLANES TO THE PENUMBRA

Lemma. Let the set M of competing moment matrices be compact andconvex. There exists a nonnegative definite k x k matrix N such that

Proof. In the first part of the proof, we make the preliminary assump-tion that the set M of competing moment matrices intersects the opencone PD(fc).

Our arguments parallel those of Section 2.13. Namely, the matrix cc'/p2(c)lies on the boundary of the penumbra P. Thus there exists a supporting hy-perplane to the set P at the point cc'/p2(c), that is, there exist a nonvanishingmatrix N e Sym(fc) and a real number y such that

The penumbra P includes - NND(£), the negative of the cone NND(&).Hence if A is a nonnegative definite matrix, then — 8A lies in P for all 8 > 0,giving

This entails trace AN > 0 for all A > 0. By Lemma 1.8, the matrix N isnonnegative definite. Owing to the preliminary assumption, there exists apositive definite moment matrix B G M. With 0 ^ N > 0, we get 0 <trace BN < y. Subdividing by y > 0 and setting N — N/y ^ 0, we obtain amatrix N with the desired properties.

In the second part of the proof, we treat the general case that the set Mintersects just the feasibility cone A(c), and not necessarily the open conePD(Aj. By Lemma 4.2, we can find a moment matrix M € M with maximumrange, ranged C range M for all A € M. Let r be the rank of M and choose

4.13. GENERAL EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY 111

an orthonormal basis u\,..., ur 6 Rk for its range. Then the k x r matrixU — («i, . . . , UT) satisfies range M = range U and U 'U — Ir. Thus the matrixUU' projects onto the range of M. From Lemma 1.17, we get UU 'c = c andUU'A = A for all A eX.

Now we return to the discussion of the scale factor p2(c). The precedingproperties imply that for all 6 > 0 and all matrices M e M, we have

Therefore the scale factor p2(c) permits the alternative representation

The set U'MU of reduced moment matrices contains the positive definitematrix U'MU.

With c = U'c, the first part of the proof supplies a nonnegative definiter x r matrix N such that trace AN < 1 = c'Nc/p2(c) for all A e U'MU.Since A is of the form U'AUjve have trace AN = trace AUNU'. Onjheother side, we get c'Nc = c'UN U'c. Therefore the k x k matrix N = UNU'satisfies the assertion, and the proof is complete.

The supporting hyperplane inequality in the lemma is phrased with refer-ence to the set M, although the proof extends its validity to the penumbra P.But the extended validity is besides the point. The point is that P is instru-mental to secure nonnegative definiteness of the matrix N. This done, wedismiss the penumbra P. Nonnegative definiteness of N becomes essential inthe proof of the following general result.

4.13. GENERAL EQUIVALENCE THEOREM FOR SCALAROPTIMALITY

Theorem. Assume the set M of competing moment matrices is compactand convex, and intersects the feasibility cone A(c).

Then a competing moment matrix M e M is optimal for c'O in M ifand only if M lies in the feasibility cone A(c) and there exists a generalizedinverse G of M such that

Proof. The converse is established as in the proof of Theorem 2.16. Forthe direct part we choose a nonvanishing matrix N > 0 from Lemma 4.12and define the vector h = Nc/p(c), with p(c) = i/p2(c). From Lemma 4.12,

we have p2(c) — c'Nc, and c'h — p(c). We claim that the vector h satisfiesthe three conditions

By definition of h, we get h'Ah — c'NANc/p2(c) = trace ANc(c'Nc)~l

c'N. For nonnegative definite matrices V, Theorem 1.20 provides the generalinequality X(X'V~X)~X' < V, provided range X C range V. Since N isnonnegative definite, the inequality applies with V = N and X — Nc, givingNc(c'NN~Nc)~c'N < N. This and the supporting hyperplane inequalityyield

Thus condition (0) is established.For the optimal moment matrix M, the variance c'M~c coincides with

the optimal value p2(c) from Theorem 4.11. Together with p2(c) = c'Nc, weobtain

The converse inequality, h'Mh < 1, appears in condition (0). This provescondition (1).

Now we choose a square root decomposition M = KK', and introduce thevector a = K'h — K'M~c(c'M~c)~lc'h (compare the proof of Theorem 2.11).The squared norm of a vanishes,

by condition (1), and because of c'M c = p2(c) = (c'h)2. This establishescondition (2).

The remainder of the proof, construction of a suitable generalized in-verse G of M, is duplicated from the proof of Theorem 2.16.

The theorem may also be derived as a corollary to the General Equiva-lence Theorem 7.14. Nevertheless, the present derivation points to a furtherway of proceeding. Sets M. of competing moment matrices that are genuinesubsets of the set A/(H) of all moment matrices force us to properly dealwith matrices. They preclude shortcut arguments based on regression vectorsthat provide such an elegant approach to the Elfving Theorem 2.14.

EXERCISES 113

Hence our development revolves around moment matrices and informa-tion matrices, and matrix problems in matrix space. It is then by means ofinformation functions, to be introduced in the next chapter, that the matricesof interest are mapped into the real line.

EXERCISES

4.1 What is wrong with the following argument:

4.2 In the two-way classification model, use the iteration formula to calcu-late the generalized information matrix for Kaa from the generalizedinformation matrix

[Krafft (1990), p. 461].

4.3 (continued) Is there a Loewner optimal design for Kaa in the class{W e T: W'la — s} of designs with given blocksize vector 5?

4.4 (continued) Consider the design

for sample size n = a+b—1. Show that the information matrix of W /n forthe centered treatment contrasts is Ka/n. Find the <j> -efficiency relativeto an equireplicated product design 1 as''.

C H A P T E R 5

Real Optimality Criteria

On the closed cone of nonnegative definite matrices, real optimality criteriaare introduced as functions with such properties as are appropriate to measurelargeness of information matrices. It is argued that they are positively homo-geneous, superadditive, nonnegative, nonconstant, and upper semicontinuous.Such criteria are called information functions. Information functions conformwith the Loewner ordering, in that they are matrix isotonic and matrix concave.The concept of polar information functions is discussed in detail, providing thebasis for the subsequent duality investigations. Finally, a result is proved listingthree sufficient conditions for the general design problem so that all optimalmoment matrices lie in the feasibility cone for the parameter system of interest.

5.1. POSITIVE HOMOGENEITY

An optimality criterion is a function tf> from the closed cone of nonnegativedefinite 5 x 5 matrices into the real line,

with properties that capture the idea of whether an information matrix islarge or small. A transformation from a high-dimensional matrix cone to theone-dimensional real line can retain only partial aspects and the question is,which. No criterion suits in all respects and pleases every mind.

An essential aspect of an optimality criterion is the ordering that it in-duces among information matrices. Relative to the criterion </> an informa-tion matrix C is at least as good as another information matrix D when<t>(C) > 4>(D)' With our understanding of information matrices it is essentialthat a reasonable criterion be isotonic relative to the Loewner ordering,

114

5.2. SUPERADDITIVITY AND CONCAVITY 115

compare Chapter 4. We use the same sign > to indicate the Loewner orderingamong symmetric matrices and the usual ordering of the real line.

A second property, similarly compelling, is concavity,

In other words, information cannot be increased by interpolation, otherwisethe situation <j> ((I - a)C + aD) < (1 - «)</>(C) + a<J>(D) will occur. Ratherthan carrying out the experiment belonging to (1 - a)C + aD, we achievemore information through interpolation of the two experiments associatedwith C and D. This is absurd.

A third property is positive homogeneity,

A criterion <f> that is positively homogeneous satisfies </> ((n/<r2)C) =(n/a2)(f)(C). Indeed, Section 3.5 has shown that the true information matrixis not C/c(M), but (n/cr2)CK(M). It is directly proportional to the numberof observations n, and inversely proportional to the model variance or2. Ifthe optimality criterion 0 is positively homogeneous, then we can omit thecommon factor n/a2 and concentrate on the matrix C = CK(M).

For a closer study, it is preferable to divide the properties of concavityand monotonicity into more primitive ones. A real-valued function </> on theclosed cone NND(s) is called super additive when

The following lemma tells us that, in the presence of positive homogeneity,superadditivity is just another view of concavity.

5.2. SUPERADDITIVITY AND CONCAVITY

Lemma. For every positively homogeneous function <f> : NND(s) —> R,the following two statements are equivalent:

a. (Superadditivity) <j> is superadditive.b. (Concavity) <f> is concave.

Proof. In both directions, we make use of positive homogeneity. Assum-ing (a), we get

116 CHAPTERS: REAL OPTIMALITY CRITERIA

Thus superadditivity implies concavity. Conversely, (b) entails supperadditiv-ity,

An analysis of the proof shows that strict superadditivity is the same asstrict concavity, in the following sense. The strict versions of these propertiescannot hold if D is positively proportional to C, denoted by D oc C. Becausethen we have D = 8C with 8 > 0, and positive homogeneity yields

Furthermore, we apply the strict versions only in cases where at least oneterm, C or D, is positive definite. Hence we call a function <f> strictly super-additive on PD(s) when

A function <f) is said to be strictly concave on PD(s) when

5.3. STRICT SUPERADDITrVITY AND STRICT CONCAVITY

Corollary. For every positively homogeneous function <f> : NND(s) —»IR,the following two statements are equivalent:

a. (Strict superadditivity) </> is strictly superadditive on PD(.s).b. (Strict concavity) <f> is strictly concave on PD(s).

Proof. The proof is a refinement of the proof of Lemma 5.2.

Next we show that, given homogeneity and concavity, monotonicity re-duces to nonnegativity. A function <f> on the closed cone NND(s) is said tobe nonnegative when

5.4. NONNEGATIVITY AND MONOTONICITY 117

it is called positive on PD(s) when

Notice that a positively homogeneous function <£ vanishes for the null matrix,<£(0) = 0, because of <£(0) = <f>(2 • 0) = 2<£(0). In particular, <f> is constantonly if it vanishes identically.

5.4. NONNEGATIVITY AND MONOTONICITY

Lemma. For every positively homogeneous and superadditive function<}> : NND(s) —>• 1R, the following three statements are equivalent:

a. (Nonnegativity) <f> is nonnegative.b. (Monotonicity) <£ is isotonic.c. (Positivity on PD(,s)) Either $ is nonnegative and (f> is positive on

PD(s), or else (f> is identically zero.

Proof. First we show that (a) implies (b). If <£ is nonnegative, then weapply superadditivity to obtain <f> (C) = <f> (C - D + D) > <f> (C - D) + <f> (D) ><f>(D), for all C > D > 0. Next we show that (b) implies (c). Here C > 0forces <f>(C) > <£(0) = 0, so that <£ is nonnegative. Because of homogene-ity, 0 is constant only if it vanishes identically. Otherwise there exists amatrix D > 0 with $(£>) > 0. We need to show that <f>(C) is positive forall matrices C e PD(s). Since by Lemma 1.9, the open cone PD(s) is theinterior of NND(s), there exists some e > 0 such that C — eD € NND(s),that is, C > eD. Monotonicity and homogeneity entail <£(C) > £(D) > 0.This proves (c). The properties called for by (c) obviously encompass (a). LJ

Of course, we do not want to deal with the constant function <f> = 0.All other functions </> that are positively homogeneous, superadditive, andnonnegative are then positive on the open cone PD(.s), by part (c).

Often, although not always, we take $ to be standardized, $(/s) — 1- Ev-ery homogeneous function 0 can be standardized according to (l/<£(/s))<£,without changing the preordering that <£ induces among information matri-ces.

Every homogeneous function <f> on NND(s) vanishes at the null matrix.When it is positive otherwise, we say </> is positive,

Such functions <f> characterize the null matrix, in the sense that <£(C) = 0holds if and only if C = 0. A function <f> on NND(^) is called strictly isotonic

when

When this condition holds just for positive definite matrices, wesay that <f> is strictly isotonic on PD(.y).

5.5. POSITIVITY AND STRICT MONOTONICITY

Corollary. For every positively homogeneous and superadditive function<f> : NND(s) -> R, the following two statements are equivalent:

a. (Positivity) <f> is positive.b. (Strict monotonicity) <j> is strictly isotonic.

Moreover, if (f> is strictly superadditive on PD(s), then <f> is strictly isotonicon PD(s).

Proof. That (a) implies (b) follows as in the proof of Lemma 5.4. Con-versely, (b) with D = 0 covers (a) as a special case. The statement on thestrict versions is a direct consequence from the definitions.

We illustrate these properties using the trace as criterion function, 4>(C) =trace C. This function is linear even on the full space Sym(.s), hence it ispositively homogeneous and superadditive. Restricted to NND(s), it is alsostrictly isotonic and positive. It fails to be strictly superadditive. Standard-ization requires a transition to trace C/s. Being linear, this criterion is alsocontinuous on Sym (.$•)•

5.6. REAL UPPER SEMICONTINUITY

In general, our optimality criteria on NND(s) is called upper semicontinuous whenthe upper level sets {(f> > a} — {C > 0 : $(C) > a} are closed, for all a GR. This definition conforms with the one for the matrix-valued informationmatrix mapping CK, in part (a) of Theorem 3.13. There we met an alternativesequential criterion (b) which here takes the form liiriw--^ <f>(Cm) = 4>(C),for all sequences (Cm)m>\ in NND(s) that converge to a limit C and thatsatisfy <(>(Cm) > </>(C) for all m > 1. This secures a "regular" behavior at theboundary, in the same sense as in part (c) of Theorem 3.13.

5.8. INFORMATION FUNCTIONS 119

5.7. SEMICONTINUITY AND REGULARIZATION

Lemma. For every isotonic function $ : NND(s) —> R, the followingthree statements are equivalent:

a. (Upper semicontinuity) The level sets

are closed, for all a e R.b. (Sequential semicontinuity criterion) For all sequences (Cm}m>\ in

NND(^) that converge to a limit C we have

c. (Regularization) For all C,D e NND(s), we have

Proof. This is a special case of the proof of Theorem 3.13, with s = 1and CK = <f>.

5.8. INFORMATION FUNCTIONS

Criteria that enjoy all the properties discussed so far are called informationfunctions.

DEFINITION. An information function <f> on NND(.s) is a function <f> :NND(s) —> R that is positively homogeneous, superadditive, nonnegative,nonconstant, and upper semicontinuous.

The most prominent information functions are the matrix means <j>p, forp € [-00; 1], to be discussed in detail in Chapter 6. They comprise the classicalD-, A-, E-, and T-criteria as special cases.

The list of defining properties of information functions can be rearranged,in view of the preceding lemmas and in view of the general discussion inSection 5.1, by requiring that an information function be isotonic, concave,and positively homogeneous, as well as enjoying the trivial property of beingnonconstant and the more technical property of being upper semicontinuous.It is our experience that the properties as listed in the definition are moreconvenient to work with.

Information functions enjoy many pleasant properties to which we willturn in the sequel. We characterize an information function by its unit level

set, thus visualizing it geometrically (Section 5.10). We establish that the setof all information functions is closed under appropriate functional operations,providing some kind of reassurance that we have picked a reasonable classof criteria (Section 5.11). We introduce polar information functions (Sec-tion 5.12) which provide the basis for the duality discussion of the optimaldesign problem. We study the composition of an information function withthe information matrix mapping (Section 5.14), serving as the objective func-tion for the optimal design problem (Section 5.15).

5.9. UNIT LEVEL SETS

There exists a bewildering multitude of information functions. We depict thismultitude more visibly by associating with each information function <f> itsunit level set

The following Theorem 5.10 singles out the characteristic properties of suchsets.

In general, we say that a closed convex subset C C NND(s) is boundedaway from the origin when it does not contain the null matrix, and that itrecedes in all directions of NND(s) when

The latter property means that C + NND(s) C C for all C e C, that is, if thecone NND(^) is translated so that its tip comes to lie in C € C then all ofthe translate is included in the set C. Exhibit 5.1 shows some unit level setsC = {®p > 1} in Rj.

Given a unit level set C C NND(^), we reconstruct the corresponding infor-mation function as follows. The reconstruction formula for positive definitematrices is

Thus <f>(C) is the scale factor that pulls the set C towards the null matrix orpushes it away to infinity so that C comes to lie on its boundary. However,for rank deficient matrices C, it may happen that none of the sets SC with8 > 0 contains the matrix C whence the supremum in (1) is over the emptyset and falls down to -oo. We avoid this pitfall by the general definition

For S > 0, nothing has changed as compared to (1) since then (SC) +

5.9. UNIT LEVEL SETS 121

EXHIBIT 5.1 Unit level sets. Unit contour lines of the vector means 3>p of Section 6.6 inIR+, for p = -co, —1,0,1/2,1. The corresponding unit level sets are receding in all directionsof R+, as indicated by the dashed lines.

NND(s) = S(C + (l/S)NND(s)) - SC. But for 8 = 0, condition (2) turnsinto C 6 (OC) + NND(s) = NND(s), and holds for every matrix C > 0.Hence the supremum in (2) is always nonnegative.

The proof of the correspondence between information functions and unitlevel sets is tedious though not difficult. We ran into similar considerationswhile discussing the Elfving norm p(c) in Section 2.12, and the scale factorp2(c) in Section 4.10. The passage from functions to sets and back to functionsmust appear at some point in the development, either implicitly or explicitly.We prefer to make it explicit, now.

5.10. FUNCTION-SET CORRESPONDENCE

Theorem. The relations

define a one-to-one correspondence between the information functions <f>on NND(s) and the nonempty closed convex subsets C of NND(s) that arebounded away from the origin and recede in all directions of NND(s).

Proof. We denote by 4> the set of all information functions, and by Fthe collection of subsets C of NND(s) which are nonempty, closed, convex,bounded away from the origin, and recede in all directions of NND(j). Thedefining properties of information functions </> e 3> parallel the propertiesenjoyed by the sets C e F.

An information function </> e 4> is

o finitei positively homogeneousii superadditiveiii nonnegative

iv nonconstantv upper semicontinuous

A

0IIIIII

IVV

unit level set C e F is

a subset of NND(s)bounded away from the originconvexreceding in all directions ofNND(s)nonemptyclosed

In the first part of the proof, we assume that an information function <f>is given and take the set C to be defined through the first equation in thetheorem. We establish for C the properties 0-V, and verify the second equalityin the theorem. This part demonstrates that the mapping $ i - > { < £ > l } o n < f >has its range included in F, and is injective.

0. Evidently C is a subset of NND(s).1. Because of homogeneity 0(0) is zero, and the null matrix is not a mem-

ber of C.II. Concavity of <£ implies

for all C, D 6 C and a e (0; 1). Hence (1 - a)C + aD is a member of C, andthe set C is convex.

III. Fix a matrix C e C. Then superadditivity and nonnegativity of <£ entail for all D > 0. Thus C + NND(s) is

included in C, or in other words, C recedes in all directions of NND(s).IV. Because of (f>(Is) > 0, we have V. Closedness of C is an immediate consequence of the upper semiconti-

nuity of <f>.

5.10. FUNCTION-SET CORRESPONDENCE 123

In order to verify the second equality in the theorem we fix C € NND(s)and set a = sup{8 > 0 : C € (8C) + NND(^)}. It is not hard to show that(C) = 0 if and only if a = 0. Otherwise <f>(C) is positive, as is ex. From

we learn that <f>(C) < a. Conversely we know for S < a that (1/S)C e C. Butwe have just seen that C is closed. Letting 5 tend to a, we get (l/a)C G C,and 0((l/a)C) > 1. This yields <j>(C) > a. In summary, e£(C) = a and thefirst part of the proof is complete.

In the second part of the proof we assume that a set C with the properties0-V is given, and take the function <£ to be defined through the secondformula in the theorem. We show that $ satisfies properties o-v, and verifythe first equality in the theorem. This part demonstrates that the mapping<f> •-» {$ > 1} from 4> to F is surjective.

0. We need to show that <£ is finite. Otherwise we have <f>(C) — oo forsome matrix C. This entails C e 8C for all 8 > 0. Closedness of C yields0 = limg^oo C/8 G C, contrary to the assumption that C is bounded awayfrom the origin.

1. Next comes positive homogeneity. For C > 0 and 8 > 0, we obtain

iii. Since 8 = 0 is in the set over which the defining supremum is formed,we get 4>(C) >0.

ii. In order to establish superadditivity of $, we distinguish three cases.In case <f>(C) = 0 = <£(/)), nonnegativity (iii) yields <f>(D). In case <j>(C) > 0 = <HD), we choose any 5 > 0 such that We obtain

and <f>(C + D) > 8. A passage to the supremum givesIn case </>(C) > 0 and <£(/)) > 0, we choose any y, 8 > 0 such

that C 6 yC and D € 8C. Then convexity of C yields

Thus we have and therefore

iv. Since C is nonempty there exists a matrix C € C. For any such matrixwe have <f>(C] > 1. By Lemma 5.4, <£ is nonconstant.

v. In case a < 0, the level set is closed. Thus upper semicontinuity of <f> follows from closedness of C viaequality of the sets

To prove the direct inclusion, we take any matrix C e {<£ > a} and choosenumbers Sm > 0 converging to <f>(C) such that C e (8mC) + NND(s) = 8mCfor all m > 1. The matrices Dm = C/8m are members of C. This sequenceis bounded, because of \\Dm\\ — \\C\\/8m —> \\C\\/<f>(C). Along a convergentsubsequence, we obtainclosedness. This yields

by

whence Converselv. if C G aC then the definition of <6 immediatelventails <; This proves for all

In particular, a = 1 yields the first formula in the theorem, Altogether the mapping <j> H-» {<£ > 1} from <£ onto F is bijective, withinverse mapping given by the second formula in the theorem.

The cone NND(^) includes plenty of nonempty closed convex subsets thatare bounded away from the origin and recede in all directions of NND(s), andso there are plenty of information functions. The correspondence betweeninformation functions and their unit level sets matches the well-known cor-respondence between norms and their unit balls. However, the geometricorientation is inverted. Unit balls of norms include the origin and excludeinfinity, while unit level sets of information functions exclude the origin andinclude infinity, in a somewhat loose terminology. The difference in orienta-tion also becomes manifest when we introduce polar information functions inSection 5.12. First we overview some functional operations for constructingnew information functions from old ones.

5.11. FUNCTIONAL OPERATIONS

New information functions can be constructed from given ones by elemen-tary operations. We show that the class of all information functions is closedunder formation of nonnegative combinations and least upper bounds of fi-

5.12. POLAR INFORMATION FUNCTIONS AND POLAR NORMS 125

nite families of information functions, and under pointwise infima of arbitraryfamilies.

Given a finite family of information functions, fa,..., <f>m, every nonneg-ative combination produces a new information function,

if at least one of the coefficients 5, > 0 is positive. In particular, sums andaverages of finitely many information functions are information functions.The pointwise minimum is also an information function,

Moreover, the pointwise infimum of a family </>/ with / ranging over an arbi-trary index set J is an information function,

unless it degenerates to the constant zero. Upper semicontinuity follows sincethe level sets (inf/ej <fo > a} = Cli&iifa > «} are intersections of closed sets,and hence closed. This applies to the least upper bound of a finite family

where denotes the class of all information functions. The set over whichthe infimum is sought is nonempty, containing for instance the sum ^,-<»i fa-ll is also bounded from below, for instance by fa. Therefore the infimumcannot degenerate, and lub,<m <fc is an information function.

These structural properties of information functions suggest that our def-inition of information functions is not only statistically reasonable, but alsomathematically sound. The second half of Chapter 11, from Section 11.10 on-wards, makes extensive use of compositions of information functions. How-ever, the functional operation of greatest import is to come next: polarity.

5.12. POLAR INFORMATION FUNCTIONS AND POLAR NORMS

Polarity is a special case of a duality relationship, and as such based on thescalar product of the underlying linear space, (C,D) = trace CD for allC,£> G Sym(.s). The polar function <f>°° of a given information function </> is

best thought of as the largest function satisfying the (concave version of thegeneralized) Holder inequality,

For the definition, it suffices that <£ is defined and positive on the opencone PD(s).

DEFINITION. For a function <f> : PD(s) —> (0;oo), the polar function<f>°° : NND(s) -»[0; oo) is defined by

That the function </>°° so defined satisfies the Holder inequality is evidentfrom the definition for C > 0. For C > 0, it follows through regularizationprovided <f> is isotonic, by Lemma 5.7.

In contrast, a real-valued function <£ on the space Sym(.s) is called a normwhen </> is

absolutely homogeneous: for all Sym(s)

subadditive: for all ( Sym(s), and

positive: for all

A norm <£ has polar function <£° defined by

This leads to the (convex version of the generalized) Holder inequality,

Alternatively the polar function <j>° is the smallest function to satisfy thisinequality.

The principal distinction between a norm and an information function is, ofcourse, that the first is convex and the second is concave. Another difference,more subtle but of no less importance, is that norms are defined everywherein the underlying space, while information functions have a proper subset fortheir domain of definition.

One consequence concerns continuity. A norm is always continuous, be-ing a convex function on a linear space which is everywhere finite. An in-formation function is required, by definition, to be upper semicontinuous.

5.13. POLARITY THEOREM 127

Semicontinuity is not an automatic consequence of the other four propertiesthat constitute an information function. On the other hand, an informationfunction is necessarily isotonic, by Lemma 5.4, while a norm is not.

Another feature emerges from the function-set correspondence of Sec-tion 5.10. An information function </> is characterized by its unit level set{<£ > 1}. A norm <£ corresponds to its unit ball {<£ < 1}. The distinct ori-entation towards infinity or towards the origin determines which version ofthe Holder inequality is appropriate, as well as the choice of an infimumor a supremum in the definitions of the polars. In order that our notationindicates this orientation, we have chosen to denote polars of concave andconvex functions by a superscript oo or 0, respectively. Some matrix norms<£ have the property that

In this case, the similarity between the polars of information functions andthe polars of norms becomes even more apparent.

The next theorem transfers the well-known polarity relationship fromnorms to information functions. Polars of information functions are them-selves information functions. And polarity (nomen est omen} is an idempo-tent operation. The second polar recovers the original information function.

5.13. POLARITY THEOREM

Theorem. For every function <£ : NND(s) —» R that is positively homo-geneous, positive on PD(s), and upper semicontinuous, the polar function <f>°°is an information function on NND(s). For every information function 4> onNND(s) the polar function of </>°° recovers </>,

Proof. The definition of the polar function is <£°° — mfoo'/'c* wherethe functions ^c(I>) - (C,D)/<f>(C) are information functions on NND(^).Hence <t>°° is positively homogeneous, superadditive, nonnegative, and uppersemicontinuous, as mentioned in Section 5.11. It remains to show that <£°° isnonconstant. We exploit positive homogeneity to obtain the estimate

where the set T> = {C > 0 : trace C — 1} is compact. Owing to uppersemicontinuity, the supremum of </> over T> is attained and finite. Thereforethe value <f>°°(Is) is positive, whence the function 0°° is nonconstant.

In the second part of the proof, we take <f> to be an information functionon NND(s), and derive the relation f>. The argument is based onthe unit level sets associated with

It suffices to show that hen the function-set correspondenceof Theorem 5.10 tells us that <£ and ( coincide.

We claim that the two sets C°° and C satisfy the relation

For the direct inclusion, take a matrix D € C°°, that is, D > 0 and 4>°°(D) > 1.For C <E C, the Holder inequality then yields (C, D) > <t>(C)<t>°°(D) > 1. Theconverse inclusion holds since every matrix D > 0 satisfying (C, D) > I forall C e C fulfills <f>°°(D) - infc>0{C/<HC),£>) > infc>0:<MC)>i{C,D) > 1.Thus (1) is established. Applying formula ((1) to

Formulae (1) and (2) entail the direct inclusion The converse inclusion is proved by showing that the complement of C

is included in the complement of C0000, based on a separating hyperplaneargument. We choose a matrix E e NND(s) \ C. Then there exists a matrixD e Sym(s) such that the linear form {-,£>) strongly separates the matrix Eand the set C. That is, for some y e R, we have

Since the set C recedes in all directions of NND(.s), the inequality shows inparticular that for all matrices C e C and A > 0 we have {£, D) < (C+8A, D)for all 8 > 0. This forces (A,D) > 0 for all A > 0, whence Lemma 1.8necessitates D > 0. By the same token, 0 < (E,D) < y. Upon settingD = D/v the. sfrnna sp.naration of F. and C takes the form

Now (1) gives D € C°°, whence (2) yields Thereforeand the prooi is complete.

and we get

and

5.14. COMPOSITIONS WITH THE INFORMATION MATRIX MAPPING 129

Thus a function $ on NND(s) that is positively homogeneous, positiveon PD(s), and upper semicontinous is an information function if and onlyif it coincides with its second polar. This suggests a method of checkingwhether a given candidate function <f> is an information function, namely,by verifying <f> = 00000. The method is called quasi-linearization, since itamounts to representing <b as an infimum over the family of linear functions

This looks like being a rather roundabout way to identify an informationfunction. However, for our purposes we need to find the polar function any-way. Hence, the quasi-linearization method of computing polar functions andto obtain, as a side product, the information function properties is as efficientas can be.

Another instance of quasi-linearization is the definition of informationmatrices in Section 3.2, CK(A) = minL€Kix*. LK=is LAL1. The functional prop-erties in Theorem 3.13 follow immediately from the definition, underliningthe power of the quasi-linearization method.

The theory centers around moment matrices A e NND(/c). Let K'6 be aparameter system of interest with a coefficient matrix K that has full columnrank s. We wish to study A H-> <£ o CK(A), the composition of the infor-mation matrix mapping CK : NND(/c) —> NND(s) of Section 3.13 with aninformation function <£ on NND(s) of Section 5.8. This composition turnsout to be an information function on the cone NND(fc) of k x k matrices.

5.14. COMPOSITIONS WITH THE INFORMATION MATRIXMAPPING

Theorem. Let the k x s coefficient matrix K have full column rank s.For every information function $ on NND(s), the composition with the in-formation matrix mapping, <j> o CK, is an information function on NND(fc).Its polar is given by

Proof. In the first part of the proof, we verify the properties of an infor-mation function as enumerated in the proof of Theorem 5.10.

0. Clearly <f> o CK is finite.1. Since CK and <f> are positively homogeneous, so is the composition <f> o

CK.ii. Superadditivity of CK and $, and monotonicity of <f> imply superaddi-

tivity of <f> o CK-

iii. Nonnegativity of CK and of <f> entail nonnegativity of $ o CK.iv. Positive definiteness of CK(Ik), from Theorem 3.15, and positivity of <f>

on PD(s), from Lemma 5.4, necessitate <£ o €#(4) > 0. Hence <f> o CK isnonconstant.

v. In order to establish upper semicontinuity of <f> o CK we use regular-ization,

From Theorem 3.13. we know that the matrices satisfyand converge Monotonicity of <f> yields

Therefore part (b) of Lemma 5.7 ascertains the convergence of

The second part of the proof relies on a useful representation of the polarinformation function <f>°° that is based on the unit level set of <f>,

To see this, we write the definition iMaking use of regularization, we obtain the inequality chain

Hence equality holds throughout this chain, thus proving (1).The unit level set of the composition <£ o CK is

Indeed, for with we get withConversely, if for. there exists a matrix

such that then monotomcitv imoliesFor all and we get

for all

5.15. THE GENERAL DESIGN PROBLEM 131

To see this, we estimateKCK' (see Section 3.21). Monotonicitv leads to the lower bound (A.B)(KCK'B}. The lower bound is attained at KCK' Now

proves (3).Finally we apply (1) to <f> o CK and then to <f> to obtain from (2) and (3),

for all B > 0,

A first use of the polar function of the composition <f> o CK is made inLemma 5.16 to discuss whether optimal moment matrices are necessarilyfeasible. First we define the design problem in its full generality.

5.15. THE GENERAL DESIGN PROBLEM

Let K'0 be a parameter subsystem with a coefficient matrix K of full columnrank s. We recall the grand assumption of Section 4.1 that M is a set ofcompeting moment matrices that intersects the feasibility cone A(K). Givenan information function <f> on NND(s) the general design problem then reads

This calls for maximizing information as measured by the information func-tion $, in the set M of competing moment matrices. The optimal value ofthis problem is, by definition,

A moment matrix M e M is said to be formally <j>-optimal for K'B in Mwhen </> (CK(M)) attains the optimal value v (<£). If, in addition, the matrix Mlies in the feasibility cone A(K), then M is called <f>-optimal for K'6 in M.

The optimality properties of designs £ are determined by their momentmatrices M(£). Given a subclass H, a design £ e H is called (f>-optimalfor K'6in H when its moment matrix M(£) is </>-optimal for K'6 in M(E).

However, an optimal design is not an end in itself, but an aid to identi-fying efficient practical designs. The appropriate notion of efficiency is thefollowing.

DEFINITION. The $-efficiency of a design £ e H is defined by

It is a number between 0 and 1, and gives the extent (often quoted in per-cent) to which the design £ exhausts the maximum information v(<f>) for K'6in M.

A formally optimal moment matrix M that fails to be feasible for K'Bis statistically useless, even though it solves a well-defined mathematical op-timization problem. However, pathological instances do occur wherein for-mally optimal moment matrices are not feasible! An example is given inSection 6.5. The appropriate tool to check feasibility of M is given in Sec-tion 3.15. The information matrix CK(M) must have rank5.

The following lemma singles out three conditions under which every for-mally optimal matrix is feasible. When an information function is zero forall singular nonnegative definite matrices, we briefly say that it vanishes forsingular matrices.

5.16. FEASIBILITY OF FORMALLY OPTIMAL MOMENTMATRICES

Lemma. If the set M of competing moment matrices is compact, thenthere exists a moment matrix M G M that is formally <£-optimal for K' 0in M, and the optimal value v(<f>) is positive.

In order that every formally <£-optimal moment matrix for K'6 in M liesin the feasibility cone A(K), and thus is <£-optimal for K'Q in M, any oneof the following conditions is sufficient:

a. (Condition on M) The set M is included in the feasibility cone A(K).b. (Condition on </>) The information function <f> vanishes for singular

matrices.c. (Condition on<j>°°) The polar information function <£°° vanishes for sin-

gular matrices and is strictly isotonic on PD(s), and for every formallyoptimal moment matrix M e M there exists a matrix D e NND(*) thatsolves the polarity equation

5.17. SCALAR OPTIMALITY, REVISITED 133

Proof. By Theorem 5.14, the composition <f> o CK is upper semicontinu-ous, and thus attains its supremum over the compact set M. Hence a formallyoptimal matrix for the design problem exists. Because of our grand assump-tion in Section 4.1, the intersection MnA(K) contains at least one matrix B,say. Its information matrix C^(B) is positive definite and has a positive in-formation value <£ (CK(B)}, by Theorem 3.15 and Lemma 5.4. Therefore theoptimal value is positive, v(<f>) > <fr(CK(B)) > 0.

Under condition (a), all competing moment matrices are members ofA(K), including those that are formally optimal. Under condition (b), thecriterion </> vanishes for singular information matrices CK(A). As v(</>) >0, any formally optimal moment matrix M has a nonsingular informationmatrix CK(M). Then M must lie in A(K), by Theorem 3.15.

Finally we turn to condition (c). Let z e Rs be a vector such thatz'CK(M)z = 0. We show that z vanishes. Since <f>°° is zero for singularmatrices, <(>(CK(M)) <f>°°(D) = 1 forces D to be positive definite. If z ^ 0,we obtain the contradiction

as follows from the Holder inequality, the property z'CK(M)z = 0, the po-larity equation, and strict monotonicity of <f>°° on PD(.$i). Hence z = 0. Thisentails positive definiteness of C#(M), and feasibility of M.

In the Duality Theorem 7.12, we find that for a formally optimal momentmatrix M e M., there always exists a matrix D € NND(s) satisfying thepolarity equation of part (c). Theorem 7.13 will cast the present lemma intoits final form, just demanding in part (c) that the polar function <f>°° vanishesfor singular matrices and is strictly isotonic for positive definite matrices.

5.17. SCALAR OPTIMALITY, REVISITED

Loewner optimality and information functions come to bear only if the co-efficient matrix K has a rank larger than 1. We briefly digress to see how theseconcepts are simplified if the parameter system of interest is one-dimensional.

For a scalar system c'O, the information "matrix" mapping A H-» CC(A) isactually real-valued. It is positive on the feasibility cone A(c) where CC(A) —(c'A~cYl > 0, and zero outside. The Loewner ordering among informationnumbers reverses the ordering among the variances c'A'c. Hence Loewneroptimality for c'O is equivalent to the variance optimality criterion of Sec-tion 2.7.

The concept of information functions becomes trivial, for scalar optimality.They are functions <£ on NND(l) = [0;oo) satisfying 0(y) = <f>(l)y forall y > 0, by homogeneity. Thus all they achieve is to contribute a scalingby the constant <£(!) > 0. The composition <f> o Cc orders any pair ofmoment matrices in the same way as does Cc alone. Therefore Cc is the onlyinformation function on the cone NND(fc) that is of interest. It is standardizedif and only if c has norm 1. The polar function of Cc is Q°(jB) = c'Bc. Thisfollows from Theorem 5.14, since the identity mapping $(7) = y has polar0°°(S) = infy>o yS/y = 8. The function B i-> c'Bc is the criterion functionof the dual problem of Section 2.11.

In summary, information functions play a role only if the dimensionality sof the parameter subsystem of interest is larger than one, s > 1. The mostprominent information functions are matrix means, to be discussed next.

EXERCISES

5.1 Show that <J> > & if and only if {</> > 1} D (<A > 1), for all informationfunctions

5.2 In Section 5.11, what are the unit level sets of

5.3 Discuss the behavior of the sets SC as S tends to zero or infinity, with Cbeing (i) the unit level set {<f> > 1} of an information function $ asin Section 5.9, (ii) the penumbra M — NND(fc) of a set of competingmoment matrices M as in Section 4.9, (iii) the Elfving set conv(A? u(—«¥)) of a regression range X as in Section 2.9.

5.4 Show that a linear function <f> : Sym(.s) —* IR is an information functionon NND(s) if and only if for some D e NND(s) and for all C e Sym($)one has <f>(C) = trace CD.

5.5 Is forforan information function?

5.6 Show that neither of the matrix normsare isotonic on NND(s) [Marshall and olkin (1969), p. 170].

5.7 Which properties must a function <£ on NND(s) have so that </>(|C|) isa matrix norm on Sym(.s), where |C| = C+ + C_ is the matrix modulusof Section 6.7?

C H A P T E R 6

Matrix Means

The classical criteria are introduced: the determinant criterion, the average-variance criterion, the smallest-eigenvalue criterion, and the trace criterion.They are just four particular cases of the matrix means <f>p, with parameterp e [—oo;l]. The matrix mean of a given matrix is the same as the vectormean of the eigenvalue vector of the matrix. This and a majorization inequalityshow the matrix mean (f>p to be an information function. Its polar is propor-tional to the matrix mean <$>q where the numbers p and q are conjugate in theinterval [—oo; 1].

6.1. CLASSICAL OPTIMALITY CRITERIA

The ultimate purpose of any optimality criterion is to measure "largeness" ofa nonnegative definite 5 x 5 matrix C. In the preceding chapter, we studiedthe implications of general principles that a reasonable criterion must meet.We now list specific criteria which submit themselves to these principles,and which enjoy a great popularity in practice. The most prominent criteriaare

the determinant criterion,the average-variance criterion,definite)the smallest-eigenvalue criterion, andthe trace criterion,

Each of these criteria reflects particular statistical aspects, to be discussedin Section 6.2 to Section 6.5. Furthermore, they form but four particularmembers of the one-dimensional family of matrix means <f>p, as defined inSection 6.7. In the remainder of the chapter, we convince ourselves that thematrix mean <j>p qualifies as an information function if the parameter p liesin the interval [-00; 1].

135

positive

136 CHAPTER 6: MATRIX MEANS

6.2. D-CRITERION

The determinant criterion <fo(C) differs from the determinant del C by tak-ing the 5 th root, whence both functions induce the same preordering amonginformation matrices. From a practical point of view, one may therefore dis-pense with the 5 th root and consider the determinant directly. However, thedeterminant is positively homogeneous of degree s, rather than 1. For com-paring different criteria, and for applying the theory of information functions,the version <fo(C) = (det C)1/5 is appropriate.

Maximizing the determinant of information matrices is the same as mini-mizing the determinant of dispersion matrices, because of the formula

Indeed, in Section 3.5 the inverse C l of an information matrix was identifiedto be the standardized dispersion matrix of the optimal estimator for the pa-rameter system of interest. Its determinant is called the generalized variance,and is a familiar way in multivariate analysis to measure the size of a disper-sion matrix. This is the origin of the great popularity that the determinantcriterion enjoys in applications.

In a linear model with normality assumption, the optimal estimator K'0for an estimable parameter system K'6 has distribution N^/fl.(ff2/n)C-i, withC = (K'M'KY1 and M = (\/ri)X'X. It turns out that the confidence ellip-soid for K'0 has volume inversely proportional to (det C)1/2. Hence a largevalue of det C secures a small volume of the confidence ellipsoid. This isalso true for the ellipsoid of concentration which, by definition, is such thaton it the uniform distribution has the same mean vector K'6 and dispersionmatrix (ar2/ri)C~l as has K'0.

For testing the linear hypothesis K'0 — 0, a uniform comparison of powerleads to the Loewner ordering, as expounded in Section 3.7. Instead we mayevaluate the Gaussian curvature of the power function, to find a design sothat the F-test has good power uniformly over all local alternatives close tothe hypothesis. Maximization of the Gaussian curvature again amounts tomaximizing det C.

Another pleasing property is based on the formula det(H'CH) =(det //2)(det C), with a nonsingular 5 x 5 matrix H. Suppose the parametersystem K'0 is reparametrized according to H'K'6. This is a special case ofiterated information matrices, for which Theorem 3.19 provides the identities

Thus the determinant assigns proportional values to Cx(A) and CKu(A), and

6.4. E-CRITERION 137

the two function induced orderings of information matrices are identical. Inother words, the determinant induced ordering is invariant under reparamet-rization. It can be shown that the determinant is the only criterion for whichthe function induced ordering has this invariance property.

Yet another invariance property pertains to the determinant function it-self, rather than to its induced ordering. The criterion is invariant underreparametrizations with matrices H that fulfill det H = ±1, since then wehave det CKH(^) = det CK(A), We verify in Section 13.8 that this invarianceproperty is characteristic for the determinant criterion.

6.3. A-CRITERION

Invariance under reparametrization loses its appeal if the parameters of in-terest have a definite physical meaning. Then the average-variance criterionprovides a reasonable alternative. If the coefficient matrix is partitioned intoits columns, K = (c\,... ,cs), then the inverse l/<£_i can be represented as

This is the average of the standardized variances of the optimal estimatorsfor the scalar parameter systems c[6,...,c'sd formed from the columns of K.

From the point of view of computational complexity, the criterion </>_! isparticularly simple to evaluate since it only requires the computation of the sdiagonal entries of the dispersion matrix K'A~K. Again we can pass backand forth between the information point of view and the dispersion point ofview. Maximizing the average-variance criterion among information matricesis the same as minimizing the average of the variances given above.

6.4. E-CRITERION

The criterion $-00, evaluation of the smallest eigenvalue, also gains in under-standing by a passage to variances. It is the same as minimizing the largesteigenvalue of the dispersion matrix,

Minimizing this expression guards against the worst possible variance amongall one-dimensional subsystems z'K'O, with a vector z of norm 1. In termsof variance, it is a minimax approach, in terms of information a maximinapproach. This criterion plays a special role in the admissibility investigationsof Section 10.9.

The eigenvalue criterion <£_oo is one extreme member of the matrix meanfamily <£p, corresponding to the parameter

6.5. T-CRITERION

The other extreme member of the <(>p family is the trace criterion <j>\. Byitself the trace criterion is rather meaningless. We have made a point inSection 5.1 that a criterion ought to be concave so that information cannotbe increased by interpolation. The trace criterion is linear, and this is so weakthat interpolation becomes legitimate. Yet trace optimality has its place in thetheory, mostly accompanied by further conditions that prevent it from goingastray. An example is Kiefer optimality of balanced incomplete block designs(see Section 14.9).

The trace criterion is useless if the regression vectors x e X have a constantsquared length c, say. Then the moment matrix Af (£) of any design £ € Hsatisfies

whence the criterion <fo is constant. For instance, in the trigonometric fitmodel of Section 2,22, it assigns the value d + 1 to all moment matrices,

Similarly <fo is constant in the two-way classification model of Section 1.5,since

In these situations the criterion <fo provides no distinction whatsoever.The trace criterion also exemplifies the pathologies discussed in

Section 5.15, that a formally optimal moment matrix may fail to be feasi-ble. As an example, we take the parabola fit model of Section 1.6, with allthree parameters being of interest. We assume that the experimental domainis the symmetric unit interval T = [—!;!]. The <fo information of a design ron T is

The moments /u; = $<\.\\ t> dr for j = 2,4 attain the maximum value 1 if andonly if the design is concentrated on the points ±1. Thus every <ft-optimal

6.6. VECTOR MEANS 139

design T has at most two support points, ±1, whence a formally optimalmoment matrix has rank at most equal to 2. No such moment matrix can befeasible for a three-dimensional parameter system.

The weaknesses of the trace criterion are an exception in the matrix meanfamily <£p, with p e [-00; 1]. Theorem 6.13 shows that the other matrix meansare concave without being linear. Furthermore they fulfill at least one of theconditions (b) or (c) of Lemma 5.16 for every formally optimal momentmatrix to be feasible (see Theorem 7.13).

6.6. VECTOR MEANS

Before turning to matrix means, we review the vector means 4>p on thespace Rs. The nonnegative orthant W+ = [0;oo)* is a closed convex conein R5. Its interior is formed by those vectors A e Rs that are positive, A > 0.It is convenient to (1) define the means <J>P for positive vectors, and (2) extendthe definition to the closed cone Us

+ by continuity, and (3) cover all of thespace Us by a modulus reduction.

For positive vectors, A = ( A i , . . . , A s) ' > 0, the vector mean 4>p is definedby

For vectors A e Us+ with at least one component 0, continuous extension

yields

For arbitrary vectors, A e Rs, we finally define

The definition of <J>P extends from positive vectors A > 0 to all vectorsA £ RJ in just the same way for both p > 1 and for p < 1. It is not thedefinition, but the functional properties that lead to a striking distinction.For p > 1, the vector mean 4>p is convex on all of the space R*; for p < 1, itis concave but only if restricted to the cone IR+. We find it instructive to followup this distinction, for the vector means as well as for the matrix means, even

though for the purposes of optimal designs it suffices to investigate the meansof order p e [-00; 1], only.

If A = als is positively proportional to the unity vector ls then themeans <&p do not depend on p, that is, 4>p(a75) = a for all a > 0. Theyare standardized in such way that <t>PC/5) = 1. For p ^ ±00, the dependenceof <£p on a single component of A > 0 is strictly isotonic provided the othercomponents are fixed.

With the full argument vector A held fixed, the dependence on the parame-ter/? € [—00; oo] is continuous. Verification is straightforward for p tending to±00. For/? tending to 0 it follows by applying the 1'Hospital rule to log ̂ (A).If the vector A has at least two distinct components then the means ^(A)are strictly increasing in /?.

The most prominent members of this family are the arithmetic mean 4>!,the geometric mean <J>0, and the harmonic mean <!>_!. Our notation suggeststhat they correspond to the trace criterion <fo, the determinant criterion <fo,and the average-variance criterion <£_i. The precise relationship is as follows.

6.7. MATRIX MEANS

Again we find it instructive to define the matrix means <j>p for every parameterp e [-00; oo], on all of the space Sym^). Later, we contrast the convexbehavior for /? > 1 with the concave behavior for p < 1. For p > 1, thematrix mean <j>p is a norm on the space Sym(.s); for p < 1, it is an informationfunction on the cone NND(s).

For a matrix C e Sym(5i), we let A(C) = (Ai , . . . , A,)' be the vector consist-ing of the eigenvalues A; of C, in no particular order but repeated accordingto their multiplicities. The matrix mean <f>p is defined through the vectormean <£>p,

Since the vector means 3>p are invariant under permutations of their argu-ment vector A, the order in which A(C) assembles the eigenvalues of C doesnot matter. Hence <f>p(C) is well defined.

An alternative representation of <f>p(C) avoids the explicit use of eigen-values, but instead uses real powers of nonnegative definite matrices. Thisrepresentation enters into the General Equivalence Theorem 7.14 for <f>p-optimality. There the conditions are stated in terms of not eigenvalues, butmatrices and powers of matrices. To this end let us review the definition ofthe powers Cp for arbitrary real parameters n nrnvide.H C is nnsitive defi-nite. Using an eigenvalue decompositiondefinition is

the

6.7. MATRIX MEANS 141

For integer values p, the meaning of Cp is the usual one. In general, weobtain trace Cp trace

For positive definite matrices, C 6 PD(s), the matrix mean is repre-sented by

For singular nonnegative definite matrices, C e NND(s) with rank C < s, wehave

For arbitrary symmetric matrices, C e Sym(s), we finally get

where 1C I, the modulus of C, is denned as follows. With eigenvalue decom-position the positive part C+ and the negative part C_ aregiven by

They are nonnegative definite matrices, and fulfill C = C+ - C_. Themodulus of C is defined bv |C| = C+ + C_. It is nonneeative definite andsatisfies thus substan-tiating (3)

It is an immediate consequence of the definition that the matrix means q>p

on the space Sym(.s) are absolutely homogeneous, nonnegative (even positiveif p 6 (0;ooj), standardized, and continuous. This provides all the propertiesthat constitute a norm on the space Sym(^), or an information function onthe cone NND(s), except for subadditivity or superadditivity. This is wherethe domain of definition of (f>p must be restricted for sub- and superadditivityto hold true, from Sym(s) for p > 1, to NND(s) for p < 1.

We base our derivation on the well-known polarity relation for the vec-tor means <J>P and the associated Holder inequality in Rs, using the quasi-linearization technique of Section 5.13. The key difficulty is the transitionfrom the Euclidean vector scalar product (A,/x) = X'fi on Rs, to the Eu-clidean matrix scalar product ( C , D ) = trace CD on Sym(s). Our approach

uses a few properties of vector majorization that are rigorously derived inSection 6.9. We begin with a lemma that provides a tool to recognize diagonalmatrices.

6.8. DIAGONALITY OF SYMMETRIC MATRICES

Lemma. Let C be a symmetric s x s matrix with vector 8(C) = (en,.. . ,css)' of diagonal elements and vector A(C) = (Ai , . . . ,A s ) ' of eigenvalues.Then the matrix C is diagonal if and only if the vector S(C) is a permutationof the vector A(C).

Proof. We write c = 8(C) and A = A(C), for short. If C is diagonal,C = Ac, then the components c;; are the eigenvalues of C. Hence the vector cis a permutation of the eigenvalue vector A. This proves the direct part.

For the converse, we first show that Ac may be obtained from C throughan averaging process. We denote by Sign(s) the subset of all diagonal s x smatrices Q with entries qy} € {±1} for / < s, and call it the sign-changegroup. This is a group of order 2s.

We claim that the average over the matrices QCQ with Q e Sign(s) is thediagonal matrix Ac,

To this end let e, be the ; th Euclidean unit vector in IR5. We have e-QCQej =quqjjCij. The diagonal elements of the matrix average in (1) are then

The off-diagonal elements are, with i ̂ /,

Hence (1) is established.Secondly, for the squared matrix norm, we verify the invariance property

6.8. DIAGONALITY OF SYMMETRIC MATRICES 143

On the one hand, we have ||(>C()||2 = trace QCQQCQ = trace C2 =£)y<5A2. On the other hand, we have ||AC||2 = trace ACAC = ]Cy<5c/;

=

£]7<SA2, where the last equality follows from the assumption that c is apermutation of A.

Finally we compute the squared norm of the convex combination (1) andutilize the invariance property (2) to obtain, because of convexity of thesquared norm,

Hence (3) holds with equality. Since the convexity of the squared norm isstrict and every weight 1/2* is positive, equality in (3) forces all matrices inthe sum to be the same, QCQ = C for all Q € Sign(s). Going back to (1)we find that Ac = 1/25 £G€Sign(*) C = C, that is, C is diagonal. D

The proof relies on equation (1), to transform C into QCQ, then takethe uniform average as Q varies over the sign-change group Sign(s), andthereby reproduce Ac. The vectors A(C) and 5(C) are related to each otherin a similar fashion; to transform A(C) into Q\(C), then take some averageas Q varies over a finite group Q, and finally reproduce 8(C). The relation iscalled vector majorization. It is entirely a property of column vectors, and achange of notation may underline this. Instead of S(C) and A(C), we choosetwo arbitrary vectors x and y in IR*.

The group Q in question is the permutation group Perm(/c), that is, thesubset of all permutation matrices in the space Rkxk. A permutation IT of thenumbers ! , . . . ,& induces the permutation matrix

where €j is the yth Euclidean unit vector of Uk. It is readily verified thatthe mapping IT i-» Q^ is a group isomorphism between permutations, andpermutation matrices. Hence Perm(/c) is a group of order k\.

A permutation matrix Q^ acts on a vector y e Uk by permuting its entriesaccording to TT~I,

144 CHAPTER 6. MATRIX MEANS

We are interested in averages of the type

with weights aQ that satisfy mmQePeTm(k) aQ>Qvector x has its components less spread out than those of y in the sense that xresults from averaging over all possible permutations Qy of y. This averagingproperty is also reflected by the fact that the matrix 5 is doubly stochastic.

A matrix S £ R/cx* is called doubly stochastic when all elements are non-negative, and all row sums as well as all column sums are 1,

The set of all doubly stochastic k x k matrices is a compact and convexsubset of the matrix space Rkxk, closed under transposition and matrix mul-tiplication. Every permutation matrix is doubly stochastic, as is every averageY^QePerm(k) aQQ- Conversely, the Birkhoff theorem states that every doublystochastic matrix is an average of permutation matrices. We circumvent theBirkhoff theorem in our exposition of vector majorization, by basing its def-inition on the property of 5 being doubly stochastic.

6.9. VECTOR MAJORIZATION

The majorization ordering compares vectors of the same dimension, express-ing that one vector has its entries more balanced, or less spread out, thananother vector. For two vectors x,y e Rk, the relation x -< y holds when xcan be obtained from y by a doubly stochastic transformation,

for some doubly stochastic k x k matrix 5.

In this case, the vector x is said to be majorized by the vector y.Not all pairs of vectors are comparable under majorization. A necessary

reauirement is that the comnonent sums of the two vectors are the same:if then The relation isreflexive and transitive,

Hence the majorization ordering constitutes a preordering,For this preordering to be a partial ordering, antisymmetry is missing.

We claim that vector majorization satisfies a weaker version which we call

The

6.9. VECTOR MAJORIZATION 145

antisymmetry modulo Perm(/c),

To prove the direct part in (1), we consider decreasing rearrangements andpartial sum sequences. Given a vector x its decreasing rearrangement, jq, isdefined to be the vector that has the same components as x, but in decreasingorder. For x to be a permutation of another vector y, it is evidently necessaryand sufficient that the two vectors share the same decreasing rearrangements,JC| = yi. This equality holds if and only if the partial sums over the consecutiveinitial sections of jq = (x^,..., x^)' and y^ — (y^,. . . , y i / t)' coincide,

If jc -x y, then the two sums are the same for h — k, since jc,y,jtj,y| shareone and the same component sum.

We show that majorization implies an ordering of the partial sums,

To this end, let x be majorized by y, that is, x = Sy for some doubly stochasticmatrix S. We can choose two permutation matrices Q and R that map x and yinto their decreasing rearrangements, jt| = Qx and yj = Ry. Since R' is theinverse of /?, we obtain y = R % It follows that x± = Qx = QSy = QSR 'yj =/*yi, where the matrix P = QSR' is doubly stochastic. For h < k, we get

with coefficients that satisfy and Thisfinally yields

We have verified property (3). If jc and y majorize each other, then the twoinequalities from (3) yield the equality in (2). Hence the direct part in (1)is established. For the converse, we only need to apply the majorizationdefinition to x = Qy and y = Q'x. The proof of our claim (1) is complete.

We are now in a position to resume the discussion of Section 6.8, andderive the majorization relation between the diagonal vector 8(C) and theeigenvalue vector A(C) of a positive definite matrix C.

6.10. INEQUALITIES FOR VECTOR MAJORIZATION

Lemma. Let C be a positive definite 5 x 5 matrix with vector 8(C) =(en,.. . ,c s s) ' of diagonal elements and vector A(C) = ( A i , . . . , A,)' of eigen-values. Then we have:

a. (Schur inequality) 8(C) is majorized by A(C).b. (Monotonicity of concave or convex functions) For every strictly con-

cave function g : (0; oo) —> R, we have the inequality

while for strictly convex functions g the inequality is reversed. In eithercase equality holds if and only if the matrix C is diagonal.

c. (Monotonicity of vector means) For parameter p € (-00; 1) the vectormeans <&p obey the inequality

while for parameter p e (l;oo), the inequality is reversed. In eithercase equality holds if and only if the matrix C is diagonal.

Proof. We write and for short. For oart fa), wechoose an eigenvalue decomposition and definethe s x s matrix S with entries Since Z' = (zi,...,zs) is anorthogonal sxs matrix, the rows and columns of S sum to 1. Thus 5 is doublystochastic, and we have, for all i < s.

This yields c = S\, that is, c is majorized by A.To prove part (b), let g be a strictly concave function. For a single subscript

/ < s, equalitv (1) vields

6.11. THE HOLDER INEQUALITY 147

Summation over / gives

as claimed in part (b). Equality holds in (3) if and only if, for all / < s,equality holds in (2). Then strict concavity necessitates that in (1) positivecoefficients s/y come with an identical eigenvalue Ay = c,,, that is, stj > 0implies c(/ = Ay for all i,j < s. Therefore A; = Yî-s >os'iî — lL,i<ssijc"> ôr

all 7 < 5, that is, A = S'c. Thus c and A are majorized by each other. Fromthe antisymmetry property (1) in Section 6.9, the vector c is a permutationof the vector A. Hence C is diagonal by Lemma 6.8. The case of a strictlyconvex function g is proved similarly.

The proof of part (c) reduces to an application of part (b). First considerthe case p e (—oo;0). Since the function g(x) = xp is strictly convex, part (b)yields Ej<s8(cjj) < £;•<,?(*/), that is, g(*p(c)) < g(*p(X)). But g is alsostrictly antitonic, whence we obtain $>p(c} > $P(A). For p — 0, we use thestrictly concave and strictly isotonic function g(x) — log x, for p e (0; 1) thestrictly concave and strictly isotonic function g(x) = xp, and for p € (l;oo)the strictly convex and strictly isotonic function g(x) — X?. LJ

We next approach the Holder inequality for the matrix means <j>p. Twonumbers p,q e (-00; oo) are called conjugate when p + q — pq. Except forp — q = 0, the defining relation permits the familiar form l/p + l/q = 1. Forfinite numbers p and q, conjugacy implies that both lie either in the interval(—oo;l), or in (l;oo). The limiting case, as p tends to ±00, has q = 1.

Therefore we extend the notion of conjugacy to the closed real line[-00; oo] by saying that the two numbers -oo and 1 are conjugate in theinterval [—00; 1], while the two numbers 1 and oo are conjugate in [1; oo]. Theonly self-conjugate numbers are p = q = 0, and p = q = 2 (see Exhibit 6.1).

6.11. THE HOLDER INEQUALITY

Theorem. Let p and q be conjugate numbers in [—oo; 1], and let p and qbe conjugate numbers in [l;oo]. Then any two nonnegative definite 5 x 5matrices C and D satisfy

Assume C to be positive definite, C > 0. In the case p,q € (—oo;l),equality holds in the left inequality if and only if D is positively proportionalto Cp~l, or equivalently, C is positively proportional to Dq~^. In the case

EXHIBIT 6.1 Conjugate numbers, p + q = pq. In the interval [-00; 1] one has p < 0 if andonly if q > 0; in the interval [l;oo] one has p > 2 if and only if <? < 2. The only self-conjugatenumbers are p = q = 0 and p — q = 2.

p,q € (l;oo), the equality condition for the right inequality is the samewith p in place of p, and q in place of q.

Proof. First we investigate the cases p,q € (—oo; 1), for positive definitematrices C and D. We choose an eigenvalue decomposition Z 'AAZ. With the orthogonal 5x5 matrix Z' = (z\,..., zs) that comes with D,we define C = ZCZ'. Then the scalar product (C, D) of the matrices C and /)is the same as the scalar product (S(C),A(Z))) of the diagonal vector 5(C)of C and the eigenvalue vector \(D} of D.

The Holder inequality for vector means provides the lower bound0>p (8(C))<bq (A(/))). Lemma 6.10 bounds 4>p (S(C)) from below by <t>p (A(C)).By construction, C has the same eigenvalues as C, that is, A(C) = A(C). Thus

6.12. POLAR MATRIX MEANS 149

we obtain the inequality chain

We denote the components of A(D) by Ay , while 8(C) has components Equality holds in the first inequality of (1) if and only if, for some a > 0and for all j < s, we have A in case p,q ^ 0, and

in case p = q — 0. In the second inequality of (1), equality

holds if and only if the matrix C is diagonal. Hence c;/ are the eigenvaluesof C as well as of C, and we havein (1) entails

Altogether equality

That this condition is also sufficient for equality is seen bv straightforwardverification. Evidently is equivalent with

The case p,q e (l;oo) uses similar arguments, with reversed inequalitiesin (1). The extension from positive definite matrices to nonnegative definitematrices C and D follows by continuity. The extension to parameter valuesP->4 £ {-0°,!} and p,q € {l,oo} follows by a continuous passage to thelimit.

If p, q / 0 and C, D > 0, then D is proportional to Cp~l if and only if Cp

and Dq are proportional. The latter condition looks more symmetric in Cand D.

The polar function of a matrix mean <f>p is defined by

In either case the denominators are positive and so the definition makessense. The formulae are the same as those in Section 5.12. It turns out thatthe polar functions again are matrix means, up to a scale factor.

6.12. POLAR MATRIX MEANS

Theorem. Let p and q be conjugate numbers in [-co; 1], and let p and qbe conjugate numbers in [1; oo]. Then the polar functions of the matrix means

are

Proof. We again write (C, D) = trace CD. In the first part, we concen-trate on the case p,q < 1. Fix a positive definite matrix D. The Holderinequality proves one half of the polarity formula,

The other half follows from inserting the particular choice Dq~l for C,provided p,q € (-00; 1). This choice fulfills the equality condition of theHolder inequality. Hence we obtain

For/? = -oo, we insert for C the choice /5. Forp = 1, we choose the sequenceCm = zz' + (l/m)/s > 0 and let m tend to infinity, where z is any eigenvectorof D corresponding to the smallest eigenvalue Amin(D).

This shows that the functions <f>£° and s<f>q coincide on the open conePD(,s). Because of continuity they also coincide on the closed cone NND(s).The first polarity formula is established.

In the second part of the proof, we turn to the case p, q > 1. For D = 0, wehave <^(0) = 0 — s^(0). In order to handle symmetric matrices D ^ 0, weutilize from Section 6.7, the positive and negative parts £>+ and £>_ and themodulus \D\ = D+ + Z)_. Let C be another symmetric matrix. Monotonicityof the trace, as discussed in Section 1.11, entails -(C±,DT) < 0 < (C±,DT).Hence we obtain a bound for the trace scalar product,

This bound and representation (3) from Section 6.7 yield

(This estimate has no counterpart for polars of information functions.)We restrict ourselves to the case p, q e (1; oo) and a nonsingular matrix D.

The other cases then follow by continuity. Nonsingularity of D forces |D| to

6.13. MATRIX MEANS AS INFORMATION FUNCTIONS AND NORMS 151

be positive definite. Hence the Holder inequality applies:

Together with (2), this establishes the first half of the polarity formula,

The other half follows if we insert the choice ) forPositive and negative parts fulfill the orthogonality relationentailing

From i and we obtain

Hence equality holds, and the proof is complete.

The functional properties of the matrix means may be summarized asfollows.

6.13. MATRIX MEANS AS INFORMATION FUNCTIONS ANDNORMS

Theorem. Let p and q be conjugate numbers in [-00; 1]. Then the matrixmean (f>p is an information function on NND(s), with s<j>q as its polar function;$p is strictly concave on PD(s) if p e (-oo;l), (j>p is strictly isotonic onNND(s) if p e (0; 1], and <f>p is strictly isotonic on PD(s) if p e (-00; 0].

Let p and q be conjugate numbers in [l;oo]. Then the matrix mean <j>p

is a norm on Sym(s), with sfa as its polar function; <j>p is strictly convex ifp e (l;oo), 4>p is strictly isotonic on NND(s) if p e [l;oo).

Proof. With p and q interchanged, Theorem 6.12 gives <f>p = (l/s)<f>™.Hence <f>p is an information function, by Theorem 5.13. Theorem 6.12 alsocomprises the polarity formula <j>™ = s<f>q. It remains to investigate strictconcavity and strict monotonicity.

Strict concavity on PD(s) follows from Corollary 5.3, provided tj>p is strictly

superadditive on PD(s). To this end we show that, for C > 0, D ^ 0 and

p e (-00; 1), the equality (f>p(C+D) = <f>p(C)+<t>p(D) forces D to be positivelyproportional to C. Indeed, upon introducing E — (C+D)p~l, we have equalityin the Holder inequality whence

Now we assume that equality holds. Then E minimizes {C, F)/<f>px>(F) over

F > 0. The Holder inequality states that this occurs only if E is positivelyproportional to CP~I. But E = (C + DY'1 and E = aCp~l with a > 0 implyC + D = a1/^-1^, and D = (aV(p-i) _ i)c. If a < 1, then D < 0 and if

a = 1, then D = 0. Either case contradicts D ^ 0. Hence a > 1 and D is

positively proportional to C. This establishes strict concavity on PD(^), forpe (-00; l).

From Corollary 5.5, if p £ (0;1], then (f>p is positive and hence strictlyisotonic. Moreover, if p e (-00; 0], then <f>p is strictly superadditive on PD(s)and hence strictly isotonic on PD(s). The proof for the norms <f>p is similarand hence omitted.

Having established the functional properties of the matrix means <j>p, wereturn to the design problem.

6.14. THE GENERAL DESIGN PROBLEM WITH MATRIX MEANS

In Section 5.15, we introduced the general design problem for a parametersystem K'6, with coefficient matrix K of full column rank s. If the optimalitycriterion is a matrix mean <f>p, with parameter p e [-co; 1], the problem takesthe form

For C > 0 fixed, <f>p(C) is isotonic in p. Hence the optimal value function(CK(M)) is isotonic in p, that is,

In particular, the optimal values v(<f>p) are bounded fromabove by v(d>\).

for alll

6.15. ORTHOGONALITY OF TWO NONNEGATIVE DEFINITE MATRICES 153

The matrix means of greatest importance are singled out in Section 6.1to Section 6.5, the D-, A-, E-, and T-criteria <fr>, 4>_i, <f>-oo, and <f>\. The de-terminant criterion <fo is self-polar. The average-variance criterion <£>_! hasthe matrix mean s<fo/2 for its polar function. The smallest-eigenvalue crite-rion 0-oo and the trace criterion <fo form a polar pair. In summary, for theclassical criteria, the polarity relations are

In Section 6.5, we illustrated by example that there need not exist a <f>\-optimal moment matrix for K'6 in M. However, this anomaly can happenonly for the trace criterion <fo. Indeed, if p E [-00;0], then the matrix mean<^>p vanishes for singular matrices, and part (b) of Lemma 5.16 applies. Ifp e (0; 1), then the mean <f>p has polar function s<f>q which vanishes for sin-gular matrices and is strictly isotonic on PD(s). This verifies the first half ofpart (c) in Lemma 5.16. The second half, solvability of the polarity equation<l>p(C)4>™(D) = trace CD = 1, is established later, in Theorem 7.13.

At any rate, the matrices solving the polarity equation in part (c) ofLemma 5.16 play a vital role. In Theorem 7.9, we show on an abstract levelthat the polarity equation is solvable provided C is positive definite. Forthe matrix means <f>p we can determine these solutions quite explicitly. Anauxiliary matrix result is established first.

6.15. ORTHOGONALITY OF TWO NONNEGATIVE DEFINITEMATRICES

Let A and B be two n x k matrices. We call A orthogonal to B when thematrix scalar product (A,B) = trace A'B vanishes,

This notion of orthogonality refers to the scalar product in the space Rnxk.It has nothing to do with an individual k x k matrix A being orthogonal. Ofcourse, the latter means A 'A — Ik and n — k.

If A and B are square and symmetric, then for the scalar product to vanish,it is clearly sufficient that the matrix product AB is null. For nonnegativedefinite matrices A and B the converse also holds true:

In order to establish the direct part, we choose two square root decom-positions, A = UU' and B = VV (see Section 1.14). We get (A,B) =trace UU'VV = \.race(U'V)'U'V = \\U'V\\2. Hence if A is orthogonal


to B, then U'V is null. Premultiplication by U and postmultiplication by V'leads to Q=UU'VV = AB.

Moreover the equation AB = 0 permits a geometrical interpretation interms of the ranges of A and B. The equation means that the range of B isincluded in the nullspace of A. By Lemma 1.13, the latter is the orthogonalcomplement of the range of A. Hence AB is the null matrix if and only ifthe ranges of A and B are orthogonal subspaces. For nonnegative definitematrices A and R we thus have

Notice the distinct meanings of orthogonality. The left hand side refers toorthogonality of the points A and B in the space Sym(fc), while the righthand side refers to orthogonality of the two subspaces range A and range Bin the space HI*. Lemma 2.3 is a similar juxtaposition for sums of matricesand sums of subspaces.

The equivalences (1), (2), and (3) are used repeatedly and with othernonnegative definite matrices in place of A and B.

6.16. POLARITY EQUATION

Lemma. Let C be a positive definite 5 x 5 matrix, and let p be a numberin [—00;!]. Then D e NND(s) solves the polarity equation

if and only if

where the set 5 consists of all rank 1 matrices zz' such that z is a norm 1eigenvector of C corresponding to its smallest eigenvalue Amjn(C).

Proof. For the direct part, let D > 0 be a solution of the polarity equa-tion. In the case p e (-00; 1), the polarity equation holds true if and onlyif D = aCP~l for some a > 0, by Theorem 6.11. From trace C(aCp~l) -a trace Cp = 1, we find a = I/trace Cp.

In the case p = 1, trace monotonicity of Section 1.11 and the polarity

6.17. MAXIMIZATION OF INFORMATION VERSUS MINIMIZATION OF VARIANCE 155

equation yield

where the penultimate line employs the definitions of fa and </>-oo, and thelast line uses the polarity formula s^-oo = 4>\°' Thus we have traceC(D -hmm(D)Is) = 0, with C positive definite. This entails D — Amin(D)/5, thatis, D is positively proportional to Is — C1"1. From trace C(als) = 1, we geta = I/trace C1.

The last case, p = -oo, is more delicate. With C > Amin(C)/s, we infer asbefore

This entails trace (C - Amin(C)/s)£> = 0, but here D may be singular. How-ever, C - Amin(C)/5 is orthogonal to D in the sense of Section 6.15, that is,CD = Amin(C)D. With eigenvalue decomposition D = ̂ 2j<s hjZjZJ, postmul-tiplication by Zj yields \jCzj = Amjn(C)A,z ;, for all ; < s. If Ay ^ 0, thenZj is a norm 1 eigenvector of C corresponding to Amin(C), and ZjZJ G S.Furthermore, the polarity equation implies that the nonnegative numbersAmin(C)A, sum to 1, £;<,Amin(C)A;- - trace Amin(C)D - trace CD = 1.Hence Amin(C)D = Y^j:\>o(^mm(C)^j}zjZ- is a convex combination of rank1 matrices from S. The proof of the direct part is complete.

The converse follows by straightforward verification.

6.17. MAXIMIZATION OF INFORMATION VERSUSMINIMIZATION OF VARIANCE

Every information function </> is log concave, that is, log $ is concave. To seethis, we need only apply the strictly isotonic and strictly concave logarithmfunction,

In view of log </> = — log(l/0), log concavity of </> is the same as log con-vexity of 1/0. Log convexity always implies convexity, as is immediate fromappealing to the strictly isotonic and strictly convex exponential function.Hence the design problem, a maximization problem with concave objectivefunction 0, is paired with a minimization problem in which the objectivefunction N *-* 1/0 °° (#'#.£) is actually log convex, not just convex.

The relation between 0 and 1/0 has the effect that some criteria are cov-ered by the information function concept which at first glance look different.For instance, suppose the objective is to minimize a convex matrix mean 0pof the dispersion matrix C"1, in order to make the dispersion matrix as smallas possible. A moment's reflection yields, with p = -p,

Hence the task of minimizing the norm 0p of dispersion matrices C l isactually included in our approach, in the form of maximizing the informationfunction 0P of information matrices C, with p = —p. By the same token,minimization of linear functions of the dispersion matrix is equivalent tomaximization of a suitable information function (compare the notion of linearoptimality in Section 9.8).

Maximization of information matrices appears to be the appropriate opti-mality concept for experimental designs, much more so than minimization ofdispersion matrices. The necessary and sufficient conditions for a design tohave maximum information come under the heading "General EquivalenceTheorem".

EXERCISES

6.1 Show that the nroiection of ( onto the cone NND(s) is thepositive part C+, that is,

6.2 Disprove that if C = A - B and A,B > 0 then A > C+ and B > C_,for C € Sym(.y).

6.3 Show that the matrix modulus satisfies , for all C e Sym(s) [Mathias

(1990), p. 129].

6.4 For a given matrix C € NND(s), how many solutions has the equation in SymCO?

6.5 Are |(0,1,1)' and |(3,1,1)' comparable by vector majorization?

EXERCISES 157

6.6 Verify that, for p < 1, the matrix mean $p fails to be concave onSvmf.vV

6.7 Prove the oolaritv equation for vector means, for i andA vector solves if

and only if in case andconv S in case where the set 5 consists

of all Euclidean unit vectors ef such that y, is the smallest componentof y.

6.8 Let w-i,..., ws be nonnegative numbers. For p ^ 0, ±00 and(0; oo)s, the weighted vector mean is defined by

What is the aoorooriate definition for Extend the domainof definition of < to the space Us.

6.9 (continued) Show that if wi,...,ws are positive and sum to 1, thenthe polar of the weighted vector mean isw^Vs) on NND(s), where p and q are conjugate over [-00; 1] [Gut-mair (1990), p. 29].

6.10 Let W > 0 be a nonnegative definite matrix. For p ^ 0, ±00 and C > 0,the weighted matrix mean is defined byWhat is the appropriate definition for p — 0, ±00? Extend the domainof definition of <^ to the space Sym(,s).

6.11 Show that the polar of the weighted matrix mean $\ '(C) = trace WCequals l/Amax(WD~) or 0 according as D e A(W) or not [Pukelsheim(1980), p. 359].

C H A P T E R 7

The General EquivalenceTheorem

The General Equivalence Theorem provides necessary and sufficient condi-tions for a moment matrix to be <f> -optimal for the parameter system of interestin a compact and convex set of competing moment matrices, where <f> is aninformation function. The theorem covers nondifferentiable information func-tions, and singular moment matrices. The proof is based on convex analysis,the tools being subgradients and normal vectors to a convex set. The particularversions of the theorem are given that are applicable to the matrix means <f>p.

7.1. SUBGRADIENTS AND SUBDIFFERENTIALS

The topic of this chapter is the General Equivalence Theorem 7.14 whichprovides necessary and sufficient conditions for a moment matrix to solvethe design problem. The main results, from Section 7.10 onwards, pertain tomoment matrices M. In Chapter 8 we turn to designs £ proper. The auxiliaryresults, up to Section 7.9, develop tools from convex analysis.

This approach avoids an overemphasis of differentiability. A differentiabil-ity assumption precludes such objective functions as the smallest-eigenvaluecriterion 4>_oo, and it does not permit moment matrices to become singu-lar. In the latter case, we even have to face a lack of continuity, as met inSection 3.16.

We make extensive use of two notions of convex analysis: subgradientsand normal vectors to a convex set. To begin with, we discuss subgradientsand subdifferentials of concave functions, in the linear space of symmetricmatrices Sym(fc) with Euclidean scalar product (A,B) = trace AB.

For a concave function g : NND(£) —> R and for a nonnegative definitematrix M e NND(fc), a symmetric matrix B e Sym(fc) is called a subgradientof g at M when it satisfies the subgradient inequality

158

7.2. NORMAL VECTORS TO A CONVEX SET 159

EXHIBIT 7.1 Subgradients. Each subgradient Bi,B2,B3,... to a concave function g at apoint M is such that it determines an affine function A »-» g(M) + (A — M, Bj) that bounds gfrom above and coincides with g at M.

The set of all subgradients of g at M is called the subdifferential of g at M,and is denoted by dg(M). When the set dg(M) is nonempty the function gis said to be subdifferentiable at M. In the design problem g is a composition4> ° C/c-

The subgradient inequality has a pleasing geometrical interpretation. Letus consider the linear function T(A) = (A,B). Then the right hand side ofthe subgradient inequality is the affine function A i-» g(M) + T(A - M),with value g(M) at A = M. Hence any subgradient of g at M gives rise to anaffine function that globally bounds g from above and, locally at M, coincideswith g (see Exhibit 7.1).

We rigorously prove whatever properties of subgradients we need. How-ever, we briefly digress and mention some more general aspects. The sub-differential dg(M) is a closed and convex set. If M is singular then thesubdifferential dg(M) may be empty or not. If M is positive definite thendg(M) is nonempty. The function g is differentiable at M if and only if thesubdifferential dg(M) is a one-point set. In this case the unique subgradientis the gradient, dg(M) = {Vg(M)}. A similar relation holds for generalizedmatrix inversion (compare Section 1.16). The matrix A is invertible if andonly if the set A~ of generalized inverses is a one-point set. In this case theunique generalized inverse is the inverse, A' = {A'1}.

7.2. NORMAL VECTORS TO A CONVEX SET

The other central notion, normal vectors to a convex set, generalizes theconcept of orthogonal vectors to a linear subspace. Let M be a convex set ofsymmetric k x k matrices, and let M be a member of M. A matrix B e Sym(A:)

160 CHAPTER?: THE GENERAL EQUIVALENCE THEOREM

EXHIBIT 12 Normal vectors to a convex set. Any normal matrix B to a set M at a point Mis such that the angle with the directions A - M is greater than or equal to a right angle, forall A 6 M.

is said to be normal to M at M when it satisfies the normality inequality

Geometrically, this means that the angle between B and all the directions A —M from M to A e M is greater than or equal to a right angle, as shown inExhibit 7.2.

If M. is a subspace, then every matrix B normal to M at M is actuallyorthogonal to M. If M is a subspace containing the matrix A, then it alsocontains SA for 8 > 0. The inequality (8A, B) < (M, B), for all 6 > 0, forces{>!,#} to be nonpositive, (A,B) < 0. Since the same reasoning applies to—A e M, we obtain (-4,5) = 0, for all A £ M. This shows that the matrix Bis orthogonal to the subspace M.

The following lemma states that, under our grand assumption that theset of competing moment matrices M. intersects the feasibility cone A(K),we can formulate an "equivalent problem" in which the set M intersects theopen cone of positive definite matrices. The "equivalent problem" is obtainedby reparametrizing the original quantities. This is the only place in our de-velopment, and a technical one, where we make use of a reparametrizationtechnique. A similar argument was used in the second part of the proof ofLemma 4.12.

7.3. FULL RANK REDUCTION

Lemma. Let the set M. C NND(/:) be convex. Then there ev'ct a mimh*»rr < k and a k x r matrix U with U'U = Ir such that the setof reduced r x r moment matrices satisfie and the reduced

7.3. FULL RANK REDUCTION 161

information matrix mapping Cu >K fulfills

Proof. By Lemma 4.2, the set M contains a moment matrix M withmaximum rank r, say. Let be an eigenvalue decomposition

of M. Then the k x r matrix U = (u\,...,ur) satisfies U'U = Ir, and UU'projects onto the range of M. The set M. — U 'M.U contains the r x r matrixU'MU = U'U&\ U 'U = AA which is positive definite. Hence the intersectionof M and PD(r) is nonempty.

Since the range of M is maximal it includes the range of M, for all M e M.By our grand assumption of Section 4.1, there is a matrix in M. such that itsrange includes the range of K, hence so does M. Recalling that UU' projectsonto the range of M, we get

Now we verify the equality string, upon setting > for

The first and the last equation hold by definition. The two minima are equalbecause the two sets of matrices over which the minima are formed are thesame. Namely, given GU'MUG' with GU'K = Is, we have that L = GU'is a left inverse of K with LML' = GU'MUG'. Conversely, given LML'with LK = /,, we see that G = LU satisfies GU'K = LUU'K = LK = Is

and GU'MUG' = LUU'MUU'L' = LML'.This establishes the first equation of the lemma. For the second equation,

we use Cv>K(M) = CK(M), and conclude with CK(M) = CK(U(U'MU)U ') =CK(UMU').

The lemma says something about the reparametrized system K'6 =(UU'K)'O = (U'K)'U'O = K'6, where K = U'K and 0 = U'O. Theset of information matrices C#(M) for K'6, as M varies over the set M ofcompeting moment matrices, coincides with the set of reduced informationmatrices C~(M) for K'6, as M varies over the set M of reduced moment

/C

matrices. Therefore, the moment matrix M is optimal for K'6 in M if andonly HU'MU is optimal for K'8 in M.

We employ the lemma in the Duality Theorem 7.12. Otherwise it allows usto concentrate on the assumption that the set M contains a positive definitematrix, that is, M intersects the interior of the cone NND(fc). The followingtheorem provides the essential tool for our optimality investigations.

7.4. SUBGRADIENT THEOREM

Theorem. Let g : NND(fc) —» R be a concave function, and let the setM C NND(fc) be convex and intersect the cone PD(fc). Then a momentmatrix M e M maximizes g over M if and only if there exists a subgradientof g at M that is normal to M at M, that is, there exists a matrix B € dg(M)such that

Proof. The converse part of the proof is a plain consequence of thenotions of subdifferentiability and normality, g(A) < g(M) + (A - M,B) <g(M) for all A e M.

The direct part is more challenging in that we need to exhibit the exis-tence of a subgradient B that is normal to M. at M. We invoke the sep-arating hyperplane theorem in the space Sym(fc) x IR with scalar product((A, a), (B, ft)) = (A,B) + aft = (trace AB) + aft. In this space, we introducethe two sets

It is easily verified that both sets are convex. Optimality of M forces anypoint (A, a) in the intersection /C n C to fulfill a < g(A) < g(M) < a. This isimpossible. Hence the sets 1C and £ are disjoint.

Therefore there exists a hyperplane separating /C and £, that is, there exista pair (0,0) ^ (B,ft) £ Sym(k) x R and a real number y such that

In other words, the hyperplane H = {(A, a) e Sym(fc) x IR : (A, B)+aft — y}is such that the set /C is included in one closed half space associated with H,while £ is included in the opposite closed half space.

In (1), we now insert (A/,g(M)) e K. for (A,a) andfor (A, a):

7.5. SUBGRAD1ENTS OF ISOTONIC FUNCTIONS 163

This yields e$ > 0 for all e > 0, whence /3 is nonnegative. We exclude thecase j8 = 0. Otherwise (1) turns into sup~^ft(^4, B} < iniAeM(A,B}, with B ̂

s\^\)

0. That is, in Sym(fc) there exists a hyperplane separating the sets NND(fc)and M. But by assumption, the set M contains a matrix that is positivedefinite. Therefore NND(fc) and M cannot be separated and /3 is positive.

Knowing j8 > 0, we define B — -B//3. Replacement of y by g(M)/3 +(M,B) from (2) turns the first inequality in (1) into -(M,B). With a = g(A), we get

Therefore B is a subgradient of g at M. The second inequality in (1) becomes

Lettine a tend to e(M} we see that the suberadient B is normal to M at M.

7.5. SUBGRADIENTS OF ISOTONIC FUNCTIONS

Lemma. Let the function g : NND(fc) —> U be isotonic and concave.Then for all M e NND(fc), every subgradient of g at M is nonnegative defi-nite. Let the function <f> : NND(s) —» R be strictly isotonic on PD(s) and con-cave. Then for all C € PD(s), every subgradient of 4> at C is positive definite.

Proof. Let B be a subgradient of g at M > 0. In the subgradient in-equality g(^4) < g(M) + (A - M,B), we insert A — M + zz', with z € IR*.Monotonicity yields 0 < g(M + zz') — g(M), whence we get

Hence the matrix B is nonnegative definite.Let £ be a subgradient of $ at C > 0. Then strict monotonicity on PD(s)

entails

It remains to compute subdifferentials, and to this end we exploit theproperties that are enjoyed by an information function. First we make use ofconcavity and monotonicity. Again we have in mind compositions g = <£ o CK

on NND(/c), given by information functions <£ on NND(^).

Therefore the matrix E is positive definite.

We now study the subdifferential of compositions of <£ with the informa-tion matrix mapping CK, where the k x 5 coefficient matrix K has full columnrank s.

7.6. A CHAIN RULE MOTIVATION

The tool for computing derivatives of compositions is the chain rule. Thedefinition of subgradients in Section 7.1 applies to functions with values inthe real line R, with its usual (total) ordering. We extend the idea to functionswith values in the linear space Sym(.s), with its Loewner (partial) ordering.A subgradient mapping T of the information matrix mapping CK at M isdefined to be a linear mapping T from the space Sym(A:) into the spaceSym(s) that satisfies the subgradient inequality

where the inequality sign refers to the Loewner ordering in the space Sym(.s).Can we find such subgradient mappings Tl

The answer is in the affirmative. The information matrix CK(M) is a min-imum relative to the Loewner ordering,

where L is a left inverse of K that is minimizing for A/. We define a linearmapping T from Sym(A:) to Sym(s) by T(A) = LAL', and claim that T is asubgradient mapping of CK at M. Indeed, for all A > 0 we have

It remains open whether all subgradient mappings of CK at M arise in thisway.

Now let <f> be an isotonic and concave optimality criterion. We turn to thecomposition of a subgradient D of $ at C/t(Af), with a subgradient map-ping T(A) — LAL' of CK at M. Thus we want to merge two subgradientinequalities,

The subgradient D is nonnegative definite, by the first part of Lemma 7.5.As seen in Section 1.11, the linear form C i-» (C,D) is then isotonic relative

7.7. DECOMPOSITION OF SUBGRADIENTS 165

to the Loewner ordering. Now we apply the first inequality to E = CK(A).Making CK(A) larger according to the second inequality, we get, for all A

Therefore the matrix T'(D) — L'DL is a subgradient of the composition0 o CK at M.

However, this argument leaves open the question whether all subgradientsof <t> o CK at M are obtained this way. The answer is in the affirmative if Mis positive definite, and in the negative if M is singular. Indeed, let F be anonnegative definite matrix orthogonal to M, that is, F > 0 and (F,M} =0.Then we get 0 < (A,F) = (A - M, F), for all A > 0. Hence the matrix

is also a subgradient of </> o CK at M. It emerges as Corollary 7.8 to the(somewhat technical) next theorem that all subgradients are of this formif M is feasible for K'6.

7.7. DECOMPOSITION OF SUBGRADIENTS

Theorem. Let the function <f> : NND(s) —> U. be isotonic and concave,and let M be a nonnegative definite k x k matrix with generalized informa-tion matrix MK = KCK(M}K' for K'6, where the k x s matrix K has fullcolumn rank s. Then for every symmetric k x k matrix B, the following threestatements are equivalent:

a. B is a subgradient of </> o CK at M.

b. K'BK is a subgradient of <f> at CK(M), and B is nonnegative definiteand orthogonal to M - MK.

c. K'BK is a subgradient of <f> at CK(M), and there exist a generalizedinverse G of M and a nonnegative definite k x k matrix F orthogonalto M such that

B = GMKBMKG' + F.

Proof. The proof builds on the properties of generalized informationmatrices as summarized in Theorem 3.24.

166 CHAPTER 7: THE GENERAL EQUIVALENCE THEOREM

I. First we show that (a) implies (b). Let B be a subgradient of <f> o CKat M. Nonnegative definiteness of B is established in Lemma 7.5. We have

With A = MK, the subgradient inequality gives 0 < (MK — M,B) = -(M -MK,B). But M — MK is nonnegative definite, by Lemma 3.14 and soLemma 1.8 provides the converse inequality (M — MK, B) > 0. This proves Bto be orthogonal to M — MK.

Setting C = CK(M), we get (M,B) = (MK,B) = (KCK',B). Forevery nonnegative definite s x s matrix E, we have CK(KEK') —minL€Ksxk. LK=Is LKEK'L1 = E, giving

Thus K 'BK is a subgradient of <£ at C, and (b) is established.II. Secondly we assume (b) and construct matrices G and F that fulfill (c).

With R being a projector onto the nullspace of M, we introduce the k x kmatrix

This definition is legitimate: F is the generalized information matrix of Bfor R'0, in the terminology of Section 3.21. By Theorem 3.24, the matrix Fenjoys the three properties

By (1), F is nonnegative definite. From (2), we have MF = 0, that is, F isorthogonal to M. By assumption, the matrix B is orthogonal to M - MK,whence we have MB = MKB and BM = BMK. This yields

A passage to the complement turns (3) into (nullspace(fi-F))+(range M) =Uk. As in the proof of Theorem 2.16, this puts us in a position where we canfind a nonnegative definite k x k matrix H with a range that is included in

7.8. DECOMPOSITION OF SUBDIFFERENTIALS 167

the nullspace of B - F and that is complementary to the range of M,

The first inclusion means (B - F)H = 0 whence (4) gives (M + H)(B -F)(M + H) — MKBMK. Choosing for M the generalized inverse G = (M +HY1 from Lemma 2.15, we obtain B - F = GMKBMKG. This provides therepresentation required in part (c).

III. Thirdly we verify that (c) implies (a). For A > 0, we argue that

The first inequality holds because K'BK is a subgradient of $ at CK(M).The equality uses the representation AK — KCK(A)K'. For the second in-equality we observe that B inherits nonnegative definiteness via MKBMK

from K'BK, by Lemma 7.5. Thus monotonicity yields (AK,B} < (A,B).Since G'MG is a generalized inverse of M it is also a generalized inverseof MK, by Lemma 3.22. Inserting the assumed form of B we obtain (M, B} =(MKG'MGMK,B) + (M,F} = (MK,B).

Except for the full column rank assumption on the coefficient matrix K,the theorem is as general as can be. The matrix M need neither be positivedefinite nor lie in the feasibility cone A(K), and the optimality criterion $ isrequired to be isotonic and concave only.

Part (c) tells us how to decompose a subgradient B of </> o CK at M intovarious bits and pieces that are simple to handle individually. For the purposeof construction, we reverse the issue. Given a subgradient D of $ at C, ageneralized inverse G of M, and a nonnegative definite matrix F orthogonalto M, is GKCDCK'G' + F a subgradient of 4> o CK at M? This is indeedtrue provided M is feasible.

7.8. DECOMPOSITION OF SUBDIFFERENTIALS

Corollary. For every nonnegative definite k x k matrix M, the subdif-ferentials of 0 o CK at M and of 0 at C = CK(M) fulfill the inclusion

relations

Moreover, if M lies in the feasibility cone A(K], then the three sets are equal.

Proof. The first inclusion is verified in Section 7.6. The second followsfrom part (c) in Theorem 7.7, upon replacing MK by KCK' and K'BKby D. Moreover, let M be feasible. For any member B = GKCDCK'G1 +F of the last set, we define Z = CK'G1. Because of feasibility, Theo-rem 3.15 yields LK = CK'G'K = (K'M-K)-1K'M~K = Is and LML' =CK'G'MGKC = C. Thus B = L'DL + F is a member of the first set. D

Lemma 7.5 provides some partial insight into the subgradients D of (f> atC = CK(M), for isotonic and concave criteria <£. The following theorem saysmuch more for information functions <f>. It places all the emphasis on thepolarity equation

which we have already encountered earlier, in part (c) of Lemma 5.16, andwhich for the matrix means <J>P is solved explicitly in Lemma 6.16.

7.9. SUBGRADIENTS OF INFORMATION FUNCTIONS

Theorem. Let </» be an information function on NND($). Then for everypair C and D of nonnegative definite s x s matrices, the following threestatements are equivalent:

and

and

a. (Subdifferential of <b. (Polarity equation)c. (Subdifferential of (

In particular, the Subdifferential of <f> at C is

provided </>(C) is positive. Moreover, if C is positive definite, then <£ is sub-differentiable at C, d<£(C) ^ 0-

7.9. SUBGRADIENTS OF INFORMATION FUNCTIONS 169

Proof. It suffices to establish the equivalence of (a) and (b). The equiva-lence of (b) and (c) then follows from the polarity correspondence <£ = (f)0000

of Theorem 5.13.First we derive (b) from (a). We insert SC with 8 > 0 into the subgradient

inequality, dtf>(C) < </>(C) + (8C - C,<j>(C)D}. Cancelling 0(C) > 0, we get8 - 1 < (8 - 1){C,D). The values 8 = 2 and 8 = 1/2 yield {C,D) = 1. Withthis, the subgradient inequality simplifies (j>(E) < <f>(C)(E,D) for all E > 0.For positive definite matrices E, we subdivide by <£(£) > 0 to obtain 1 <<£(C)inf£>0 (E,D)/4>(E) = 4>(Q<j>°°(D). The Holder inequality contributesthe converse direction, <#>(C)^OC(D) < (C,D) = 1. Thus equality holds, and(b) is established.

Next we prove (a) from (b). In view of <#>(C)^°°(D) = 1, the factor 0(C)must be positive. For E > 0, the definition of the polar function yields 1 =<£(C)<£°°(D) < <£(C){£, £>}/<£(£) and 0(£) < <J>(C) + (E - C,<t>(C)D).Regularization extends this inequality to all matrices E > 0. Hence (j>(C)Dis a subgradient of <j> at C.

Moreover, if C is positive definite, then d<f>(C) is nonempty. To see this, wenotice that positive definiteness of C makes the setZ> = { D > 0 : (C,D) — 1}compact. Let D € T> be such that the upper semicontinuous function <£°°attains its maximum over T>. The double polarity relation of Theorem 5.13entails

Hence D solves the polarity equation, </>(C)<£°°(.D) = 1 = (C,D). A similarargument was employed in the proof of Theorem 5.13.

For a composition of the form </> o CK, we now have a couple of ways ofrepresenting the subdifferential. If M € A(K] then C = CK(M) is positivedefinite, d(<f> o CK(M)} is nonempty, and it fulfills

The first two equalities follow from Corollary 7.8. The last equality applies thepresent theorem to the information function <f> o CK on NND(fc) and the polarfunction N »-»<t>°°(K'NK) from Theorem 5.14. For the matrix means <f>p, thepolarity equation submits itself to the explicit solution of Lemma 6.16, thusalso giving us complete command over all subgradients.

If a scalar parameter system c'B is of interest, then <£ is the identity on[0;oo), with derivative one on (0;oo), as remarked in Section 5.17. Hence forM € A(c}, we get

These expressions are familiar to us from the proof of the Equivalence The-orem 2.16 for scalar optimality, except for the matrix F. The General Equiv-alence Theorem 7.14 follows just the same lines.

If subdifferentiability of 4> o CK at M fails to hold, that is, d($ oCK)(M) =0, then M is neither feasible for K' B nor formally $-optimal for K'6 in M.In this case M is of no interest for the general design problem of Section 5.15.

7.10. REVIEW OF THE GENERAL DESIGN PROBLEM

With these results from convex analysis, we now attack the design problem inits full generality. The objective is to maximize the information as measuredby some information function $ on NND(s):

The set M C NND(fc) of competing moment matrices is assumed to becompact and convex. The k x 5 coefficient matrix K is assumed to be of fullcolumn rank s, with information matrix mapping CR. We avoid trivialities bythe grand assumption of Section 4.1 that there exists at least one competingmoment matrix that is feasible, M n A(K) ^ 0.

Although our prime interest is in the design problem, we attack it indirectlyby first discussing its dual problem, for the following reason. Theorem 7.4 tellsus that a moment matrix MI is optimal if and only if there exists a subgradi-ent BI with certain properties. Similarly another moment matrix M2 will beoptimal if and only if there exists a subgradient B2 with similar properties. Inorder to discuss multiplicity of optimal moment matrices, we need to knowhow MI relates to B2. The answer is given by the dual problem which rep-resents B (more precisely, its scaled version N = B/<f>(C)) in a context ofits own. We start with a family of upper bounds for the optimal value v((f>).Derivation of these bounds is based on the polar function <£°° of Section 5.12,but elementary otherwise.

7.11. MUTUAL BOUNDEDNESS THEOREM FOR INFORMATION FUNCTIONS 171

7.11. MUTUAL BOUNDEDNESS THEOREM FOR INFORMATIONFUNCTIONS

Theorem. Let M 6 M. be a competing moment matrix and let N be amatrix in the set

Then we have 4>(CK(M)) < l/(f>°°(K'NK), with equality if and only if Mand N fulfill conditions (1), (2), and (3) given below. More precisely, uponsetting C •= CK(M} we have

with respective equality if and only if

Proof. Inequality (i) and equality condition (1) are an immediate conse-quence of how the set M is defined. Inequality (ii) uses M > MK = KCK'and monotonicity of the linear form A >-> trace AN, from Theorem 3.24 andSection 1.11. Equality in (ii) means that N is orthogonal to M - MK > 0. BySection 6.15, this is the same as condition (2). Inequality (iii) is the Holderinequality from Section 5.12, leading to condition (3).

The theorem generalizes the results for scalar optimaliy of Theorem 2.11,suggesting that the general design problem is accompanied by the dual prob-lem:

The design problem and the dual problem bound each other in the sense thatevery value for one problem provides a bound to the other, <f>(CK(M)) <l/<f>°°(K'NK). Another way of expressing this relation is to write

For a moment matrix M to attain the supremum and therefore to be formallyoptimal for K'6, that is, disregarding the identifiability condition M € -A(K),it is sufficient that there exists a matrix TV e A/" such that <}>(CK(M)} =l/<f>°°(K'NK). We now appeal to the Subgradient Theorem 7.4 and the FullRank Reduction Lemma 7.3 to show that the condition is also necessary.

7.12. DUALITY THEOREM

Theorem. We have

In particular, a moment matrix M e M is formally optimal for K'O in M ifand only if there exists a matrix TV € A/" such that

Moreover, any two matrices M e M and N e M satisfy <f>(CK(M)) —l/<t>°°(K'NK) if and only if they jointly satisfy the three conditions (1), (2),and (3) of Theorem 7.11.

Proof. There exists a moment matrix M € M that is formally optimalfor K'O in M, and the optimal value v(<f>) is positive, by Lemma 5.16. ThusC = CK(M) satisfies <f>(C) = v(4>) > 0.

In the first part of the proof, we assume that the set M contains a positivedefinite matrix, MnPD(k) ^ 0. Then Theorem 7.4 secures the existence of asubgradient B o f ( j > o C K a t M that is normal to M at M. From Theorem 7.9,the matrix N = B/(C) satisfies the polarity equation 4»(C)^°°(K'NK) =trace M N = 1. Since B is normal to M at M so is N:

Thus the matrix N is a member of the set A/", and it satisfies </>(C) =1/4>°°(K'NK). Hence M is an optimal solution of the design problem, Nis an optimal solution of the dual problem, and the two problems share acommon optimal value.

In the second part, we make do with the assumption that the set M meetsthe feasibility cone A(K). From Lemma 7.3, we first carry through a fullrank reduction based on the k x r matrix U and the set M = U'MU thatappear there. Because of M n PD(r) ^ 0, the first part of the present proof

7.12. DUALITY THEOREM 173

is applicable and yields

Here we have set ft = {N e NND(r) : trace AN < 1 for all A € M}.We claim that this set is related to the set M C NND(fc) of Theorem 7.11through

For the direct inclusion, we take a matrix N e A/" and define N = UNV.Then A4 = V'MU implies N £ M since trace MN = trace MUNU' =trace MN < 1 for all M € X. Furthermore U'U = Ir entails U'NU = N,whence N is a member of U 'AfU. For the converse inclusion, we are givena matrix N eM. From the proof of Lemma 7.3, we borrow

Thus N = U'NU lie-in Af since trace MN - traceU'MUU'NU =trace MN < 1 for all M e A-l. This proves (2). Now we may continue in(1),

and observing UU'K — K concludes the proof.

We illustrate the present theorem with the pathological example of theparabola fit model of Section 6.5. We consider the design which places equalmass 1/2 on the two points ±1. Its information matrix for the full parametervector 9 is singular,

Under the trace criterion, the information for 6 is <£i(M) = (1/3) trace M =1. Now we define the matrix N — 73/3 > 0. For the moment matrix M2(r) ofan arbitrary design r on T = [—1; 1], we compute

Thus the matrix N lies in the set .A/", providing for the design problem theupper bound

Since <f>i(M) attains this bound, the moment matrix M is formally optimal.However, M provides only two degrees of freedom for three unknown pa-rameters whence feasibility fails to hold, as pointed out in Section 6.5.

The counterpart of this nonexistence example are the following three suf-ficient conditions which secure that every formally optimal moment matrixis automatically feasible.

7.13. EXISTENCE THEOREM FOR OPTIMAL MOMENT MATRICES

Theorem. There exists a moment matrix M e M that is formally <f>-optimal for K'O in M, and the optimal value v(<f>) is positive.

In order that every formally <£-optimal moment matrix for K'O in M liesin the feasibility cone A(K), and thus is </>-optimal for K'O in M, any oneof the following conditions is sufficient:

a. (Condition on M) The set M is included in the feasibility cone A(K).b. (Condition on <f>) The information function <£ vanishes for singular

matrices.c. (Condition on <f>°°) The polar information function <£°° vanishes for

singular matrices and is strictly isotonic on PD(s).

Specifically, for the matrix means <f>p with parameter p e [-00; 1), there existsa <t>p-optimal moment matrix for K'O in M..

Proof. Parts (a) and (b) are copied from Lemma 5.16 and part (c) is thefirst half of condition (c) of that lemma. Hence it suffices to show that thesecond half is always true. But for every formally optimal matrix A/, thereexists a solution N e M of the dual problem, and M and N jointly satisfyconditions (1), (2), (3) of Theorem 7.11. Hence C = CK(M) and D = K'NKsolve the polarity equation of Lemma 5.16.

The arguments concerning the matrix means <f>p are compiled in Theo-rem 6.13. For p G [-00; 0], they vanish for singular matrices C (see (2) inSection 6.7). For p e (0; 1) the polar function s<j>q vanishes for singular matri-ces and is strictly isotonic on PD(s).

For the design problem, we are interested in feasible moment matricesonly. Therefore we recast the Duality Theorem 7.12 into the following formwhich is the key result of optimal design theory. The assumptions underlyingthe design problem are those from Section 7.10.

7.14. THE GENERAL EQUIVALENCE THEOREM 175

7.14. THE GENERAL EQUIVALENCE THEOREM

Theorem. Let M e M be a competing moment matrix that is feasiblefor K'd, with information matrix C = CK(M). Then M is $-optimal for K' 6in M if and only if there exists a nonnegative definite 5 x 5 matrix D thatsolves the polarity equation

and there exists a generalized inverse G of M such that the matrix N =GKCDCK'G' satisfies the normality inequality

In case of optimality, equality obtains in the normality inequality if for A weinsert M, or any other matrix M e M that is </>-optimal for K'B in M.

Proof. For the direct part, we do not need the feasibility assumption.Let M be a formally </>-optimal moment matrix for K'B in M. We cannotappeal directly to the Subgradient Theorem 7.4 since there we require M nPD(fc) ^ 0 while here we only assume Mr\A(K) ^ 0. Instead we piece thingstogether as follows. The Duality Theorem 7.12 provides a matrix N G Af with4>(C) - l/<f>°°(K'NK). Conditions (3), (2), and (1) of Theorem 7.11 yield

Theorem 7.9 states that <j>(C)N is a subgradient of <f> o C^ at M. FromCorollary 7.8, it has a representation

with D e d<f>(C), G £ M-, and 0 < F_LM. It follows from Lemma 7.5that the matrix D — D/<f>(C) is nonnegative definite. From part (b) of Theo-rem 7.9, it also solves the polarity equation. The matrix N = GKCDCK'G' <GKCDCK'G' + F/<f>(C) = N satisfies trace AN < trace AN < 1, by thetrace monotonicity of Section 1.11 and because of N e Af.

The converse refers to the more elementary Theorem 7.11 only. The nor-mality inequality shows that the matrix N = GKCDCK'G' lies in the set Af.By feasibility of M, we have K'GK = C~l and K'NK = D. Now the po-larity equation gives <£(C) = l/(f>°°(K'NK). Hence M and W are optimalsolutions of the primal problem and of the dual problem, respectively.

In case of optimality, M and any other optimal matrix M e M satisfytrace M N = 1 = trace MAf, by condition (1) of Theorem 7.11.

We record the variants that emerge if the full parameter vector 6 is ofinterest, and if maximization takes place over the full set M(E) of all momentmatrices.

7.15. GENERAL EQUIVALENCE THEOREM FOR THE FULLPARAMETER VECTOR

Theorem. Let M e M be a competing moment matrix that is positivedefinite. Then M is $-optimal for 0 in M if and only if there exists a non-negative definite k x k matrix N that solves the polarity equation

and that satisfies the normality inequality

In case of optimality, equality obtains in the normality inequality if for A weinsert M, or any other matrix M e M that is <£-optimal for 6 in M.

Proof. For the full parameter vector 0, any feasible moment matrix M eM is positive definite by Theorem 3.15. In the General Equivalence Theo-rem 7.14, we then have G — M~l, K = Ik, and Cjk(M) — M. The polarityequation and the normality inequality simplify accordingly.

7.16. EQUIVALENCE THEOREM

Theorem. Let M e M(S) be a moment matrix that is feasible for K'0,with information matrix C = CK(M). Then M is ^-optimal for K'8 in M(H)if and only if there exists a nonnegative definite s x s matrix D that solvesthe polarity equation

and there exists a generalized inverse G of M such that the matrix N =GKCDCK'G' satisfies the normality inequality

In case of optimality, equality obtains in the normality inequality if for x weinsert any support point *,- of any design £ e H that is <£-optimal for K'Bin H.

7.18. MERITS AND DEMERITS OF EQUIVALENCE THEOREMS 177

Proof. Since M (E) contains the rank 1 moment matrices A = xx', thenormality inequality of Theorem 7.14 implies the present normality inequal-ity. Conversely, the present inequality implies that of Theorem 7.14, for ifthe moment matrix A e M(H) belongs to the design 17 e E, then we gettrace AN = E^suPP *, *l(x)x'Nx < I .

In the case of optimality of A/, let £ e E be a design that is <£-optimalfor K'S in H, with moment matrix M = A/(£) and with support points*!,...,*£ e X. Then Jt/Afjt, < 1 for some / entails the contradictiontrace MN = £,.<< £(*,-) jc/ATjc,- < 1 - trace MN.

7.17. EQUIVALENCE THEOREM FOR THE FULL PARAMETERVECTOR

Theorem. Let M € M(S) be a moment matrix that is positive definite.Then M is </>-optimal for 0 in M(E) if and only if there exists a nonnegativedefinite k x k matrix N that solves the polarity equation

and that satisfies the normality inequality

In case of optimality, equality obtains in the normality inequality if for x weinsert any support point jc, of any design £ € H that is <£-optimal for 9 in H.

Proof. The proof parallels that of the previous two theorems.

7.18. MERITS AND DEMERITS OF EQUIVALENCE THEOREMS

Equivalence theorems entail an amazing range of consequences. It is worth-while to pause and reflect on what we have achieved so far. Theorem 7.14allows for a fairly general set M of competing moment matrices, requiringonly that M is a compact and convex subset of the cone NND(&), and that itcontains at least one feasible matrix for K'B. It is for this generality of M thatwe term it a General Equivalence Theorem. Most practical applications re-strict attention to the set M(H) of all moment matrices, and in such cases wesimply speak of an Equivalence Theorem. Exhibit 7.3 provides an overview.

In any case the optimality characterization comes in two parts, the polarityequation and the normality inequality. The polarity equation refers to 5 x5 matrices, as does the information function <£. The normality inequalityapplies to k x k matrices, as does the information matrix mapping CK. Thus

EXHIBIT 7.3 A hierarchy of equivalence theorems. A General Equivalence Theorem(GET) allows for a compact and convex subset M of the set M(H) of all moment matrices.Over the full set M(a) we speak of an Equivalence Theorem (ET). Either theorem simplifiesif interest is in the full parameter vector 6.

the two parts parallel the two components </> and CK of the composition</> ° CK.

In order to check the optimality of a candidate matrix M, we must insome way or other compare M with the competing matrices A. Equivalencetheorems contribute a way of comparison that is linear in A, and hence simpleto evaluate. More involved computations do appear but need only be doneonce, for the optimality candidate A/, such as determining the informationmatrix C = C#(A/), a solution D of the polarity equation, and an appropriategeneralized inverse G of M. The latter poses no problem if M is positivedefinite; then G is the regular inverse, G = M~l. In general, however, thechoice of the generalized inverse G does matter. Some version of G satisfythe normality inequality, others do not.

Yet more important, for the matrix means </>p, reference to the polarityequation disappears. Lemma 6.16 exhibits all of its solutions. If p is finitethen the solution is unique.

7.19. GENERAL EQUIVALENCE THEOREM FOR MATRIX MEANS

Theorem. Consider a matrix mean <f>p with parameter p finite, p e(-00; 1]. Let M € M be a competing moment matrix that is feasible for K'O,with information matrix C = CK(M). Then M is ^-optimal for K'6 in Mif and only if there exists a generalized inverse G of M that satisfies thenormality inequality

7.19. GENERAL EQUIVALENCE THEOREM FOR MATRIX MEANS 179

In case of optimality, equality obtains in the normality inequality if for A weinsert M or any other matrix M € M that is $p-optimal for K'6 in M.

Specifically, let M e M be a competing moment matrix that is positivedefinite. Then M is $p-optimal for 0 in M if and only if

Proof. The polarity equation has solution D = Cp l/ trace Cp, byLemma 6.16. Hence the General Equivalence Theorem 7.14 specializes tothe present one.

ALTERNATE PROOF. If the optimality candidate M is positive definite thenthe theorem permits an alternate proof based on differential calculus, pro-vided we are ready to use the facts that the functions fp(X) = trace Xp, forp / 0, and fo(X) — det X are differentiable at a positive definite matrix C,with gradients

The chain rule then yields the gradient of the matrix mean </>p at C,

Because of positive definiteness of M, the information matrix mapping C%becomes differentiable at M. Namely, utilizing twice that matrix inversionX-1 has at M the differential -M~1XM-1, the differential of CK(X) =(K'X~lKYl at M is found to be

Composition with the gradient of <f>p at C yields the gradient of <J>P o CKatM,

That is, the gradient is proportional to M~lKCp+lK'M~l. Hence it is normalto M at M if and only if the normality inequality of Theorem 7.19 holdstrue.

7.20. EQUIVALENCE THEOREM FOR MATRIX MEANS

Theorem. Consider a matrix mean <f>p with parameter p finite, p e(-00; 1]. Let M be a moment matrix that is feasible for K'0, with informationmatrix C = CK(M). Then M is ^-optimal for K'O in M(H) if and only ifthere exists a generalized inverse G of M that satisfies the normality inequality

In case of optimality, equality obtains in the normality inequality if for x weinsert any support point */ of any design £ € H that is <£p-optimal for K'Oin H.

Specifically, let M £ A/(E) be a moment matrix that is positive definite.Then M is ^-optimal for 6 in M(H) if and only if

Proof. The results are special cases of Theorem 7.19.

For the smallest-eigenvalue criterion <£-oo, Lemma 6.16 provides many so-lutions of the polarity equation, unless the smallest eigenvalue of C has mul-tiplicity 1. Some care is needed to appropriately accommodate this situation.

7.21. GENERAL EQUIVALENCE THEOREM FOR E-OPTIMALITY

Theorem. Let M e M be a competing moment matrix that is feasi-ble for K'O, with information matrix C — CK(M). Then M is <£_oo-optimalfor K'O in M if and only if there exist a nonnegative definite s x s matrix £with trace equal to 1 and a generalized inverse G of M that satisfy the nor-mality inequality

In case of optimality, equality obtains in the normality inequality if for A weinsert M, or any other matrix M e M that is 0_oo-optimal for K'O in M\and any matrix E appearing in the normality inequality and any momentmatrix M 6 M that is $-00-optimal for K'O in M, with information matrixC = CK(M\ fulfill

where the set S consists of all rank 1 matrices zz' such that z 6 Rs is a norm1 eigenvector of C corresponding to its smallest eigenvalue.

7.22. EQUIVALENCE THEOREM FOR E-OPTIMALITY 181

Specifically, let M e M be a competing moment matrix that is positivedefinite. Then M is ^-oo-optimal for 0 in M if and only if there exists anonnegative definite k x k matrix E with trace equal to 1 such that

Proof. We write A = Amjn(C), for short. For the direct part, we solve thepolarity equation in the General Equivalence Theorem 7.14. Lemma 6.16states that the solutions are D = E/\ with E e conv 5, where S comprisesthe rank 1 matrices zz' formed from the norm 1 eigenvectors z of C thatcorrespond to A. Hence if M is ^-oo-optimal, then the normality inequalityof Theorem 7.14 implies the present one.

For the converse, the fact that trace E — 1 entails that every spectraldecomposition E = Y^j<raizizj represents £ as a convex combination ofthe rank 1 matrices z\z{, • • . ,zrz'r. The eigenvectors Zj that come with E areeigenvectors of C corresponding to its smallest eigenvalue, A. To see this wenotice that the nonnegative definite sxs matrix F = C-A/5 and the matrix Esatisfy

The first inequality exploits the monotonicity behavior discussed in Sec-tion 1.11. The middle equality expands C into CK'G'MGKC. The last in-equality uses the special case trace MGKCECK'G' < A of the normalityinequality. Thus we have trace FE = 0. Section 6.15 now yields FE = 0, thatis, CE = A£. Postmultiplication by Zj gives Czj = Az;, for all ; = 1,... ,r,giving E e conv S. It follows that the matrix D = E/\ solves the polarityequation and satisfies the normality inequality of the General EquivalenceTheorem 7.14, thus establishing (/^oo-optimality of M.

Now let M e M be another moment matrix that is </>_oo -optimal forK'd in M, with information matrix C. Then M and M share the sameoptimal value, Amin(C) = A. Since the dual problem has optimal solutionN = GKCECK'G'/\, condition (1) of Theorem 7.11jields trace MN = 1.Hence the normality inequality is an equality for A — M. Moreover, with con-dition (2) of Theorem 7.11, we may continue 1 = trace MN — trace CE/\.Therefore the nonnegative definite sxs matrix F — C - \IS is orthogonalto E. Again we get FE — 0, thus establishing property (1). Postmultiplicationof (1) by Zj shows that z, is a eigenvector of C, whence follows (2)

7.22. EQUIVALENCE THEOREM FOR E-OPTIMALITY

Theorem. Let M be a moment matrix that is feasible for K'6, with in-formation matrix C = CK(M). Then M is 4>_oo-optimal for K'6 in M(H) if

and only if there exist a nonnegative definite s xs matrix E with trace equalto 1 and a generalized inverse G of M that satisfy the normality inequality

In case of optimality, equality obtains in the normality inequality if for x weinsert any support point *,• of any design £ € H that is <£_oo-optimal for K'6in H; and any matrix E appearing in the normality inequality and any momentmatrix M G A/(H) that is ^-oo-optimal for K'B in A/(E), with informationmatrix C = CK(M), fulfill conditions (1) and (2) of Theorem 7.21.

Specifically, let M e M(H) be a moment matrix that is positive definite.Then M is <£_oo-optimal for 0 in A/(H) if and only if there exists a nonnegativedefinite k x k matrix E with trace equal to 1 such that

Proof. The theorem is an immediate consequence of Theorem 7.21.

If M is ^-oo-optimal for K '6 in M and the smallest eigenvalue A of CK(M)has multiplicity 1, then the matrix E in the two preceding theorems is uniquelygiven by zz', where z e R* is a norm 1 eigenvector of CK(M) correspondingto A. If the smallest eigenvalue has multiplicity greater than 1 then little ornothing can be said of which matrices E > 0 with trace E = 1 satisfy thenormality inequality. It may happen that all rank 1 matrices E = zz' thatoriginate with eigenvectors z fail, and that any such matrix E must be positivedefinite.

As an illustration, we elaborate on the example of Section 2.18, with re-gression range X = {x e R2 : \\x\\ < 1}. The design £(J) = 1/2 = £(J) withmoment matrix M = I2/2 is <£_oo-optimal for 6 in H. Indeed, the positivedefinite matrix E = I2/2 satisfies x Ex = ||*||2/2 < Amin(Af), for all x e X.On the other hand, the norm 1 eigenvectors of M corresponding to 1/2 arethe vectors z e R2 with ||z|| = 1. For x — z, we get x'zz'x = 1 > Amjn(M).Hence no rank 1 matrix E — zz' fulfills the normality inequality.

The same example illustrates that 0_oo-optimality may obtain without anyscalar optimality property. For every coefficient vector c ^ 0, the ElfvingTheorem 2.14 shows that the unique optimal design for c'B is the one-pointdesign in c/\\c\\ e X, with optimal variance (p(c))2 = ||c||2, where p is theElfving norm of Section 2.12. Indeed, the variance incurred by the $_oo-optimal moment matrix I2/2 is twice as large, c'(I2/2)~lc = 2||c||2.

Nevertheless, there are some intriguing interrelations with scalar optimal-ity which may help to isolate a ^-oo-optimal design. Scalar optimality alwayscomes into play if the smallest eigenvalue of the </>_oo-optimal informationmatrix has multiplicity 1.

7.24. E-OPTIMALITY, SCALAR OPTIMALITY, AND ELFVING NORM 183

7.23. E-OPTIMALITY, SCALAR OPTIMALITY, AND EIGENVALUESIMPLICITY

Theorem. Let M € M. be a competing moment matrix that is feasiblefor K'O, and let ±z e Rs be an eigenvector corresponding to the smallesteigenvalue of the information matrix Cx(M). Then M is <^_oo-optimal forK'B in M and the matrix E = zz'/\\z\\2 satisfies the normality inequality ofTheorem 7.21 if and only if M is optimal for z'K'O in M.

If the smallest eigenvalue of C#(M) has multiplicity 1, then M is </>-oo-optimal for K'B in M. if and only if M is optimal for z'K'O in M..

Proof. We show that the normality inequality of Theorem 7.21 for <£_oo-optimality coincides with that of Theorem 4.13 for scalar optimality. WithE = zz'/\\z\\2, the normality inequality in Theorem 7.21 reads

The normality inequality of Theorem 4.13 is

With c = Kz the two left hand sides are the same. So are the right handsides, because of c'M~c = z'K'M~Kz = z'C~lz = \\z\\2/\mia(CK(M)).

If the smallest eigenvalue of CK(M} has multiplicity 1 then the only choicefor the matrix E in Theorem 7.21 is E = zz'/\\z\\2.

Under the assumption of the theorem, the right hand sides of the two nor-mality inequalities in the proof are the same, ||z||2/Amin(C/f(M)) =z'K'M~Kz. The following theorem treats this equality in greater detail. Itbecomes particularly pleasing when the optimal solution is sought in the setof all moment matrices, Af (E). Then the ElfVing Theorem 2.14 representsthe optimal variance for z'K'O as (p(Kz))2, where p is the Elfving norm ofSection 2.12.

7.24. E-OPTIMALITY, SCALAR OPTIMALITY, AND ELFVINGNORM

Theorem. Every moment matrix M e M(H) and every vector 0 ̂ z € R5

fulfill

If equality holds, then M is </>_oo-optimal for K'O in M(E) and any momentmatrix M e M(E) that is </>_oo-optimal for K'O in M(E) is also optimalfor z'K'e inM(H).

Proof. Let M 6 M(H) be a moment matrix that is feasible for K'6. Thenwe get 0 7^ ATz € range /C C range M C £(#), where C(X} is the regressionrange introduced in Section 2.12. It follows that Kz has a positive ElfVingnorm, p(Kz) > 0.

If M is not feasible for K'O, then the information matrix C = CK(M) issingular, by Theorem 3.15, and \min(CK(M)) = 0 < (\\z\\/p(Kz))2. Other-wise maximization of the smallest eigenvalue of C > 0 is the same as min-imization of the largest eigenvalue of C'1 = K'M'K. Using AmaxCC"1) =maxjiju^i z'C~lz, we have for all 0 / z € IR5, the trivial inequalities

If equality holds for M and z, then attaining the upper bound (\\z\\/p(Kz))2, the moment matrix M is 4>-oo-optimal for K'O in A/(S). For every(^-oo-optimal moment matrix M for K'O in A/(E), we get

Now (p(tfz))2 = z 'K'M'Kz shows that M is optimal for z 'K'O in M(E).

For ||z || = 1, the inverse optimal variance l/(p(Kz))2 is the maximum in-formation for z'K'O in M(S). Therefore the theorem establishes the mutualboundedness of the problems of maximizing the smallest eigenvalue Amin(M)as M varies over M(S), and of minimizing the largest information for z'K'Oas z varies over the unit sphere in Rs. The two problems are dual to eachother, but other than in Theorem 7.12 here duality gaps do occur, as is evi-denced by the example at the end of Section 7.22.

Duality does hold in the parabola fit model of Section 2.21. Indeed, thevector c = (—1,0,2)' fulfills ||c||/p(c) = 1/%/S. The optimal moment matrix Mfor c'B has smallest eigenvalue 1/5. The present theorem proves that M is$-00-optimal for 0 in A/(S). This will guide us to determine the </>_oo-optimaldesign for polynomial fit models of arbitrary degree d > 1, in Section 9.13.

EXERCISES 185

In the present chapter, our concern has been to establish a set of neces-sary and sufficient conditions for a moment matrix to be optimal. We haveconcentrated on moment matrices M rather than on designs £. In the nextchapter we explore what consequences these results imply for the optimalityof a design £, in terms of the support points and weights of £.

EXERCISES

7.1 Let T be a linear mapping from a scalar product space C into a scalarproduct space /C. The transposed operator T' is defined to be theunique linear mapping from /C into C that satisfies (Tx,y) = (x,T'y)for all jt e £, y e /C. Find the transposed operator of (i) T(x) = Ax :Uk -> R", where A e R"x*, (ii) T(A) = LAL' : Sym(k) -» Sym(s),where L e Rs*k.

7.2 For p G (-00; 1], prove that <f>p is differentiable in C > 0 by showingthat the subdifferential is a singleton,

7.3 Prove that </>_oo is differentiable in C > 0 if and only if the smallesteigenvalue of C has multiplicity 1, by discussing whether d<f>-oo(C) =conv 5 is a singleton or not.

7.4 For p e (0; 1), show that (f>p is not subdifferentiable at singular matrices

7.5 True or false: if <£(C) > 0, then 6<f>(Q ^ 0?

7.6 Let </> be an information function, and let C > 0 be such that <f> (C) = 0.Show that D is a subgradient of <f> at C if and only if D is orthogonalto C and 0°°(D) > 1.

7.7 For the dual problem of Section 7.11, show that if N € M is an optimalsolution, then so is NK(K'NK)-K'N.

7.8 Show that a moment matrix M e M n A(K) is </>-optimal for K'Oin .M if and only if there exists a solution D e NND(s) of the polarityequation and there exists a left inverse L e RixAr of K that is minimizingfor M such that N = L 'DL satisfies the normality inequality.

7.9 Demonstrate by example that in the Equivalence Theorem 7.16, thegeneralized inverse G cannot generally be taken to be the Moore-Penrose inverse M+ [Pukelsheim (1981), p. 17].

7.10 Demonstrate by example that in the Equivalence Theorem 7.16, thedomain of the normality inequality cannot generally be reduced fromx € X to x € X n range M [Pukelsheim (1981), p. 17].

7.11 Use the General Equivalence Theorem to deduce Theorem 4.13.

7.12 Show that any matrix E e NND(s) with trace E = 1 satisfies (i) E < Is,and (ii) c'Ec = c'c if and only if E - cc'/\\c\\2, where 0 ̂ c e R5.

7.13 For the two-point regression range X = < (]), (Jj i, show that 72 is

the unique 4>-<x-optimal matrix for 0 in Af(E). Which matrices E of

O. O- !(»?)>!(!!). i( -r.1) satisfyTheorem 7-22?

7.14 Show that ^(^oo) < r2, where f(^_oo) is the <£_oo-optimal value for 6in E and r is the Euclidean in-ball radius of the Elfving set Tl =conv(;ru (-#)).

7.15 (continued) Demonstrate by example that f(</>_oo) < r2 is possible.

C H A P T E R 8

Optimal Moment Matricesand Optimal Designs

The interrelation between optimal moment matrices and optimal designs isstudied. Necessary conditions on optimal support points are derived in termsof their number, their location, and their weights. The optimal weights onlinearly independent regression vectors are given as a fixed point of a nonlinearequation. Multiplicity of optimal moment matrices is characterized by a linearrelationship provided the criterion is strictly concave.

8.1. FROM MOMENT MATRICES TO DESIGNS

The General Equivalence Theorem 7.14 concentrates on moment matrices.This has the advantage of placing the optimization problem in a linear spaceof finite dimensions. However, the statistical interest is in the designs them-selves. Occasionally we experience a seamless transition between optimalmoment matrices and optimal designs, such as in the Elfving Theorem 2.14on scalar optimality. However, the general rule is that the passage from mo-ment matrices to designs is difficult, and limited in its extent. All we canaim at are necessary conditions that aid in identifying the support pointsand the weights of optimal designs. This is the major theme of the presentchapter.

First we establish an upper bound on the number of support points. Thebound applies to all designs, not just to optimal ones. The theorem statesthat every k x k moment matrix A e M (H) is achieved by a design 17 whichhas at most \k(fc + 1) + 1 support points, and that for a feasible momentmatrix A, the 5x5 information matrix CK(A) (possibly scaled by some 8 > 1)is realized by a design £ with a support size that obeys the tighter bound^s(s + 1) + s(rank A - s).

187

188 CHAPTER 8: OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS

8.2. BOUND FOR THE SUPPORT SIZE OF FEASIBLE DESIGNS

Theorem. Assume that the k x s coefficient matrix K of the parametersystem K'O is of full column rank s. Then for every moment matrix A eM(H), there exists a design 17 e H such that

For every design 17 € H that is feasible for K'O, there exists a desigsuch that

Proof. From Lemma 1.26, the set of all moment matrices M(H) admitsa representation as a convex hull,

The set M(H) is a convex and compact subset of the linear space Sym(&) ofsymmetric matrices. The latter has dimension \k(k + 1), whence (1) and (2)are immediate consequences of the Caratheodory theorem.

For the tighter bound (5), we replace the linear space Sym(A:) by a hyper-plane in the range of an appropriate linear transformation T on Sym(A;). Theconstruction of T is reminiscent of the full rank reduction in Lemma 7.3. Letthe moment matrix A/ of 17 have rank r. Feasibility entails s < r < k. Wechoose some k x r matrix U with U'U = lr such that UU' projects onto therange of M. Then the matrix U'M~K is well defined, as follows from thepreamble in the proof of Theorem 4.6.

We claim that U'M'K has rank s. To this end we write UU' = MGwhere G is a generalized inverse of M (see Section 1.16). From MG —(UU1)1 = G'M and M2G2M2 = MG'MGM2 = M2, we infer G2 e (M2)~.Now we get K'G'UU'GK = K'G'MG2K = K'(M2)~K. By Lemma 1.17,the last matrix has rank s, and so has U'GK = U'M~K.

With a left inverse L of U'M~K, we define the projector R — lr —U'M'KL. Now we introduce the linear transformation T through

In view of T(A) = T(UU'AUU'}, the range of T is spanned by T(UBU') =B - R 'BR with B e Sym(r). For counting dimensions, we may choose R to

8.2. BOUND FOR THE SUPPORT SIZE OF FEASIBLE DESIGNS 189

be in its simplest form

This induces the block partitioning

The symmetric sxs matrix B\\ contributes ^s(s+\} dimensions, while anothers(r - s) dimensions are added by the rectangular s x (r - s) matrix 512 = B2'rTherefore the dimension of the range of T is equal to |s(.s + l) + ,s(r-s).

In the range of T, we consider the convex set M generated by the finitelymany matrices T(xx') which arise from the support points x of 17,

Since M. contains T(M}, the scale factor e = inf{5 > 0 : T(M) € 8M}satisfies e < 1. If e = 0 then T(M) = 0, and RU'M'K = 0 yields thecontradiction

Hence e > 0, and we may define 8 = l/e > 1. Our construction ensures that5T(M) lies in the boundary of M.. Let T~t be the hyperplane in the range of Tthat supports M in 8T(M). Thus we obtain 8T(M) € H n M = conv(H n5(17)). By the Caratheodory theorem, 8T(M) is a convex combination of atmost

members of H n 5(rj). Hence there exists a design £ such that T(M(£)) =dT(M), the support points of £ are support points of 17, and the support sizeof £ is bounded by (5).

It remains to establish formula (3). We have rangeM(£) C range M. Thisentails UU'M(£)UU' = M(£). Because of RU'M'K = 0, we obtain

Evidently the range of K is included in the range of M(£) whence M(£) isfeasible for K'B. Premultiplication by K'M(g)~ and inversion finally yield

In the case of optimality, we obtain the following corollary.

8.3. BOUND FOR THE SUPPORT SIZE OF OPTIMAL DESIGNS

Corollary- Let $ be an information function. If there exists a </> -optimalmoment matrix M for K'O in M(H), then there exists a </>-optimal design £for K'B in H such that its support size is bounded according to

Proof. Every design £ with a feasible moment matrix for K'O needs atleast s support points. This is so because the summation in range M (£) —ZLcesupp f (range **') must extend over at least s terms in order to includethe range of K, by Lemma 2.3.

For a design 17 that has M for its moment matrix, we choose an improveddesign £ as in Theorem 8.2, with support size bounded from above by (5) ofTheorem 8.2. Positive homogeneity and nonnegativity of <f> yield

Optimality of M forces the design £ to be optimal as well.

If there are many optimal designs then all obey the lower bound, but somemay violate the upper bound. The corollary only ascertains that at least oneoptimal design respects both bounds simultaneously.

For scalar optimality, we have 5 = 1. Thus if the moment matrix M isoptimal for c'd in A/(H), then there exists an optimal design £ for c'O in Hsuch that

The upper bound k also emerges when the Caratheodory theorem is appliedin the context of the Elfving Theorem 2.14. For the full parameter vector 6,we have s = k, and the bounds become

In Section 8.6, we provide examples of optimal designs that require \k(k + \}support points. On the other hand, polynomial fit models illustrate attainmentof the lower bound k in the very strict sense that otherwise a design becomesinadmissible (see Theorem 10.7).

Before studying the location of the support points we single out a matrixlemma.

8.4. MATRIX CONVEXITY OF OUTER PRODUCTS

Lemma. Let Y and Z be two k x s matrices. Then we have, relative tothe Loewner ordering,

8.5. LOCATION OF THE SUPPORT POINTS OF ARBITRARY DESIGNS 191

for all a € (0; 1), with equality if and only if Y — Z.

Proof. With X — (1 - a)Y + «Z, the assertion follows from

The following theorem states that in order to find optimal support points,we need to search the "extreme points" X of the regression range X only.Taken literally, this does not make sense. The notion of an extreme pointapplies to convex sets and X generally fails to be convex. Therefore we firstpass to the Elfving set 72. = conv(#u (-<¥)) of Section 2.9. Since 72. is convex,it has points which are extreme, that is, which do not lie on a straight lineconnecting any other two distinct points of 72.. Convex analysis tells us thatevery extreme point of 72, is a member of the generating set X U (-X). Wedefine those extreme points of 72. which lie in X to form the subset X.

8.5. LOCATION OF THE SUPPORT POINTS OF ARBITRARYDESIGNS

Theorem. Let X be the set of those regression vectors in X that areextreme points of the Elfving set 72. = conv(^ U ( — X ) ] . Then for everydesign rj 6 H with support not included in X, there exists a design £ e Hwith support included in X such that

Proof. Let x\,...,xt e X be the support points of 17. Being closedand convex, the set 72. is the convex hull of its extreme points. Hence forevery i = 1,. . . ,^, there exist extreme points V n , . . . , V j r t . of 72. such thatJt, = £]/<M( otijyij, with min;<M( «,; > 0 and £)y-<fl. a,; = 1. Since at least onesupport point jc, of TJ is assumed not to Be an extreme point of 7£,Lemma 8.4 yields *,-*/ ^ £,<„, a^iyy/y Thus we get M(TJ) =

where the design £ is denned to allocateweight 17 (*/)«,; to the point y<;, for / = 1,... ,^ and ;' = 1,... ,n,. The pointsyij are extreme in 72, and hence contained in X or in —X. If y/; e —X is ex-treme in 72. then, because of point symmetry of 71, -y/y- e ̂ is also extremein 71. From the fact that (—yi/)(—y/;)' — -ft;)'/;' we may replace Vij £ —X \yy—yij € X. This proves the existence of a design £ with supp £ C #, and withmoment matrix Af (£) improving upon M(TJ).

The preceding result relates to the interior representation of convex setsby sweeping mass allocated in the interior of the ElfVing set into the ex-treme points out on the boundary. This improvement applies to arbitrarydesigns 77 e 5. For designs £ that are optimal in H the support points */ mustattain equality in the normality inequality of the Equivalence Theorem 7.16,x-Nxi = 1. This property relates to the exterior representation of convex sets,in that the matrix N induces a cylinder which includes the Elfving set 71. Al-together the selection of support points of designs optimal in H is narroweddown in two different ways, to search the extreme points x of 72. that lie inX, or to solve the quadratic equation x'Nx = 1 that arises with an optimalsolution N of the dual problem.

Polynomial fit models profit from the second tool, as detailed in Sec-tion 9.5. The following example has a regression range X so simple thatplain geometry suggests concentrating on the extreme points X.

8.6. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNITSQUARE

Over the unit square as the regression range, X = [0;1]2, we consider atwo-way first-degree model without constant term,

also called a multiple linear regression model. We claim the following.

Claim. For p € [-00; 1), the unique design £p € H that is ^-optimal for6 in H, assigns weights w(p) to (J) and \(l - w(p)) to (J) and (J), where

Proof. The proof of our claim is arranged in three steps, for p > -oo.The case p = — oo is treated separately, in a fourth step.

8.6. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT SQUARE 193

I. The regression vectors x € X that are extreme points of 7£ visibly are

For p € (—oo; 1), the matrix mean (f>p is strictly isotonic on PD(2), by Theo-rem 6.13. Therefore Theorem 8.5 forces the ^-optimal design £p to have itssupport included in X, whence its moment matrix takes the form

II. By interchanging the weights W2 and H>3 we create the design £p, withmoment matrix

If £p and £p are distinct, w>2 / n>3, then their mixture r\ = \(£p + £p) improvesupon £p. Namely, the matrix mean <f>p is strictly concave on PD(s), by The-orem 6.13, and the matrices M(£p) and M(£p} share the same eigenvalues.This entails

contradicting the optimality of £p. Thus gp and £p are equal, and we haveW2 = H>3 = i(l — Wi). Upon setting w(p) = w\, the moment matrix of gp

becomes

III. It now suffices to maximize cf>p over the one-parameter family

This maximization is carried out by straightforward calculus. The matrix Mw

has eigenvalues \(\ + 3w) and |(1 - H>). Hence the objective function fp =<f>p o M is

EXHIBIT 8.1 Support points for a linear fit over the unit square. For p e (-00; 1), thesupport of the (ftp-optimal design £p for d comprises the three extreme points (Q)>(?))(}) in72. n X. The 0-oo-optimal design £-00 is supported by the two points (Q), ("), and has optimal

information one for c'6 with c = \(\}-

The unique critical point of fp is w(p) as given above. It lies in the interiorof the interval [0; 1], and hence cannot but maximize the concave function f

IV. For ^-oo-optimality, the limiting value lim/,_>_00 w(p) = 0 suggests thecandidate matrix M0 = \h- Indeed, the smallest eigenvalue ^(1 - w) of Mw isclearly maximized at w = 0. Thus our claim is proved. See also Exhibit 8.1.

The argument carries over to the linear fit model over the A>dimensionalcube [0; \}k (see Section 14.10). The example is instructive from various per-spectives.

The smallest-eigenvalue optimal design £_oo makes do with a minimumnumber of support points (two). Moreover, there is an intriguing relation-ship with scalar optimality. The eigenvector corresponding to the smallesteigenvalue of the matrices Mw is c = (_?j). It is not hard to see that £-<» isuniquely optimal for c'O in H. This is akin to Theorem 7.23.

For p > — oo, the designs gp require all three points in the set X for theirsupport, and hence attain the upper bound |/c(A:+l) = 3 of Theorem 8.3. The

8.7. OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS 195

average-variance optimal design £_j has weight vv(-l) = (\/3-l)/(A/3+3) =0.1547 which, very roughly, picks up half the variation between w(-oo) = 0and H-(O) = 1/3. The determinant optimal design & distributes the weight 1/3uniformly over the three points in X, even though the support size exceedsthe minimum two. (In Corollary 8.12 we see that a determinant optimaldesign always has uniform weights l//c if the support size is a minimum, k.)The design with u>(l) = 1 is formally (fo-optimal for 6 in H. This is a one-point design in (|), and fails to be feasible for 0. Theorem 9.15 proves thisto be true in greater generality.

This concludes our discussion of the number and the location of the sup-port points of optimal designs. Next we turn to the computation of the op-timal weights. Theorem 8.7 and its corollaries apply to the situation wherethere are not very many support points: x\,..., xt are assumed to be linearlyindependent.

8.7. OPTIMAL WEIGHTS ON LINEARLY INDEPENDENTREGRESSION VECTORS

Theorem. Assume that the i regression vectors Xi,...,xi € X C Rk

are linearly independent, and form the rows of the matrix X £ Uexk, thatis, X' = (*i,.. .,*£). Let H be the set of designs for which the support isincluded in {*],... ,xe}. Let £ 6 H be a design that is feasible for K'O, withinformation matrix C = CK(M(g)}.

Then the design £ is $-optimal for K'6 in H if and only if there ex-ists a nonnegative definite s x s matrix D solving the polarity equation^(C)^°°(D) = trace CD = 1 such that the weights w,- = £(*,-) satisfy

where a\\,... ̂ a^ are the diagonal elements of the nonnegative definite i x tmatrix A = UCDCU' with U = (XX')'1XK.

Proof. If Xi is not a support point of £, then we quite generally get a,-,- = 0.To see this, we may assume that the initial r weights are positive while theothers vanish, w i , . . . , wr > 0 = wr+i = ••• — wf. Then x\,...,xr span therange of Af, by Lemma 2.3. Because of feasibility, there exists some r x smatrix H such that

By assumption, X has full row rank £, whence

(H',G)ei = 0 and an = 0, by the definitions of A and U.For i>r, we get

Let Aw be the diagonal matrix with weight vector w = (wj , . . . , we)' e Ue

on the diagonal. The moment matrix M of £ may then be represented asM =X'bwX. Hence

is a specific generalized inverse of M, where A+ is the diagonal matrix withdiagonal entries 1/vv, or 0 according as tv, > 0 or w, = 0. With the Euclideanunit vector et of Uk we have x, — X'ei. Thus we obtain

which equals a/y/w? or 0 according as w, > 0 or H>, = 0.For the direct part of the proof, we assume £ to be <f> -optimal in E. The

Equivalence Theorem 7.16 provides a solution D > 0 of the polarity equationand some generalized inverse G e M~ such that

If xi is a support point of £, then (2) holds with equality, while the left handside is au/wj. This yields w, = ^/a^. If *, is not a support point of £, thenWii = 0 = an, as shown in the beginning of the proof.

For the converse, (2) is satisfied with G — GQ since the left hand side onlytakes the values 1 or 0. By Theorem 7.16, the design £ is optimal

For the matrix means </>p with parameter p 6 (-00; 1], the polarity equa-tion has solution D = Cp~1/^ trace Cp, by Lemma 6.16. Hence the matrixA = UCDCU' is proportional to B = UCf>+lU'. With diagonal elementsb\\,...,bw of B, the optimal weights satisfy

Fromthe optimal value becomes

Application to average-variance optimality is most satisfactory. Namely, ifp = -1, then B = UU' only involves the given regression vectors, and noweights.

we obtain Hence

8.9. C-OPTIMAL WEIGHTS ON LINEARLY INDEPENDENT REGRESSION VECTORS 197

8.8. A-OPTEVfAL WEIGHTS ON LINEARLY INDEPENDENTREGRESSION VECTORS

Corollary. A design £ is <£_i -optimal for K' 9 in H if and only if theweights w, = £(*,) satisfy

where 6 n , . . . , f t^ are the diagonal elements of the matrix B = UU' withU = (XX'Y^XK. The optimal value is

Proof. The corollary is the special case p = —I from the preceding dis-cussion.

If X is square and nonsingular, t = k, then we obtain U = X' lK.li thefull parameter B is of interest, then K = Ik and B = (XX1)'1. The average-variance optimal weights for the arcsin support designs in Section 9.10 arecomputed in this way.

If a scalar parameter system c'O is of interest, s = I, then (XX1 )~lXc isa vector, and matters simplify further. We can even add a statement guaran-teeing the existence of linearly independent regression vectors that supportan optimal design.

8.9. C-OPTIMAL WEIGHTS ON LINEARLY INDEPENDENTREGRESSION VECTORS

Corollary. There exist linearly independent regression vectors x\,..., Xiin X that support an optimal design £ for c'O in H. The weights w, = £(*,)satisfy

where u\,...,ut are the components of the vector u — (XX')~lXc. Theoptimal variance is

Proof. An optimal design 17 for c'd in H exists, according to the ElfvingTheorem 2.14, or the Existence Theorem 7.13. From Section 8.3, we knowthat k support points suffice, but there nothing is said about linear indepen-dence.

If the optimal design TJ has linearly dependent support points x\,... , x f ,then the support size can be reduced while still supporting an optimal de-sign £, as follows. The ElfVing Theorem 2.14 provides the representation

where yt = £(*/)*/. Because of linear dependence there are scalars / A I , . . . , /A£,not all zerOj such that 0 = ^2i<e /A,V,. We set /AO = T^,i<t M/- If Po is negative,then we replace the scalars /A, by — /A,-. Hence t*Q>0 and at least one of thescalars /A, is positive.

Now we investigate the "likelihood ratios" i7(jc/)//A/ and introduce

say. Because of we may rewrite (1) as

with Wf = rj(xi) — a/A/. The numbers w, are nonnegative, for if /A, < 0, thenthis is obvious, and if tt, > 0, then ^(jc/)//^, > a by definition of a. Weapply the ElfVing norm p to (2) and use the triangular inequality to obtain

By the ElfVing Theorem 2.14, the design £ with weights £(*,) = w/ isoptimal for c'Q. It has a smaller support size than 77, because o

We may continue this reduction process until it leavesus with support points that are linearly independent. Finally Corollary 8.8

applies, with

As an example we consider the line fit model, with experimental domainT = [-1; 1). We claim the following.

Claim. The design £ that assigns to the points jci = (\), x2 = (j), theweights

is optimal for c'6 in H, with optimal variance

Proof. The set of regression vectors that are extreme points of the ElfVingset is X = {*i,Jt2}. By Theorem 8.5, there exists an optimal design £ in H

8.11. OPTIMAL WEIGHTS ON GIVEN SUPPORT POINTS 199

with support included in X. The present corollary then proves our claim, byway of

Without the assumption of linear independence, we are left with a moreabstract result. First we single out a lemma on the Hadamard product, thatis, entrywise multiplication, of vectors and matrices.

8.10. NONNEGATIVE DEFINITENESS OF HADAMARDPRODUCTS

Lemma. Let A and B be nonnegative definite i x i matrices. Then theirHadamard product A * B = ((«,,£>,,)) is nonnegative definite.

Proof. Being nonnegative definite, B admits a square root representationB = UU' with U = (MI, ... ,ue) € R /x/. For jc e R£, we then get

8.11. OPTIMAL WEIGHTS ON GIVEN SUPPORT POINTS

Theorem. Assume that the design £ € E is </>-optimal for K'O in H. LetE e Rsxs be a square root of the information matrix, CK(M(g)) — C — EE'.Let D € NND(s') and G € M(£)~ satisfy the polarity equation and thenormality inequality of the Equivalence Theorem 7.16. Let the support points*!,...,*£ of £ form the rows of the matrix X e IR*x*,thatis,^' = ( jc i , . . . , jCf),and let the corresponding weights H>, = £(*,) form the vector w e IR£.

Then, with A = XGKE(E'DE}1I2E'K'G'X' e NND(^), the weight vec-tor w solves

and the individual weights w, satisfy H>, < I/a? < Amax(CD), for all / = 1,..., i.

Proof. The proof builds on expanding the quadratic formx/GKCDCK'G'xf from Theorem 7.16 until the matrix M(£) appears in themiddle. We set N = GKCDCK'G', and v/ = E'K'G'x; for i < L Theo-rem 7.16 states that

We expand E'DE into (E'DE)ll2ls(E'DE)1/2. Since C is nonsingular, sois E. From E'-lE~l = C~l = K'M(^)~K = K'G'M(£)GK, we get 7S =E'K'G'M(C)GKE. Inserting M(£) = £•<< *>;•*;*/> we continue in (1),

This proves (A*A)w = l f . From the lower bound a?M>, for (2), we obtairwi < 1/fl2/. On the other hand, we may bound (1) from above by

Since E'DE and EE'D = CD share the same eigenvalues, this yields

By Lemma 8.10, the coefficient matrix A *A of the equation (A*A)w = le

is nonnegative definite. The equation is nonlinear in w since A dependson w, through G, E, and D. Generally, the fixed points w of the equation(A * A)w = lt appear hard to come by.

For the matrix means <f>p with parameter p e (-00; 1], the solution ofthe polarity equation is D = Cp~l/lrace Cp, by Lemma 6.16. With E =C1/2 we find that A is proportional to XGKCP/^K'G'X1. For p = -2 thepower p/2 + 1 vanishes, but the optimal moment matrix still enters througha generalized inverse G. For the full parameter vector 6, we get G = M"1

and A oc XMpl2'lX'. For a scalar system c'O we obtain ,4 oc XGcc'G'X.The polarity equation of the General Equivalence Theorem 7.14 entails

Amax(CZ)) < trace CD — 1. Hence the upper bound Amax(CD) on the weightsH>, is not unreasonable. For the matrix means <f>p with p e (—oo;l], we getAmax(CD) = Amax(Cp)/trace Cp. While in general this bound depends on

8.13. MULTIPLICITY OF OPTIMAL MOMENT MATRICES 201

the optimal information matrix C, an exception emerges for determinantoptimality.

8.12. BOUND FOR DETERMINANT OPTIMAL WEIGHTS

Corollary. Every $o-optimal design £ for K'd in 5 has all its weightsbounded from above by 1/5. Moreover, every (^-optimal design £ for 6 inH that is supported by k regression vectors has uniform weights l/k.

Proof. With D = C~l/s, we get wt < Amax(C£>) = Amax(/,)/5 = 1/5.Moreover, if the full parameter vector 8 is of interest, then w, < l/k. Ifthere are no more than k support points, then each weight must attain theupper bound l/k in order to sum to 1

Next we turn to multiplicity of optimal designs. We assume that the in-formation function 4> is strictly concave, forcing uniqueness of the optimalinformation matrix C. Optimal moment matrices M need not be unique, butare characterized as solutions of a linear matrix equation.

8.13. MULTIPLICITY OF OPTIMAL MOMENT MATRICES

Theorem. Assume that the information function </> is strictly concaveon PD(5). Let the moment matrix M € At be </>-optimal for K'd in M,with generalized inverse G of M that satisfies the normality inequality of theGeneral Equivalence Theorem 7.14. Then any other moment matrix M e Mis also </>-optimal for K'0 in M if and only if

Proof. For the direct part, we relate the given optimal solution M ofthe primal problem to the optimal solution N = GKCK(M}DCK(M}K'G'of the dual problem of Section 7.11. We obtain NK = GKCK(M)D andK'NK = D. Any other optimal moment matrix M, and the dual optimalsolution N that comes with M, jointly fulfill equation (2) of Theorem 7.11.Postmultiplication of this equation by K gives

At this point, we make use of the strict concavity of <f> on PD(5), twice.Firstly, Corollary 5.5 states that <£ is strictly isotonic on PD(5). This forces D

to be positive definite, for otherwise z 'Dz = 0 and z ^ 0 lead to the contra-diction

much as in the proof of Theorem 5.16. Therefore we may cancel D in (1). Sec-ondly, strict concavity entails uniqueness of the optimal information matrix.Cancelling CK(M) — CK(M) in (1), we now obtain the equation MGK = K.

For the converse part we note that M is feasible for K'9. Premultiply-ing MGK = K^by K'M~, we obtain K'M'K = K'M~MGK = K'GK =K'M~K. Thus MJias the same information matrix for K'O as has M. Since Mis optimal, so is M.

The theorem may be reformulated in terms of designs. Given a $-optimalmoment matrix M for K'6 in M and a generalized inverse G of M from theGeneral Equivalence Theorem 7.14, a design £ is 0-optimal for K'9 in H ifand only if M(£)GK = K.

For general reference, we single out what happens with the matrix means

8.14. MULTIPLICITY OF OPTIMAL MOMENT MATRICES UNDERMATRIX MEANS

Corollary. If p e (-00; 1), then given a moment matrix M e M that is<£p-optimal for K'O in M, any other moment matrix M e M is also <f>p-optimal for K'6 in M if and only if MGK - K, with G e M~ satisfying thenormality inequality of Theorem 7.19.

Given a moment matrix M e M that is </>_oo-optimal for K'6 in M,then for any other moment matrix M € M. to be also <£_oo-optimal for K'Qin Ai, it is sufficient that MGK = K, and it is necessary that MGKE = KE,with G e M~ and E satisfying the normality inequality of Theorem 7.21.

Proof. For p e (-oo;l) the matrix means <f>p are strictly concave onPD(s), by Theorem 6.13. Hence Theorem 8.13 applies. For p = —oo, wecontinue equation (1) in the proof of Theorem 8.13 using D = E/Amin(C),from Lemma 6.16, and CE = Amin(C)£ = C£, from (1) in Theorem 7.21.

8.16. MATRIX MEAN OPTIMALITY FOR COMPONENT SUBSETS 203

In the remainder of this chapter we treat more specialized topics: simul-taneous optimality relative to all matrix means, and matrix mean optimalityif the parameter system of interest K'd comprises the first s out of all kcomponents of 0, or if it is rank deficient.

If a moment matrix remains ^-optimal while the parameter p of thematrix means runs through an interval (p\,p2), then optimality extends tothe end points p\ and p2, by continuity. The extreme cases p\ ~ -oo andp2 ~ 0 are of particular interest since <£_oo and <fo are continuous extensionsof the matrix means <f)p with p € (-00;0) (see Section 6.7).

8.15. SIMULTANEOUS OPTIMALITY UNDER MATRIX MEANS

Lemma. If M is <f>p-optimal for K'6 in M, for all p e (-00;0), then Mis also <£_oo-optimal and (^-optimal for K'6 in M.

Proof. Fix a competing matrix A e M. For every p € (-00; 0) we have<!>P(CK(M}} > ({>P(CK(A)). By continuity the inequality extends top = -oo,0.Hence M is ^-optimal also for

For the first 5 out of the k components Q\,..., Bk, we partition any momentmatrix according to

with s x s block A/n, (k - s) x (k - s) block A/22, and s x k block M\i = A/2'rThe vectors x = (xi,...,xk)' e Uk such that jcs+1 = • • • = Jt* = 0 form asubspace which we denote by IRS x {0}. Theorem 7.20 and Theorem 7.22specialize as follows.

8.16. MATRIX MEAN OPTIMALITY FOR COMPONENT SUBSETS

Corollary. Let M e M (E) be a moment matrix such that its range in-cludes the leading s coordinate subspace, Rs x {0}, with information matrixc = MH — M^M^MII.

If p € (—oo; 1], then M is ^-optimal for (6\,..., 8s)f in M (S) if and only

if there exists some s x (k - s) matrix B such that

In case of optimality, B satisfies BA/22 = —^tu-

The matrix M is <£_oo-optimal for (#1,..., 6S)' in M(H) if and only if thereexist a nonnegative definite s x s matrix E with trace equal to 1 and somes x (k — s) matrix B such that

In case of optimality, B satisfies EBM-& = -EM\2-

Proof. We set K = (/s,0)' and choose a generalized inverse G of Maccording to Theorem 7.20. Partitioning the 5 x k matrix CK'G' = (A,B)into a left s xs block A and a right s x(k-s) block B, we postmultiply by Kto get /I - CK'G'K = 7S. This yields GKCDCK'G' = (IS,B)'D(IS,B).The normality inequality now follows with Z) = Cp~1/ trace C^ and Z) =£/Amin(C') from Lemma 6.16.

In the case of optimality, we have, upon setting N = (1S,B)'D(IS,B),

Because of condition (2) of Theorem 7.11, we may equate the bottom blocksto obtain DBM22 — -DM\2. If p > -oo, then D cancels; if p = -oo, thenAminCC") cancels.

In order to discuss parameter systems K'0 with a k x s coefficient matrixof less than full column rank, we briefly digress to discuss the basic propertiesof Moore-Penrose matrix inverses.

8.17. MOORE-PENROSE MATRIX INVERSION

Given a rectangular matrix A e Ukxs, its Moore-Penrose inverse A+ is de-fined to be the unique s x k matrix that solves the four Moore-Penroseequations

The Moore-Penrose inverse of A obeys the formulae

Therefore, in order to compute A+, it suffices to find the Moore-Penroseinverse of the nonnegative definite matrix AA', or of A 'A. If a nonneg-ative definite matrix C has r positive eigenvalues A l 5 . . . , A r , counted with

8.18. MATRIX MEAN OPTIMALITY FOR RANK DEFICIENT SUBSYSTEMS 205

their respective multiplicities, then any eigenvalue decomposition permits astraightforward transition back and forth between C and C+,

It is easily verified that the matrix A+ obtained by (1) and (2) solves theMoore-Penrose equations. One can also show that this solution is the onlyone.

8.18. MATRIX MEAN OPTIMALITY FOR RANK DEFICIENTSUBSYSTEMS

If K is rank deficient and a moment matrix M is feasible for K'O, M eA(K), then the dispersion matrix K'M~K is well defined, but has at leastone vanishing eigenvalue, by Lemma 1.17. Hence K'M~K is singular andregular inversion to obtain the information matrix C fails. Instead one maytake recourse to generalized information matrices, to a reparametrizationargument, or to Moore-Penrose inverses.

Indeed, the positive eigenvalues of K'M~K and (K'M~K}+ are inversesof each other, as are all eigenvalues of K'M~K and (K'M~K)~l if K hasfull column rank s. On the other hand, the matrix mean <j>p(C) depends on Conly through its eigenvalues. Therefore we introduce a rank deficient matrixmean <£p', by requiring it to be the vector mean of the positive eigenvalues\l,...,\r of C = (K'M~K)+:

The definition of the rank deficient matrix mean <£p is not consistent with howthe matrix mean </>p is defined in Section 6.7. If p e [-00; 0], then the ordinarymatrix mean 4>p vanishes identically for singular matrices C, whereas </>p' doesnot.

Nevertheless, with the rank deficient matrix means </>p', the equivalencetheorems remain valid as stated, except that in Theorem 7.19, negative powersof C = (K'M'KY must be interpreted as positive powers of K'M'K. Thisleads to the normality inequality

trace AGKC(K'M~K)l-pCK'G' < trace C(K'M-K)l-p for all A e M.

In Theorem 7.21, Amin(C) becomes the smallest among the positive eigenval-ues of C, entailing l/Amin(C) - \max(K'M~K).

An alternate approach leading to the same answers is based on repara-metrization, K'6 — UH'd, with some k x r matrix H which has full column

rank r and with some s x r matrix U which satisfies U'U = lr. Then we get

Respecting multiplicities, the positive eigenvalues of (K 'M K)+ are the sameas all eigenvalues of (H'M~H)~l. Therefore, by adopting (H'M~H)~l asthe information matrix for K'O, the resulting optimality criterion coincideswith the one that is built on Moore-Penrose inversion, <f>p((H'M~H)~l) =*p(\l,...,\,) = 4>;((K'M-Kr).

Furthermore, if the coefficient matrix K is a k x k orthogonal projec-tor, then Moore-Penrose inversion and the generalized information matricesof Section 3.21 lead to the same answer. Namely, for a feasible momentmatrix M its generalized information matrix for K'O is MK =K(K'M-K)-K'. With K = K2 = K', it is easy to verify that (K'M~K)+ =MK.

Such coefficient matrices arise with the system of centered contrasts in two-way classification models. There the coefficient matrix is the ax a centeringmatrix Ka. Equivalently, by filling up with zero matrices to obtain a k x kmatrix, k = a + b, the coefficient matrix may be taken to be the orthogonalk x k projector

8.19. MATRIX MEAN OPTIMALITY IN TWO-WAYCLASSIFICATION MODELS

For the centered contrasts of the factor A in a two-way classification model,Section 4.8 establishes Loewner optimality of the product designs rs' in theset T(r) of designs with given treatment marginals r. Relative to the full set Tof all a x b block designs we now claim the following.

Claim. The equireplicated product designs las' with arbitrary columnsum vector s are the unique ^'-optimal designs for the centered contrasts offactor A in the set T of all block designs, for every p e [-00; 1]; the optimalcontrast information matrix is Ka/a, and the optimal value is I/a.

Proof. The centered contrasts of factor A have a rank deficient co-efficient matrix

An equireplicated product design 1 as' has row sum vector 1 a = la/a, that is,the levels i = 1,..., a of factor A are replicated an equal number of times. Its

8.19. MATRIX MEAN OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS 207

moment matrix Af, a generalized inverse G, and the product GK are givenby

(see Section 4.8). Hence the standardized dispersion matrix is K'GK — aKa,and contrast information matrix is C = Ka/a. The powers of C are Cp+l =Ka/a

p+l.Let A e M(T) be a competing moment matrix,

If the row sum vector of W coincides with one given before, r = r, then weget K'GAGK = a2Ka&rKa. The left hand side of the normality inequality ofSection 8.18 becomes

The right hand side takes on the same values, trace Cp = (a - l)/ap. Thisproves ^'-optimality of 1 as' for Kaa in T for every p € [-oo;l], by Theo-rem 7.19 and Lemma 8.15.

Uniqueness follows from Corollary 8.14, since by equating the bottomblocks in

we obtain aW 'Ka = 0 and W — 1 as'. The optimal value is found to be

In other words, the optimal information for the contrasts is inversely propor-tional to the number of levels. This completes the proof of our claim.

A maximal parameter system may also be of interest. The expected re-sponse decomposes into a mean effect, the centered contrasts of factor A,and the centered contrasts of factor B,

This suggests an investigation of the parameter sys-tem

Here we claim that simultaneous <£p'-optimality pertains to the uniform design

1 al b. This is the unique product design that is equireplicated and equiblock-sized. It assigns uniform weight \/(ab) to each combination (/,;') of factor AandB.

Claim. The uniform design 1 al b is the unique ^'-optimal design for themaximal parameter system (1) in the set T of all block designs, for everyp € [-00; 1]; the dispersion matrix D = K'(M(1 alb)]~K and the </>p'-optimalvalue are

Proof, With the generalized inverse G of M = M(l al b ) as in Section 4.8,we get MGK — K for the coefficient matrix K from (1). Hence M is feasible.Furthermore C = D+ is readily calculated from D, and

say. With regression vectors

from Section 1.5, the normality inequality of Theorem 7.19 becomes(e/,d/)Af(e/,<*/)' = 1 + (a - l)/aP + (b - l)/bP = trace CP. Therefore theuniform design is (^'-optimal, for every p e [-00; 1].

Uniqueness follows from Corollary 8.14 since

forces r = s = and W = This proves the claim.

EXERCISES 209

In the following chapter we continue our discussion of designs that areoptimal under the matrix means <f>p, with a particular emphasis on 0o-, $_i-,$_oo-, and <fo-optimality in polynomial fit models.

EXERCISES

8.1 In the third-degree polynomial fit model over T = [-1; 1], consider thescalar parameter system c'S with c — (1,2,4,8)'. Show that the optimaldesign for c'O in T assigns weights 5/52, 12/52, 20/52, 15/52 to thepoints -1,-1/2,1/2,1, and has information 26~2 [Hoel and Levine(1964), p. 1557].

8.2 (continued) Show that the optimal design for c'O on the equispacedsupport points -1,-1/3,1/3,1 assigns weights 35/464, 135/464, 189/464, 105/464, and has information 29~2 and efficiency 0.8. Show thatthe uniform design on ±l,±l/3 is 65% efficient for c'O.

8.3 In the proof of Corollary 8.16, show that equality of the top blocks inMNK = KCK'NK yields BM2i = -M12Af22M2i, and that this identityis implied by BM22 = —^12-

8.4 Show that the solution to the Moore-Penrose equations is unique.

8.5 Show that (PA)+ = (PA)+P and (AQ)+ = Q(AQ)+, for all A e RBX*,orthogonal n x n projectors P, and orthogonal k x k projectors Q.

8.6 Show that A+ > B+ if and only if rank A = rank B, for all B > A > 0[Milliken and Akdeniz (1977)].

8.7 Show that K+AK+t > (K'A~K)+, for all A <E A(K) [Zyskind (1967),Gaffke and Krafft (1977)].

8.8 Show that if U 6 R^xr has rankr and $ is an information func-tion on NND(r), then A i—> tf>(U'AU) is an information functionon NND(fc).

8.9 (continued) Furthermore assume range U = range K and U'U = Ir.Show that the matrix means satisfy <f>p(U'AKU) — <&p(\i(AK),...,\r(AK)) for all A e NND(fc).

8.10 (continued) Show that if K' = K+, then <}>p(U'AKU) = <}>p((K'A-K)+)for all A € A(K).

C H A P T E R 9

D-, A-, E-, T-Optimality

Optimality of moment matrices and designs for the full parameter vector inthe set of all designs is discussed, with respect to the determinant criterion, theaverage-variance criterion, the smallest-eigenvalue criterion, and the trace cri-terion. Optimal designs are computed for polynomial fit models, where designswith arcsin support are seen to provide an efficient alternative. In trigonometricfit models, optimal designs are obtained not only for infinite sample size, buteven for every finite sample size. Finally we illustrate by example that the samedesign may be optimal in models with different regression functions.

9.1. D-, A-, E-, T-OPTIMALITY

The most popular optimality criteria in the design of experiments are thedeterminant criterion fa, the average-variance criterion </>_i, the smallest-eigenvalue criterion <£_oo, and the trace criterion <fo, introduced in Section 6.2to Section 6.5. The equivalence theorems for these criteria take a particularlysimple form, and even more so if the optimum is sought in the set of allmoment matrices, M = A/(E), and if the parameter system of interest is thefull mean parameter vector 0 (see Theorem 7.20 and Theorem 7.22). In thepresent chapter, we discuss these criteria in greater detail.

To begin with, we introduce yet another, global criterion. It turns outto be equivalent to the determinant criterion even though it looks entirelyunrelated.

9.2. G-CRITERION

A special case of scalar optimality arises if the experimenter wishes to investi-gate x '0, the mean value for the response that comes with a particular regres-sion vector x — /(/). The performance of a design £ with a feasible momentmatrix M is then measured by the information value CX(M) = (x'M~x)~l

210

9.3. BOUND FOR GLOBAL OPTIMALITY 211

for x'6, or equivalently, by the standardized variance x'M x of the optimalestimator x'O.

However, if the experimenter is interested, not just in a single point x'6,but in the regression surface x H+ x'O as x varies over the regression range X,then a global performance measure is called for. The following criterion gconcentrates on the smallest possible information and provides a naturalchoice for a global criterion. It is defined through

A design £ E H is called globally optimal in M(H) when its moment matrix Msatisfies g(M) = sapA£M^g(A). Thus we guard ourselves against the worstcase, by maximizing the smallest information over the entire regressionrange X.

Traditionally one prefers to think in terms of variance rather than infor-mation, as pointed out in Section 6.17. The largest variance over X is

The global criterion thus calls for maximization of g(A), the smallest possibleinformation over X, or minimization of d(A\ the largest possible varianceover X, as A varies over the set M(H) of all moment matrices. A bound onthe optimal value of the global criterion is easily obtained as follows.

9.3. BOUND FOR GLOBAL OPTIMALITY

Lemma. Assume that the regression range X C IR* contains k linearlyindependent vectors. Then every moment matrix M e M (H) satisfies

Proof. If M is singular, then there exists a regression vector x e X whichis not a member of the range of M. Hence we obtain d(M) = oo and g(M) =0, and the bounds are correct. If M is nonsingular and belongs to the design£ e H, then the bounds follow from

Indeed, the upper bound I/k for the minimum information g, and thelower bound k for the maximum variance d are the optimal values. This

212 CHAPTER 9: D-, A-, E-, T-OPTIMALITY

comes out of the famous Kiefer-Wolfowitz Theorem. Furthermore, this theo-rem establishes the equivalence of determinant optimality and global optimal-ity if the set of competing moment matrices is as large as can be, M = M (E).The moment matrices and designs that are determinant optimal are the sameas those that are globally optimal, and valuable knowledge about the locationand weights of optimal support points is implied.

9.4. THE KIEFER-WOLFOWITZ THEOREM

Theorem. Assume that the regression range X C Rk contains k linearlyindependent vectors. Then for every moment matrix M e M(E) that is pos-itive definite, the following four statements are equivalent:

a. (Determinant optimality) M is <fr)-optimal for 6 in Af (E).b. (Normality inequality) x'M~lx < k for all x € X.c. (Minimax variance) d(M) = k.d. (Global optimality) M is globally optimal in A/(E).

In the case of optimality, any support point *,- of any design £ e E that is<fo-optimal for 0 in a satisfies

Proof. Theorem 7.20, with p = 0, covers the equivalence of parts (a)and (b), as well as property (1). From (b), we obtain d(M) < k. ByLemma 9.3, we then have d(M) = k, that is, (c). Conversely, (c) plainlyentails (b). Condition (c) says that M attains the global optimality bound ofLemma 9.3, whence (c) implies part (d).

It remains to show that part (d) implies (c). By the Existence Theo-rem 7.13, there is a moment matrix M e M(S) which is <fo-optimal for 6in M(a). We have proved so far that, because of optimality, this matrix Msatisfies d(M) — k and is globally optimal. Now we assume part (d), that is,M is another globally optimal matrix in M(E), besides M. The two matricesmust lead to the same optimal value, d(M) = d(M) = k. Hence part (d)implies part (c). Property (2) simply reiterates Corollary 8.12.

The argument in the proof that leads back from (d) to (c) is of the sort: suf-ficiency of an optimality condition together with the existence of an optimalsolution implies necessity. The proofs of the Gauss-Markov Theorem 1.19and the Elfving Theorem 2.14 are arranged along the same lines.

9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS 213

Property (2) entails that if a <ft)-optimal design has a minimum numberof support points, k, then it distributes the weight l/k uniformly to each ofthem, as stated in Corollary 8.12. This phenomenon occurs in polynomial fitmodels.

9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS

The Kiefer-Wolfowitz Theorem 9.4 is now used to determine the <fo-optimaldesigns for the full parameter vector in the class of all designs, for polynomialfit models on the symmetric unit interval [-!;!].

By Section 1.6, the model equation for degree d > I is

with tf e [—!;!]• The regression function / maps t e [—1;1] into the powervector x = f ( t ) = (1, r , . . . , td}' e M.d+l. The full parameter vector 6 comprisesthe k — d + 1 components BQ, &\,..., 6d. Rather than working with designs£ e a on the regression range X C Ud+l, we concentrate on the set T of alldesigns T on the experimental domain T — [-!;!].

In the dth-degree model, the moment matrixof a design T on T is given in Section 1.28:

A design r is feasible for the full parameter vector 0 if and only if its momentmatrix Md(r) is positive definite. The minimum support size of r then is d+l,because r must have at least d + 1 support points r0, f i , . . . , td e [-1; 1] suchthat the corresponding regression vectors jc/ = /(f,-) are linearly independent.

If the design T is <f> -optimal for 0 in T, for an information function <f> onNND(&) which is strictly isotonic on PD(s), then the support size is actuallyequal to d + I. This is so because the Equivalence Theorem 7.17 states thatevery support point r,- of T maximizes the function P ( t ) — f(t)'Nf(t) over/ e [-1; 1] where (f> (Md(r))N is a subgradient of $ at Md(r), by Theorem 7.9.Lemma 7.5 then shows that the matrix N is positive definite. Thus the bottomright entry of N must be positive, whence P ( t ) is a polynomial of degree 2d.Therefore P has at most d - 1 local maxima on the real line R, attained atpoints ? i , . . . ,fd-i. saY- In order to achieve the minimum support size d + 1,these must be distinct points in the interior of the interval [—1;1], and theboundary points t$ — — 1, and td = 1 must also attain the maximum value.

Degree

23456789

10

Legendre Polynomial Pd on [— 1, 1]

(-l + 3r2)/2(-3r + 5r3)/2(3-30?2 + 35*4)/8(15/-70f3+63fs)/8(-5 + 105f2 - 315f4 + 231f6)/16(-35/ + 315f3 - 693/5 + 429f7)/16(35 - 1260/2 + 6930/4 - 12012/6 + 6435f8)/128(315* - 4620f3 + 18018*5 - 25740?7 + 12155f9)/128(-63 + 3465r2 - 30030/4 + 90090/6 - 109395/8 + 46189/10)/256

EXHIBIT 9.1 The Legendre polynomials up to degree 10. We have PQ(t) = 1 and /^(r) = t,the next nine polynomials are shown in the exhibit.

This yields k = d +1 support points of the form

for any design T e T that in the dth-degree model is <£-optimal for 0 in T.We have constructed the matrix N as the subgradient of <f> at Afd(r). How-ever, we may also view N as an optimal solution of the dual problem ofSection 7.11. Hence any one such N works for all <£-optimal design T, asdoes the polynomial P(t) — f(t}'Nf(t). Thus the support points fo>'i»•••>*</are common to all <£-optimal designs for 6 in T.

This applies to the determinant criterion <fo, the theme of this section.Therefore the <fo -optimal support for d in T is of the form -1 = /0 < h <• • • < td_\ < id = 1, where we claim that the interior points ? i , . . . , /d_iare the local extrema of the Legendre polynomial Pd. For the various waysto characterize these classical polynomials, we refer to an appropriate dif-ferential equation. Alternatively they are obtained by orthogonalizing thepowers t,...,td relative to Lebesgue measure on [-!;!]. The first ten poly-nomials Pd are listed in Exhibit 9.1.

Claim. The unique design TQ that in the d th-degree model is <fo-optimalfor 6 in T assigns equal weight \/(d + 1) to the d + 1 points f, that solve theequation

where Pd is the derivative of the d th-degree Legendre polynomial Pd.

Proof. Uniformity of the weights follows from the Kiefer-Wolfowitz The-

9.5. D-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS 215

orem 9.4. The location of the support is found by studying the normality in-equality f(t}'(Md(T*)}-lf(t] < d+l. Introducing the (d+l) x (d+l) matrix Xthrough

we get Md(r*) = X'X/(d + 1). Since Md(r*} is nonsingular so is X. Thisreduces the normality inequality to IJA""1/!')!!2 < 1 f°r au" f £ [-!;!]•

The inverse V of X' admits an explicit representation once we changefrom the power basis l,t,...,td to the basis provided by the dth-degreeLagrange polynomials L/ with nodes fo, h > • • • > t<i->

where in the products, k ranges from 0 to d. Evidently L/(f ;) equals 1 or 0according as / = j or / ^ j. The same is true of the dth-degree polynomialP(t) = e!Vf(t) since we have P(tj) = e!Vf(tj) - efVX'ej = etc,. Thus thetwo polynomials are identical,

In other words, the matrix V, which has the power basis coefficient vectorsof the Lagrange polynomials as rows, is the inverse of the matrix X', whichhas the power vectors jc, = /(/,) that come with the nodes ?, as columns.

This yields \\X'~lf(t)\\2 = ||V/(r)||2 = £?=oL?(0- In order to comparethis sum of squares with the constant 1, we use the identity

Indeed, on either side the polynomials are at most of degree 2d+l. At tj, theyshare the same value 1 and the same derivative 2L;-(fy), for j = 0,1, . . . ,d.Hence the two sides are identical.

The polynomial Q(t) = n*=o(' ~'*) nas degree d+\ and satisfies Q(r,-) = 0.Around f,-, it has Taylor expansion

With this, the Lagrange polynomials and their derivatives at r, become

We let c e R be the coefficient of the highest term td+l in (1 -t2)Pd(t). Byassumption this polynomial has zeros tk, giving cQ(t) = (1 — t2)Pd(t). Onlynow do we make use of the particulars of the Legendre polynomial Pd thatit is characterized through the differential equation

In terms of Q this means cQ(t) = -d(d+l)Pd(t), and cQ(t) = -d(d+l)Pd(t).Insertion into (2) yields

For i = 1,... ,d — 1, the points /, are the zeros of Pd, and so the deriva-tives L,-(fj) vanish. This aids in evaluating (1), since it means that the onlynonvanishing terms occur for / = 0 and / = d.

The left hand side in (3) is (l-t2)Pd(t)-2tPd(t). Thus t = ±1 in (3) impliesL0(-l) = -d(d+l)/4 and Lrf(l) = d(d+l)/4 in (4). With Pd(-l) = (-\)d

and Pd(l) = 1, we insert L0 and Ld from (4) into (1) to obtain

This verifies the normality inequality. The proof of our claim is complete.

For d > 1, the optimal value i^(<fo) = <£o(To) obeys the recursion relation

with starting value vi(<fo) = 1. Exhibit 9.2 in Section 9.6 provides a list upto degree d = 10, of the <fo-optimal designs rfi for 0 in T and of their op-timal values vd(<f)o). The initial cases d = 1,2 are easy to verify directly.The line fit model has (fo-optimal design TQ supported by ±1 with uni-

9.6. ARCSIN SUPPORT DESIGNS 217

form weight 1/2, and optimal value ui(<fo) = 1. The parabola fit model has^-optimal design TQ supported by —1,0,1 with uniform weight 1/3, and op-timal value i/2(<fo) = 41/3/3 = 0.52913.

Although the optimal support points have an explicit representation, aszeros of derivatives of Legendre polynomials, they are not easy to computenumerically. However, in Section 5.15 we argued that an optimal design isnot an end in itself, but helps to identify good practical designs. To this endwe introduce designs with arcsin support (see Exhibit 9.3). Their supportis constructed easily, and they are nearly optimal in many polynomial fitproblems.

9.6. ARCSIN SUPPORT DESIGNS

The theory of classical polynomials says that the designs TQ which are <fo-optimal for 6 in T converge to the arcsin distribution, as the degree dtends to infinity. The arcsin distribution on [-1;1] has distribution func-tion A and Lebesgue density a, given by A(t) = | + (1/w) arcsin(f) anda(t) = l/(ir(l - r2)1/2). The convergence becomes plainly visible through ahistogram representation of the designs TQ, as in Exhibit 9.4.

Because of the limiting behavior, it seems natural to approximate the opti-mal support points r, by the d th-degree quantiles $/ of the arcsin distribution,

for / = 0,1,... ,d. Symmetry of the arcsin distribution entails symmetry ofthe quantiles, s^ = — sd_i for all / = 0 ,1, . . . ,d, with s0 = —1 and sd = 1.Exhibit 9.5 illustrates the construction.

Designs with arcsin support are very efficient in many polynomial fit prob-lems. They deserve a definition of their own.

DEFINITION. In a polynomial fit model of degree d over the experimentaldomain [—1; 1], an arcsin support design ad is defined by having for its supportthe d th-degree quantiles

of the arcsin distribution on [-!;!].

The set of all arcsin support designs for degree d is denoted by ^d. Themember with uniform weights l/(d + 1) is designated by <TQ. It is <fo-optimalfor 0 if the set of competing designs is restricted to the arcsin support de-signs 2d. This follows from applying the Kiefer-Wolfowitz Theorem 9.4 tothe finite regression range Xd = [f(s0), f(si),..., f ( s d ) } .

EXHIBIT 9.2 Polynomial fits over [-1; 1): ^-optimal designs r* for 6 in T. Left: degree dof the fitted polynomial. Middle: support points and weights of rfi, and a histogram represen-tation. Right: optimal value vd(<f>o) of the determinant criterion.

9.6. ARCSIN SUPPORT DESIGNS 219

EXHIBIT 93 Polynomial fits over [-1; 1]: ^-optimal designs a* for G in 2d. Left: degree

d of the fitted polynomial. Middle: arcsin support points, weights of <rft, and a histogram

representation. Right: efficiency of <r£ relative to the optimal value i^(<fo).


EXHIBIT 9.4 Histogram representation of the design TJ}°. Superimposed is the arcsin density

EXHIBIT 9.5 Fifth-degree arcsin support. Top: as quantiles sf - A '(//5). Bottom: asprojections s, = cos(l - i/5)tr of equispaced points on the half circle.

9.7. EQUIVALENCE THEOREM FOR A-OPTIMALITY 221

Exhibit 9.2 presents the overall optimal designs TQ. In contrast, the optimalarcsin support designs crfi and their (^-efficiencies are shown in Exhibit 9.3.For degrees d — 1,2, the arcsin support designs aft coincide with the optimaldesigns TQ and hence have efficiency 1. Thereafter the efficiency falls downto 97.902% for degree 9. Numerical evidence suggests that the efficiencyincreases from degree 10 on. The designs are rounded to three digits usingthe efficient design apportionment for sample size n = 1000 of Section 12.12.

The determinant criterion fa is peculiar in that optimal designs with aminimum support size d + 1 must assign uniform weights to their supportpoints. For other criteria, the optimal weights tend to vary. We next turn tothe matrix mean <f>_\, that is, the average-variance criterion.

The quantity that takes the place of the largest variance d of Section 9.2now becomes

for positive definite k x k matrices A, and d_\(A) — oo for nonnegativedefinite k x k matrices A that are singular. Because of

the function d_\ is on A/(a) bounded from below by 1.

9.7. EQUIVALENCE THEOREM FOR A-OPTIMALITY

Theorem. Assume that the regression range X C U.k contains k linearlyindependent vectors. Then for every moment matrix M € M (E) that is pos-itive definite the following four statements are equivalent:

a. (Average-variance optimality) M is <£_!-optimal for 6 in M(E).b. (Normality inequality] x'M~2x < trace M"1 for all x € X.c. (Minimax property) d_i(M) = 1.d. (d ̂ -optimality) M minimizes d_\ in M (E).

In the case of optimality, any support point jc, of any design £ e S that is</>_i-optimal for 6 in S satisfies

Proof. The proof parallels that of the Kiefer-Wolfowitz Theorem 9.4.

The present theorem teaches us a major lesson about the frame that theKiefer-Wolfowitz Theorem 9.4 provides for determinant optimality. It is falsepretense to maintain the frame also for other criteria, in order to force astructural unity where there is none. No interpretation is available that wouldpromise any statistical interest in the function d^\. Therefore, the frame ofthe Kiefer-Wolfowitz Theorem 9.4 is superseded by the one provided by theGeneral Equivalence Theorem 7.14.

9.8. L-CRITERION

An optimality concept not mentioned so far aims at minimizing linear criteriaof the dispersion matrix. It arises when a design £ is evaluated with a viewtowards the average variance on the regression surface x »-> x'0 over X.Assuming M(£) to be positive definite, the evaluation is based on the criterion

where the distribution A on X reflects the experimenter's weighting of theregression vectors x e X. Upon setting W = A/(A), this approach calls for theminimization of trace WM (^)~1. The generalization to parameter subsystemsK'0 is as follows.

Let W be a fixed positive definite s x s matrix. Under a moment matrix Mthat is positive definite, the optimal estimator for K' 6 has a dispersion matrixproportional to K'M~1K (see Section 3.5). The notion of linear optimalityfor K'O calls for the minimization of

This is a linear function of the standardized dispersion matrix K'M 1K.Linear optimality poses no new challenge, being nothing but a particular

case of average-variance optimality. Let H 6 R5X* be a square root of W,that is, W — HH'. Then the criterion takes the form

Hence minimization of <fay is the same as maximization of <£_i o CKH- Fur-thermore, the latter formulation extends to all moment matrices M e M(H),whether they are positive definite, or whether they are merely nonnegativedefinite.

In summary, with weight matrix W = HH' > 0, linear optimality forK'O is the same as <£_i-optimality for H'K'B. The optimality results for theaverage-variance criterion carry over, and also characterize linear optimality.

9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS 223

9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS

Continuing the discussion of polynomial fit models, we now study designsthat for degree d are $_i -optimal for the full parameter vector 6 in the set Tof all designs on the experimental domain T = [—!;!].

Let Md be a moment matrix that is </>_]-optimal for 6 in M(E). Thecriterion <f>_i is strictly isotonic on PD(s), owing to Theorem 6.13. By thesame reasoning as in Section 9.5, this determines the optimal support toconsist of d + 1 points of the form -1 = tQ < t\ < • • • < fd-i < td = 1.

The </>_i-optimal weights w0, w},..., wd for the support points to, t\,..., td

are unique, and are obtained using Corollary 8.8. Namely, let bu be the / th di-agonal element of the matrix B — (XX')~l where the regression vectors *, =/(*/) are the rows of the square and nonsingular matrix X — (XQ,XI, ... ,xd)'.Then the optimal weights w, and the optimal value are

In summary, there is a unique design T^ that is <f>_\-optimal for 6 in T. Onceits support points are found, the weights (1) and the optimal value (2) areeasy to compute.

Numerical computation of the </>_! -optimal designs r^ produces the re-sults shown in Exhibit 9.6. The weights are rounded to three digits usingthe efficient design apportionment for sample size n — 1000 of Section 12.5.The line fit model has </>_i -optimal design rlj supported by ±1 with uni-form weight 0.5 and optimal value v\ (</>_i) = 1. The parabola fit model has</>_i-optimal design r^ supported by -1,0,1 with weights 0.25, 0.5, 0.25, andoptimal value V2(<t>-\) = 0.375.

The class 2d of designs with a d th-degree arcsin support provides an effi-cient alternative. The design a^ that is <j>_\-optimal for 6 in 2d has weightsalso given by (1), except that the matrix B — (XX'}~1 involves the arcsinsupport points st through

Exhibit 9.7 shows that the </>_!-efficiency of a^ is remarkably high.Next we turn to the left extreme matrix mean, the smallest-eigenvalue

criterion <£_oo- In order to determine the 4>_oo-optimal designs for polyno-mial fit models we need to refer to Chebyshev polynomials. We review theirpertinent properties first.

224 CHAPTER 9. D-, A-, E-, T-OPTIMALITY

EXHIBIT 9.6 Polynomial fits over [-!;!]: ^-optimal designs T^, for 9 in T. Left: de-gree d of the fitted polynomial. Middle: support points and weights of r^r and a histogramrepresentation. Right: optimal value u,/(<^_i) of the average-variance criterion.

9.9. A-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS 225

EXHIBIT 9.7 Polynomial fits over [-!;!]: <A_j-optimal designs o-^ for 6 in Ld. Left: degree

d of the fitted polynomial. Middle: arcsin support points, weights of adr and a histogram

representation. Right: efficiency of cr^ relative to the optimal value vd(<f>^).

Degree

23456789

10

Chebyshev Polynomial Td on [-1

-l + 2f2

-3r + 4t 3

l-&2 + 8r4

5f - 20f3 + 16f5

-! + 18f2-48r4 + 32r6

-7r + 56r3-112r5 + 64f7

1 - 32f2 + 160r4 - 256f6 + 128r8

9t - 120r3 + 432r5 - 576r7 + 256r9

-1 + 50r2 - 400r4 + 1120f6 - 1280r8 +

;i]

512/10

EXHIBIT 9.8 The Chebyshev polynomials up to degree 10. We have T0(t) = 1 and 7\ (/) = t,the next nine polynomials are shown in the exhibit.

9.10. CHEBYSHEV POLYNOMIALS

The Chebyshev polynomial Td of degree d is defined by

The Chebyshev polynomials for degree d < 10 are shown in Exhibit 9.8.It is immediate from the definition that the function Td is bounded by 1,

All extrema of Td have absolute value 1. They are attained at the arcsinsupport points

from Section 9.6:

In order to see that Td(t] is a polynomial in t, substitute cos (p for / andcompare the real parts in the binomial expansion of the complex exponentialfunction, cos(d<p) + i sin(d<p) = ei(fd — (cos(<p) + isin(<p))d. This leads to

the polynomial representation 7X0 = Y^=ocjfj f°r which only the highest

9.11. LAGRANGE POLYNOMIALS WITH ARCSIN SUPPORT NODES 227

coefficient and then every second are nonzero,

where [d/2\ is the integer part of d/2,We call c — (CQ,CI, ... ,Q)' e IRd+1 the Chebyshev coefficient vector. With

power vector f ( t ) = (1, t,..., td)', we can thus write Td(r) - c'/(r). Whereasthe vector c pertains to the power basis l , f , . . . , f d , we also need to referto the basis provided by the Lagrange interpolating polynomials, with nodesgiven by the arcsin support points Sj.

9.11. LAGRANGE POLYNOMIALS WITH ARCSIN SUPPORTNODES

The Lagrange polynomials with nodes SQ, s\,..., s^ are

where in the products, k ranges from 0 to d. We find it unambiguous to usethe same symbol L, as in Section 9.5 even though the present node set isdistinct from the one used there, and so are the Lagrange polynomials andall associated quantities.

Again we introduce the (d + 1) x (d + 1) matrix V = (VQ,VI, ... ,vd)that comprises the coefficient vectors v, from the power basis representa-tion Li(t) = i>//(0- F°r degrees d = 1,2,3,4, the coefficient matrices V andthe sign pattern (-l)d~l+j of vijtj-2j are shown in Exhibit 9.9. More precisely,the entries of V satisfy the following.

Claim. For all i = 0 ,1, . . . ,d and ; = 0,1,... , |///2j, we have

Proof. The proof of (1) is elementary, resorting to the definition of L,and exploiting the arcsin support points -1 = SQ < Si < ••• < sd^ < sd = 1,solely their symmetry, sd_( = -5,-. The denominator satisfies


EXHIBIT 9.9 Lagrange polynomials up to degree 4. For degrees d = 1,2,3,4, the (d +1) x(d + 1) coefficient matrix V is shown that determines the Lagrange polynomial L,-(f) =y*._nv;/f'', with nodes given by the arcsin support points 5,-. The bordering signs indicate

the pattern

It remains to multiolv out the numerator oolvnomialand to find the coefficient belonging to td 2>. We distinguish three cases.

I. Case d odd. Except for / - sd_t = t + s,-, the factors in P come in pairs,/ - sk and / + sk. Hence we have P(t) = (t + Si)Q(t), where the polynomial

involves only even powers of t. In order to find the coefficient of the oddpower td~2i in P, we must use the term t of the factor t + st and \(d - 1) - ;terms t2 in the factors of Q. This results in the coefficient

With \(d - 1) = \\(d - 1)J, we see that (3) and (2) establish (1).

n. Case d even and i = \d. The definition of P misses out on the factort — 0. Since the remaining factors come in pairs, we get

9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I 229

The even power td~2i uses \d — j terms t2, and thus has coefficient

With \d-l = \\(d - 1)J now (4) and (2) imply (1).

III. Case d even and i / \d. Here P comprises the factors t - 0 = t andt - sd_i = t + St. We obtain P(t) = t(t + s,-)G(0» with

The even power td 2> is achieved in P only as the product of the leadingfactor r, the term t in the factor t + Si, and \d-l-j terms t2 in the factorsof Q. The associated coefficient is

This and (2) prove (1). Moreover (5) is nonzero unless the summation isempty. This occurs only if ; = \d; then the setcontains \d - 1 numbers and does not include a ^-element subset.

As depicted in Exhibit 9.9, we refer to (1) through the sign pattern(-\Y~l+l• Furthermore both numerator and denominator in (1) are sym-metrical in / and d — i. The discussion of (5), together with (3) and (4) showthat the numerator vanishes if and only if d is even and / = ^d ^ i. All theseproperties are summarized in

This concludes our preparatory remarks on Chebyshev polynomials, andLagrange polynomials with nodes s,. In order to apply Theorem 7.24, we needthe Elfving norm of the vector Kz. For the Chebyshev coefficient vector cand z = K'c, we compute the norm p(KK'c) as the optimal value of thedesign problem for c'KK'0.

9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I

In a d th-degree polynomial fit model, we consider parameter subsystems ofthe form QX = (Oi)iei, with an ^-element index set I = {ii,...,is}. Using


the Euclidean unit vectors e§,e\,...,ed of Rd+1, we introduce the (d + 1) x 5matrix K = (e,,,... ,eis) and represent the parameter system of interest byK'O = ( f t , , . . . , ft,)'. The matrix K fulfills K'K = Is and KK1 the latter is a diagonal matrix D where da is 1 or 0 according as / belongs tothe set X or not.

Because of the zero pattern of the Chebyshev coefficient vector c, thevector KK 'c depends on X only through the set

More precisely, we have KK'c = Y^j&jcd-2jed-2j- We assume the index set Xto be such that J is nonempty, so that KK 'c ̂ 0. In this section, we considerscalar optimality for c'KK'd and find, as a side result, the Elfving normp(KK'c) to be \\K 'elf. It turns out that the optimal design is an arcsin supportdesign.

Claim. Let c e F8rf+1 be the coefficient vector of the Chebyshev polyno-mial Td(t) = c'f(t) on [—!;!]. Then the unique design TJ that is optimal for

in T has support points

and weights

and optimal variance (p(KK 'c)) = \\K 'c||4, where the coefficients w0, "i, • • • ,ud are determined from

If d is odd or J ^ {\d} then all weights are positive. If d is even andJ — {\d}, then we have c'KK'B = BQ and wd/2 = 1, that is, if d is even,then the one-point design in zero is uniquely optimal for BQ in T.

Proof. The proof is an assembly of earlier results. Since the vector u =(HO, MI , • • • i ud)' solves X'u — KK 'c, the identity X' V = Id+i from Section 9.5yields u = VKK'c, that is, «, = X)yej vi,d-2jCd-2j- An exceptional case occursfor even degree d and one-point set J = {\d}\ then we have wd/2 = 1 andall other weights are 0.

9.12. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, I 231

Otherwise we use the sign patterns and andto show that the weights are positive,

The normalization follows from upon using

From (6) in Section 9.11, we see that the weights are symmetric, w, = Wd-i-Let M be the moment matrix of the design TJ. The key relation is the

following:

Premultiplication by c'KK'M~ gives \\K'c\\4 = c'KK'M'KK'c. That is, forthe design TJ the optimality criterion for c'KK'O takes on the value ||AT'c||4.

Now we switch to the dual problem. The matrix N = cc' satisfiesf(t)'Nf(t) = (Td(t)}

2 < 1, for all t e [-!;!]. The dual objective functionhas value c'KK'NKK'c = ||A"'c||4. Therefore the Mutual Boundedness The-orem 2.11 proves the matrices M and N to be optimal solutions of the designproblem and its dual.

The theorem also stipulates that any support point t of an optimal designsatisfies the equation f(t)'Nf(t) = (Td(t)}

2 = 1. By Section 9.10, this singles out the arcsin quantiles SQ, s\,..., sd as the only possible support points.The matrix identity X'V = Id+i shows that X is nonsingular, whence the re-gression vectors f ( s Q ) , f ( s i ) , . . . ,f(sd) are linearly independent. Hence Corol-lary 8.9 applies and provides the unique optimal weights, iThese are the weights given above. Our claim is established.

As a corollary we obtain a representation as in the Elfving Theorem 2.14,

Thus the coefficient vector KK 'c penetrates the ElfVing set 72. through the d-dimensional face generated by (-I)d/(s0), (-l)*"1/^!)* • • • > -/(*d-i)i/(**)•

The full index set J = {0,1,..., d} gives the unique optimal design for c'B,denoted by TC, with optimal variance \\c\\4. The one-point set J = {d — 2j}yields the optimal design Td_2j for the scaled individual parameter cd_2j8d-2j->with optimal variance c^_2.. The same design is optimal also for the unsealedcomponent 0</-2/» with optimal variance c^_2,. These designs satisfy the rela-tion

Therefore rc is a mixture of the designs Td_2y, with mixing weights c^_2 -/||c||2.We now leave scalar optimality and approach our initial quest, of dis-

cussing optimality with respect to the smallest-eigenvalue criterion <f>-oo. Weclaim the following.

Claim. The design TJ has an information matrix C = CK(Md(rj)] forK '6 such that K 'c is an eigenvector corresponding to the eigenvalue ||A" 'c||~2.

Proof. We choose a left inverse LoiK which satisfies LM (Id+i —L'K') =0, where M = Md(rj) is the moment matrix of TJ. As in Section 3.2, we thenhave C — LML', in addition to LML'K1 = LM. Insertion of Me from thekey relation (2) yields

For instance, in a fourth-degree model, the sets {0,1,3,4}, {0,1,4},{0,3,4}, {0,4} all share the same set J = {0,2}, the same vector KK'c =Coeo+c4e4 — (1,0,0,0,8)' e R5, and the same optimal design T{0,2}- The infor-mation matrices for K'8 are of size 4x4 , 3x3, 3x3, 2x2 , with respectiveeigenvectors (1,0,0,8)', (1,0,8)', (1,0,8)', (1,8)', and common eigenvaluel/(cJ + c3) = l/65.

While it is thus fairly easy to see that \\K'c\\~2 is some eigenvalue, <£_oo-optimality boils down to showing that it is the smallest eigenvalue of C.

9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS

The designs TJ of the previous section are symmetric whence the odd mo-ments vanish. Therefore their moment matrices decompose into two inter-lacing blocks (see Exhibit 9.10).

In investigating the subsystem K'B = 0j, we place a richness assumption

9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS 233

EXHIBIT 9.10 E-optimal moment matrices. Because of symmetry of the design TC = T^,its moment matrix splits into two interlacing blocks, as shown for degrees d — 1,2,3,4. Dotsindicate zeros.

on the index set J which ensures that the smallest eigenvalue of the infor-mation matrix comes from the block that is associated with the Chebyshevindices d - 2j. More precisely, we demand that every non-Chebyshev indexd — 1 — 2j in J is accompanied by its immediate successor d — 2/,

for all j — 0,1,..., [\d\. Since the scalar case is taken care of by the previoussection, we further assume that the index set J contains at least two indices.This prevents the set

from degenerating to the one-point set {\d}. We claim the following for adth-degree polynomial fit model with experimental domain [-!;!].

Claim. Let c e Ud+l be the coefficient vector of the Chebyshev polyno-mial Td(t) — c'f(t) on [-1;1], and assume that the index set X satisfies as-sumption (1). Then the design TJ of Section 9.12 is the unique <£_oo-optimaldesign for K'6 — % in T, with optimal value ||A"'c||"2. If d > 1, then thesmallest eigenvalue of the information matrix C = CK(Md(Tj)} has multi-plicity 1.

Proof. From Section 9.12 we have p(KK'c) = ||/C'c||2. The proof restson Theorem 7.24, in verifying the equality


That is, we wish to show that

with equality if and only if z is proportional to K 'c.

I. Let M be the moment matrix of TJ. With a left inverse L of K that isminimizing for M, we can write C = LML'. Hence the assertion is

Because of the interlacing block structure of A/, the quadratic form on theleft hand side is

where P and Q are polynomials of degree d and d — 1 associated with theChebyshev index set and its complement,

The contributions from P2 and Q2 are discussed separately.II. In (1) of Section 9.12, we had Any

d th-degree polynomial satisfies

A comparison of coefficients with yields

Observing sign patterns we apply this to the Chebyshev polynomial Td:

9.13. E-OPTIMAL DESIGNS FOR POLYNOMIAL FIT MODELS 235

We also apply it to the dth-degree polynomial P, in thatNow, for each / 6 J, the Cauchy inequality yields

If equality holds in all Cauchy inequalities, then we need exploit only anyone index j € J with j ^ \d to obtain proportionality, for / = 0,1,... , d,

of P(s,-)Kd-2;|1/2 and (-l)d~~i+i\vitd_2j\l/2- Because of i/M_2;- ^ 0, this entails

P(SI) — a(-l)d~', for some a e IR. Hence equality holds if and only ifP = aTd, that is, ad_2y = acd_2y for all j = 0,1,... , \\d\.

III. The argument for Q2 is reduced to that in part II, by introducing thed th-degree polynomial

From sj < 1 and part II, we get the two estimates

If d is even and \d e J, then the last sum involves the coeffcient of t° in Pwhich is taken to be a,\ = 0.

EXHIBIT 9.11 Polynomial fits over [-!;!]: ^-oo-optimal designs vd_^ for 0 in T. Left:degree d of the fitted polynomial. Middle: support points and weights of T^ and a histogramrepresentation. Right: optimal value v</(</>-oo) of the smallest-eigenvalue criterion.

9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II 237

Equality holds in (3) only if, for some /3 e R, we have P = f$Td. Equalityholds in (2) only if Q2(Si) = s2Q2(si); in case d > 1, any one index / =l , . . . ,d - 1 has 5? < 1 and hence entails Q(s,-) = 0 = P(SI) = (-l)d-'p.Thus equality obtains in both (2) and (3) if and only if 0^-1-2; — 0 f°r a^; = 0 , l , . . . ,L j (d - l ) J .

IV. Parts II and III, and the assumption that an index d - 1 - 2; occursin X only in the presence of d - 2j yield

With a = L'z and LK = 7S, we get a'KK'a = ||z||2. Therefore ||tf'c||-2 isthe smallest eigenvalue of C. If d > 1, then a'Ma = \\z\\2/\\K'c\\2 holds ifand only if a — ac, that is, z = aK'c\ hence the eigenvalue ||/£'c||~2 hasmultiplicity 1.

V. If T is another ^^-optimal design for K'O in T, then T is also optimalfor c'KK'O, by Theorem 7.24. Hence the uniqueness statement of that the-orem carries over to the present situation. This completes the proof of ourclaim.

For instance, in a fourth-degree model only the last two of the four sets{0,1,3,4}, {0,1,4}, {0,3,4}, {0,4} meet our assumption (1). Hence T{0,2}is 4>_oo-optimal in T for (0o,03,04)', as we^ as f°r (0o>04)', with commonoptimal value 1/65.

The present result states that many 0_oo-optimal designs are arcsin sup-port designs. The case of greatest import is the full index set I — {0,1,..., d}for which the design TC of the previous section is now seen to be also $-00-optimal for 0 in T. It is in line with our earlier conventions that we employthe alternate notation r^ for TC. Exhibit 9.11 lists the ^-oo-optimal designsr^ up to degree 10, with weights rounded using the efficient design appor-tionment of Section 12.5. The line fit model has <£_oo-optimal designsupported by ±1 with uniform weight 0.5 and optimal value f^-oo) = 1;the (^oo-optimal moment matrix for 0 is /2, its smallest eigenvalue, 1, hasmultiplicity 2. The parabola fit model has </>_oo-optimal design r2^ supportedby -1, 0, 1 with weights 0.2, 0.6, 0.2, and optimal value u2(<£-oo) = 0.2.

From Theorem 7.24, we also know now that the Chebyshev coefficientvector c determines the in-ball radius of the polynomial Elfving set 72., r2 =l/||c||2. The in-ball touches the boundary of ft only at

9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II

In Section 9.12, we derived the optimal designs rd_2j for those individualparameters 0^-2y that are an even number apart from the top coefficient Bd.

A similar argument leads to the optimal designs rd_i_2j for the coefficientsOd-\-2jthat are an odd number away from the top. Their support falls downon the arcsin support of one degree lower, d — \.

In order that the coefficient vector of the (d — 1) st-degree Chebyshevpolynomial Td_i fits into the discussion of the dth-degree model, we usethe representation Td_i(t) = c'f(t), with dth-degree power vector f(t) =( l , r , . . . ,td)' as before. That is, while ordinarily Td_\ comes with d coeffi-cients c0,ci,... ,Q_I, it is convenient here to append the entry cd = 0. Letagain Ox = (#i)<ez be the parameter system of interest. Because of the zeropattern of the Chebyshev coefficient vector c introduced just now, the vectorKK 'c depends on J only through the set

Assuming the index set I to be such that J is nonempty, we claim thefollowing.

Claim. For degree d > 1, let c e Rd+1 be the coefficient vector of theChebyshev polynomial Td_i(t) = c'f(i) on [-!;!]. Then the unique designTJ that is optimal for in T has support points

and weights

<2

and optimal variance (p(KK 'c)) = \\K'c ||4, where the coefficients w0, MI , . . . ,ud_i are determined from

If d is even or J ^ {\(d — 1)}, then all weights are positive. If d is odd andj = {I(d - 1)}, then we have c'KK'O = QQ and w(d^)/2 = 1, that is, if d isodd then the one-point design in zero is uniquely optimal for OQ in T.

Proof. The support of the design T~ gives rise to only d linearly inde-pendent regression vectors /(?o))/(^i)> • • • >/fo-i) in the d + 1 dimensionalspace Ud+l. Hence the moment matrix Md(r~) is singular, and feasibility of r~for c'KK'O requires proof.

9.14. SCALAR OPTIMALITY IN POLYNOMIAL FIT MODELS, II 239

Using the (d - l)st-degree power vector f(t) = (l,t,...,td~1)', equa-tion (1) without the last line reads This yields

where are the power basis coefficients of the Laeraneepolynomials with nodes The last line in (1) is

Using the symmetries andwe get

Thus the last line in (1) is fulfilled and T~ is feasible for c'KK'B. The rest\J

of the proof duplicates the arguments from Section 9.12 and is thereforeomitted.

Again we deduce a representation as in the Elfving Theorem 2.14:

Here the coefficient vector ATAT'? nenetrates the Elfvine set 72. throueh the(d — l)-dimensional face that is generated by

The one-point set I — {d - 1 - 2;} yields the optimal design rd_i_2j for:he component 0</_i_2/, with optimal variance c j_1_2;.

As a comparison, on the rfth-degree arcsin support points SQ,Si,...,sd,;he design 0^-1-2; ̂ at *s optimal f°r Od-i-2j has weights

Degree

345678910

fib

0.36001

0.25281

0.20621

0.17931

0i

10.6141

10.4679

10.3893

10.3393

h

0.56251

0.49731

0.43121

0.38241

03

10.6863

10.5821

10.5085

10.4549

04

I0.5968

10.5404

10.4907

1

0s

10.6462

10.5825

10.5313

06

10.6066

10.5609

1

0j

I0.6331

10.5860

0s

10.6106

1

09

10.6271

#10

1

EXHIBIT 9.12 Arcsin support efficiencies for individual parameters 0y. For degree d < 10and j — 0,1,..., d, the table lists the efficiencies of the optimal arcsin support design ay for 0,

in ?.d, relative to the optimal design TJ for 0, in T.

by Corollary 8.9. This design has efficiency

The efficiency is 1 for degree 1 and 2. From degree 3 on, the discrepancybetween the arcsin support sets of degree d and d - 1 entails a drastic jumpin the efficiencies, as shown in Exhibit 9.12.

Finally, we treat the right extreme matrix mean, the trace criterion fa.Although there is little practical interest in this criterion its discussion isinstructive. Just as the Elfving Theorem 2.14 addresses design optimalitydirectly without a detour via moment matrices, so does the Theorem 9.15.Furthermore, we are thrown back to the concept of formal optimality ofSection 5.15, that a moment matrix can have maximum fa -information forthe full parameter vector 6 without being feasible for 0.

9.15. EQUIVALENCE THEOREM FOR T-OPTIMALITY

Theorem. Assume that the regression range X C Uk contains k linearlyindependent vectors. Let R be the maximum length of all regression vectors,R = max{ ||jr|| : x 6 X}. Then a design £ e H is formally (fo-optimal for 0in H if and only if every support point of £ has maximum length R.

Proof. It is easy to establish the theorem by a direct argument. Everymoment matrix M(£) obeys the bound trace M(g) = £)*esupp f £(*) x'x < R2.The bound is attained if and only if all support points of £ have maximumlength R.

9.16. OPTIMAL DESIGNS FOR TRIGONOMETRIC FIT MODELS 241

From Section 2.17, the maximum length R of the regression vectors is theradius of the Euclidean ball circumscribing the regression range X. Usuallythis makes the situation easy to analyse.

For polynomial fit models over the experimental domain T — [—1;1] thesquared length of a regression vector is \\x\\2 = ||/(OII2 — 1 + /2 + • • • + t2d.The maximum R2 = d + 1 is attained only at t = ±1. Hence any ^-optimaldesign rf on [—1;1] has a moment matrix of rank at most 2 and cannot befeasible for 0, except for the line fit model. The optimal value is Vd(4>i) —R2/(d + l) = l.

In our discussions of the polynomial fit models we have managed to studythe optimality properties of, not just moment matrices, but designs proper.The following example, trigonometric fit models, even leads to optimal de-signs for finite samples sizes.

9.16. OPTIMAL DESIGNS FOR TRIGONOMETRIC FIT MODELS

The trigonometric fit model of degree d > 1 has regression function

and a total of k — 2d + 1 components for the parameter vector 6 (see Sec-tion 2.22). The experimental domain is the "unit circle" T = [0;27r).

We call a design rn an equispaced support design when r" assigns uniformweight \/n to each of n equispaced support points on the unit circle [0;27r).The support points of an equispaced support design T" are of the form

where a £ [0;27r) is a constant displacement of the unit roots 2Trj/n. Forthese designs we claim the following optimality properties.

Claim. Every equispaced support design T" with n > 2d+l is $p-optimal,for all p £ [-00; 1], for the full parameter system 0 in the set T of all designson the experimental domain T = [0;27r). The optimal value function strictlyincreases in p,

from over

Proof. First we show that the designs r" all share the same momentmatrix

The diagonal entries and the off-diagonal entries of M are, with a,Z> =!,. . . ,<*,

Evaluation of the five integrals (2), (3) and (4)-(6) is based on the twoformulas

where m e {!,...,«-!} determines some multiple of ITT. In order to es-tablish (7), we set /3 = 2irm/n e (0;27r). The complex exponential functionprovides the relation e" = cos t + i sin t, and we get

Thus both sums in (7) are linear combinations ot the nnite geometric series, with quotients q The two series sum to

and hence vanish, because of This proves (7).We now return to the integrals (2)-(6). For cos2 (at) I in

we set i , and apply (7) to obtain in (2)

9.17. OPTIMAL DESIGNS UNDER VARIATION OF THE MODEL 243

Then sin2 = 1 - cos2 gives fsm2(bt)drn = \ in (3). The integrals (4), (5)evidently are of the form (7) and vanish. In the integral (6), the sin-cosaddition theorem transforms the integrand into cos(at) sm(bt) = \ (sin(at +bt)-sm(at-bt)). Again the integral vanishes because of (7). Therefore everyequispaced support design T" has moment matrix M as given by (1).

Optimality of the designs T" and the moment matrix M is approachedthrough the normality inequalities. With parameter p € (-00; 1] we have, forall t e [0;27r),

Theorem 7.20 proves 0P-optimality of the designs T" if Lemma 8.15 extends optimality to the t^-oo-criterion.

What do we learn from this example? Firstly, there are many designsthat achieve the optimal moment matrix M. Hence uniqueness may fail fordesigns although uniqueness holds true for moment matrices. Secondly, thetheory of designs for infinite sample size occasionally leads to a completesolution for the discrete problem of finding optimal designs for sample sizen > k = 2d + l. Thirdly, every regression vector/(f) has squared length d + 1.Therefore every design is <fo-optimal for 6 in T, illustrating yet another timethe poor performance of <ft. Finally, a single design may be optimal underthe matrix means (j>p, for all parameters p e [-00;!]. This also follows fromthe symmetry properties that the model enjoys under rotations (compareSection 14.5).

The concluding example of this chapter illustrates yet another feature thatmay occur, namely, that one and the same design remains optimal even undervariation of the underlying model.

9.17. OPTIMAL DESIGNS UNDER VARIATION OF THE MODEL

Speaking of designs T for the full parameter system 6 on an experimentaldomain T, we suppress any explicit reference to the regression function /that determines the statistical model. Of course, any notion of optimalityis meaningless unless the model is specified. Optimality of a design usuallyholds in a specific model only. In most cases the underlying model is clearlyunderstood and no ambiguity arises.

There are rare instances where a design remains optimal under a varietyof models. For example, on the experimental domain 1 — [—1;1], considerthe design r which assigns uniform weight 1/3 to the support points —1,0,1.This design is <fo-optimal for 0 in the class T of all designs on [—1;1], withrespect to each of the following three models I, II, and III.


I. The first model has regression function f(t) = ( l , r , f 2 ) ' and hence fitsa parabola. The design £ on the regression range X which is induced by Thas support points, moment matrix, inverse moment matrix, and normalityinequality given by

Hence the design r is ̂ -optimal for 6 in T. The cfo-optimal value is 41//3/3 =0.52913 (see also Section 9.5).

II. The next model has regression function /(?) = (l,sin(|/7rr),cos(5irr))'and fits a trigonometric polynomial of first degree over the half circle. Theinduced design £ has support points, moment matrix, inverse moment matrixand normality inequality given by

Again r is (^-optimal for 0 in T, with value 41/3/3 = 0.52913.

III. The third model has regression function f(t) = (l,e',e~')'. The de-sign £ has support points

With a = 1 + e + e~l = 4.08616 and b = 1 + e2 + e~2 = 8.52439, the moment

EXERCISES 245

matrix of £ and its inverse are

where c - 6a2 + 3b2 - 2a2b - 27 = 6.51738 is the determinant of 3M. Thenormality inequality turns out to be

Hence T is <fo-optimal for 0 in T in the present model as well, and has optimalvalue c1/3/3 = 0.62264.

This chapter has focused on the important special cases of <£p-optimalityfor all parameters in the class of all designs. We now return to the Loewnercriterion, and investigate a kind of minimum performance requirement thatany reasonable design should satisfy, that is, admissibility. While Loewneroptimality of a design £ means that £ beats every competitor, admissiblityprohibits any competitor to be better than £. The two notions are distinctbecause the Loewner ordering is only a partial ordering.

EXERCISES

9.1 Show that the global criterion g of Section 9.2 is an information func-tion on NND(&).

9.2 The following line of reasoning of Guest (1958) provides an alternativeto our exposition in Section 9.5 from display (1) onwards.

i. The normality inequality, with equalityat the optimal support points tQ,t\,. ..,td, entails L;(ry) = 0 for alli — 1 // _ 1

ii. The polynomial satis-fies and

Hence t\,..., td_\ are zeros of Q.

Hi. The polynomials Q(t) and (1 — t2)Q(t) have the same zeros. There-fore there exists some c ̂ 0 so that Q solves the polynomial identity

on R.

for a;;l

iv. Represent O in the power basis. withand compare coefficients in obtain

for alla\ = 6fls, and a0 = 2«2- In other words, the polynomial Q whichsolves the differential equation (*) is unique.

v. Show that (1 - t2)Pd(t) solves (*), where Pd is the Legendre poly-nomial.

93 Verify that 2.1e °nd is a reasonable fit to the determinant optimal valueVdM'

9.4 How does the linear dispersion criterion 4>w of Section 9.8 relate tothe weighted matrix mean </>_^?

9.5 Show that relative to the arcsin distribution A of Section 9.6, theChebyshev polynomials satisfy fTdTmdA = 0, 1/2, 1 according asd=£m, d = m ̂ 0, d = m = Q.

9.6 In Section 9.11, show that viid^_2j = SiVitd_2j and |v/fd_i_2/| =\vd-i,d-l-2j\-

9.7 In the line fit model over the interval [-b;b] of radius b > 0, show thatthe uniform design on ±b is ^-oo-optimal for 6 in T, with informationmin{l,Z?2}.

9.8 In the quadratic fit model over T = [-V5;\/2], show that (i) theunique ^>_oo-optimal design for 0 in T assigns weights 1/8, 3/4, 1/8to —\/2, 0, v/2, (ii) its moment matrix has smallest eigenvalue 1/2with multiplicity 2, and eigenvectors z = (-1/V5,0,1/V5)' and z =(0,1,0)', (iii) the only nonnegative definite trace 1 matrix E satisfyingTheorem 7.22 is zz' [Heiligers (1992), p. 33].

9.9 In the dth-degree polynomial fit model over [-1;1], show that thearcsin support design with weights l/(2d),l/d,... ,l/d,l/(2d) is theunique design that is optimal for the highest coefficient 6d [Kiefer andWolfowitz (1959), p. 282].

9.10 In the Jth-degree polynomial fit model over [—1;1], show that (i)Amax(Md(r)) < d + 1 for all T € T, (ii) supT€T fy (Md(r)) = d+ 1for all p e [l;oo].

9.11 In the trigonometic fit model of Section 9.16, are the equispaced sup-port designs the only $p-optimal designs for 0 in T?

C H A P T E R 10

Admissibility of Moment andInformation Matrices

Admissibility of a moment matrix is an intrinsic property of the support setof the associated design. Polynomial fit models serve as an example. Never-theless there are various interrelations between admissibility and optimality, aproperty relating to support points and weights of the design. The notion ofadmissibility is then extended to information matrices. In this more generalmeaning, admissibility does involve the design weights, as is illustrated withspecial contrast information matrices in two-way classification models.

10.1. ADMISSIBLE MOMENT MATRICES

A kind of weakest requirement for a moment matrix M to be worthy ofconsideration is that M is maximal in the Loewner ordering, that is, that Mcannot be improved upon by another moment matrix A. In statistical termi-nology, M has to be admissible. Let M C NND(£) be a set of competingmoment matrices, and H C H be a subset of designs on the regression range•y r~ oA;<a ^_ IW .

DEFINITION. A moment matrix M e M is called admissible in M whenevery competing moment matrix A e M with A > M is actually equal to M.A design £ e H is called admissible in H when its moment matrix M(£) isadmissible in M(H).

In the sequel, we assume the set M to be compact and convex, as in thegeneral design problem reviewed in Section 7.10. A first result on admissibil-ity, in the set H of all designs, is Theorem 8.5. If a design rj e H has a supportpoint that is not an extreme point of the Elfving set 72. = conv(X U (-.#)),then 17 is not admissible in H. Or equivalently, if £ e H is admissible in then its support points are extreme points of 72. Here is another result thatemphasizes the role of the support when it comes to discussing admissibility.

247

248 CHAPTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES

10.2. SUPPORT BASED ADMISSIBILITY

Theorem. Let 17 and £ be designs in H. If the support of £ is a subset ofthe support of 17 and 17 is admissible in H, then £ is also admissible in H.

Proof. The proof is akin to that of Corollary 8.9, by introducing theminimum likelihood ratio a = min{ri(x)/^(x): x e supp£}. Because of thesupport inclusion, a is positive. It satisfies i7(jc) - ag(x) > 0 for all x e X. Let£ e H be any competing design satisfying M(£) > M(£). Then a£+ 17 - a£is a design in H, with moment matrix

Here equality must hold because of admissibility of rj. Since a is positive, weget M(£) = M(£). This proves admissibility of £.

In measure theoretic terms the inclusion of the supports, supp £ C supp 17,means that £ is absolutely continuous relative to 17.

A matrix M e M. is said to be inadmissible in M. when it is not admissiblein M, that is, if there exists another competing moment matrix B e M suchthat B ^ M. In this case, B performs at least as well as M, relative to everyisotonic function <f> on NND(£). This calls for substituting B in place of M.The improvement can always be achieved with a matrix A £ M which ispossibly distinct from B but which cannot be improved any further, that is,which is admissible.

10.3. ADMISSIBILITY AND COMPLETENESS

Lemma. Let M € M be a competing moment matrix. If M is inadmis-sible in M, then there exists an admissible matrix A e M which improvesupon M, that is, A § M.

Proof. The subset M - {B e M : B > M} of M is compact. Hencethe trace function attains its maximum over M at A G M, say. The matrix Ais admissible in M. For if C > A then C € M. Now trace C > trace /4 >trace C forces trace(C — A) = 0, and C — A.

By definition of M we have /I > M. Since M is inadmissible there exists amatrix B e M with B^M. Strict monotonicity of the trace yields trace A >trace B > trace M, entailing A^M.

In decision theoretic terms, the lemma says that the admissible designsform a complete class, that is, every inadmissible moment matrix in M may

10.4. POSITIVE POLYNOMIALS AS QUADRATIC FORMS 249

be improved upon, A^M, where the moment matrix A e M is admissible.Theoretically, the design problem is simplified by investigating the "smaller"subset Hadm of admissible designs, rather than the "larger" set H of all designs.Practically, the design set Eadm and the moment matrix set M(Hadm) may welllack the desirable property of being convex, besides the fact that the meaningsof "smaller" and "larger" very much depend on the model.

Here are two examples in which every design is admissible, the trigono-metric fit model and the two-way classification model. In both models theregression vectors have constant length,

as mentioned in Section 6.5. But if x'x is constant over X, then any twodesigns £ and 17 that are comparable, M(T]} > M(£), have identical momentmatrices, M (17) = M(£). This follows from

and the strict monotonicity of the trace function. Thus every moment matrixM e M(H) is admissible in M(H), and admissibility leads to no simplifica-tion at all. In plain words, admissibility may hold for lack of comparablecompetitors.

In this respect, polynomial fit models show a more sophisticated structure.We take the space to explicitly characterize all admissible designs in thismodel. The derivation very much relies on the peculiarities of the model.Only fragments of the arguments prevail on a more general level. We beginwith an auxiliary result from calculus.

10.4. POSITIVE POLYNOMIALS AS QUADRATIC FORMS

Lemma. Let P be a polynomial defined on R, of even degree 2d. Then Pis positive, P ( t ) > 0 for all t 6 (R, if and only if there exists a positive definite(d + 1) x (d + 1) matrix A such that, with power vector f ( t ) = (1, t,..., td)',

Proof. For the direct part, we extract the coefficient c e IR of the highestpower t2d in P. Because of P > 0, we have c > 0. We proceed by inductionon d > 1.

If d = 1, then P is a parabola,

with a, )8 e IR and y — a2 + )3. Because of P > 0, we haveThus we get the desired representation P(t) = ( l , t ) A ( l

t ) , with

Assuming the representation holds for all positive polynomials R of degree2(d — 1), we deduce the result for an arbitrary positive polynomial P ofdegree Id. Because of P > 0, the roots of P are complex and come inconjugate pairs, Zj = ay +i/3; and 7; = a; - i/3;. Each pair contributes aparabola to the factorization of P,

with ay, Pj € U. Because of P > 0 we have 0? > 0. Thus the factorized formof P is

say. The first polynomial in (1) is Q(t) bd_itd~l +td = (fr',l)/(0, with Z> = ( f to ,* i i - - - , ^ - i ) ' e ^d- Hence we canwrite

The second polynomial in (1) isIn this sum, each term is nonnegative, the constant term is andthe highest power r2^-1) has coefficient £/<d ftf > 0. Hence R is a positivepolynomial of degree 2(d — l). By assumption there is a positive definite d x dmatrix B such that

Altogether, (1), (2), and (3) provide the representation P(t) = f(t}'Af(t),with the positive definite matrix

This completes the direct part of the proof.The converse part holds because of f(t) ^ 0 for all t 6 R.

10.5. LOEWNER COMPARISON IN POLYNOMIAL FIT MODELS 251

10.5. LOEWNER COMPARISON IN POLYNOMIAL FIT MODELS

Again we denote by T the set of all designs on the experimental domain T =[—1; 1]. The model for a polynomial fit of degree d > 1 has regression functionf(t) = (!,*,..., td}. A design T e T has moment matrix

Here the Loewner comparison of two moment matrices reduces to a com-parison of moments. We claim that only the highest moments can differ, theothers must coincide. To this end, we introduce a notation for the initialsection of the first 2d-\ moments,

We can now give our claim a succinct form.

Claim. Two designs cr, T e T satisfy if and only if themoments of a and T fulfill and

Proof. For the proof of the direct part, we evaluate the quadratic form

for vectors z = (z0> z\,. • • , zd)' With just two nonzero entries, Zi — 1and Zi = a, we obtain, tor i — 0,1,..., a — 1,

By induction on i, we show that (1) implies

that is, the consecutive initial sections of the moments of a and r coincide.For i = 0 we have HQ(O) = 1 = /AO(T). Hence (1) forces JJLJ(or) == /A/(T) for all

; = !, . . . ,</, that is, H(d)(<*} = AK<*)(T)- Assuming /i(rf+l--i)(o-) = M(d+i-i)(T),we deduce /i^+oC0") = /*(<*+/) (T)- The restriction i < d-l leads to 2/ < d+i — 1.Hence jt2,-(o-) = M2i(T) and (1) yields /A,+/(O-) = Pi+j(r), for all / = /+!,.. . , d.Adjoining At/+d(cr) = /tl+d(T) to the assumed identity of the initial sections oflength d + i - I, we obtain i*.(d+i)((r} — P-(d+i)(T)- Thus (2) is established.

The case / = d - 1 in (2) gives At(2d-i)(o") = ^(2d-\)(f)- Finally (1) yieldslJL2d(°') > ^2d(T)» where equality is ruled out because of Md(a) ^ Md(r).

The proof of the converse is evident since Md(a) - Md(r) has entries 0except for the bottom right entry fadfa) - M2d(T) which is positive. Thus theproof of our claim is complete.

A first admissibility characterization is now avaifabte. fn a o"tfi-cfegreemodel, a design r 6 T is admissible in T if and only if it maximizes themoment of highest order, 2d, among those designs that have the same lowerorder moments as has T,

This characterization is unwieldy and needs to be worked on. The crucialquestion is how much the 2dth moment At2d(0-) varies subject to the re-striction /t(2d-i)(cr) = /A(2d-i)(T)- If there is no variation, then /u,2d(T) is theunique moment that goes along with iA(2d-i)(T)-> and again admissibility holdsfor lack of comparable competitors. Otherwise /i2d, given the initial sectionM(2</-i)(T)i is nonconstant, and admissibility is a true achievement.

10.6. GEOMETRY OF THE MOMENT SET

The existence of distinct, but comparable moments depends on how the giveninitial section relates to the set of all possible initial sections,

We call ju,(2d-i)(T) the moment set up to order 2d - 1. Its members are inte-grals,

just as the members of the set A/(H) of moment matrices are the integralsfxxx' dg, with £ e H. For this reason the arguments of Lemma 1.26 carryover. The moment set /A(2d-i)(T) is a compact and convex subset of R2d~\

10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS

being the convex hull of the power vectors g(t) — (t,...,t2d~1)' with t e

[-i;i].The set M(2d-i)(T) includes all polytopes of the form conv{0,g(fi),...,

g(hd-\}}- If tne points 0 ^ *i, • • • , *2<*-i £ [-1»1] are pairwise distinct, thenthe Vandermonde determinant proves the vectors g(t\),... ,g(?2d-i) to belinearly independent,

In this case, the above polytopes are of full dimension 2d-l. Therefore themoment set /A(2</-i)(T) has a nonempty interior. We now claim the following,for a given design T € T.

Claim. If /u,(2rf_i)(r) lies in the interior of the moment set /t(2d-i)(T), thenthere exists a design a e T satisfying /A(2d-i)(0") = M(2d-i)(r) and ^2d(a'} /^2d(f}-

Proof. The statement has nothing to do with moments, but follows en-tirely from convex analysis. A change in notation may underline this. LetC — /i(2d)(T) be the moment set up to order Id, a convex set with nonemptyinterior in Rw+1, with m = 2d - 1. Let D C Um be its image under the pro-jection ( y , z ) i-> y. We assume that y — )H(2d-i)(T) nes in tne interior of D.The assertion is that the cut Cy — [z € 1R : (y,z) € C} contains at leasttwo points. Convex analysis provides the fact that the number z lies in theinterior of Cy if and only if the vector (y,z) lies in the interior of C. Thelatter is nonempty, and hence so is Cy. Thus Cy is a nondegenerate intervaland contains another point /u,2</(cr), say, besides /^X1")- This proves the claim(see also Exhibit 10.1).

10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS

We are now in a position to describe the admissible designs in polynomial fitmodels.

Claim. For a design T G T in a d th-degree polynomial fit model on theexperimental domain [-1;1], the following three statements are equivalent:

a. (Admissibility) r is admissible in T,b. (Support condition) r has at most d - 1 support points in the open

interval (-1;1).

253

254 CHAiTER 10: ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES

EXHIBIT 10.1 Cuts of a convex set. If C C Rm+1 is a convex set with nonempty interiorand y e IRm is an interior point of the projection D of C on Rm, then the cut Cy = |z 6 IR :

(y,z) € C\ is a nondegenerate interval.

c. (Normality condition) There exists a positive definite (d + l) x (d + l)matrix N that satisfies

with equality for the support points of T.

Proof. First we prove that part (a) implies part (b). Let r be admis-sible. We distinguish two cases. In the first case, we assume that the vec-tor M(2d-i)(T) lies °n the boundary of the moment set /i(2d-i)(T). Thenthere exists a supporting hyperplane in R2d-1, that is, there exists a vector0 ̂ h e U2d~l such that

with power vector g(t) = (t,..., t2d *)'. Therefore the polynomial

is nonpositive on [-1; 1], and satisfies //*(*) dr = 0. The support points of Tare then necessarily zeros of P. Because P < 0, they actually determine localmaxima of P in [—!;!]. The degree of P is at most 2d — 1. As h ^ 0, the

10.7. ADMISSIBLE DESIGNS IN POLYNOMIAL FIT MODELS 255

polynomial P is nonconstant. Hence P possesses at most d - 1 local maximaon R and r has at most d - 1 support points in (—1; 1).

In the second case we assume that H(2d-i)(T) ues m the interior of themoment set M(2d-i)(T). Because of admissibility, the moment /-^(T) maxi-mizes ^2d °ver the designs that have initial section At(2rf-i)(7"), by Section 10.5.Stepping up to order Id, this puts the vector n^d)^} °n the boundary of themoment set /t(2d)(T). Hence there exists a supporting hyperplane in U2d, thatis, there exists a vector 0 ̂ h e R2d such that

with enlarged power vector g(t) — (t,...,t2d I,t2d)'. Therefore the polyno-mial

is nonpositive on [—1; 1], and again the support points of r are local maximaof P in [-!;!]. In order to determine the degree of P, we choose a design crwith distinct but comparable moments. Section 10.6 secures the existence ofsuch designs cr, for the present case. Moreover T is assumed to be admissible,whence //^(o") < M2d(T)> by Section 10.5. From

we obtain h2d > 0. If h2d = 0, then 0 ^ (hi,.. - ,/i2d-i)' e 032d~l defines asupporting hyperplane to /X(2d-i)(T) at n^d-i)^}, contradicting the presentcase that H(2d~i)(T) lies in the interior of the moment set M(2d-i)(T). Thisleaves us with h2d > 0. Any polynomial P of degree 2d with highest coefficientpositive has at most d-1 local maxima on R. Thus r has at most d-1 supportpoints in (—1;1).

Next we show that part (b) implies part (c). Let t\,...,tt be the supportpoints of T in (—1; 1). By assumption, we have i < d - 1. If i < d - 1, thenwe add further distinct points r ^ + i , . . . , td_\ in (-1; 1). Now the polynomial

is nonnegative inside [-1;1], nonpositive on the outside, and vanishes at thesupport points of r. Therefore the polynomial

is positive on R and of degree 2d. From Lemma 10.4, there exists a matrixAT € PD(d +1) such that P(t) = f(t)'Nf(t). In summary, we obtain

with equality for the support points of T.Finally, we establish part (a) from part (c). Let a e T be a compet-

ing design with Md(o~) > Md(r). Since the support points t of T satisfyf(t)'Nf(t) = 1, the normality condition (c) yields

Strict monotonicity of the linear form A H-+ trace AN forces Md(o~) = Md(r}.Hence T is admissible, and the proof of the claim is complete.

This concludes our discussion of admissible designs in polynomial fit mod-els. The result that carries over to greater generality is the interplay ofparts (a) and (c), albeit in the weaker version of Corollary 10.10 only.

We may view part (c) as a special instance of maximizing a strictly isotonicoptimality criterion <f>, defined through 4>(A) — trace AN. This points to thebroader issue of how optimality theory interacts with admissibility. Generallythere are three instances when <£-optimality entails admissibility, dependingon whether <f> is strictly isotonic, or </> is strictly isotonic only on PD(A;), or <£is merely isotonic. The increasing generality of the criteria <f> is compensatedby appropriate conditions on the admissibility candidate M.

10.8. STRICT MONOTONICITY, UNIQUE OPTIMALITY, ANDADMISSIBILITY

Lemma. Let M e M be a competing moment matrix. In order that Mis admissible in M, any one of the following conditions is sufficient:

a. (Strict monotonicity) M is <£-optimal for 6 in M, for some strictlyisotonic optimality criterion <£.

b. (Nonsingularity and strict monotonicity on PD(A:)) M is positive definiteand M is </>-optimal for B in M, for some optimality criterion <£ thatis strictly isotonic on PD(k).

c. (Unique optimality) M is uniquely <f>-optimal for 6 in A4, for someisotonic optimality criterion <£.

10.9. E-OPTIMALITY AND ADMISSIBILITY 257

Proof. The argument for part (a) is indirect. If there exists a competi-tor B e M with B ^ M, then strict monotonicity of <f> on NND(fc) implies4>(B) > <£(M), whence M is not <£-optimal. For part (b), the same rea-soning applies to B ^ M > 0, appealing to the strict monotonicity of <£on PD(fc) only. In part (c), every competitor A € M with A > M satis-fies <j>(A) > <f>(M), by monotonicity. Optimality of M entails <j>(A) — <t>(M),and uniqueness forces A — M.

Part (c) leads to the most comprehensive results, with criteria of the form0 = (fr^ o CK. We can even exhibit the coefficient matrices K that bestserve this purpose.

10.9. E-OPTIMALITY AND ADMISSIBILITY

Theorem. Let M e M be a competing moment matrix, with a full rankdecomposition M - KK'. Then M is admissible in M if and only if M isuniquely $-00-optimal for K'd in .A/1.

/Voo/ Let r be the rank of M, so that K eRkxr. Then the informationmatrix of M for K'6 is the identity matrix,

For the direct part, we assume admissibility. Let A e M be <£_oo-optimalfor K'd in M', such matrices exist by the Existence Theorem 7.13. Optimalityof A yields

This means Ir < CK(A). Pre- and postmultiplication by K and K' give

where AK is the generalized information matrix for K'B from Section 3.21.Admissibility forces A = M, showing that M is uniquely 4>_oo-optimal forK'B inM.

The converse follows from part (c) of Lemma 10.8, with

As an application, we consider one-point designs £(*) = 1. Their momentmatrices are of the form M(£) = xx'. Such a design is admissible in H if andonly if it is uniquely optimal for x '6 in H. The Elfving Theorem 2.14 permitsus to check both optimality and uniqueness. Thus the one-point design £(*) =1 is admissible in H if and only if x is an extreme point of the Elfving setK = con\(Xu(-X)).

Another application occurs in the proof of the following theorem, on opti-mality relative to the information functions <f>u(A) = trace AN. These tracecriteria are isotonic if N >Q, and strictly isotonic if N > 0 (see Section 1.11).Without loss of generality, we take N ^ 0 to be scaled in order to satisfytrace M N = 1.

10.10. T-OPTIMALITY AND ADMISSIBILITY

Corollary. Let M e M be a competing moment matrix. For N eNND(fc), let <f>N be the optimality criterion given by <J>N(A) = trace AN.

a. (Necessity) If M is admissible in M, then M is <£#-optimal for 6 in Mfor some nonnegative definite k x k matrix N with trace MN = 1.

b. (Sufficiency) If M is <£#-optimal for 0 in M for some positive definitek x k matrix N with trace MN = 1, then M is admissible in M.

Proof. We deduce part (a) from Theorem 10.9. For any full rank decom-position M = KK', the matrix M is <£_oo-optimal for K'B in Ai, with optimalvalue Amin(C/^(M)) = 1. By Theorem 7.21, there exist a matrix E > 0 withtrace E = 1 and a generalized inverse G of M, such that N = GKEK'G' > 0satisfies

In particular, M is <f>N-optimal for 6 in M. Part (b) follows from part (a) ofLemma 10.8.

The merits of the theorem lie in its transparent geometric meaning. Giventhe matrix 0 ̂ N e Sym(/c), the projection of a matrix A E Sym(fc) onto theone-dimensional subspace £(N) = {aN : a e R} is

Hence if M maximizes A H-» trace AN, then M has a longest projection onto£(N) among all competitors A e M. The necessary condition (a) and thesufficient condition (b) differ in whether the matrix N points in any directionof the cone NND(fc), N > 0, or whether it points into the interior, N > Q(see Exhibit 10.2).

Except for the standardization trace M N = 1, the normality inequality ofSection 7.2 offers yet another disguise of <fov-optimality:

In the terminology of that section, the matrix N is normal to M at M.

10.10. T-OPTIMALITY AND ADMISSIBILITY 259

EXHIBIT 10.2 Line projections and admissibility. The matrices M\ and MI have a longestprojection in the direction N^ > 0, but only MI is admissible. The matrices A/2 and MI areadmissible, but only M2 has a longest projection in a direction N > 0.

Relative to the full set M(H), we need to refer to the regression vectorsjc e X only. The design £ is (j>N -optimal for 6 in H if and only if

with equality for the support points of £. This is of the same form as thenormality condition (c) of Section 10.7. The geometry simplifies since we neednot argue in the space Sym(fc) of symmetric matrices, but can make do withthe space Rk of column vectors. Indeed, N induces a cylinder that includesthe regression range X, or equivalently, the Elfving set Tl (see Section 2.10).

Neither condition (a) nor (b) can generally be reversed. Here are twoexamples to this effect. Let HI be the set of all designs in the line fit modelof Section 2.20, with regression range X\ — {1} x [-1; 1]. Choosing N = (QQ),we find x'Nx = 1 for all jc e X\. Hence the one-point design £(Q) =1 is <f>N-optimal, even though it is inadmissible. For instance, the design 17 (_\) =rj(!) = \ has a better moment matrix,

Therefore the converse of condition (a) fails to hold in this example.A variant of this model is appropriate for condition (b). We deform the

regression range X\ by rounding off the second and fourth quadrant,

In the set Ha of all designs on X2, the design £(Q) = 1 is uniquely optimal for

in Ha and hence admissible, by the Elfving Theorem 2.14 and by part (c) ofTheorem 10.7. But the only matrix

satisfying

is the singular matrix N = (J|j). Indeed, we have a = (1,0)N(J) = 1. Non-negative definiteness yields y > /32. Inserting x = (j), we get 1 + 2j8 + y < 1,that is, y < -2/3. Together this means

Now we exploit the curved boundary of From andwe obtain Siihriivisinn hv

eives '. As Jt? tends 1 tends giving 2From (1), we get and Thus N is necessarily singular. Theconverse of condition (b) is false, in this setting (see Exhibit 10.3).

The results of the present corollary extend to all matrix means <f>p with

10.11. MATRIX MEAN OPTIMALrTY AND ADMISSIBILITY

Corollary. Let M e M be a competing moment matrix and let 0 ̂ p €

a. (Necessity) If M is admissible in M and N > 0 satisfies trace AN <1 = trace MN, for all A e M, then M is </>p-optimal for HK'B in Mwhere K is obtained from a full rank decomposition MNM = KK'and where H = (C*(Af ))<1+^/Grt.

b. (Sufficiency) If M is uniquely <jH»p-optimal for K'O in M., for some fullcolumn rank coefficient matrix K, then M is admissible in M.

10.11. MATRIX MEAN OPTIMALITY AND ADMISSIBILITY 261

EXHIBIT 10J Cylinders and admissibility. Left: the one-point design £(J) = 1 is ̂ -optimal,but inadmissible over the line fit regression range X\. Right: the same design is admissible overthe deformed regression range X^, but is ̂ -optimal for the singular matrix N = (^Q), only.

Proof. We recall that in part (a), admissibility of M entails the exis-tence of a suitable matrix N', from part (a) of Corollary 10.10. The point isthat N is instrumental in exhibiting the parameter system HK'6 for whichthe matrix M is <£p-optimal. Suppose, then, that the k x r matrix K sat-isfies MNM = KK', where r is the rank of MNM. From range AT =range MNM C range M, we see that M is feasible for K'6. Hence C =CK(M) is a positive definite r x r matrix, and we have

For the positive definite matrix H = C(1+p)/(2p), we obtain

with q conjugate to p. Therefore the primal and dual objective functions takethe value

The Duality Theorem 7.12 shows that M is ^-optimal for HK'6 in M, and


that N is an optimal solution of the dual problem. Part (b) follows from part(c) of Lemma 10.8.

If the hypothesis in part (a) is satisfied by a rank 1 matrix N = hh',then MNM = cc', with c = Mh, and M is optimal for c'6 in M. Closelyrelated conditions appear in Section 2.19 when collecting, for a given momentmatrix M, the coefficient vectors c so that M is optimal for c'6.

10.12. ADMISSIBLE INFORMATION MATRICES

The general optimality theory allows for parameter subsystems K'6, beyondthe full parameter vector 0. Similarly admissibility may concentrate on in-formation matrices, rather than on moment matrices. The requirement isthat the information matrix C^(Af) is maximal, in the Loewner ordering,among whichever competing information matrices are admitted. To this end,we consider a subset C C C/^(M(H)) of information matrices for K'6 where,as usual, the coefficient matrix K is taken to be of full column rank.

An information matrix C#(M) e C is called admissible in C when everycompeting information matrix CK(A) € C with CK(A) > C#(A/) is actuallyequal to C/^(M). A design £ e H is said to be admissible for K'6 in H C Hwhen its information matrix C#(M(£)) is admissible in C/c(A/(E)).

Some of the admissibility results for moment matrices carry over to infor-mation matrices, such as Section 10.8 to Section 10.11, others do not (The-orem 10.2). We do not take the space to elaborate on these distinctions ingreater detail. Instead we discuss the specific case of the two-way classifica-tion model where admissibility of contrast information matrices submits itselfto a direct investigation.

10.13. LOEWNER COMPARISON OF SPECIAL C-MATRICES

In the two-way classification model, let the centered contrasts of the firstfactor be the parameter system of interest. For the centered contrasts to beidentifiable under an a x b block design W, the row sum vector r = Wlb

must be positive. Then the special contrast information matrix Ar — rr' isLoewner optimal in the set T(r) of designs with row sum vector equal to r,by Section 4,8.

The first step is to transcribe the Loewner comparison of two special con-trast information matrices into a comparison of the generating row sum vec-tors. We claim the following.

Claim. In dimension a > 2, two positive stochastic vectors f , r e Ra satisfy

10.13. LOEWNER COMPARISON OF SPECIAL C-MATRICES 263

if and only if the components fulfill, for some / < a,

Proof. The major portion of the proof is devoted to showing that, underthe assumption of (1), conditions (2) and (3) are equivalent to

Indeed, with K' = (Ka,Q) and utilizing generalized information matricesas in (1) of Section 3.25, condition (4) means (M(ts'))K > (M(rs'))K. ByLemma 3.23, this is the same as K'M(ts')~K < K'M(rs')~K. Insertion ofthe generalized inverse G of Section 4.8 yields an equivalent form of (4),

The a x (a - 1) matrix //, with iih row -l^-i while the other rowsform the identity matrix 7f l_i, satisfies KaH = H and H(H'H)~1H' = Ka.This turns (5) into //'A'1// < H'k~lH. Upon removing the ith componentfrom t to obtain the shorted vector 7= ( f i , . . . , r,-_i, tM,..., ta)', we compute//'A,"1// = Af~

] -i- (l/ti)la-.\lg__r With f defined similarly, we may rearrangeterms to obtain

Because of (1) the factor 1/f, - 1/r, is positive. It is a lower bound to theelements l/r; — l/r; of the diagonal matrix A^1 — Af~\ whence (6) entailsAr1 > A^1. Moreover, in view of the Schur complement Lemma 3.12, we seethat (6) is equivalent to

Another appeal to Lemma 3.12, with the roles of the blocks in (7) inter-changed, yields

This is merely another way of expressing conditions (2) and (3). Hence (4)is equivalent to (2) and (3).

Only a few more arguments are needed to establish the claim. For thedirect part we conclude from that t ̂ r, whence follows (1),and then (4) leads to (2) and (3).

For the converse, (1), (2), and (3) imply (4). Equality in (4) is ruled outby (1). If we assume A,-f?' = A r—rr' , then we trivially geSince (1) secures / ^ r, there exists some k < a such that tk > rk and

We have more than two subscripts, a > 2, and so conditions (2) and (8) cannothold simultaneously. Hence A, - tt' ^ Ar - rr', and the proof is complete.

Thus we have A, — tt' ^ Ar - rr' if and only if, with a single exception (1),the components of t are larger than those of r (2), and the discrepanciesjointly obey the quantitative restriction (3). This makes it easy to comparetwo special contrast information matrices through their generating row s,umvectors, and to attack the problem whether comparable row sum vectors exist.

10.14. ADMISSIBILITY OF SPECIAL C-MATRICES

For the set of special contrast information matrices

the question is one of admissibility of a particular member Ar - rr' in theclass C.

Claim. For a positive stochastic vector r € Rfl in dimension a > 2, thematrix Ar - rr' is admissible in C if and only if r, < \ for all / = 1,..., a.

'Proof. We prove the negated statement, that ' is inadmissible in Cif and only if r, > \ for some /. If Ar - rr' is inadmissible, then there exists apositive stochastic vector t e Ra such that . Consideration

of the diagonal elements yields t} - tj > r; - r?, for all j. From condition (1)of Section 10.13, we get r, < r/, for some /. This is possible only if r, > \ and>/€ [ l - r , , r ( - ) .

For the converse, we choose /, e [1 - r,,r/) and define

Then t — (t\,...,ta)' is a positive stochastic vector satisfying (1) and (2) of

10.15. ADMISSIBILITY, MINIMAXITY, AND BAYES DESIGNS 265

Section 10.13. It also fulfills (3) since f,• > 1 — r, yields

By Section 10.13 then whence Ar - rr' is inadmissible.This completes the proof.h

In view of the results on Loewner optimality of Section 4.8, we may sum-marize as follows. A block design W e T is admissible for the centeredcontrasts of factor A if and only if it is a product design, W = rs', and atmost half of the observations are made on any given level / of factor A,r, < \ for all i < a. It is remarkable that the bound \ does not depend on a,the number of levels of factor A.

The mere presence of the bound \ is even more surprising. Admissibilityof a design for a parameter subsystem K '6 involves the design weights. This isin contrast to Theorem 10.2 on admissibility of a design for the full parametervector 6 which exclusively concentrates on the design support.

10.15. ADMISSIBILITY, MINIMAXITY, AND BAYES DESIGNS

The notion of admissibility has its origin in statistical decision theory. There,a parameter 6 e @ specifies the underlying model, and the performance ofa statistical procedure T is evaluated through a real-valued function 6 H-»R(6,T], called risk function. The terminology suggests that the smaller thefunction R(-,T), the better. This idea is captured by the partial ordering ofsmaller risk which, for two procedures T\ and T2, is defined through pointwisecomparison of the risk functions,

It is this partial ordering to which admissibility refers. A procedure T\ iscalled admissible when every competing procedure T2 with T2 < T\ hasactually the same risk as T\. Otherwise, T\ is inadmissible.

Decision theory then usually relates admissibility to minimax procedures,that is, procedures that minimize the maximum risk,

Alternatively, we study Bayes procedures, that is, procedures that minimizesome average risk,

where TT is a probability measure on (an appropriate sigmafield of subsetsof) 6.

In experimental design theory, the approach is essentially the same exceptthat the goal of minimizing risk is replaced by one of maximizing information.A design is evaluated through its moment matrix M. The larger the momentmatrix in the Loewner ordering, the better. The Loewner ordering is a partialordering that originates from a pointwise comparison of quadratic forms,

where Sk = {x € Rk : \\x\\ = 1} is the unit sphere in IR*. Therefore, thedifference with decision theoretic admissibility is merely one of orientation,of whether an improvement corresponds to the ordering relation <, or thereverse ordering >.

With this distinct orientation in mind, the usual decision theoretic ap-proach calls for relating admissibility to maximin designs, that is, designs thatmaximize the minimum information,

This is achieved by Theorem 10.9. Alternatively, we may inquire into someaverage information.

where TT is a probability measure on 5^. This is dealt with in Corollary 10.10,with a particular scaling of N. From the analogy with decision theory, a designmay hence be called a Bayes design when it maximizes a linear criterion(j>N(A) = trace AN.

However, there is more to the Bayes approach than its use as a tool of deci-sion theory. In the arguments above, TT conveys the experimenter's weightingof the various directions x e Sk. More generally, the essence is to bring tobear any prior information that may be available. Other ways of how priorinformation may permeate a design problem are conceivable, and each ofthem may legitimately be called Bayes,

EXERCISES

10.1 Show that M is admissible in M if and only if HMH' is admissiblein HMH1, for all H e GL(fc).

10.2 In the parabola fit model over [-1; 1], a design r is optimal for BQ+ fain T if and only if ftdr = 0 and /12 dr = |. Which of the optimal

EXERCISES 267

designs with such thatare admissible in T? [Kiefer and Wolfowitz (1965), p.

1652].

10.3 Show that M is admissible in M if and only if M is uniquelyoptimal for (CK(M})l'2K'9 in M, where K provides a full rank de-composition MNM = KK' for some N > 0 such that trace AN < 1 =trace MN for all A e M.

10.4 Show that £ is admissible for K'9 in H if and onlv if £ is admissiblefor for all

10.5 Verify that and fulfill (;i)-(3) of Section 10.13, butviolate

10.6 (continued) Show that H — (Ia-\,-la-i)' satisfies KaH = H andH(H'H)~1H' = Ka.

C H A P T E R 11

Bayes Designs andDiscrimination Designs

The chapter deals with how to account for prior knowledge in a design prob-lem. These situations still submit themselves to the General Equivalence The-orem which, in each particular case, takes on a specific form. In the Bayessetting, prior knowledge is available on the distribution of the parameters.This leads to a design problem with a shifted set of competing moment matri-ces. The same problem emerges for designs with bounded weights. In a secondsetting, the experimenter allows for a set of m different models to describe thedata, and considers mixtures of the m information functions from each model.The mixing is carried out using the vector means 3>p on the nonnegative or-thant R™. This embraces as a special case the mixing ofm parameter subsets orofm optimality criteria, in a single model. A third approach is to maximize theinformation in one model, subject to guaranteeing a prescribed efficiency levelin a second model. Examples illustrate the results, with special emphasis ondesigns to discriminate between a second-degree and a third-degree polynomialfit model.

11.1. BAYES LINEAR MODELS WITH MOMENT ASSUMPTIONS

An instance of prior information arises in the case where the mean parametervector 6 and the model variance a2 are not just unknown, but some valuesfor them are deemed more likely than others. This is taken into account byassuming that 6 and o2 follow a distribution, termed the prior distribution,which must be specified by the experimenter at the very beginning of themodeling phase. Thus the parameters 6 and a2 become random variables,in addition to the response vector Y. The underlying distribution P nowdetermines the joint distribution of Y, 6, and a2. Of this joint distribution, weutilize in the present section some conditional moments only, as a counterpartto the classical linear model with moment assumptions of Section 1.3.

268

11.1. BAYES LINEAR MODELS WITH MOMENT ASSUMPTIONS 269

The expectation vector and the dispersion matrix of the response vector Y,conditionally with 6 and or2 given, are as in Section 1.3:

The expectation vector and the dispersion matrix of the mean parametervector 0, conditionally with a2 given, are determined by a prior mean vectordo 6 Rk, a prior dispersion matrix RQ € NND(A:), and a prior sample sizen0 > 1, through

Finally, a prior model variance a^ > 0 is the expected value of the modelvariance cr2,

Assumptions (1), (2), and (3) are called the Bayes linear model with momentassumptions.

Assumption (2) is the critical one. It says that the prior estimate for 6, be-fore any sampling evidence is available, is OQ and has uncertainty (o-2/n0)/?o-On the other hand, with the n x k model matrix X being of full columnrank k, the Gauss-Markov Theorem 1.21 yields the sampling estimate 9 =(X'X)~1X'Y, with dispersion matrix (a2/n)M~l, where as usual M =X'X/n. Thus, uncertainty of the prior estimate and variability of the sam-pling estimate are made comparable in that the experimenter, through spec-ifying nQ, assesses the weight of the prior information on the same per ob-servation basis that applies to the sampling information. The scaling issueis of paramount importance because it touches on the essence of the Bayesgoal, to combine prior information and sampling information in an optimalway. If the prior information and the sampling information are measured incomparable units, then the optimal estimator is easily understood and looksgood. Otherwise, it looks bad.

With a full column rank k x s coefficient matrix K, let us turn to a param-eter system of interest K'6, to be estimated by T(Y) where T maps from R"into Rs. We choose a matrix-valued risk function, called mean squared-errormatrix,

Two such matrices are compared in the Loewner ordering, if possible. Itis possible, remarkably enough, to minimize the mean squared-error matrixamong all affine estimators AY + b, where the matrix A e Rsxn and the shift

270 CHAPTER 11: BAYES DESIGNS AND DISCRIMINATION DESIGNS

b e Rs vary freely. The shift b is needed since, even prior to sampling, thereis a bias

unless BQ = 0. Any affine estimator that achieves the minimum mean squared-error matrix is called a Bayes estimator for K'B.

11.2. BAYES ESTIMATORS

Lemma. In the Bayes linear model with moment assumptions, let theprior dispersion matrix jR0 be positive definite. Then the unique Bayes esti-mator for K'6 is K'8, where

and the minimum mean squared-error matrix is

Proof. I. We begin by evaluating the mean squared-error matrix for anarbitrary affine estimator T(Y) = AY + b. We only need to apply twice thefact that the matrix of the uncentered second moments, provided it exists,decomposes into dispersion matrix plus squared-bias matrix. That is, for ageneral random vector Z we have E/>[ZZ'] = D/>[Z] + (E/>[Z])(E/>[Z])'.Firstly, with Z = AY + b — K'B, we determine the conditional expectationof ZZ' given B and a2,

where B — AX-K'. Secondly with Z = BB + b, the conditional expectationgiven a2 is

Integrating over a2, we obtain the mean squared-error matrix of AY + b,

11.2. BAYES ESTIMATORS 271

II. The third term, (BBo + b)(B8Q + b)', is minimized by b = -B00 =K'OQ-AXBQ.

III. The key point is to minimize the sum of the first two terms which,except for oj, is S(A) = AA1 + (l/no)(AX -K')R0(AX -K')'. With A =K'(n0RQl +X'X)-1X', we expand S(A + (A-A)). Among the resulting eightterms, there are two pairs summing to 0. The other four terms are rearrangedloyiGldS(A) = S(A) + (A-A)(In+nQlXR0X')(A-AY > 5(1), with equalityonly for A — A.

IV. In summary, with & = K'^-AXOo = K'(nQR-1 + X'X^noR^Oo,the unique Bayes estimator for K'6 is AY + b = K'O, with 6 as in (1) andmean squared-error matrix as in (2).

The Bayes estimator B does not yet look good because the terms which ap-pear in (1) are too inhomogeneous. However, we know that the sampling esti-mate 6 satisfies the normal equations, X'Xd = X'Y, and so we replace X'Yby X'X9. This exhibits 9 as a combination of the prior estimate &Q and thesampling estimate 6. In this combination, the weight matrices can also bebrought closer together in appearance. We define

to be the prior moment matrix which, because of (2) of Section 11.1, is prop-erly scaled. The corresponding scaling for the sampling portion is X'X = nM.With all terms on an equal footing the Bayes estimator looks good:

It is an average of the prior estimate 6fo and the sampling estimate B, weightedby prior and sampling per observation moment matrices M0 and A/, andby prior and experimental samples sizes n0 and n. The weights sum to theidentity, (/ioM) + nAf )~1n0A/o + («o^o + nM)~lnM = Ik. The mean squared-error matrix of the Bayes estimate K'd is

It depends on the Bayes moment matrix Ma which is defined to be a convexcombination of prior and sampling moment matrices,


with a specifying the sampling weight that the experimenter ascribes to theobservational evidence that is to complement the prior information.

The Bayes design problem calls for finding a moment matrix M such thatthe mean squared-error matrix (4) is further minimized. As in the generaldesign problem of Section 5,15, we switch to maximizing the inverse meansquared-error matrix, (K'M~lK)~l = CK(Ma), and recover the familiar in-formation matrix mapping CK. This mapping is defined on the closed coneNND(&), whence we can dispense with the full rank assumption on MO = RQ l.As a final step we let a vary continuously in the interval [0; 1].

11.3. BAYES LINEAR MODELS WITH NORMAL-GAMMA PRIORDISTRIBUTIONS

In this section we specify the joint distribution of Y, 6, and o2 completely,as a counterpart to the classical linear model with normality assumption ofSection 1.4. The joint distribution is built up from conditional distributionsin three steps.

The distribution of Y, conditionally with 6 and tr2 given, is normal as inSection 1.4:

The distribution of 0, conditionally with a2 given, is also normal:

where the prior parameters are identical to those in (2) of Section 11.1. Thedistribution of the inverse of a2 is a gamma distribution:

with prior form parameter «o > 0 and prior precision parameter /3o > 0.Assumptions (1), (2), and (3) are called a Bayes linear model with a normal-gamma prior distribution.

Assumption (2) respects the scaling issue, discussed at some length inSection 11.1. In (3), we parametrize the gamma distribution F^^, in orderto have Lebesgue density

with expectation oo/A) and variance 2a0//3o- These two moments may guidethe experimenter to select the prior distribution (3). For OQ > 2, the moment

11.4. NORMAL-GAMMA POSTERIOR DISTRIBUTIONS 273

of order -1 exists:

say. Solving for OQ we obtain «0 = 2 + /3/0-Q. The parametrization F2+ft)/(72.A)

is more in line with assumption (3) of Section 11.1, in that the parametersare prior model variance ofi and prior precision j3o- Note that Fy;1 is the^-distribution with / degrees of freedom.

The statistician cannot but hope that the normal-gamma family givenby (2) and (3) allows the experimenter to model the available prior informa-tion. If so, they profit from the good fortune that the posterior distribution,the conditional distribution of 6 and a2 given the observations Y, is fromthe same family.

11.4. NORMAL-GAMMA POSTERIOR DISTRIBUTIONS

Lemma. In the Bayes linear model with a normal-gamma prior distribu-tion, let the prior dispersion matrix RQ be positive definite. Then the posteriordistribution is the normal-gamma distribution given by

where 6 is the Bayes estimator (1) of Section 11.2, and where the posteriorprecision increase is

with mean

Proof. With z = l/o-2, the posterior distribution has a Lebesgue densityproportional to the product of the densities from assumptions (1), (2), and (3)of Section 11.3,

where the quadratic form Q(B) involves )8i from (3),

With gamma density 'Xa0+/j;#)+0i(z) as given in Section 11.3, (1) and (2) followfrom

Lemma 3.12 entails nonnegativity of ft, as a Schur complement in thenonnegative definite matrix

The determination of the expected value of fii parallels the computation ofthe mean squared-error matrix in part I of the proof of Lemma 11.2,

The posterior distribution testifies to what extent the experimental dataalter the prior opinion as laid down in the prior distribution. More evidenceshould lead to a firmer opinion. Indeed, for the mean parameter vector 0,the posterior distribution (1) is less dispersed than the prior distribution (2)from Section 11.3,

and for the inverse model variance l/<r2, the precision parameter in the pos-terior distribution (2) exceeds that of the prior distribution (3) in Section 11.3,A) + ft > A)-

11.5. THE BAYES DESIGN PROBLEM 275

The prior and posterior conditional distributions of 6 are so easily com-pared because the common conditioning variable a2 appears in both disper-sion matrices in the same fashion and cancels, and because (4) is free ofthe conditioning variable Y. The situation is somewhat more involved whenit comes to comparing the prior and posterior distributions of 1 /a2. Theposterior precision increase Pi depends on the observations y, whence nodeterministic statement on the behavior of ft is possible other than that it isnonnegative. As a substitute, we consider its expected value, E/>[ft] = na^,called preposterior precision increase. This quantity is proportional to thesample size «, but is otherwise unaffected by the design and hence providesno guidance towards its choice.

For the design of experiments, (1) suggests minimizing (no/?^1 + X'X)~l.With a transition to moment matrices, R^1 = MQ and X'X = nM, this callsfor maximization of the Bayes moment matrices

In summary, the Bayes design problem remains the same whether we adoptthe moment assumptions of Section 11.1, or the normal-gamma prior distri-butions of Section 11.3.

11.5. THE BAYES DESIGN PROBLEM

Given a prior moment matrix MQ € NND(/c) and a sampling weight a G [0; 1],the matrix Ma(g) = (1 - a)MQ + aM(g) is called the Bayes moment matrixof the design £ e H. The Bayes design problem is the following:

A moment matrix M that attains the maximum is called Bayes <j> -optimalfor K'O in M. If there is no sampling evidence, a = 0, then there is only onecompetitor, MQ, and the optimization problem is trivial. If all the emphasis ison the sampling portion, a = 1, then we recover the general design problemof Section 5.15. Hence we assume a e (0; 1).

The criterion function M \-> <f> o CK((l — a)MQ + aM) fails to be aninformation function for lack of homogeneity. As a remedy, we transformthe set M. of competing moment matrices into

The set Ma inherits compactness and convexity from M. If M intersectsthe feasibility cone A(K), then so does Ma, by Lemma 2.3. Now, in the

formulation

the Bayes design problem is just one manifestation of the general designproblem of Section 5.15. That is, the general design problem is general enoughto comprise the Bayes design problem as a special case.

As an illustration, we consider the question of whether optimal Bayesmoment matrices for K'0 are necessarily feasible. From a Bayes point ofview, the prior moment matrix MQ often is positive definite; then the set Ma

is included in the open cone PD(/c). Hence by part (a) of the ExistenceTheorem 7.13, all formally ^-optimal moment matrices for K'B in Aia areautomatically feasible for K'B. Thus, for the Bayes design problem, the ex-istence issue is much less pronounced. As another example, we present theBayes version of the General Equivalence Theorem 7.14.

11.6. GENERAL EQUIVALENCE THEOREM FOR BAYES DESIGNS

Theorem. Assume that a prior moment matrix M0 € NND(fc) and asampling weight a e (0; 1) are given. Let the matrix M e M be such thatthe Bayes moment matrix B = (1 — a)M0 + aM is feasible for K'0, withinformation matrix C - CK(B). Then M is Bayes ^-optimal for K'0 in Mif and only if there exists a nonnegative definite s x s matrix D that solvesthe polarity equation

and there exists a generalized inverse G of B such that the matrix N =GKCDCK'G' satisfies the normality inequality

In the case of optimality, equality obtains in the normality inequality if for Awe insert any matrix M e M that is Bayes $-optimal for K'0 in M.

Proof. Evidently M is Bayes <£-optimal for K '0 in M if and only if B is</>-optimal for K'0 in Ma = {(1 - a)MQ + aA: Ae M}. To the latter prob-lem, we apply the General Equivalence Theorem 7.14. There, the normalityinequality is

11.7. DESIGNS WITH PROTECTED RUNS 277

This is equivalent to a trace AN < 1 - (1 - a) trace MQN, for all A e M.Because of trace BN — 1, the right hand side is trace (B - (1 - a)Mo)N —a trace MN.lna trace AN < a trace MN we cancel a to complete the proof.

Three instances may illustrate the theorem. For a matrix mean <j>p withp e (-00; 1), the matrix M e M is Bayes <f>p-optimal for K'B in M if andonly if some generalized inverse G of B = (1 — a)Mo + aM satisfies

Indeed, in the theorem we have N = GKCp+lK'G'/irace Cp, and the com-mon factor trace Cp cancels on both sides of the inequality.

Second, for scalar optimality, M e M is Bayes optimal for c'B in M ifand only if there exists a generalized inverse G ot B = (l-a)A/o + aM thatsatisfies

Third, we consider the situation that the experimenter wants to minimize amixture of the standardized variances c'Ma(g)~c where c is averaged relativeto some distribution /A. With W — JR* cc'rf/x e NND(fc), this calls for theminimization of a weighted trace, trace WMa(£)~. An optimal solution iscalled Bayes linearly optimal for 6 in M. This is the Bayes version of thecriterion that we discussed in Section 9.8. Again it is easy to see that theproblem is equivalent to finding the <£_i-optimal moment matrix for H'6in Ma, where W = HH' is a full rank decomposition. Therefore M e M isBayes linearly optimal for B in M if and only if some generalized inverse Gof B = (1 - a)M0 + aM fulfills

The optimization problem for Bayes designs also occurs in a non-Bayessetting.

11.7. DESIGNS WITH PROTECTED RUNS

Planned experimentation often proceeds in consecutive steps. It is then de-sirable to select the next, new stage by taking into account which design £0

was used in the previous, old stage. Let us presume that n0 observationswere taken under the old design £0- If the new sample size is n and the newdesign is £, then joint evaluation of all n0 + n observations gives rise to the

per observation moment matrix

where a = n/(n$ + n] designates the fraction of observations of the newexperiment. The complementary portion 1 - a consists of the "protected"experimental runs of the old design &• Therefore an optimal augmentationof the old design £0 is achieved by maximizing a criterion of the form (f> oC/i:(Afa(£)). This optimization problem coincides with that motivated by theBayes approach and Theorem 11.6 applies.

Another way of viewing this problem is that the weights of the old designprovide a lower bound a(x) = &(*) f°r the weights of the new design £.There may also be problems which, in addition, dictate an upper bound b(x).For instance, it may be costly to make too many observations under theregression vector x. For the designs with weights bounded by a and b, theGeneral Equivalence Theorem 7.14 specializes as follows.

11.8. GENERAL EQUIVALENCE THEOREM FOR DESIGNS WITHBOUNDED WEIGHTS

Theorem. Assume a[a\b] = {£ e H : a(x) < £(x) < b(x) for all x e X}is the set of designs with weights bounded by the functions a, b : X —> [0; 1].Let £ e H[a;6] be a design that is feasible for K'B, with information matrixC = C/c(A/(£)). Then £ is ^-optimal for K'O in E[«;fc] if and only if thereexists a nonnegative definite s x s matrix D that solves the polarity equation

and there exists a generalized inverse G of M(g) such that the matrix N =GKCDCK'G' satisfies the normality inequality

Proof. The set M(H [«;&]) of moment matrices is compact and convex,and intersects A(K) by the assumption on £. Hence the General EquivalenceTheorem 7.14 applies and yields the present theorem but for the normalityinequality

We prove that (1) and (2) are equivalent.First we show that (1) entails (2). For any two vectors y, z 6 X with g(y)

and £(z) lying between both bounds, the inequalities in (1) go either way and

11.8. GENERAL EQUIVALENCE THEOREM FOR DESIGNS WITH BOUNDED WEIGHTS 279

become an equality. Hence there exists a number c > 0 such that we have,for all x € X,

For a competing design 17 e E[a;b], let x\,... ,xe be the combined supportpoints of 17 and £. We have the two identities

the latter invoking 1 = trace M(£)N. Splitting the sum into three terms andobserving a(x) < 7j(jt) < b(x), we obtain (2) from

Conversely, that (2) implies (1), is proved indirectly. Assuming (1) is falsethere are regression vectors y,z 6 X with £(v) < b(y) and £(z) > a(z)fulfilling e = y 'Ny -z'Nz > 0. We choose 8 > 0 and define 17 in such a waythat

Therefore 17 is a design in H[«; b], with moment matrix M(17) = M(£)+d(yy '-zz'). But trace M(rj)N = l + 8e > 1 violates the normality inequality (2). D

As an example we consider a constraint design for a parabola fit model, inpart II of the following section. We place the example into the broader con-text of model discrimination, to be tackled by> different means in Section 11.18and Section 11.22.


11.9. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIALFIT MODELS, I

With experimental domain [-1;1], suppose the experimenter hopes that asecond-degree polynomial fit model will adequately describe the data. Yet itdeems desirable to guard against the occurrence of a third-degree term. Thiscalls for a test, in a third-degree model, of 03 = 0 versus #3 ̂ 0. If thereis significant evidence for 03 not to vanish, then the third-degree model isadopted, with parameter vector 0(3) = (0o> #i? #2> 03)'- Otherwise a second-degree model will do, with parameter vector 0(2) = (0o, 0i> flz)'-

The experimenter proposes that the determinant criterion fe is appropri-ate for evaluating the designs, in the present situation. The statistician advisesthe experimenter that there are many designs that provide information simul-taneously for 03, 0(3), and 0(2). The efficiencies of the following designs aretabulated in Exhibit 11.1 in Section 11.22, along with others to be discussedlater. The efficiencies are computed relative to the optimal designs, so theseare reviewed first.

i. In a third-degree model, the optimal design for the individual compo-nent 03 = e3'0(3) minimizes T ̂ e^M^(r)e^. The solution is the arcsin supportdesign r from Section 9.12, with weights, third-degree moment matrix, andinverse given by

Dots indicate zeros. The optimal information for 03 is 1/16 = 0.0625.ii. In a third-degree model, the ^-optimal design for the full vector 0(3)

maximizes T i-+ (det M3(T))1/4. The solution T is shown in Exhibit 9.4 inSection 9.6,

11.9. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, I 281

The <fo-optimal information for 0(3) is 0.26750.Hi. In a second-degree model, the <£o-optimal design r for the vector 0(2)

maximizes r H-» (det M^r))1/3, and is mentioned in Section 9.5, r(±l) =T(0) = 1/3. The (^-optimal information for 0(2) is 0.52913. Under this design,in a third-degree model, neither the vector 0(3) nor the component 03 areidentifiable.

I. An allocation with some appeal of symmetry and balancedness is theuniform design T on five equispaced points,

It has respective efficiencies of 72%, 94%, and 84% for 03, 0(3), and 0(2). Ofcourse, there is no direct merit in the constant spacing. The high efficienciesof r are explained by the fact that the support set T = (±1, ±1/2,0} is theunion of the second-degree and third-degree arcsin support, as introduced inSection 9.6. We call T the arcsin support set for the discrimination betweena second-degree and a third-degree model.

II. As an alternative, half of the observations are drawn according to theold design T from the previous paragraph while the other half r is adjoinedin a <fo-optimal way for 0(2), that is, f solves the maximization problem r H-»(det M2(\r+ |r))1/3. The resulting design r = \(r + r) is

This design has respective efficiencies of 42%, 89%, and 94% for #3, 0(3),and 0(2).

To see that r is the <fo-optimal augmentation of the old design T, werepresent it as TW = |(T + T) with r(±l) = w, ?(0) = l-2w. That is, the newpart T is a symmetric design supported by the second-degree arcsin supportpoints ±1,0, and placing weight 2w on the boundary {±1}.

In the one-parameter subclass {TW : w € (0; 1/2)} we might use differentialcalculus to maximize the objective function w ̂ <^(M2(TW,)) = \<fa(M2(r} +M2(T)). However, reverting to calculus means ignoring the achievements ofthe General Equivalence Theorem which, after all, is derived from nothingbut (sub)differential calculus.

The design TW has second-degree moment matrix M(w) and its inverse aregiven by

where d = w + 17/80 - (w + 1/4)2. It is a member of the class T[a; 1] ofdesigns with bounded weights, with lower bound function a(t) = 1/10 fort = ±1,±1/2,0 and a(t) = 0 elsewhere. Theorem 11.8 states that z'M(w)~lzis constant for those z = f(t} for which the new weight r(t) is positive. Ourconjecture that one of the designs rw solves the problem requires us to con-sider positivity of r(t) for the points t — ±1,0. That is, the sum of all entriesof (M(vf))~1 must be equal to the top left component. This leads to the equa-tion w2-H>/6 = 11/120. The relevant solution is w = (1+^/71/5)712 = 0.3974.

For the design T = T#, the moment matrix M(w] and its inverse are

The lower bound a(t) is exceeded only at / = ±1,0 and, by constructionof iv, we have f(t) 'M(w)~lf(t) — 3.2. All weights r(t) stay below the constant

11.10. MIXTURES OF MODELS 283

upper bound 1. Therefore the normality inequality of Theorem 11.8 becomes

The inequality holds true by the very contruction of w. Namely, the polyno-mial P ( t ) = f ( t ) ' M ( w ) - l f ( t ) = 3.2-5.3f2+5.3f4 satisfies P(0) == P(±l) = 3.2and has a local maximum at 0. Thus P is on [-1; 1] bounded by 3.2, and op-timality of the design r is established. Alternatively we could have utilizedTheorem 11.6.

There are other ways to find designs that perform equally well acrossvarious models. We explore two of them, to evaluate a mixture of the infor-mation obtained from each model and, from Section 11.19 on, to maximizethe information in a first model, within a subclass of designs that achieve aprescribed efficiency in a second model.

11.10. MIXTURES OF MODELS

The problem of maximizing mixed information from different models is rel-evant when the experimenter considers a number of potential models to de-scribe the data, and wants to design the experiment to embrace each modelin an efficient manner. Hence suppose that on a common experimental do-main T, we are given m different regression functions,

Eventually the task is one of finding a design on T, but again we first con-centrate on the optimization problem as it pertains to the moment matrices.With model dimensions equal to k\,... ,km, we introduce the cone

in the Euclidean space Sym(A:i x • • • x km} — Sym(A:i) x • • • x Sym(km). Thescalar product of two members A = (Ai,...,Am) and B = (Bi,...,Bm)in Sym(A;1 x • • • x km} is the sum of the trace scalar products, (A, B) =trace A\B\ + • • • + trace AmBm. In the space Sym(A:i x • • • x km), the coneNND(£i x • • • x km) is closed and convex; for short we write A > 0 whenA e NND(*i x • - . x km).

In the i th model, the moment matrix M, is evaluated through an informa-tion function «//,. Compositions of the form <fr o CK. are but special manifes-tations of \l/i. This gives rise to the compound function

Thus ^(M) = (tj/i(Mi),..., i]/m(Mm))' is the vector of information numbersfrom each of the m models, and takes its values in the nonnegative orthantra? = [0;ooT.

To compress these m numbers into a single one, we apply an informationfunction <J> on R™, that is, a function

which is positively homogeneous, superadditive, nonnnegative, nonconstant,and upper semicontinuous. The prime choices are the vector means 4>p withp e [-00; 1], from Section 6.6. Thus the design problem is the following,

where the subset M^ C NND(fci x • • • x km) is taken to be compact andconvex.

Increasing complexity to formulate the design problem does not necessar-ily mean that the problem itself is any more difficult than before. Here isa striking example. Suppose the regression function / = (/i, ...,/*)' has kcomponents, and the experimenter is uncertain how many of them ought tobe incorporated into the model. Thus the rival models have moment matri-ces MI arising from the initial sections /(,) of /,

In the / th model, the coefficient vector of interest is e, = (0,... ,0,1)' G R',in order to find out whether this model is in fact of degree /. The criterionbecomes a Schur complement in the matrix A — A/, by blocking off AH =Mi-i,

see Lemma 3.12 and its proof. If the information from the k models is aver-aged with the geometric mean 3>0 °

n ^+»then tne criterion turns into

11.11. MIXTURES OF INFORMATION FUNCTIONS 285

Thus we simply recover determinant optimality in the model with regressionfunction /, despite the challenging problem formulation.

Generally, for maximizing 4> o tf/ over M^m\ little is changed comparedto the general design problem of Section 5.15. The underlying space hasbecome a Cartesian product and the optimality criterion is a composition ofa slightly different fashion. Other than that, its properties are just the sameas those that we encountered in the general design problem. The mappingty is an Um-valued information function on NND(&i x • • • x km), in that it ispositively homogeneous, superadditive, nonnegative, nonconstant, and uppersemicontinuous, relative to the componentwise partial ordering A > 0 thatthe cone IR™ induces for vectors A in the space Rm. Therefore its propertiesare exactly like those that the mapping CK enjoys relative to the Loewnerordering C > 0 on the space Sym(s') (see Theorem 3.13). With the present,extended terminology, the information matrix mapping CK is an instance ofa Sym(s)-valued information function on NND(/c).

The following lemma achieves for the composition o if/ what Theo-rem 5.14 does for the composition </> o CK, in showing that $ o ty enjoys allthe properties that constitute an information function on NND(A;1 x • • • x km),and in computing its polar.

11.11. MIXTURES OF INFORMATION FUNCTIONS

Lemma. Let <J> be an information function on IR™, and let if/i,...,if/m

be information functions on NND^),... ,NND(/:m), respectively. Set ty —( « A i , . . . , \ltm)' and tf/°° = (i/^00,..., ty™)'. Then 4> o if/ is an information func-tion on NND(A:1 x • • • x km), with polar function (4> o if/)00 — 4>°° o i^00.

Proof. The same steps o-v as in the proof of Theorem 5.14 show that<£ = <J> o i/r is an information function. Also the polarity relation is establishedin a quite similar fashion, based on the representation

With level sets {tf/ > A} - {A > 0 : ^r(A) > A}, for all A € R™, the unitlevel set of the composition <£ o tf/ is

For all B e KNO^ x • • - x km) and A e R?, we get

The last line uses inf/1.e(l/,i>A.}(J4,,fi,) = A,-^/30 (/?/). This is so since for A, > 0we have {^/ > A,} = A,{^ > 1}, while for A, = 0 both sides are 0.

Finally we apply (1) to $ o tfr and then to 4> to obtain from (2) and (3),for all B > 0,

The grand assumption of Section 4.1, Mr\A(K) / 0, ensures that for somemoment matrix M e A4, the information matrix mapping CK(M) falls intoPD(s), the interior of the cone NND(s) where an information function <f> isdefined. For the present criterion 4> o &, the analoguous grand assumptionrequires that for some tuple M € M^m\ we have i/f,(A/,) > 0 for all / < m,that is, at least one setting leads to positive information in each model. Forsuch a tuple M the vector ^(M) is positive, and hence lies in the interior ofthe nonnegative orthant IR™ where <E> is defined.

11.12. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OFMODELS

Theorem. For the problem of Section 11.10, let M = (M1,...,Mm) eM^ be a tuple of moment matrices with «fc(M/) > 0 for all / < m. Then Mmaximizes o ^(A) over A e M^ if and only if for all i <m there exist anumber a, > 0 and a matrix Af, € NND(&,) that solve the polarity equations

11.12. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OF MODELS 287

and that jointly satisfy the normality inequality

Proof. Just as in the Subgradient Theorem 7.4, we find that M e M^maximizes the function $ o ^r : NND(/C! x • • • x km) —* IR over .M(m) if andonly if there is a subgradient B of 4> o t[/ at M that is normal to M^ at M.

Theorem 7.9, adapted to the present setting, states that a subgradient Bexists if and only if the tuple N = B/(3> o ̂ r)(M) satisfies the polarity equationfor <f> o if/,

where we have set The polarity inequal-ities for 4> and for «/*-,• yield

Thus (4) splits into the conditions

The numbers a, = trace A/,Af, > 0 sum to 1, because of (5). If a, > 0, then werescale (7) by I/a, to obtain (2); conversely, (2) implies (7) with Ni = a,-W/.If a, = 0, then ^(Af/) > 0 forces ^(ty) = 0 in (7), and AT/ = 0 is (another)particular solution to (5), (6), and (7). With AT,- = a,-Af,-, (5), (6), and (7) arenow seen to be equivalent to (1) and (2). The normality of B to M^ at Mthen translates into condition (3).

The result features all the characteristics of a General Equivalence The-orem. The polarity equation (1) corresponds to the component function 4>;it serves to compute the weighting a\,..., am. The set of polarity equationsin (2) come with the individual criteria ^,, they determine the matrices Ni.The normality inequality (3) carries through a comparison that is linear inthe competing tuples A e M^m\ It evaluates not the m normality inequalities

trace AiNi < 1 individually, but their average with respect to the weighting a,.More steps are required than before, but each one alone is conceptually nomore involved than those met earlier.

The vector means <J>P and the compositions fa = 4>i ° CKi let the resultappear yet closer to the General Equivalence Theorem 7.14.

11.13. MIXTURES OF MODELS BASED ON VECTOR MEANS

Theorem. Consider a vector mean 4>p of finite order, p e (-oo;l], andcompositions of the form fa — <& o C#., with information functions <fr onNND(s/), for all i < m. Let the tuple (Mi, . . . , Afm) e M(m) be such that forall i < m, the moment matrix M, is feasible for K/6, with information matrixQ^C^Mt).

Then (Mi,. . . ,M«) maximizes 4>p(<h o C/^i), ...,<f>mo CKm(Am)) over(Ai,... ,Am) e A1(w) if and only if for every / < m there exists a matrix D/ eNND(s/) that solves the polarity equation

and there exists a generalized inverse G, of Af, such that

jointly satisfy the normality inequality

Proof. For a vector mean 4>p, the solution of the polarity equation isgiven by the obvious analogue of Lemma 6.16. With this, property (1) ofTheorem 11.12 requires a/i/^Af,) to be positive and positively proportionalto (fa(Mi)Y~l. This determines the weighting a,. The polarity equations (2)of Theorem 11.12 reduce to the present ones, just as in the proof of theGeneral Equivalence Theorem 7.14.

We demonstrate the theorem with a mixture of scalar criteria, across mdifferent models. Let us assume that in the ith model, a scalar parametersystem given by a coefficient vector c, e IR*( is of interest. The design T in Tis taken to be such that the moment matrix A/, = JT fj(t)fj(t)' dr is feasible,MI € A(ci). For the application of the theorem, we get

11.14. MIXTURES OF CRITERIA 289

Therefore (Mlt...tMm) 6 X(w) makes ^((c^fci)-1,...,^^-^)-1) amaximum over M^ if and only if for all / < m there exists a generalizedinverse G, of M,- such that, for all (>4i , . . . ,Am) <E .M(w),

The discussion of Bayes designs in Section 11.6 leads to the average ofvarious criteria, not in a set of models, but in one and the same model. Thisis but a special instance of the present problem.

11.14. MIXTURES OF CRITERIA

We now consider the situation that the experimenter models the experimentwith a single regression function

but wishes to implement a design in order to take into account m param-eter systems of interest AT/0, of dimension st, evaluating the informationmatrix CK.(M) by an information function fa on NND(s,). This comprisesthe case that a single parameter system K'B is judged on the grounds of mdifferent criteria <f>i,..., <f>m.

With the composite information functions fa = fa o CK. on NND(fc), weagain form the compound function

Carrying out the joint evaluation with an information function 4> on R™, theproblem now is one of studying a mixture of information functions, utilizingan information function on R™:

where the set M C NND(A;) of competing moment matrices is compact andconvex. Upon defining the set M(m) C NND(A: x • • • x A;) to be M(m) ={(M, . . . ,M) : M € M}, the problem submits itself to the General Equiva-lence Theorem 11.12.

11.15. GENERAL EQUIVALENCE THEOREM FOR MIXTURES OFCRITERIA

Theorem. For the problem of Section 11.14, let M G M be a momentmatrix with ^(M) > 0 for all / < m. Then M maximizes <1> o i f / ( A , . . . ,A)over A e M if and only if for all / < m there exist a number a, > 0 and amatrix Af, e NND(£) that solve the polarity equations

such that the matrix N — satisfies the normality inequality

Proof. The result is a direct application of Theorem 11.12.

Again the result becomes more streamlined for the vector means <&p, andcompositions ifa = <£, o CK,. In particular, it suffices to search for a singlegeneralized inverse G of M to satisfy the normality inequality.

11.16. MIXTURES OF CRITERIA BASED ON VECTOR MEANS

Theorem. Consider a vector mean 4>p of finite order, p € (-00; 1], andcompositions of the form fa = fa o £#., with information functions fa onNND(s/), for all i < m. Let M e M be a moment matrix that is feasiblefor K/6, with information matrix C, = C/^.(M). Then M maximizes 4>p(<fo oCKl(A),...,<j>m o CKm(A)] over A 6 M if and only if for all / < m thereexists a matrix £>,- e NND(s,) that solves the polarity equation

and there exists a generalized inverse G of M such that the matrix

satisfies the normality inequality

11.16. MIXTURES OF CRITERIA BASED ON VECTOR MEANS 291

Proof. Theorem 11.13 provides most of the present statement, exceptfor a set of generalized inverses G l 5 . . . , Gm of M. We need to prove thatthere is a single generalized inverse G of M that works for all / < ra. Wedefine the matrix N = £,.<„, «,-#,-, with Nf = G^C/D/C/f/G/ as in theproof of Theorem 11.13. Our argument is akin to that of establishing part (a)of Corollary 10.11. Let r be the rank of MNM, and choose a full rankdecomposition MNM = KK'. We have trace K'M~K = trace MN — 1.

We show that M is </>_!-optimal for K'6 in M, with optimal value

Optimality follows from the Mutual Boundedness Theorem 7.11. The matrixN is feasible for the dual problem. From (K'M-K)2 = K'M'MNMM'K =K'NK, we get (K'NK}1'2 = K'M~K. Hence the dual criterion takes thevalue

Now < / > _ ! ( C K ( M ) ) = 1/4>™(K'NK) shows that M is <£_roptimal for K'Bin M.

From Theorem 7.19, there exists a generalized inverse G of M such that

This is the normality inequality we wish to establish. On the left hand sidewe obtain

The right hand side is trace AT'M K = trace MN = 1.

As in Section 11.13, we demonstrate the theorem with a mixture of scalarcriteria, this time in one and the same model. Again let the design T in Tbe feasible for the scalar parameter systems c-6 for all / < m. Then themaximum of <bp((c{A~ci)~l,...,(c!nA~cm)~l) over A e M is attained atM e M if and only if there exists a generalized inverse G of M such that

where W = For the harmonic mean 4>_i, we getW = where p is the uniform weighting on thepoints C i , . . . ,cm; this is inequality (3) of Section 11.6 with a = 1.

11.17. WEIGHTINGS AND SCALINGS

In general we admit an arbitrary information function <f> on IR™, to averagethe information from m distinct sources. Specifically, with p = 1,0, -1, thevector means 4>p comprise the arithmetic mean, the geometric mean, and theharmonic mean.

However, it is always the arithmetic mean that is utilized to average the mindividual normality inequalities trace AjNi < 1, albeit with a weighting a,which is generally not uniform. The reason is that the normality inequality,pertaining to (sub)gradients and derived from (sub)differential calculus, isintrinsically linear. For averaging subgradients, a method other than a linearone has no place in the theory.

For a vector mean 4>p and compositions fa = <£, o CK., the weights are

Only the geometric mean <S>0 leads to a constant and uniform weight a, =1/m. Otherwise, the weights a/ vary with the value of the criterion function.For example, in the case of the harmonic mean, p = -1, two models withinformation <f>\(Ci) = 1 and <k.(C2) = 9 yield a\ = 0.9 and a2 — 0.1. That is,the emphasis is to improve upon models with comparatively little information.The weighting a/ reflects a relative scaling among the optimality criteria fa =<t>i °CK..

Sensible scaling is a challenging issue when it comes to combining in-formation that originates from different sources. We have emphasized thisin Section 11.1, in the context of Bayes modeling, and it is also relevantfor mixtures of models. The matrix means <j>p on NND(s) are standardized,(f>p(Is) = 1. This is convenient, and unambiguous, as long as a single modelis considered. For the purpose of mixing information across various models,this standardization is purely coincidental and likely to be meaningless. It isup to the experimenter to tell which scaling is appropriate for the situation.

One general method is to select the geometric mean <J>0. This is the onlyvector mean which is positively homogeneous (of degree 1/m) separatelyin each variable. Hence a comparison based on <J>0 is unaffected by scaling.As seen above, the averaging of the normality inequalities is then uniform,expressing scale invariance in another, dual way.

An alternate solution is to scale each model by its optimal value. In otherwords, the information criterion <fo o CKi(Mi) is substituted by the efficiency

11.18. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, II 293

criterion fa o C/c;,(M/)/u,, where u, = ma\Mi€M. fa o CK.(Mi). This methodis computationally expensive as it requires the m optimal values v, to becalculated first.

11.18. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIALFIT MODELS, II

We continue the discussion of Section 11.9 of finding efficient designs for dis-criminating between a second-degree and a third-degree model, for a poly-nomial fit on [—!;!].

I. The geometric mean <£0 of the ^-criterion for 0(2) m tne second-degree

model and the (fo-criterion for 0(3) in the third-degree model is

The design r which maximizes this criterion is

Again dots indicate zeros. The respective efficiencies for 63, 0(3), and 0(2) are66%, 98%, and 91%. The value of the optimality criterion is 0.35553.

In order to verify optimality, we refer to Theorem 11.13. It is convenientto work with indices / = 2,3, for the second-degree and third-degree models.For j = 2, we get G2 - M2(r)-\ K2 = 73, C2 = M2(r), D2 = M2(r)/3, anda2 — 1/2. Similar quantities pertain to / = 3. Hence the normality inequalityturns into P(t) < 1 for all t € [-1; 1], where P is given by

Since the polynomial P attains the value 1 in ±1 and in ± A/17/117, andhas local maxima at ±-v/17/117, it is bounded by 1 on [-1; 1]. Therefore thedesign r maximizes the geometric mean of <^(M2(r}) and <fo(M3(T)), on theexperimental domain T = [-!;!].

II. As an alternative, we propose the design that maximizes the samecriterion but on the five-point arcsin support set T = {±1, ±1/2,0} of Sec-tion 11.9 (I):

Its respective efficiencies for #3, #(3), and 0(2) are 64%, 96%, and 90%. Thecriterion takes the value 0.34974 which is 98% of the maximum value ofpart (I). The efficiencies are excellent even though the design is inadmissible,as seen in part (b) of Section 10.7.

To compute the design and verify its optimality, we again utilize The-orem 11.13. We guess the optimality candidate to be symmetric, r(±l) =w, r(±l/2) = M, r(0) = 1 - 2w - 2u. The inverse moment matrices ofsecond-degree and third-degree are

with j th moment ftj — 2w + 21 '«, for / = 2,4,6, and with subblock de-terminants d = 1*4 - /t| and D = n,2^(> - A^- These matrices define thepolynomial P on the left hand side of the normality inequality.

If t = 0 rightly belongs to the optimal support, then we must have P(0) =1, providing a relation for u in terms of w. If t = 1 is another optimal supportpoint, then P(\) = I leads to an equation that implicitly determines w. Insummary, we obtain

11.18. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, II 295

From this, the weights w = 0.279 and u = 0.164 are computed numerically.The polynomial P becomes P(t) = I + 0.7&2 - 3.89r4 + 3.11f6. Now P(0) =P(±l) = P(±l/2) = 1 proves optimality, on the arcsin support set T —{±1, ±1/2,0}.

III. Another option is to stay in the third-degree model and consider therethe two parameter systems #3 = e^O^) and 0(2) = K'B^. The geometric meanof the information for 63 and of the (^-information for 0(2) is

The design that maximizes this criterion over the five-point support set T —{±1, ±1/2,0} is

The respective efficiencies for #3, 6^, and 0(2) are 100%, 94%, and 75%. Thefull efficiency for #3 indicates that the present design is practically the sameas the optimal design from part (i) in Section 11.9.

The derivation of the design follows the pattern of part (II), except foremploying Theorem 11.16. Again we conjecture the optimal design to besymmetric. Let its moments be ja2,/Lt4,/x6, and again set d = 1*4 — /i| andD — n2fi6 - JJL^ The information for 63 is C\ = D/^, while the informationmatrix C2 for 0(2) and the matrix N of Theorem 11.16 are

Let the associated polynomial be P(t) = ( l , f , f 2 , f 3 )W( l , f , f 2 , f 3 ) ' .If t — 0 is an optimal support point, then we have P(0) = 1, entailing a

relation to express u in terms of w. On the other hand, P(l) = 1 yields an

equation that implicitly determines w. Thus we get

The resulting values w = 0.19 and u — 0.44 are not feasible because the sum2w + 2u exceeds 1. Hence t = 0 cannot be an optimal support point.

This leaves us with the relation u = \ - w. From P(l) = 1, we de-termine w = 0.168, and hence the design T. Its moments yield the poly-nomial P(t) = 0.50 + 5.04r2 - I4.73t4 + 10.19r6. Now P(0) = 0.50 < 1 =P(±l/2) = P(±l) establishes the optimality of r, on the arcsin support setr={±i, ±1/2,0}.

A different approach is not to average the criteria from the m models, butto optimize only a few, subject to side conditions on the others. A reasonableside condition is to secure some prescribed levels of efficiency.

11.19. DESIGNS WITH GUARANTEED EFFICIENCIES

Finally we maximize information in one model subject to securing someprescribed efficiencies in other models. With the notation of Section 11.10,let vi — maxM€A/((m) ^i(Mi) be the largest information in the ith member ofM = (Mi,...,M;,..., Mm) e M(m) under criterion «//,, and let «,- € (0; 1) bethe efficiencies that the experimenter wants to see guaranteed. Thus interestis in designs with moment matrices M,- fulfilling ift(Afj) > £/i», for all / < m.

With A, = fiji/i, we define the vector A = ( A i , . . . , \m)' e IR™, and introducethe level set of the function ^ = (^,..., $m)' : NND(fci x - • • x km) -» Rm,

Thus the set of competing moment matrices has shrunk to M^m) n {^r > A}.Of course, we presume that the set still contains some moment matrices thatare of statistical interest.

We wish to maximize the information in the m th model while at the sametime observing the efficiency bounds in the other models:

Since the criterion if/m is maximized, we take its efficiency bound to be 0,\m = 0. For / < m, the bounds only contribute to the problem providedthey are positive and stay below the optimum, A, 6 (0;i>,). Furthermore we

11.20. GENERAL EQUIVALENCE THEOREM FOR GUARANTEED EFFICIENCY DESIGNS 297

assume that there exists a moment matrix A G M^ that satisfies the strictinequalities tf/i(Ai) > A,- for all / < m. This opens the way to redrafting theGeneral Equivalence Theorem 7.14 using Lagrange multipliers. The ensuingnormality inequality refers to the set M.^ itself, rather than to the unwieldyintersection with the level set {tf/ > A}.

11.20. GENERAL EQUIVALENCE THEOREM FOR GUARANTEEDEFFICIENCY DESIGNS

Theorem. For the problem of Section 11.19, let M = (Mi,...,Mm) GM^ n {& > A} be a tuple of moment matrices that satisfies the efficiencyconstraints. Then M maximizes \lim(Am) over A e M^ n {tf/ > A} if andonly if for all i < m, there exist a number a/ > 0 and a matrix N, G NND(A:,),with a,-0i(A//) = a/A, for / < m and am — 1 - ]T,<W a, > 0, that solve thepolarity equations

and that jointly satisfy the normality inequality

Proof. The General Equivalence Theorem 7.14 carries over to the com-pact and convex set M^ n [tj/ > A} and the criterion function </>(A) —«Aw(^w)- It then states that M is optimal if and only if there exists a tupleN G NND(fci x • • • x km} that solves the polarity equation ^(M)^°°(N) =X)/<m trace M,yv, = 1 and that satisfies the normality inequalityY,i<mtrace ^i^i < 1 for all A G X(w) n {^f > A}. This happens if and only ifthere is a matrix Nm G NND(fcOT) with ^m(Mm)^(Nm) - trace MmNm = 1that satisfies

That (2) implies (3) is seen as follows. For we have

Then trace AiNi = trace AmNm + trace AiNi < 1 andam > 0 yield

Conversely, (3) leads to the additional polarity equations in (1) by view-ing (3) as an optimization problem in its own right. For / < m, we introducefor the i th level set {i/r, > A,}, the concave indicator function g,(A), withvalues 0 or —oo according as î(Ai) > A, or not. Furthermore we definegm(A) = trace AmNm. Then (3) means that M maximizes the objective func-tion g(A) = £^<OTgj(A) over the full set M^m\ Convex analysis teaches usthat this happens only if there exists a subgradient B of g at M that is nor-mal to M^ at M. The assumptions set out in Section 11.19 ascertain thatthe subgradient has the form B = (u\B\,..., wm_iJ3m_i,Nm), with Lagrangemultipliers (Kuhn-Tucker coefficients) M, > 0 satisfying H/j/Âf,) = w/A,, andwhere /?, 6 Sym(A:,) is a subgradient of </f, at A// provided M, is positive.In the latter case, we have <A,(A//) = A, > 0 and Theorem 7.9 provides therepresentation 5, = «/f,(M/)Af, with Nj solving (1).

Finally the Lagrange multipliers w, are transformed to yield the coefficientsa,. With (M,B) = 1 + ]Cl<OTMi'/'/(Af/) > 0, this is achieved by setting a,- =iil-0l-(MJ-)/(M,B> for / < m and am = 1/{M,B).

The result is close to Theorem 11.12 both in content and format, forthe reason that there exist Lagrange multipliers MI, ... ,«OT_i that make thepresent problem equivalent to maximizing ̂ m(Am) + £)/<mw, (^/(-A,) — A,) = o ^r(A) — £]/<mMjAj, sav» over lê unrestricted set M^m\

A prime application is the discrimination between two rival models. Forthis task, the theorem specializes as follows.

11.21. MODEL DISCRIMINATION

Theorem. Consider two models i — 1,2 with criteria of the form </>, o CK.,where Kf is a A:, x s, matrix of full column rank and </>, is an informationfunction on NND(s,). Assume that every member (Ai,A2) € Ai(2) for whichAI maximizes <fc o CK2 violates the given efficiency bound A for the firstmodel, <fa(CKl(Ai)) < A. Let the pair (A/!,M2) € Ai(2) be such that Af, is amember of the feasibility cone A(Ki), with information matrix C, = CK.(Mi),for i = 1,2, of which Q fulfills ^(Q) > A.

Then (Mi,Af2) maximizes <fc o CK2(A2) over those (A\,A2) £ M with0i ° CKI(A\) > A if and only if <fo(Ci) = A and for / = 1,2 there exists amatrix DI e NND(5«) that solves the polarity equation

and there exists a generalized inverse G, of A// such that, for some a G (0; 1),the matrices Nt = G/ATjCjD/Cj/f/G/ jointly satisfy the normality inequality

11.22. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, III 299

Proof. In Theorem 11.20 we have a} > 0. Otherwise the problem issolved by some pair (A\,A2) for which A2 maximizes <fo ° C%2 and, by as-sumption, this is not so. With a = a\ > 0 and 1 — a = a2 > 0 the theoremfollows.

While these concepts are appealing they are available only at an ad-ditional computational expense. The optimality characterization involves anovel equation, <j>\(Ci) — A, but there is also a further parameter, a, thatenters into the normality inequality.

11.22. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIALFIT MODELS, III

For the last time, we turn to the setting discussed in Section 11.9 of discrim-inating between a second-degree and a third-degree polynomial model, forwhich in Section 11.18 we have found some optimal mixture designs.

I. In order to find the design with largest (fo-information for 0(2) in thesecond-degree model, among the designs that guarantee 50% efficiency for 03

in the third-degree model, we must maximize <fo(M2(T)) subject toe3'M3~(T)e3 < 32. This is so since 50% of the optimal information for 63is 1/32, by Section 11.9 (i). The solution is the design

As before, dots indicate zeros. The respective efficiencies for 63, 0(3), and 0(2)are 50%, 93%, and 94%.

We use Theorem 11.21 to verify optimality. The two matrices

enter into the definition of the polynomial P in the normality inequality.


From P(l) = 1, we obtain a = 0.074, giving P(t) = 0.98 + 0.52f2 - 2.88f4 +2.38f6. Now P(±l) = P(±0.3236) = 1 and ^(±0.3236) = 0 imply that onthe interval [—1;1] the polynomial P is bounded by 1. This establishes thedesired optimality property of r.

II. As a last example, we take the same criterion as in part (I), but againrestrict attention to the arcsin support set T. We obtain the design

The respective efficiencies for 63, 0(3), and 0(2) are 50%, 92%, and 93%.In order to prove optimality, we apply Theorem 11.21, essentially repeating

the steps of part (II) in Section 11.18. The efficiency constraint, /x2/Z> = 32,gives u in terms of w. The matrices

together with a e (0;1), determine the polynomial P(t) on the left handside of the normality inequality. From P(0) = 1, we get a formula for a. Insummary, we obtain

where w is given implicitly through P(l) = 1. With the resulting weights w =0.292 and u = 0.123, and a = 0.086, we get P(t) = 1.00+0.69/2-3.43/4+2.75f6:Thus J°(±l) = P(±l/2) = P(0) = 1 establishes optimality on T, on thearcsin support set T = {±1, ±1/2,0}. The designs we have encountered aretabulated in Exhibit 11.1.

The apparent multitude of reasonable designs for the purpose of modeldiscrimination again reflects the fact that there is no unique answer when

11.22. SECOND-DEGREE VERSUS THIRD-DEGREE POLYNOMIAL FIT MODELS, III 301

Section

11.9 (i)11.9 (ii)11.9 (iii)

11.9 (I)11.9 (II)

11.18 (I)

11.18 (II)

11.18 (III)

11.22 (I)

11.22 (II)

Design for polynomial fit on [—1, 1]

Optimal for fa, (value 0.0625)<fo-optimal for 0(3) (value 0.26750)0o-optimal for 0(2) (value 0.52913)

Uniform, on the arcsin support set T = {±1, ±\,Q}Half (^-optimally augmented for 0(2)

o-optimal for 0(2) and 0(3)

o-optimal for 0(2) and 0(3), on T

<t>o-optimal for 0(2) and 03, on T,in a single third-degree model

(fo-optimal for 0(2), 50% efficient for 03

<fo-optimal for 0(2), 50% efficient for 03, on T

(^-Efficiencies for#3 %) %)

1

0.850

0.720.42

0.66

0.64

1.00

0.50.5

0.9310

0.940.89

0.98

0.96

0.94

0.93

0.92

0.750.871

0.840.94

0.91

0.90

0.75

0.94

0.93

EXHIBIT 11.1 Discrimination between a second- and a third-degree model. The designefficiency for the individual component 63 is evaluated in the third-degree model, as is the(^-efficiency for the full parameter vector 6^ = (00, 6\, #2,63)'• The ^-efficiency for 0(2) =(Oo,Qi,&2.)' is computed in the second-degree model.

the question is to combine information that arises from various sources. Wehave restricted our attention to the criteria that permit an evident statisticalinterpretation, that is, which are mixtures of the form ^ o iff. The scopeof the optimality theory extends far beyond these specific compositions. Itwould cover all information functions on the product cone 1C = NND(fci x• - • x km). Theorem 5.10 depicts this class geometrically, through a one-to-onecorrespondence with the nonempty closed convex subsets that are boundedaway from the origin and recede in all directions of 1C. Clearly, 1C carries manymore information functions than any one of its component cones NND(fc,).Little use seems to flow from this generality.

In contrast, the generality of the General Equivalence Theorem 7.14 isuseful and called for. It covers mixtures <1> o if/, just as it applies to anyother information function <f>. It allows for quite arbitrary sets of competingmoment matrices A4, as long as they are compact and convex. We havemet the sets Ma for Bayes designs, M(E[0;b]) for designs with boundedweights, M(m) for mixtures of models, {(M,..., A f ) : M e M] for mixturesof criteria, and M^ n [iff > A} for designs with guaranteed efficiencies.

Of course, growing problem complexity entails increased labor to find theoptimal designs. Even when the optimal design is obtained, it is usually validonly for infinite sample size, as pointed out in Section 1.24. That is, its weights

need to be rounded off to conform to a finite sample size n. This is the topicof the following chapter.

EXERCISES

11.1 Fill in the details of the proof of £/>[&] = ncr^ in Section 11.4.

11.2 Show that the one-point design in x = S((l - a)M0 + alk)~lc, with 8

such that ||jc|| — 1, is Bayes optimal for c'0 in H, over the unit ballX = {x € R* : ||*|| < 1} [Chaidner (1984), p. 293].

11.3 (continued] Show that if

then the one-point design in x = (0.98,0.14,0.14)' is Bayes optimalfor 61 in H, with minimum mean squared-error 1.04. In comparison,the one-point design in c = (1,0,0)' has mean squared-error 1.09 andefficiency 0.95.

11.4 (continued) Show that, with W > 0 and 8 = (« + (!-«) trace M)/trace Wl/2 > 0, the matrix M - (l/a)(6W1/2 - (1 - a)MQ) is positivedefinite provided a is close to 1 or Af0 is close to 0. Use a spectraldecomposition M — Y%=\ wixix! to define a design £(*,) = w, in H.Show that £ is Bayes linearly optimal for 6 in H, with minimum averagemean squared-error equal to the trace of Wl/2/8.

11.5 (continued) Show that if

then the design which assigns weights 0.16, 0.44, 0.40 to

is Bayes linearly optimal for 0 in E, with minimum average meansquared-error 21.7. In comparison, the design which assigns to the Eu-

EXERCISES 303

clidean unit vectors e\,e^ei the weights 0.2,0.4,0.4 has average meansquared-error 21.9 and efficiency 0.99.

11.6 Use del M3 = C\ del M2 in the examples (I) and (II) of Section 11.18to verify <J>o(<fo(M2), <fo(M3)) = Ci1/8<fo(M2)

7/8. Comment on how theresulting designs reflect the heavy weighting of <fo(Af2).

C H A P T E R 12

Efficient Designs forFinite Sample Sizes

Methods are discussed to round the weights of a design for infinite samplesize in order to obtain designs for finite sample size n. We concentrate on theefficient apportionment method since it has the best bound on the efficiencyloss that is due to the discretization. Asymptotically the efficiency loss is seento be bounded of order n~l; in the case of differentiability the order is n~2. Inpolynomial fit models, the efficient apportionment method even yields optimaldesigns in the discrete class of designs for sample size n, under the determinantcriterion fo and provided the sample size n is large enough.

12.1. DESIGNS FOR FINITE SAMPLE SIZES

A design for sample size n, £„ e Ew, specifies for its support points jc,- € Xfrequencies «/ = &,(*,-) e {l,2,. . . ,n — l,n}, and is standardized through5^/<£ HI — n. This inherent discreteness distinguishes the design problem forsample size n:

where H« C H« is a subset of designs for sample size n which compete foroptimality. The optimal value of this problem is denoted by v(<f>,n).

The discretization does not affect the class of optimality criteria to be con-sidered, information functions. Indeed, the unstandardized moment matrixof £, is

304

12.2. SAMPLE SIZE MONOTONICITY 305

(see Section 1.24). Therefore the optimal design for sample size n generallydepends on the unknown model variance a-2, unless the optimality crite-rion (/> is homogeneous. This again singles out the information functions </>on NND(fc) as the only reasonable criteria, the other defining properties be-ing called for by the same arguments as in the general design problem ofSection 5.15. The information function to be used may well be a compositionof the form <f> o CK, but at this point this is of no importance.

The set £„ of designs for sample size n is embedded in the set a of alldesigns by a transition from £„ to its standardized version, £M/n. This givesrise to the set of moment matrices

This set is still compact, being the image of the compact subset Xn of (Rk)n

under the continuous mapping (*!,...,*„) i-» Y^j<nxixj/n' Hence the exis-tence of optimal designs for sample size n poses no problem and parts (a)and (b) of Lemma 5.16 carry over. The striking distinction is that discretenessprevents the set M(E/i/«) from being convex. Since almost everything in ourdevelopment is built on convexity properties, none of those results apply.Thus it is generally beyond our approach to find the optimal value v(<f>,n) ofthe design problem for sample size n, or the $-optimal designs in subsets Ew

of S«.Instead we propose a specific apportionment method which takes any de-

sign £ for infinite sample size (preferably an optimal one) and efficientlyrounds it to a design £„ for sample size n. The rounding is carried out irre-spective of the particular criterion </>, but with due attention to the generalprinciples that underly the design of experiments. Section 12.2 to Section 12.5introduce the efficient design apportionment. Its optimality property of en-joying the best efficiency bound among all designs for sample size n is studiedin Section 12.6 to Section 12.11. From Section 12.13 on, we present a partic-ular setting where rounding leads to designs that actually are $o-optimal inthe discrete set of designs for sample size n.

12.2. SAMPLE SIZE MONOTONICITY

Suppose a given design £ is rounded to a design £„ for sample size n. Theleast complications arise if the support sets are the same (whence £ and €n/nare mutually absolutely continuous),

in that the two designs then share identical identifiability properties for what-ever parameter system K'O is under investigation. We pursue this case only.

306 CHAPTER 12: EFFICIENT DESIGNS FOR FINITE SAMPLE SIZES

Support pointWeight

Quota at 299Apportionment

Quota at 300Apportionment

10.02557

7.658

7.6717

20.03224

9.639810

9.67210

30.06234

18.639718

18.719

40.87985

263.1263

263.96264

EXHIBIT 12.1 Quota method under growing sample size. The first support point loses anobservation as the sample size increases from 299 to 300.

The rationale is that if £ is optimal, then it is presumably wise to maintain itssupport points when rounding. This requires at least as many observationsas £ has support points, n > #supp £ = t, say.

Thus the problem is the following. For a fixed design £ that is supportedby the regression vectors jci,... ,xi in /f, we wish to discretize the weightsw, = £(*,) into frequencies n, summing to n. Then £„(*/) = H, defines a designfor sample size n. We want to find a procedure for which the standardizedand discretized design gn/n is "close" to the original design £. The questionis, what does "close" mean for the design of experiments.

A naive approach is to focus on the quota n\\>i, as the fair share for thesupport point */. Of course, the numbers nw\,.. .,nwe generally fail to beintegers. The usual numerical rounding, of rounding nwi up or down to thenearest integer n;5 has little chance of preserving the side condition that thefrequencies n/ sum to n.

A less naiVe, but equally futile approach is to minimize the total variationdistance max/<^ \n-Jn — wi\. This results in the quota method which operatesin two phases. First it assigns to jc« of the quota AZH>/ the integer part [nwi\.This leaves n — £),•<£ j/WiJ observations to be taken care of in the secondphase. They are allocated, one by one, to those support points that happento possess the largest fractional part nwf - [nwf\.

The total variation distance is rarely a good statistical measure of closenessof two distributions. For design purposes, it suffices to note that the quotamethod is not sample size monotone. An apportionment method is calledsample size monotone when for growing sample size n, the frequencies n, donot decrease for all i < I. If a method is not sample size monotone, thena sequential application may lead to the fatal situation that for sample sizen + 1 an observation must be removed which was already realized as part ofthe apportionment of sample size n. That the quota method suffers from thisdeficiency is shown in Exhibit 12.1.

12.4. EFFICIENT ROUNDING PROCEDURE 307

12.3. MULTIPLIER METHODS OF APPORTIONMENT

An excessive concern with the quotas nw, begs the question of treating eachsupport point equally. The true problems are caused by the procedure usedto round nvv, to n,, and to do so for each / = 1,... ,t in the same manner.Multiplier methods of apportionment pay due credit to these complicationsand reverse the emphasis.

Multiplier methods are in a one-to-one correspondence with rounding pro-cedures. A rounding function R is an isotonic function on R, z > z => R(z) >R(z), which maps a real number into one of the neighboring integers. Thecorresponding multiplier method seeks a multiplier v e [0; oo) to create pseu-doquotas vwi, so that the rounded numbers «, = R(vW{) sum to n. The pointis that every pseudoquota vwt gets rounded using the same function R. Themultiplier v takes the place of the sample size n, but is degraded to be nothingbut a technical tool.

For growing sample size n = Y^i<e^(vwi}-> tne multiplier v cannot de-crease. This proves that every multiplier method is sample size montone.

There is another advantage. On a computer, the weights w, may not sumto 1, because of limited machine precision. This calls for a standardizationW///A, with /x = Yli<e w, ^ 1. But a multiplier method yields the same result,whether based on w, (with multiplier v) or on w,//u, (with multiplier IJLV).Therefore multiplier methods are stable against machine inaccuracies.

In the ensemble of all multiplier methods, the task becomes one of findinga rounding function R which is appropriate for the design of an experiment.However, we need to accommodate the following type of situation. Considerthe function R(z) — \z~\ which rounds z to the smallest integer greater than orequal to z. The (^-optimal designs of Section 9.5 have uniform weights, H>, =I/k for i < k. Any multiplier i > e ( ( m — l)fc; mk\ yields the apportionment\vwi\ — m which sums to mk. Hence only the sample sizes n = /c,2fc, . . . arerealizable, the others are not. For this reason the definition of a roundingfunction R is changed at those points z where R has jumps, and offers thechoice of the two limits of R(z ± 5) as 8 tends to 0.

12.4. EFFICIENT ROUNDING PROCEDURE

The apportionment of designs will be based on the rounding where fractionalnumbers z always get rounded up to the next integer, while integers z may berounded up or not. Because of the latter ambiguity, the rounding procedure|[-| maps a number z into a one-element set or a two-element set,

for all integers m. In particular, we have \z] G |[z]l for all z € R.

For a design £ with weights w\,...,wt, the corresponding multiplier meth-od results in an apportionment set £(£, n) consisting of those discretizations/ii , . . . , / i i summing to n such that, for some multiplier v > 0 and for all/ < t, the frequency /i, lies in flW/U. We call \-\ the efficient roundingprocedure, and E(£,n) the efficient design apportionment, for reasons to beseen in Theorem 12.7.

The fact that the efficient apportionment may yield a set E(g,n) ratherthan a singleton is again illustrated by the <fo-optimal designs of Section 9.5.For sample size n = mk + r with remainder r e (0,1,..., k - 1}, there are (*)ways to discretize the uniform weights w,- = I/A:, by assigning frequency m+lto r support points and frequency m to the other k — r points. All of theseassignments appear equally persuasive. The reason is that equality of theweights prevents the apportionment method from discriminating betweenthe (*) possible discretizations. The same phenomenon occurs when two ormore weights are too close together rather than being equal.

The efficient design apportionment achieves the goal of Section 12.2 thatthe discretizations £„ € E($,n) are designs for sample size n that have thesame support as £, as soon as n > #supp £ = t. Namely, for n = t the uniquemember of E(g,l) is the uniform design (!,...,!) for sample size t. Forn > i, the frequencies then satisfy n, > 1, by sample size monotonicity.

Given the sample size n and the support size I of the design £, roughbounds for a multiplier v are n -1 < v < n. The multiplier n -1 is generallytoo small since «/ e J[(n — £)w,-| impliesThe multiplier n is generally too large since n, £ [FWM;ill entails

= n. The average of the two extremes is a good first choice, v =n-\l.

We concentrate on the specific frequenciesand introduce the discrepancy

The design £„+</(.*,) = \(n-\t}w^\ lies in the efficient design apportionmentE(£,n+d) for sample size n + d. If d = 0, then £„ e E(£,n) is a solution to theproblem. If d is negative or positive then \d\ observations need to be addedor removed. This is achieved by the following theorem which characterizesthe efficient design apportionment without taking recourse to multipliers.

12.5. EFFICIENT DESIGN APPORTIONMENT

Theorem. Let £ 6 H be a design with weights v v l 5 . . . , vtv > 0, and let«!,...,/!£ > 0 be integers which sum to n > 1.

12.5. EFFICIENT DESIGN APPORTIONMENT 309

a. (Characterization) Then ( / i 1 } . . . , / i f ) lies in the efficient design appor-tionment £(£,«) if and only if

b. (Augmentation, reduction) Suppose (HI, ... ,ne) e £(£,/*). If; and k aresuch that rijlwj = min^n//^ and (nk - l)/wk — max,-<£(n/ - l)/w/,then

Proof. By definition, any member (n\,..., ne) of £(£, n) is obtained froma multiplier v > 0 such that for all i < £, we have «,- e flVw,]]. This entailsn, - 1 < f w, < «, and division by w, establishes the inequality in part (a).Conversely, for v e [max,-<£(w,--l)/w/; min l<£rt//w,], we get «/-! < vw, < «,for all / < ^. This yields n, e (fv»v/]|, and proves («i , . . . ,n f) e £(£,«).

For part (b), we define n^ — n, for / / /' and «t = ny + 1, and nr = «, for/ ^ fc and n~ = nfc - 1. Hence ( / i f , . . . ,n£) and (n^, . . . ,n~) are designs forsample size n + 1 and n — 1, respectively. Furthermore we introduce

Part (a) states that M < m. In the case of (nj , . . . ,«£), we wish to establishM+ < m+. From n, < n^, we infer m < m+. Either M+ is attained by some— * — l ' — J

i ^ J; then we obtain M+ = («, — l)/w, < M < m < m+. Or otherwiseM+ = (n| — l)/vv/ = nj/Wj = m gives M+ = m < m+. Thus part (a) yields(n|,...,/i|) e E(g,n + 1). In the case of (n^, . . . ,n^) , we obtain similarlym" = rij/Wi — nj/Wj > m > M > M~ for some / ^ A:, or m~ = n^/wk —(nk - l)/wk = M > M~.

Thus a fast implementation of the efficient design apportionment has twophases. First use the multiplier n - \i to calculate the frequencies n, =\(n- ^)wi]. The second phase loops until the discrepancy (£),<£«») -n is 0,either increasing a frequency n;- which attains H//VV, = min,<£ n,/H>, to n;- + 1,or decreasing some nk with (nfc - l)/wk — max,<£(«/ - l)/w, to nk — 1. Anexample is shown in Exhibit 12.2.

Next we turn to the excellent global efficiency properties of the efficientdesign apportionment, thereby justifying the name. The approach is based on

n

i6i312

1

100010QQ1

2

no101on

3

111

4

112

5

122

6

123

7

211232334

8

221323344

9

234

10

235

11

245

12

246

13

322454667

14

332545677

15

357

16

358

17

368

18

369

EXHIBIT 12.2 Efficient design apportionment. For a design £ with weights wj = 1/6,w2 = 1/3, w>3 = 1/2, the efficient apportionment £(£,«) offers three discretizations for n =1,2,7,8,13,14,... and is a singleton otherwise. Underlined numbers determine the minimumof rij/Wi and receive the next observations.

the likelihood ratio rf(x)/g(x) of any two designs 17, £ e H, as are the proofsof Theorem 10.2 and Corollary 8.9. We introduce the minimum likelihoodratio of 17 relative to £:

The definition entails e^^ € [0; 1]. The following lemma explicates the roleof s^i£ as an efficiency bound of 17 relative to £.

12.6. PAIRWISE EFFICIENCY BOUND

Lemma. Any two designs 17, £ e H satisfy, for all information functions <£on NND(Jfe),

Proof. Monotonicity and homogeneity of an information function </>make (2) an immediate consequence of (1). Inequality (1) follows from

The merits of the efficiency bound e^^ are that it is easy to evaluate,and that it holds simultaneously over the set 4> of all information functionson NND(*):

12.7. OPTIMAL EFFICIENCY BOUND 311

Also it helps to formulate the apportionment issue as a succint optimizationproblem. Given a design £ e H, achieve the largest possible efficiency bounde^ among all standardized designs TJ € Hn/« for sample size n. The problemis solved by the efficient design apportionment E(£,n).

12.7. OPTIMAL EFFICIENCY BOUND

Theorem. For every design £ e H, the best efficiency bound among allstandardized designs for sample size n,

is attained by any member of the efficient design apportionment £(£,«).

Proof. Let £ assign weights w, > 0 to i support points *,- e X, andlet 17 € En/n be an arbitrary standardized design for sample size n. If forsome i we have 17 (jc,) = 0, then e^/f vanishes and 17 is of no interest forthe maximization problem. If the support of 17 contains additional pointsXi+i,..., Jtjt, then the efficiency bound e^/g does not decrease by shifting the

mass £)f=£+i TJ(JC,) mto tne point *i, say. Hence we need only consider designs17 of the form 17(*,) = n,-/n for all / < ^, with ]T]K£ n, = n.

In the main part of the proof, we start from a seF(«1 , . . . , «£ ) with Y^i<i ni —n which attains the best efficiency bound,

This set need not belong to the efficient design apportionment E(£,n), so wemodify it into another optimal set which lies in £(£, w). Again we put

While m < M, we loop through a construction based on some subscripts ;'and k satisfying m = rij/Wj and M = (nk - l)/wk. The assumption m < Mforces ;' ̂ k and nk > 2. We transfer an observation from the k th to the ;' thpoint,

This yields «;/w; > m and nk/wk = M > m. But m = min,-^ n,7»v,- satisfiesm < ne^(n) = m. Hence there exists a third subscript / ^ j,k where «/ = «,,such that n«/w, = m < m < n,/Wi. Therefore in = m, and («i,... ,«f) is justanother optimal frequency set, but the optimal efficiency bound is attained by

fewer n; than before. So we copy (n\,... ,nt) into (ni,...,ne) and comparethe new values of m and M.

The loop terminates with M < m. Part (a) of Theorem 12.5 proves theterminal frequency set to be a member of £(£,n).

Finally we prove that any two members (n\,... ,n f) ^ (n^,... ,ne) in theefficient design apportionment E(g,n) have the same efficiency bound. Tothis end it suffices to establish m = m where

Let v and v be two multipliers that produce (n\,..., nf) and (n\,..., n f ) . Wehave v = v, otherwise v < v entails n, < «/ for all i <£, and Yî<e nt = n —Yî<eî forces equality, n, = n,, contrary to the assumption. Also we havev < m, otherwise we get n, < vwj and «/ 0 (T^w/H for some i. The sameargument yields v < m.

Thus there is a common multiplier v satisfying «,,«, e IF*/H'il f°r all * < ^-Since the two sets of frequencies are distinct, there exists a subscript ;' withrij = vwj and Hy = vw}, + I. As both sum to n there is another subscript kwith nk — vwk + I and nk = nv^. We obtain rt//M>7 = v < m < rij/Wj andnk/\vk = v < m < nk/wk. This proves m = v = m, whence the efficiencybound is constant over £(£,«).

The construction in the proof is called for since designs other than thosein the efficient apportionment £(£,«) may also achieve the best efficiencybound £f(n). For instance, in Exhibit 12.2, the optimal bound ^(8) = 6/8is attained by the apportionment (3,2,3) which fails to belong to £(£,8) ={(2,3,3), (2,2,4), (1,3,4)}.

The optimal bound e^(n) varies with underlying design £. There are coarserefficiency bounds which hold uniformly over sets of designs, H C H.

12.8. UNIFORM EFFICIENCY BOUNDS

Lemma. For a design £ e H and sample size n, let e^(n) be the optimalefficiency bound of Theorem 12.7.

a. (Support size) For all designs £ € H that have support size t, we have

b. (Rational weights) For all designs £ G H that have rational weights with

12.8. UNIFORM EFFICIENCY BOUNDS 313

common denominator N, we have

Proof. Again we set £(*,) = w, for the i support points jc, of £. Forall /,;' = !,...,£, part (a) of Theorem 12.5 entails («y - l)/w;- < n//w,,that is, w,-(ny - 1) < w,wy. Summation over / yields w,-(n - £) < n, and(n,7n)/Wj > 1 — ̂ /n for all' i = 1,..., L This proves part (a).

For part (b), let N be a common denominator, whence Nwj are integers.If the sample size is an integer multiple mN of N, then the efficient designapportionment is uniquely given by mNwj. Generally, for sample size n =mN + r with a remainder re{0, . . . ,N-l}, sample size monotonicity yields

This gives e^(n) > mN/n = [n/N\/(n/N).

The first bound increases with n. The second bound periodically jumps to 1and decreases in between. We use the example of Exhibit 12.2 in Section 12.6to illustrate these bounds:

n

ef(n)

1 -3//I

L«/6J/(«/6)

1

0

0

0

2

0

0

0

3230

0

434

40

545250

6

1361

7674767

868

868

9896969

10

910

To6TO

111011811611

12

19121

1312T510131213

14121411141214

15 1614 1515 1612 1315 1612 1215 16

17

16n14T712T7

18

115T81

The asymptotic order n } cannot be improved without further assumptionson the optimality criterion </>. This is demonstrated by a line fit model withexperimental domain [—1;1] using the global criterion of Section 9.2,

The globally optimal design for 0 is the (fo-optimal design r(±l) = |,with moment matrix M\(T) = I2. This follows from the Kiefer-WolfowitzTheorem 9.4 and the discussion in Section 9.5. Hence the optimal value isv(g) = g(h) = l

Now let us turn to designs for sample size n. For even sample size n = 2m,the obvious apportionment rn(±l) = ra achieves efficiency 1. For an oddsample size, n — 2m + 1, the efficient apportionment assigns m observationsto -1 and m + 1 observations to +1, or vice versa. The two correspondingmoment matrices are

The common criterion value is found to be |(1 - 1/n). For either design rn

in E(r,n), the g-efficiency is

for odd sample size n. Therefore a tighter efficiency bound than er(n) isgenerally not available.

As n tends to co, either bound of the lemma converges to 1. The asymp-totic statements come in a somewhat smoother form by switching to thecomplementary bound on the efficiency loss, 1 - e$(n). Part (a) states thatthe loss is bounded by l/n, for every sample size n. De-emphasizing the con-stant I, this is paraphrased by saying that the efficiency loss is asymptoticallybounded of order n~l.

Inherent in any rounding method is the idea that for growing sample size nthe discretized weights rii/n converge to the true weight w/. This is establishedin the following theorem. Furthermore, it is shown that with £„ € £(£,«),the criterion values <t>(M(gn/n)) converge to <f>(M(g)). This is not entirelytrivial since we need to rule out discontinuities such as in Section 3.16.

12.9. ASYMPTOTIC ORDER O(n~l)

Theorem. Let £ e H be an arbitrary design.

a. (Efficiency) The efficiency loss 1 - e^(n) that is associated with theefficient design apportionment for rounding £ into a design for samplesize n is asymptotically bounded of order n~l.

b. (Weights) The Euclidean distance between £ and the standardizedmembers of the efficient apportionment E(g,n) is asymptoticallybounded of order n~l.

c. (Convergence) For every design £„ e £(£, n) for sample size n, we have,for all information functions <f> on NND(fc),

Proof. Part (a) is a consequence of Lemma 12.8, 1 - e%(n) < l/n =0(n-*).

For part (b), let £ assign weight w/ to its support points jc/. Any mem-ber («!,...,/!{) in E(g,n) satisfies w,(n; — 1) < n,-ivy, as in the proof ofLemma 12.8. Summation over / yields «/ — 1 < nwj. In the case ny > nwy, weget |n, — nvvy) = nj• — nwj < 1. In the case ny < nwy, we employ the efficiency

12.10. ASYMPTOTIC ORDER O(tl 2) 315

bound £((n) and part (a) to obtain

Altogether this yields and

In part (c), convergence of M(£n/n) to M(£) follows from the convergenceof the weights, since the support is common to all designs £, and £. Then thematrices An = (l/^(n))M(£„/«) converge to M (£) and fulfill An > M(£) forall n, by Lemma 12.6. For an information function <f> on NND(/c), Lemma 5.7now entails lim,,^ 4>(An) = <t> (M(£)) and lim,,̂ <£ (M(&/n)) = <t> (M(£)).

In the general design problem the optimality criterion is given by an in-formation function <f> on NND(/c). Then the asymptotic order O(n~l) carriesover to the <f> -efficiency when discretizing a <£ -optimal design £ in some sub-class H C H. For a design £„ in the efficient apportionment £(£,«), we get

for every sample size n, by Lemma 12.6. The asymptotic order becomesO(n~2) as soon as the optimality criterion </> is twice continuously differen-tiate in the optimal design £. The differentiability requirement pertains tothe weight vector (wi,..., wf)' e Ue of £.

12.10. ASYMPTOTIC ORDER O(n~2)

Theorem. Suppose the set H C H of competing designs is convex. Foran information function <£ on NND(fc), let the design £ 6 H be <£-optimalfor d in H, assigning weight w/ to i support points Xi e X. Assume that themapping (MI, ... ,M/) »-* <f> (Y^i<e "ii JC/JC/) K twice continuously differentiableat (w 1 ; . . . , we)' e Re. Then, for some c > 0 and for all sample sizes n, thedesigns £„ in the efficient design apportionments £"(£, n) satisfy

Proof. Since the efficient design apportionment leaves the support of £fixed, we need only consider designs on the finite regression range X —supp £ where, of course, £ continues to be tf> -optimal. The set of weightsW = {(i7(jci),...,i7(^))' e Re : i) € H, supp 17 C X} inherits convexityfrom H. The vector w = (wi,...,wf)' € W maximizes the concave func-tion /(MI, . . . , u f ) = <f> ($2i<e ui */*/) °ver W. In the presence of differentia-bility, this happens if and only if the gradient V/(w) is normal to W at w,

If for any one Euclidean unit vector et G W strict inequality holds in (1),then this entails the contradiction w 'V/(iv) =V/(w>) = w'Vf(w). But equality for all Euclidean unit vectors extends toequality throughout (1),

The Taylor theorem states that for every u in a compact ball /C around wthere exists some a e [0; 1] such that

where ///(«) denotes the Hesse matrix of / at u. For u e W, the gradientterm vanishes because of (2). Hence for u e 1C n W, we obtain the estimate

where c = maxa6;c \max(—Hf(u)}/f(w) is finite since the Hesse matrix Hf(u)depends continuously on u and AC is compact. We have c > 0 since otherwisefor u ̂ iv, we get 1 - f(u)/f(w) < 0, contradicting the optimality of w.

Any assignment («!, . . . ,«£) e E(g,n) leads to a vector u = («!/«,...,ne/n)' with squared Euclidean distance to w bounded by \\u - w\\2 < 2i2/n2

(see the proof of Theorem 12.9). Hence there exists some n0 beyond which ulies in /C where (3) applies, 1 - <^-eff(^/«) < c£2/n2 = O(n~2}. Now theassertion follows where c is the maximum of c and the numbers (n2/£2)(l —<£-eff(&AO) for & e £(£>«) and " <

An example is shown in Exhibit 12.3, for the efficient apportionment ofthe 0_oo-optimal design rl^ for the full parameter vector 6 in a tenth-degreepolynomial fit model over [—!;!]. In Section 9.13, it is shown that the small-

12.11. SUBGRADIENT EFFICIENCY BOUNDS 317

EXHIBIT 12.3 Asymptotic order of the E-efficiency loss. The scaled efficiency loss A_OO(/J) =(n2/121)(l - ^_oo-eff(Tn/n)) has maximum c = 0.422 at n = 257, where rn is the efficient

apportionment of the <£_oo-optimal design T!̂ , for 0 in a polynomial fit model of degree 10on [-!;!].

est eigenvalue of the associated moment matrix has multiplicity 1, whencethe criterion function is differentiable at Af^rl^) and the present theoremapplies. The exhibit displays the scaled efficiency loss

which for sample size n < 1000 stays below c = 0.422. Exhibit 12.4 in Sec-tion 12.12 illlustrates the situation for the determinant criterion (fo.

12.11. SUBGRADIENT EFFICIENCY BOUNDS

On our way to exploit the assumption that the criterion is twice differentiablein the optimum, we may wonder what we can infer from simple differentia-bility. We briefly digress to provide some evidence that no new insight isobtained.

Similar to Lemma 12.6, another pairwise comparison of two designs 17 and£ in H is afforded by the subgradient inequality of Section 7.1,

As before, we assume £ to be <£-optimal, with moment matrix M = M(£)and with i support points jc, and weights w,. The design 17 = i~n/n is taken

to be close to £, with weights «,/« on the same support points *,-. Dividingby the optimal value v(<j>) = <f>(M), we obtain a bound for the efficiencyloss,

The first factor is close to 1 provided <f> is continuously differentiable in M.Then gn/n eventually lies in the neighborhood of £ where differentiabilityobtains, whence the subdifferential of <f> at Mn = A/(£,/n) uniquely consistsof the gradient V<j>(Mn). These gradients converge to V<£(Af) because ofcontinuous differentiability. By Theorem 7.9, the matrix N = V</»(A/)/<£(M)solves the polarity equation <t>(M)<t>°°(N) = (M,N) = 1. The General Equiv-alence Theorem 7.14 now translates the optimality of £ into the normality(in)equality max/<f x-Nxj = 1. Therefore, given 8 > 0, there exists some n0

such that for n > no, we have maxi<iXi'V<f>(Mn)xi/(f>(M) < 1 + S.The second factor in (1) is bounded by £)/<f \

wi ~ ni/n\ < ZLx^l +Wit)/n = 21/n, for the members (n\,...,ni) in the efficient design appor-tionment £(£,/i). To rid us of the factor 2, we may use the quota methodinstead and obtain X)/<* |vv, - n,/n| < l/n. In summary, (1) leads at best tothe bound

This result is inferior to Theorem 12.9 in every detail. It excludes an initialsection up to some unknown sample size «o» and it features a constant 1 + 5larger than 1. It demands continuous differentiability, and it guides us towardsthe quota method with its appalling monotonicity behavior.

However, an objection is in order. Theorem 7.9 says, amongst others, thatevery subgradient B e d^(M(t])) satisfies (M(T/),5)/0(M(^)) = 1. Thushidden in the subgradient inequality is the term <£(Af(i7)) - (M(i)),B} = 0,and estimating a term that is 0 anyway is unlikely to produce a tight bound.A more promising starting point is

With 17 = gn/n close to £, again B becomes the gradient (Mn) at Mn =

12.11. SUBGRADIENT EFFICIENCY BOUNDS 319

And again continuous differentiability is generally needed to make (2) con-verge to 0. Hence against initial hope, the objection does not lead to weakerassumptions than those called for by (1).

The merits of (2) are that it comes for free with the General EquivalenceTheorem 7.14. By Theorem 7.9, the matrix Nn = V(f>(Mn)/<J>(Mn) solvesthe polarity equation <f>(Mn)<J>°°(Nn) = (Mn,Nn) = 1. In order to check theoptimality of TJ = &/«, the General Equivalence Theorem 7.14 directs usto invest some effort to compute max,<£ */Wn*,-. Although optimality fails tohold if the maximum exceeds 1, inequality (2) still rewards us the efficiencybound

In the absence of differentiability, the bounds (2) and (3) may be quite bad.The following example builds on Section 3.16 and uses the singular momentmatrix

where the criterion function is not continuous, let alone differentiable. Thematrix belongs to the one-point design in zero, TO, which is optimal for theintercept in the line fit model with experimental domain [-!;!]. That is, theparameter of interest is

We converge to TO along a straight line, ra = (1 - «)TO + «TI, as a tends to0. Our starting point is the designthe moment matrix Ma of T« and the matrixfor the normality inequality are

Hence the maximum of (l ,f)W0(l ,r) ' over t = ±1,0, or even over t e [—1;1],is 9/(4 - a) and converges to 9/4 as a tends to 0. The right hand sides in (2)and (3) converge 5/9 and 4/9, and leave a gap to the left hand sides 0 and1, respectively.

12.12. APPORTIONMENT OF D-OPTIMAL DESIGNS ENPOLYNOMIAL FIT MODELS

In the remainder of the chapter, we highlight a rare case where efficient de-sign apportionment preserves optimality: if £ is <£ -optimal in E, then E(g,n)is (f> -optimal in En. Thus simple rounding reconciles two quite distinct opti-mization problems, a convex one and a discrete one. We illustrate this verypleasing property with a polynomial fit model of degree d. As in Section 1.28,we shift attention from the designs £ e E on the regression range -X to de-signs T € T on the experimental domain T — [—!;!]. On T, the set of alldesigns for sample size n is denoted by Tw.

In a d th-degree model, the parameter vector 6 has d + 1 components. Itis convenient to use the abbreviation

At least k observations are needed to secure nonsingularity of the momentmatrix. The sample sizes n > k are represented as an integer multiple m of kplus a remainder r e {0,..., k — 1},

In terms of n we have m = \n/k\ and r = n — mk.The <fo-optimal design for 0 is TQ and has a support of smallest possi-

ble size, k, derived in detail in Section 9.5. Moreover TQ assigns uniformweights 11k to its k support points //. The efficient design apportionment£(£, ri) salvages from this uniformity as much as possible. Every reasonableapportionment method does the same, whence the following results are notso much indicative of any one apportionment method. Rather, they underlinethe peculiar properties of the determinant criterion <fo, and of designs whichhave a minimal support.

For sample size n = mk+r > k, the members of the efficient design appor-tionment £(£, n) have in common that they assign at least m observations toeach of the k support points f,-. They differ in where the remaining r obser-vations are placed. Hence E(£,ri) contains (*) designs for sample size n. Itfollows from Theorem 12.7 that all of them share the same efficiency bound.In the present setting, we also find that the ^-efficiencies themselves areconstant.

12.12. APPORTIONMENT OF D-OPTIMAL DESIGNS IN POLYNOMIAL FIT MODELS 321

Claim. In a dth-degree polynomial fit model and for sample size n > k,the criterion value <f>o(Md(Tn/n)) is constant over the efficient design appor-tionment E(TQ,ri). With n = mk + r as above, the efficiency loss is boundedby

Proof. To see this, we represent the moment matrix as Md(Tn/n) =X'kuX, where the model matrix X is as in Section 9.5 and Au is the di-agonal matrix with weights M, = rn(ti)/n on the diagonal. That is, r weightsare equal to (m + l)/n, while the remaining k - r weights are m/n. Hencethe optimality criterion takes the value

and does not depend on the specific assignments of the frequencies m + Iand m. The overall optimal value is f</((&)) = (te\.X)2/k/k, whence followsthe equality in (1).

In order to establish the bound, we introduce the relative remainder a =r/k e [0;1] and multiply (1) by n2/k2 to obtain the scaled efficiency loss

The biggest efficiency loss is encountered in the first period, m = 1, whenthe discretization effect is felt most. Hence we get a) (1 + a - 2a) < 0.135. Thus the proof is complete.h

For later periods, m —» oo, the functions (m + a) (m + a - m (1 + l/m)a)converge uniformly for a e [0; 1] to the parabola |a(l-a), which at a = \ hasmaximum 1/8. Therefore the limiting bound for large sample size n tightensonly slightly, from 0.135 to 0.125. Note that the same bounds c = 0.135 and0.125 hold for all degrees d > I .

For degree 10, the scaled efficiency loss is shown in Exhibit 12.4. In the first period, n = 11,..., 22, the maximum is0.134 while we use the bound 0.135. The fourth period, n = 44,... ,55, hasmaximum 0.126 which is getting close to the limiting maximum 0.125.

EXHIBIT 12.4 Asymptotic order of the D-efficiency loss. The scaled efficiency loss AQ(«) =(n2/121)(l - <^o-eff(Tn/n)) is bounded by c = 0.135, where rn is an efficient apportionmentof the ^o-optimal design T*° for 0 in a polynomial fit model of degree 10 on [-1; Ij.

12.13. MINIMAL SUPPORT AND FINITE SAMPLE SIZEOPTIMALFTY

With little effort, we can establish a first discrete optimality property of theefficient design apportionment E(TQ,II), in a subclass Tn of designs for samplesize n. Let Tn be the set of such designs for sample size n that have a smallestpossible support size for investigating the full parameter vector 0,

In this class, the best way to allocate the k support points is the one realizedby the efficient design apportionment £(T^,/I), according to the following.

Claim. In the d th-degree polynomial fit model, the designs in the efficientdesign apportionment £(TQ,H) are the only (fo-optimal designs for 6 in Tw.

Proof. To prove this, let rn 6 Tn be a competing design for sample size «,with k = d + 1 support points 7t and with frequencies n,-. We denote theassociated model matrix by X = (/(/o),. - . ,/(£/)) '• The optimality criterionhappens to separate the contribution of the support points from that of theweights,

12.13. MINIMAL SUPPORT AND FINITE SAMPLE SIZE OPTIMALITY 323

Degree

45678

±1,±1,±1,±1,±1,

Support points of r*+2

±0.6629, ±0.1154±0.7722, ±0.3541, 0

±0.8328, ±0.4920, ±0.1239±0.8734, ±0.6053, ±0.2739, 0

±0.9006, ±0.6837, ±0.3919, ±0.1083

<fo-eff(T,*+2)

0.95850.96370.97180.97520.9795

<fo-eff(Td+2)

0.95720.96200.96620.96910.9720

EXHIBIT 12.5 Nonoptimality of the efficient design apportionment. In a d th-degree poly-nomial fit model over [—1; 1], the designs r*+2 for sample size n = d+2 have larger ^-efficiency

for d than the designs rd+2 m the efficient design apportionment E(rft,d + 2).

In order to estimate the first factor, we introduce the design T € T whichassigns uniform weight l//c to the support points f, of rn. As before, welet X be the matrix that belongs to the support points r, of TQ . The unique<£o-optimality of TQ entails the determinant inequality

with equality if and only if the two support sets match, {ftf}. The second factor only depends on the weights. If there are frequencies nt

other than m or m + 1, then some pair «; and nk, say, satisfies We transfer an observation from / to k to define the competitor

Then the optimality criterion is increased because Qiiijnk < njnk+fij~nk-l =rijnk. In summary, every design rn e E(^,n) and any competing designTn G f „ fulfill

with equality if and only if ?„ e E(r^n). Thus the proof is complete.

From Theorem 10.7, designs with more than k support points are inadmis-sible in T. But there is nothing in the development of Chapter 10 to implythat this will succeed for the discrete design set Tn. Indeed, there are designsfor sample size n with more than k support points that are superior to the ef-ficient design apportionment £(T^,/I), if only slightly so. Examples are givenin Exhibit 12.5.

This appears to be an exception rather than the rule. In fact, the excep-tional cases only start with degree d > 4 and, for a fixed degree d, arerestrained to finitely many sample sizes n < nd. In all other cases, n > nd, it

is the efficient design apportionment E(TQ , n) which yields the <fo-optimal de-signs for 6 in TM. To deduce this result and to calculate nd (in Section 12.16),we start with a lemma which permits a nonconvex set of competing momentmatrices M. Based on the subgradient inequality, it provides a sufficient crite-rion for a finite subset £ C M to be a complete class relative to the optimalitycriterion <f>, in the sense that every moment matrix A € M is matched by amatrix M e £ which under <j> is at least as good as A.

12.14. A SUFFICIENT CONDITION FOR COMPLETENESS

Lemma. Assume that the optimality criterion <f> is an information func-tion on NND(&). Let M. C M(3) be an arbitrary subset of moment matriceswhich includes a finite subset £ of positive definite matrices. If for everymoment matrix A e M there exists a matrix M e £, such that some non-negative definite k x k matrix N solves the polarity equation 0(M)<^°°(N) =trace MN — 1 and satisfies trace AN < 1, then at least one of the momentmatrices in 8 is <£-optimal for 6 in M. If, in addition, </> is strictly isotonicon PD(£) and the polar function <£°° is differentiable on PD(£), then every<£ -optimal moment matrix for 0 in M is a member of £.

Proof. Let the moment matrices A € .M and M € £ be such that theequation <f>(M)<f>°°(N) = trace MN — 1 has a nonnegative definite solu-tion N satisfying trace AN < 1. Then <f>(M)N is a subgradient of <£ at M,by Theorem 7.9. Because of <£(M) - (M,4>(M)N} = 0, the subgradient in-equality simplifies

where the second inequality exploits the assumption trace AN < 1. From thefiniteness of £, we get supA€M 4>(A) = maxW€f <£(M), thus establishing thefirst part of the assertion.

If, in addition, $ is strictly isotonic on PD(/c), then the matrix N is pos-itive definite, by Lemma 7.5. Now let A e M be <f>-optimal for 6 in M.Then equality holds throughout (1), entailing <f>(A)<f>°°(N) = trace AN = 1.Part (c) of Theorem 7.9 yields that <t>°°(N)A is a subgradient of the polarfunction 0°° at N, as is (f>°°(N)M. Differentiability of $°° leaves the gradientas the only possible subgradient, we obtain A = M e £.

For the determinant criterion <fo, we apply the lemma to the momentmatrices of the designs in the efficient apportionment £(£,«) that comeswith a 0o-optimal design £ for 0 in H. The Kiefer-Wolfowitz Theorem 9.4provides the necessary and sufficient normality inequality,

Hence

12.15. A SUFFICIENT CONDITION FOR FINITE SAMPLE SIZE D-OPTIMALITY 325

Let us assume that the optimal design £ has a smallest possible support size, k.Then the weights of £ are uniform, I/A;, and the support points x\,..., xk e Xof £ assemble to form the nonsingular k x k matrix X1 = (x\,... ,xk). WithM(£) = X'X/k, the left hand side in (1) becomes kx'X'lX'~lx . Upondefining g(jc) = X'~lx, the normality inequality (1) turns into

The following theorem proposes a strengthened version of (2) by replacingthe constant upper bound 1 by a variable upper bound. The tighter, andhence only sufficient, new bound is a convex combination of 1 and of thefunction max(</tgf(*) < 1, the weighting depending on the sample size nthrough the multiple m = [n/k\. The goal is to start out from optimalitystatements relative to the class H of all designs, and to deduce optimalityproperties relative to the discrete class En of designs for sample size n.

12.15. A SUFFICIENT CONDITION FOR FINITE SAMPLE SIZED-OPTIMALITY

Theorem. Let the design £ e H be (^-optimal for 6 in H, and have ksupport points j t i , . . . , x^ e X. With the nonsingular k x k matrix X' —( jc j , . . . ,xk), we define the function g(x) = X'~lx : Uk —> Rk. If, for some m,we have

then, for all n > mk, the designs £, for sample size n that constitute theefficient design apportionment E(£,ri) are <fo-optimal for 6 in E«. If, in addi-tion, Y^i<kS2i(x} < 1 for all * e #\ {*!,...,jty} then every (^-optimal designfor 6 in EM is a member of E(£,n).

Proof. The proof is accomplished in three steps, showing that condi-tion (1) entails the sufficient condition of Lemma 12.14, with M = M(s,n/n}and£ = {M(&/n): £, e E(£,n)}.

I. First we identify a design £„ in the efficient design apportionment £(£, n)with an /--element subset I of {!,...,£}, through

To indicate this one-to-one correspondence, we denote the members ofE(£,ri) by &. Thus gi/n assigns to :c, the weight *v« = (ra+l)/n for / e J, and

Wi — m/n for / 0 J. Upon introducing the vector w(J) = (w\,..., wk)', thesolution to the polarity equation for the moment matrixbecomes NI = X~l^~l^X'~l/k, by Lemma 6.16. Therefore Lemma 12.14 de-mands verification of

We represent the moment matrices inregression vectors _ y i , . . . ,yn in X. Setting

we obtain

Thus the hypothesis of Lemma 12.14 turns into

n. In the second step, we prove that assumption (1) implies (2). For n =mk + r with remainder r e {0,..., k - 1}, we choose arbitrary regressionvectors y\,...,yn in X. For each j = !,...,«, there exists some index /; €{!,...,*} such that g?(yy) = max/<*g?(yy).

Let J be some r-element subset of {1,...,/:}. In any case, we have

since optimality of £ entailsmaximum is attained over J, ij € J, leads to

with

The particular case that the

12.15. A SUFFICIENT CONDITION FOR FINITE SAMPLE SIZE D-OPTIMALITY 327

This and assumption (1) yield an estimate that improves upon (3):

Introducing the counts c, = #{; '< n : i-} — i}, we find that there areparticular cases with improved estimates (4). Summation of (3) and

(4) over ; < n leads to Now we form the minimum over the r-element subsets J,

where among C i , . . . , ck. This sum attains the smallest value, r(m +1), if the sum ofthe k — r smallest counts is largest, (k — r)m. Insertion ofyields (2).

Hence (1) implies (2). Lemma 12.14 now states that the efficient designapportionment £(£, n) contains a design for sample size n which is <fo-optimalfor 6 in Hn. Using the same arguments as in Section 12.12, we see that thedeterminant criterion <fo is constant over E(£,n). Thus optimality of onemember in E(g, n) entails optimality of all the others.

III. Finally, let £„ be any other design for sample size n which is <fo-optimal for 0 in H/,. From the second part of Lemma 12.14, the momentmatrix M(£n/ri) coincides with one of the matrices With this, we find

Now if, in addition to must be supported by jci , . . . ,xk. Denoting the weights by ^,(xi)/n = w/, weobtain the equality u = w from Therefore £, is a member of E(£,n), and the proof is complete.

is the sum of the r largest counts

then for all

12.16. FINITE SAMPLE SIZE D-OPTIMAL DESIGNS INPOLYNOMIAL FIT MODELS

We now return to the assertion of Section 12.13 that the efficient designapportionment £(T^,AI) yields <fo-optimal designs in the discrete class Tn,for polynomial fit models on the experimental domain T = [—1;1], for largeenough sample size n. We claim the following.

Claim. For a d th-degree polynomial fit model there exists an integer nd

such that for all n > nd, the designs for sample size n in the efficient designapportionment E(TQ,H) are the only <fo-optimal designs for 0 in Tn.

Proof. From Section 9.5, we recall that the support points of the <fo-optimal design TQ for 6 in T are /o>' i> • • • i td, and that the matrix V = X'~l

is the coefficient matrix of the Lagrange polynomials L, with nodes f/. In thenotation of the previous section, we have k = d + 1 and L,-(f) = gl+1 (f(t)).

We write n = m(d + 1) + r as in Section 12.12. Condition (1) of Theo-rem 12.15 requires, for all t e [-1; 1],

To isolate m we rearrange terms, From this, we see that the key object is the function

where #,(r) = (1 - L?(0) /P(t) are rational functions, for all i = 0,... ,d,with common denominator P(t) = 1 - £)f=o Lj(t). The behavior of the poly-nomial P is studied in Section 9.5. It is of degree 2d, with a negative highestcoefficient, and has d -1 local minima at t\,..., td_\. Thus at fo and td the firstderivative is nonzero, while at t\,..., td_\ the first derivative vanishes but thesecond derivative is nonzero.

At //, the singularity of Rj is removable and Rj is continuous, as followsby applying the 1'Hospital rule once at the endpoints i — 0, d, and twice inthe interior / — 1,... ,d - 1. For j ^ /, the function Rj has a pole at f/. Asa consequence R, being the minimum of RQ,RI, ... ,Rd, coincides around r,with Ri. Therefore R is continuous at fo, f i , . . . , td- Of course, R is continuousanywhere else. Thus it has a finite maximum over [-1;1],

EXERCISES 329

dVdrid

112

.521.873

33.2512

43.9315

55.1230

65.8835

76.9948

87.7963

98.8780

109.6999

EXHIBIT 12.6 Optimality of the efficient design apportionment. For sample size n > nd =(d + l)[/jt<f — 1], the efficient design apportionment E(rfi,n) yields (fo-optimal designs for 6 inthe discrete design set Tw, in a dth-degree polynomial fit model. The bounds nd are not bestpossible.

We define md to be the smallest integer w fulfilling fjid < m + 1, that is,md — \^d — l"l- Now Theorem 12.15 applies for n > nd, with nd = (d + l)md.Thus our claim is established.

The lower bounds nd that emerge during the proof are tabulated in Ex-hibit 12.6. These numbers are not best possible. For instance, for degreed = 3, we obtain n>\2 while it is known that for every n > 4 the designs inE(rQ,n) are optimal in T«.

Preservation of <fo-optimality under the efficient apportionment methodcannot be expected to hold in general. Surprisingly enough it breaks thesymmetry that carries much of the intuitive appeal of the design rfi. For ex-ample, in a third-degree polynomial fit model the <fo-optimal design TQ for6 — (0o, #i, #2, #3)' places uniform weights 1/4 on the four points -1, -l/\/5,l/v/5,1. For sample size five, the efficient apportionment £(^,5) consists ofthe four permutations of the assignment 2,1,1,1. Each of these leads to thesame <fo-efficiency 0.951 for 9. The apportionment 2,1,1,1 and its permuta-tions violate our intuitive feeling that symmetry is a necessity, for optimality inpolynomial fit models on the experimental domain [—!;!]. However, intuitionerrs for finite sample sizes n. The (^-optimal symmetric design for sample sizefive assigns one observation to each of the five points -1, —0.511,0,0.511,1,but has for 6 an inferior ^-efficiency of 0.937.

This provides yet further evidence that discrete design problems are verypeculiar, in that often a reduction by symmetry fails to go through. On theother hand, our main problem of interest, of finding optimal designs forinfinite sample size, may very well afford a reduction by invariance. This isthe topic of the last three chapters.

EXERCISES

12.1 Show that the efficient design apportionment turns a design withweights 0.27744, 0.25178, 0.19951, 0.14610, 0.09225, 0.03292 into a de-

sign for sample size 36 with frequencies 10, 9, 7, 5, 3, 2 [Balinski andYoung (1982), p. 96].

12.2 Which of the weights of the <fo-optimal designs TQ in Exhibit 9.4 gotrounded and how?

12.3 In a polynomial fit model of degree 8, the <f>_i -optimal design rf j for 6in T assigns weights 0.045253, 0.098316, 0.121461, 0.151682 to points±t i=- 0 and 0.166576 to 0. Show that the quota method is not samplesize monotone from n = 1005 to n — 1006.

12.4 In a polynomial fit model of degree 10, the $_j -optimal design T™ for 6in T assigns weights 0.037259, 0.078409, 0.089871, 0.106787, 0.122870 topoints ±t / 0 and 0.129609 to 0. Compare the efficient apportionmentfor sample size n = 1000 with the numerical rounding.

12.5 In a polynomial fit model of degree 10, the $_oo-optimal design ri^for 0 in T assigns weights 0.036022, 0.075929, 0.087833, 0.106714,0.126145 to points ±t ^ 0 and 0.134714 to 0. Use the efficient designapportionment to obtain a design for sample sizes n = 11,..., 1000.Compare the results with Exhibit 9.11 and Exhibit 12.3.

12.6 Show that if «/ = [(n - ^)vv,] sum to n, then £(£,«) is the singleton{(HI, . . . ,«#)}, where wi, . . . , wf are the weights of £ € H.

12.7 Show that the designs £„ € £(£,«) satisfy lim,,-^ C/c(Af(£n//i)) =CK(M(£)}, for every £ e E and for every full column rank matrixK e Rkxs.

12.8 Show that the total variation distance between £ and £n/n is boundedby max

12.9 In a line fit model over [— 1;1], show that the globally optimal designfor 6 among the designs for odd sample size n = 2m + l is r*(±l) = m,r;(0) = 1. Compare the efficiency loss with that of the efficient apportionment of the globally optimal designr(±l) = 5 in T, and with the uniform bound of Theorem 12.8.

and for all

C H A P T E R 13

Invariant Design Problems

Many design problems enjoy symmetry properties, in that they remain invariantunder a group of linear transformations. The different levels of the problemactually have different groups associated with them, and the interrelation ofthese groups is developed. This leads to invariant design problems and invari-ant information functions. The ensuing optimality concept studies simultaneousoptimally relative to all invariant information functions.

13.1. DESIGN PROBLEMS WITH SYMMETRY

The general design problem, as set out in Section 5.15 and reviewed in Sec-tion 7.10, is an intrinsically multivariate problem. It is formulated in terms ofk x k moment matrices, whence we may have to handle up to \k(k +1) realvariables. Relying on an optimality criterion such as an information functionis just one way to reduce the dimensionality of the problem. Another ap-proach, with much intuitive appeal and at times very effective, is to exploitsymmetries that may be inherent in a problem. The appropriate frame is tostudy invariance properties which come with groups of transformations thatact "on the problem". However, we need to be more specific what "on theproblem" means.

The design problem has various levels, such as moment matrices, informa-tion matrices, or optimality criteria, and each level has its particular manifes-tation of symmetry when it comes to transformation groups and invarianceproperties. In the present chapter we single out these particulars, a task thatis sometimes laborious but clarifies the technicalities of invariant design prob-lems.

We find it instructive to first work an example. Let us consider a parabolafit model, with experimental variables ti,...,te in the symmetric unit intervalT = [—1; 1]. The underlying design T lives on the experimental domain [—1; 1],and assigns weight r(r/) € (0;1) to the support point r,.

331

332 CHAPTER 13: INVARIANT DESIGN PROBLEMS

The transformation to be considered is the reflection t H-» R(t) = -t.Whether this transformation is practically meaningful depends on the exper-iment under study. If the variable t stands for a development over time, fromyesterday (t = — 1) over today (/ = 0) to tomorrow (t = 1), then reflectionmeans time reversal and would hardly be applicable. But if t is a technicalvariable, indicating deviation to either side (t — ±1) of standard operatingconditions (t = 0), then —t may describe the real experiment just as well asdoes t. If this is so then reflection in variance is a symmetry requirement thatis practically meaningful.

Then, the first step is to consider the reflected design TR, given by rR(t) =r(—t) for all t e [—!;!]. The designs T and rR share the same even mo-ments pi and pi, while the odd moments of TR carry a reversed sign,

That is, the second-degree moment matrix A/2(r/?) is obtained from A/2(r)by a congruence transformation,

Hence the symmetrized design T — |(T + rR) has moment matrix

This averaging operation evidently simplifies the moment matrices, by lettingall odd moments vanish.

Furthermore, let us consider an information function (f> on NND(3) whichis invariant under the action of Q, that is, which fulfills <f>(QAQ) — (A) forall A € NND(3). Concavity and invariance of <f> imply

13.1. DESIGN PROBLEMS WITH SYMMETRY 333

Thus the transition from r to the symmetrized design f improves the crite-rion <f>, or at least maintains the same value.

In a second step, we maximize the fourth moment as a function of thesecond moment. On [—1;1], we have /i4 = /14 dr < /12 dr = ^2, withequality if and only if the only support points of T are ±1 or 0. Hence weintroduce the symmetric three-point design ra which, by definition, distributesmass a e [0;1] uniformly on the boundary {±1},

A Loewner comparison now improves upon (1) for the choice OL — ^LI-,

By monotonicity this carries over to an information function

No further improvement in the Loewner ordering is possible since the three-point design ra is admissible, by Section 10.7.

In a third step, we check the eigenvalues of the moment matrix M2(ra):

All of them are increasing for a e [0;2/5] (see Exhibit 13.1).Thus the eigenvalue vector A(2/5) is a componentwise improvement over

any other eigenvalue vector A (a) with a < 2/5,

This eigenvalue monotonicity motivates a restriction to orthogonally invari-ant information functions </>, that is, those information functions that satisfy(QAQ') = <f>(A) for all A e NND(3) and Q e Orth(3). This and Loewnermonotonicity for the diagonal matrices AA(a) yield a final improvement,

EXHIBIT 13.1 Eigenvalues of moment matrices of symmetric three-point designs. Theeigenvalues A,(a) of the moment matrices of the symmetric three-point designs ra are in-creasing for a € [0;2/5], for / = 1,2,3.

In summary, we have achieved a substantial reduction. For every orthog-onally invariant information function <f> on NND(3) and for every designT e T, there exists a weight a e [2/5; 1] such that under </> the symmetricthree-point design r« is no worse than T,

Instead of seeking the optimum as r varies over T, we may therefore solvea problem of calculus, of maximizing the real function g(a) = <f> (A/2(Ta)), asthe real variable a varies in the interval [2/5; 1]. In other words the symmetricthree-point designs ra, with a € [2/5; 1], form a complete class relative toany orthogonally invariant information function <£ on NND(3). The class isas small as possible since its members are uniquely optimal under the matrixmeans (f>p, from a = 2/5 for p = —oo (Section 9.13), over a = 1/2 forp = -1 (Section 9.9), and a = 2/3 for p = 0 (Section 9.5), to a = 1 for p = 1(Section 9.15).

This type of reasoning generalizes to quadratic fit models with m > 1factors (see Section 15.20). In Section 8.6, we put the same approach to workin a linear fit model over the unit square, to be extended to unit cubes ink > 2 dimensions in Section 14.10.

Three steps come together to culminate in the final reduction (4). Eigen-value monotonicity (3) is a technical postlude which helps to elucidate some,but not all models. The averaging operation from (1) and the Loewner com-parison in (2) are the conceptual building blocks, and are the theme of thischapter. We begin our study with an inquiry into invariance properties rela-tive to appropriate transformation groups.

13.2. INVARIANCE OF THE EXPERIMENTAL DOMAIN 335

13.2. INVARIANCE OF THE EXPERIMENTAL DOMAIN

The starting point for invariant design problems is a group K of transfor-mations (that is, bijections) of the experimental domain T. In other words,the set 72. consists of mappings R : T -» T which are one-to-one (that is,injective) and onto (that is, surjective), such that any composition RiR^1 isa member of 71 for all R\,R2 £ 11. In particular, the image of T under eachtransformation R is again equal to T,

The precise statement, that 72. is a transformation group on T, is often looselyparaphrased by (1), that the experimental domain T is invariant under eachtransformation R in 72..

In the parabola fit example of the previous section, the experimental do-main is T = [-1; 1], and the group 72. considered there contains two transfor-mations only, the identity t H-» t and the reflection t H-> R(t) = -t.

More generally, let us take a look at the m-way polynomial fit mod-els introduced in Section 1.6. Then the experimental domain T is a sub-set of m -dimensional Euclidean space Rm. In the case where T is a Eu-clidean ball of radius r > 0, Tr = {t € Um : \\t\\ < r}, the transforma-tion group 71 may be the full orthogonal group Orth(ra) or any subgroupthereof. Indeed if R is an orthogonal m x m matrix, R'R = Im, then thelinear transformation t H-> Rt is well known to preserve length, \\Rt\\2 —t'R'Rt =^t't — \\t\\2. Hence any Euclidean ball is mapped into itself. In otherwords, Euclidean balls are orthogonally invariant or, as we prefer to say,rotatable.

Another case of interest emerges when T is an m-dimensional cube ofsidelength r > 0, T = [Q;r]m, where ei denotes the ith Euclidean unit vectorin Um. This particular domain is mapped into itself provided R is a permu-tation matrix. Hence the transformation group 72. could be any subgroupof the group of permutation matrices Perm(m). In other words, cubes arepermutationally invariant.

Yet other cases are easy to conceive. If T is a symmetric rectangle withhalf sidelengths r\,... ,rm > 0, T = [—r\;ri] x • • • x [—rm;rm], then any sub-group 72. of the sign-change group Sign(m) keeps T invariant. Or else, if Tis a symmetric cube of half sidelength r > 0, T — [-r;r]m, then R might bea permutation matrix or a sign-change matrix to leave T invariant. Hence 72could be any subgroup of the group that is generated jointly by the permu-tations Perm(m) and the sign-changes Sign(m).

Exactly which transformations R of the experimental domain T make upthe group 72. depends on which implications the transformation R has onthe practical problem under study, and cannot be determined on an abstractlevel.


13.3. INDUCED MATRIX GROUP ON THE REGRESSION RANGE

The most essential assumption is that the group Tl which acts on the experi-mental domain T induces, on the regression range X C R*, transformationsthat are linear. Linear mappings on Rk are given by k x k matrices, as pointedout in Section 1.12. Any such mapping is one-to-one and onto if and only ifthe associated matrix is nonsingular. The largest group of linear transforma-tions on Rk is the general linear group GL(/c) which, by definition, comprisesall nonsingular k x k matrices.

Just as well we may therefore demand that H induces a matrix group Q, asubgroup of GL(/c). Effectively we think of Q as a construction so that anytransformation R e H and the regression function / : T —> Uk "commute", inthe very specific sense that given R eK, there exists a matrix Q € Q fulfilling

The function / is then said to be Tl-Q-equivariant. If the regression range X =f(T) contains k vectors f(t\),... , f ( t k ) that are linearly independent, then (1)

uniquely determines Q through Q = (/(#'i), • . • ,/(***))(/Ci), • • • Jfo))'1.However, we generally do not require the regression range X to span the

full space R*. In order to circumvent any ambiguity for the rank deficientcase, the precise definition of equivariance requires the existence of a grouphomomorphism h from K into GL(fc) so that (1) holds for the matrix Q =h(R) in the image group Q = h(K). Then / is termed equivariant under h.

Of course, the identity transformation in K always generates the identitymatrix Ik in Q. In Section 13.1, the reflection R(t) = —t leads to the 3 x 3matrix Q which reverses the sign of the linear component t in the powervector f(t] — (l,t,t2)'. Generally, determination of the matrix group Q is aspecific task in any given setting.

Many results on optimal experimental designs do not explicitly refer tothe experimental domain T and the regression function /. When we onlyrely on the regression range X C IR*, property (1) calls for a matrix groupQ C GL(fc) with transformations Q fulfilling

The same is stipulated by the set inclusion Q(X) C X. Since the inclusionholds for every matrix Q in the group Q it also applies to Q~l, that is,X C Q(X). So yet another way of expressing (2) is to demand

In other words, the group Q C GL(fc) is such that all of its transforma-tions Q leave the regression range X invariant. Invariance property (3) very

13.4. CONGRUENCE TRANSFORMATIONS OF MOMENT MATRICES 337

much resembles property (1) of Section 13.2. However, here we admit lineartransformations, x H-> Qx, only.

13.4. CONGRUENCE TRANSFORMATIONS OF MOMENTMATRICES

The linear action x H-> Qx on the regression range X induces a congruenceaction on the moment matrix of a design £ on X,

The congruence transformation A H-» QAQ' is linear on Sym(fc), as is theaction x H-> Qx on IR*. The sole distinction is that the way Q enters into Qxlooks simpler than with QAQ', if at all.

As a consequence of the in variance of X under Q and (1), a transformedmoment matrix again lies in A/(E), the set of all moment matrices. That is,for every transformation Q 6 Q, we have

As above, this is equivalent to the set M(H) being invariant under the con-gruence transformation given by each matrix Q e Q.

In the setting of the General Equivalence Theorem we start out from anarbitrary compact and convex set M C NND(fc). The analogue of (2) thenamounts to the requirement that M. is invariant under each transformationQeQ,

This is the invariance property to be repeatedly met in the sequel. The twoprevious sections demonstrate that it emerges quite naturally from the as-sumptions that underlie the general design problem.

Invariance property (3) entails that the transformation group Q C GL(k)usually is going to be compact. Namely, if the set of competing moment matri-ces M contains the identity matrix, then (3) gives QQ' e M, and bounded-ness of M implies boundedness of Q,

If the set M contains a positive definite matrix N other than the identitymatrix, then the same argument applies to the norm ||<2||^ = traceQNQ' <supMeM trace M < oo. Closedness of Q is added as a more technical supple-ment.


For a parameter system of interest K'O, with a k x s coefficient matrix Kof full column rank s, the performance of a design is evaluated throughthe information matrix mapping CK of Section 3.13. In an invariant designproblem, this mapping is taken to be equivariant. In Section 13.3 equivarianceis required of the regression function /, and the arbitrariness of / precludesmore special conclusions. However, as for the well-studied information matrixmapping C%, a necessary and sufficient condition for equivariance of CK isthat the range of K is invariant under each transformation Q e Q.

13.5. CONGRUENCE TRANSFORMATIONS OF INFORMATIONMATRICES

Lemma. Let CK : NND(A:) —> Sym(s) be the information matrix mappingcorresponding to a k x 5 coefficient matrix K that has full column rank s.Assume Q to be a subgroup of the general linear group GL(k).

a. (Equivariance) There exists a group homomorphism H : Q -» GL(s)such that CK is equivariant under //,

if and only if the range of K is invariant under each transformationQtQ,

b. (Uniqueness) Suppose CK is equivariant under the group homomor-phism H : Q -> GL(s). Then H(Q) or -H(Q) is the unique nonsingulars xs matrix H that satisfies QK = KH, for all Q e Q.

c. (Orthogonal transformations) Suppose CK is equivariant under thegroup homomorphism H : Q —> GL(s). If the matrix K fulfills K'K = Is

and Q e Q is an orthogonal k x k matrix, then H(Q) = ±K'QK is anorthogonal s x s matrix.

Proof. For the direct part of (a), we assume (1) for a given homomor-phism //. Choosing A = QKK'Q' and using CK(KK') = Is, we get

KCK(QKK'Q')K' = KH(Q)CK(KK')H(Q)'K' = KH(Q)H(Q)'K'.

From the range formula of Theorem 3.24 and the discussion of square roots

13.5. CONGRUENCE TRANSFORMATIONS OF INFORMATION MATRICES 339

in Section 1.14, nonsingularity of H(Q) leads to

From range QK n range K — range K, we infer that the range of QK in-cludes that of K. Because of nonsingularity of Q both ranges have the samedimension, s. Hence they are equal, Q(rangeK) = ranged.

For the converse of (a), we assume (2) and construct a suitable homomor-phism H. From Lemma 1.17, the range equality (2) entails QK = KLQK forevery left inverse L of K. Any other left inverse L of K then satisfies

Therefore the definition H(Q) — LQK does not depend on the choiceof L € K~. The s x s matrix H(Q) is nonsingular, because of rank//((2) >rank KLQK = rank (2 AT = s where the last equality follows from (2). Forany two transformations Q\, Q2 € Q, we obtain

This proves that the mapping Q i-» H(Q) is a group homomorphism from Qinto GL(s).

It remains to show the equivariance of CK under H. By definition of H,we have KH(Q) = QK, and Q~1K = KH(Q)~\ For positive definite kxkmatrices A equivariance is now deduced by straightforward calculation,

By part (c) of Theorem 3.13, regularization extends this to any nonnegative


definite k x k matrix A, singular or not,

This establishes the equivalence of (1) and (2).Part (b) claims uniqueness (up to the sign) of the homomorphism Q i->

LQK. To verify uniqueness, we start out from an arbitrary group homomor-phism H : Q —> GL(s) which fulfills (1), and fix a transformation Q e Q.The matrix QKH(Q}'1 then has the same range as has QK.

Property (2) implies a representation QKH(Q}~1 = KS, for some s x smatrix 5. For any matrix C e NND(s), we have CK(KSCS'K') = SCS'.On the other hand, we can write KSCS'K' = QAQ' with A = KH(Q)~1CH(QYl 'K'. Hence (1) and CK(A) = H(Q)'lCH(QYl' lead to CK(QAQ')= H(Q)CK(A)H(Q)' = C. Thus the matrix S satisfies SCS' = C for allC e NND(s). This forces S = Is or S = -/,. Indeed, insertion of C = Is

shows that 5 is an orthogonal matrix. Denoting the / th row of 5 by z,,insertion of C = z/z/ gives e,e/ = z/z/. That is, we have z, = e,e, with8i e {±1}, whence S = &$ is a sign-change matrix. Insertion of C = lsls'entails e\ = • • • = es = 1 or e\ — • • • = es — — 1.

This yields QKH(Q)'1 = K or QKH(QYl = -K, that is, QK = KH(Q)or QK = -KH(Q). The proof of uniqueness is complete.

Part (c) follows from part (b). If AT' is a left inverse of K, then or-thogonality of Q implies orthogonality of ±H(Q), through H(Q)'H(Q) =K'Q'KK'QK = K'Q'QK = IS.

The invariance property of the range of K also appears in hypothesis test-ing. Because the expected yield is x'0, the action x H-+ Qx on the regressionvectors induces the transformed expected yield x'Q'0, whence the groupaction on the parameter vectors is 6 H-» Q'6. Therefore the linear hypothe-sis K'8 = 0 is invariant under the transformation Q, {0 e Rk : K'6 = 0} ={6 e Rk : K'Q'S = 0}, if and only if the nullspaces of K' and K'Q' areequal. By Lemma 1.13, this means that K and QK have the same range.

Since the homomorphism H is (up to the signs) unique if it exists, allemphasis is on the transformations H(Q) that it determines. In order toforce uniqueness of the signs, we adhere to the choice H(Q) = LQK, anddefine the image group H = {LQK : Q € Q} C GL(s), relying on thefact that the choice of the left inverse L of K does not matter provided therange invariance property (2) holds true. Accordingly, the information matrixmapping CK is then said to be Q-'H-equivariant.

A simple example is provided by the parabola fit model of Section 13.1

13.5. CONGRUENCE TRANSFORMATIONS OF INFORMATION MATRICES 341

where the group Q has order 2:

Let K'd designate a subset of the three individual parameters OQ, 6\, 62- Thereare two distinct classes of coefficient matrices K, depending on whether K'Bincludes the coefficient B\ of the linear term, or not. In the first case, K isone of the 3 x s matrices

with s = 3,2,2,1, respectively. Here H is of order 2 as is Q, containing theidentity matrix as well as, respectively,

In the second case, K is one of the matrices

whence K'O misses out on the linear term BI. Here H degenerates to thetrivial group which only consists of the identity transformation. In either case,the information matrix mapping CK is equivariant.

The example also illustrates the implications of our convention to elim-inate the sign ambiguity, by adhering to the choice H = LQK. If the lin-ear coefficient is of interest, K' = (0,1,0), then our convention induces thesubgroup H = {!,-!} of GL(1) = R\{0}, of order 2. However, Hl = 1and H2 — -1 induce the same congruence transformation on Sym(l) = R,namely the identity. Evidently, the trivial homomorphism Q i-» 1 serves thesame purpose.

The same feature occurs in higher dimensions, s > 1. For instance, letQ C GL(3) be the group that reflects the jc-axis and the v-axis in three-dimensional space. This group is of order 4, generated by


and also containing the identity and the product Q\ Q2. Let us consider the fullparameter vector 0, that is, K = Ij. According to our convention, we computeH\ = Q\ and H2 = Q2, and hence work with the group H.\ = Q. The signpatterns H\ = —Q\ and H2 = Q2, or H\ = Q\ and H2 = -Q2, or HI = —Q\and HI = —Q2 define alternative groups 7i2, HI, H*. The four groups HI,for i = 1,2,3,4, are distinct (as are the associated homomorphism), but thecongruence transformations that they induce are identical. Hence the identityCi3(A) = A is Q-Hi-invariant, for all i = 1,2,3,4. Our convention singles outthe group H\ as the canonical choice.

Thus, for all practical purposes, we concentrate on s x s matrices H, ratherthan on a homomorphism H. We must check whether, for any given trans-formation Q € Q, there is a nonsingular s xs matrix H satisfying QK — KH.This secures equivariance of the information matrix mapping CK,

The matrices H in question form the subgroup H of GL(s),

where the specific choice of the left inverse L of K does not matter. Since His a homomorphic image of the group Q, the order of H does not exceed theorder of Q. Moreover, if the coefficient matrix K satisfies K'K = Is and Q isa subgroup of orthogonal k x k matrices, then H is a subgroup of orthogonals x s matrices.

Of course, for the full parameter vector 0, the transformations in Q and Hcoincide, Q = H, just as there is no need to distinguish between momentmatrices and information matrices, A = CIk(A).

13.6. INVARIANT DESIGN PROBLEMS

We are now in a position to precisely specify when a design problem is calledinvariant. While C# is the information matrix mapping for a full columnrank kx s coefficient matrix K, no assumption needs to be placed on the setM C NND(&) of competing moment matrices.

DEFINITION. We say that the design problem for K'6 in M is Q-invariantwhen Q is a subgroup of the general linear group GL(k) and all transforma-tions Q € Q fulfill

13.7. INVARIANCE OF MATRIX MEANS 343

From the range in variance (2), there exists a nonsingular s x s matrix Hsatisfying QK = KH, for any given transformation Q e Q. In Section 13.5,we have seen that these matrices H compose the set

which is a subgroup of GL(s) such that the information matrix mapping CK

is Q-7i-equivariant. We call H in (3) the equivariance group that is inducedby the Q-invariance of the design problem for K'O in M.

There are two distinct ways to exploit invariance when it comes to dis-cuss optimality in invariant design problems. One approach builds on theKiefer ordering, an order relation that captures increasing information interms of information matrices and of moment matrices, to be introduced inSection 14.2.

The other approach studies simultaneous optimality relative to a wide classof information functions delimited by invariance.

DEFINITION. An information function (f> on NND(s) is called H-invariantwhen H is a subgroup of the general linear group GL(s) and all transforma-tions H € H fulfill

The set of all ^-invariant information functions on NND($) is denoted by$>(H).

Before showing that the two resulting optimality concepts coincide, in Sec-tion 14.6, we study the basics of the two approaches. We start out by devel-oping some feeling for invariant information functions. The most prominentcriteria are the matrix means <f>p. There is a surprising discontinuity in theirbehavior, depending on whether the parameter p vanishes or not.

13.7. INVARIANCE OF MATRIX MEANS

Lemma. Let H be a subgroup of GL(s). For p e [-00; 0) U (0;oo], thematrix mean $p is ^-invariant if and only if *H is a subgroup of orthogonalmatrices, H C Orth(s).

Proof. For the direct part, we first treat p = -co. For a fixed transforma-tion H e H, invariance applies to C = Is and yields </>_oo(HH1) = <£_oo(/j),that is, \min(HH') = 1. Invariance also applies to C — (H'H)~l and gives*-«>(/,) = <f>-oo((H'H)~l), that is, 1 = \min((H'Hrl). Together we getAmin(//'H) = 1 = Amax(//'//). Hence H is orthogonal, H'H = Is. An ana-loguous argument holds for p = oo.

Secondly, we treat p e (-oo;0) U (0;oo). With C = Is, invariance yields<f>p(HH') = 4>P(IS), that is, trace(///f 'Y = s. The choice C = (H'H)~l giveslrace(H'H)-p = s. Thus for D = (H'H)P, we get

Hence D = Is and H is orthogonal.In the converse, we assume H to be a subgroup of orthogonal matrices.

But <f>p depends on C only through its eigenvalues (see Section 6.7). Sincthe eigenvalues of C and of HCH' are identical, invariance follows.

Invariance of the determinant criterion fa holds relative to the group ofunimodular transformations,

Unimodular transformations preserve Lebesgue volume, while orthogonaltransformations preserve the Euclidean scalar product. The group of uni-modular transformations is unbounded and hence of little interest in designproblems, except for actually characterizing the determinant criterion <fo.

13.8. INVARIANCE OF THE D-CRTTERION

Lemma. Let H be a subgroup of GL(s), and let </> be an informationfunction on NND(^).

a. (Invariance) The determinant criterion <fo is Ti-invariant if and onlyif H is a subgroup of unimodular matrices, H C Unim(.s).

b. (Uniqueness) The information function <j> is Unim(s)-invariant if andonly if <f> is positively proportional to the determinant criterion <fo-

Proof. For the direct part in (a), invariance implies <fo(////') = 1 forevery transformation H € H. This is the same as |det//| = 1. Conversely,for H e Unim(s), we have invariance, fa(HCH') = ((det//)2detC)1/s =0o(C).

In order to establish part (b), we take an arbitrary information function <f>on NND(s). We fix a positive definite 5 x 5 matrix C = //AA//' where thematrix H is orthogonal and AA is the diagonal matrix with the eigenvaluevector A = ( A t , . . . , A^) ' of C on the diagonal. Since unimodular invarianceembraces orthogonal invariance, we initially get $(C) = <f>(&\)- Defining thenumbers /*,- = A71/2f[y<5AJ / (2s) for i = l , . . . , j we find thatHence the diagonal matrix AM with the vector n = (jti,..., ju,s)' on the di-agonal is unimodular. From invariance and homogeneity, we finally conclude

13.9. INVARIANT SYMMETRIC MATRICES 345

Semicontinuity

extends the identity <£ = <f>(Is)<f>o from the open cone PD(s) to its clo-sure NND(j).

The present result emphasizes the prominent role that the determinantcriterion <£o plays in the design of experiments. Another outstanding propertyis the self-polarity of <fo, mentioned in Section 6.14.

An important tool for studying invariant information functions and invari-ant design problems are invariant symmetric matrices, and the subspaces thatthev form. To this we turn next.

13.9. INVARIANT SYMMETRIC MATRICES

For an arbitrary nonempty subset H of s x s matrices, we define a symmetrics x s matrix C to be H-invariant when

The set of all ^-invariant symmetric s xs matrices is denoted by Sym(s,H).Given a particular set H, we usually face the task of computing Sym(s, H).

Here are three important examples,

In other words, the 7i-invariant matrices are the diagonal matrices if H is thesign-change group Sign (s). They are the completely symmetric matrices, thatis, they have identical on-diagonal entries and identical off-diagonal entries,for the permutation group Perm(.s). They are multiples of the identity matrixunder the full orthogonal group Orth(.s).

For verification of these examples, we plainly evaluate C = HCH'. Inthe first case, let H be the diagonal matrix with ith entry -1 and all otherentries 1, that is, H reverses the sign of the ith coordinate. Then the off-diagonal elements of HCH' in the i th row and the i th column are of oppositesign as in C, whence they vanish. It follows that C is diagonal, C — Ag(C),with S(C) = (cn,... ,css)'. Conversely every diagonal matrix is sign-changeinvariant

In the second case, we take any permutation matrix Weget

whence the entry c,-y is moved from the / th row and ;th column into their(i) th row and ir(j) th column. As TT varies, the on-diagonal entries becomethe same, as do the off-diagonal entries. Therefore permutational invarianceof C implies complete symmetry. Conversely, every completely symmetricmatrix is permutationally invariant.

The third case profits from the previous two since Sign(s) and Perm(j) aresubgroups of Orth(.y). Hence an orthogonally invariant matrix is diagonal,and completely symmetric. This leaves only multiples of the identity matrix.Conversely, every such multiple is orthogonally invariant.

In each case, Sym(s,H) is a linear space. Also, invariance of C relative tothe infinite set Orth(s) is determined through finitely many of the equationsC = HCH', namely where H is a sign-change matrix or a permutationmatrix. Both features carry over to greater generality.

13.10. SUBSPACES OF INVARIANT SYMMETRIC MATRICES

Lemma. Let Sym(.s, H) C Sym(s) be the subset of H-invariant symmetric5 X 5 matrices, under a subset H. of s x s matrices. Then we have:

a. (Subspace) Sym(s,H) is a subspace of Sym(5).b. (Finite generator) There exists a finite subset HCH that determines

the 7Y-invariant matrices, that is, Sym(s,H) — Sym(5,'H).c. (Powers, inverses) If H consists of orthogonal matrices, H C Orth(s),

then the subspace Sym(5, H) is closed under formation of any power Cp

with p = 2,3,... and the Moore-Penrose inverse C+.

Proof. For H e H, we define the linear operator TH(C) = C - HCH'from Sym(5) into Sym($). Then Syn^s,?^) = f]//eH nullspace TH proves thesubspace property in part (a).

Now we form the linear space £(TH : H e H) that contains all linearcombinations of the operators TH with H G H. Since Sym(.s) is of finitedimension, so is the space of all linear operators from Sym(s) into Sym(5).Therefore, its subspace C(TH : H e H) has a finite dimension d, say, and thegenerator {TH : H € H} contains a basis {TH. : i <d}, where {//i,.. . ,Hd} isa subset of H. Given a matrix C e Sym(.y), the following lines are equivalent:

This proves part (b), with H - {//lt..., Hd}.As for part (c), we first note that for C e Sym(s,H), we get HC2H' =

HCHH'CH' = (HCH')2 = C2, using the orthogonality of H. Hence C2,

13.10. SUBSPACES OF INVARIANT SYMMETRIC MATRICES 347

C3,... are also invariant and lie in Sym(s, H}. Now we consider an eigenvaluedecomposition C = X^<r^'^" w^ r distinct nonzero eigenvalues A/. Ther x r matrix A with entries a(y = \{ then has Vandermonde determinant

and is invertible. For fixed k < r, thecoefficient vectorprojectors

Hence the

all lie in Sym(s,n). This yields C+ = £,-<,. P/A,- e Sym(s,W)- It also em-braces nonsingular inverses C"1 and fractional powers Cp where applica-ble.

The larger the sets H, the smaller are the subspaces Sym(5,7i). In thethree examples of the previous section, the groups Sign(.s), Perm(s), Orth(.s)have orders 2s < s\ < oo, while the invariant matrices form subspaces ofdimensions 5 > 2 > 1.

An orthonormal basis for the diagonal matrices Sym(s,Sign(.s)) is givenby the rank 1 matrices e\e[,.. .,ese's, where et is the i th Euclidean unit vec-tor in Rs. Hence for an arbitrary matrix C € Sym(.s), its projection on thesubspace Sym^Sign^)) is

where 5(C) = (cn,...,css)' is the vector of diagonal elements of C.An orthogonal basis for the completely symmetric matrices Sym(5,

Perm(^)) is formed by the averaging matrix Js = lsls'/s and the center-ing matrix Ks = Is - Js from Section 3.20. Any matrix C 6 Sym(^) thus hasprojection

with • and ,The one-dimensional space Sym(s,Orth(.s)) is spanned by Is and C has

projection

There is an alternative way to deal with the projection C of a matrix Conto a subspace of ^-invariant matrices, as the outcome of averaging HCH'

over the group H. It is at this point that the group structure of H gainsimportance.

13.11. THE BALANCING OPERATOR

The set <J>(7Y) of Ti-invariant information functions on NND(s) has the samestructure as the set 4> of all information functions on NND(.s). It is closedunder the formation of nonnegative combinations, pointwise infima, and leastupper bounds, as discussed in Section 5.11. The set <J>(ft) is not affected bythe sign ambiguity which surfaces in the uniqueness part (b) of Lemma 13.8,since we evidently have (ft(HCH') = <f>((-H)C(-H)'). The larger thegroup n, the smaller is the set &(H). The largest subgroup for which theclass 4>(?i) is nonempty are the unimodular transformations, H = Unim(s).By Lemma 13.8, the set 4>(Unim(s)) consists of the determinant criterion0o and positive multiples thereof. Of course, the trivial group H = {!„} issmallest and makes every information function invariant.

Under an orthogonal subgroup, H C. Orth(s), all matrix means are invari-ant, <f>p e 3>(H} for all p € [-00; 1], owing to Lemma 13.7 and Lemma 13.8.Furthermore, 4>(?i) then contains a sizeable subset 4>(7i) of 7i-invariant in-formation functions that are linear. While the matrix means are prime criteriafor practical applications, the linear invariant criteria help in understandingthe implications of in variance. Theorem 13.12 shows that the subset 4>(7i)has the same descriptional power as has the full set <&(h).

To this end let us first consider a finite group H C GL(s) of order WH. Wedefine an averaging operation C »-» C for symmetric s x s matrices C by

If the group ?i I is compact then the definition of C is

where the integral is taken with respect to the unique invariant probabilitymeasure on the group H. In any case the mapping : Sym(s) —> Sym(.s) islinear, and we call it the balancing operator. _ The_balancing operator results in matrices C that are "H-invariant,C = HCH'" for all H € H. Namely, since H is a group, the set {HG :G£H}coincides with H and the average HCH' = £GeW//GCG'//'/TO is the

same as C from (1). Thus the balancing operator is idempotent, (C j = C,and its image space is the subspace Sym(j, 7i) of W-invariant symmetric matri-

13.12. SIMULTANEOUS MATRIX IMPROVEMENT 349

ces studied in Section 13.10. In other words, the balancing operator projectsthe space Sym(s) onto the subspace Sym(5,/H).

For a closed subgroup of orthogonal matrices, H C Orth(,s), we have//' = //-' e H and {//':// 6 H} = H. It follows that the projector isorthogonal,

for all C,D € Sym(s). Given an orthqnormal basis V\,...,Vd of Sym(s,H),we may then calculate the projection C from

For an infinite compact group H this is a great improvement over the defini-tion C = fn HCH' dH from (2), since the invariant probability measure dHis usually difficult to handle. The example of Section 13.10, for H = Orth(s),illustrates the method. The cases H = Sign(s) or H = Perm(,s) show thatformula (3) is useful even if the group H is finite.

The balancing operator serves two distinct and important purposes. Firstit leads to an improvement of a given information matrix C,

utilizing concavity and 'H-invariance of an information function <f> 6 $>(H).In search of optimality we may therefore restrict attention to the informationmatrices which are Ti-invariant, Cx(M.) n Sym(.s,7i).

Secondly the balancing operator gives rise to a wide class of ^-invariantinformation functions that are linear,

The following theorem shows that a simultaneous comparison over the largecriterion class $>(H) may be achieved by considering the smaller and moretransparent subclass 4>(W).

13.12. SIMULTANEOUS MATRIX IMPROVEMENT

Theorem. Assume that H C Orth(.s) is a closed subgroup of orthogonalmatrices. Then for every %-invariant matrix C e Sym(s,H) and for everymatrix D e Sym(5), the following four statements are equivalent:

a. (Simultaneous improvement) (f>(C) > $(£>) for all "H-invariant infor-mation functions <J>.

b. (Linear criteria) <f>(C) > $(£>) for all 7i-invariant information func-tions <f> that are linear.

c. (Invariant Loewner comparison) C > D.d. (Kiefer ordering) There exists a matrix E € Sym(s) satisfying

C > E e conv{HDH': H € H}.

Proof. Part (a) implies (b) since the latter comprises fewer criteria. Inser-tion of the specific functions from <b(H) = {C i-> z'Cz : z e Rs} in part (b)and invariance of C entail z'Cz = z'Cz > z'Dz for all z € IR5. Hence (c)follows.

Next we assume part (c). Since n is compact and the mapping H i-» HDH'is continuous, the orbit {HDH': H € H} is compact and so is its convex hull.Therefore the average D = Jn HDH' dH, as the limit of finite averages, liesin con\{HDH': H e H}. This is part (d), with the particular choice E = Z>.

That (d) leads back to (a) is a consequence of the basic properties of'H-invariant information functions. For since E can be written as a convexcombination Yli<i aiHiDH/, say, we obtain the inequalities

The simplest case is the trivial group H = {Is}- Then part (a) refers to allinformation functions and part (b) to those that are linear, while parts (c)and (d) both reduce to the Loewner ordering, C > D.

However, for nontrivial compact subgroups H C Orth(s), parts (c) and (d)concentrate on distinct aspects. Part (c) says that a simultaneous comparisonrelative to all "W-invariant information functions is equivalently achieved byrestricting the Loewner ordering > to the W-invariant subspace Sym(.s, 7i).In contrast, part (d) augments the Loewner ordering > with another orderrelation, matrix majorization. The resulting new order relation, the Kieferordering, is the topic of the next chapter.

EXERCISES

13.1 In an m-way d th-degree polynomial fit model, let R(t) = At -f- b be abijective affine transformation of T. Show that there exists a matrixQ 6 GL(k) which, for all designs r e T, satisfies Md(r

R) = QMd(r)Q'[Heiligers (1988), p. 82].

EXERCISES 351

13.2 Show that a Q-invariant design problem for K'O has a Q-invariantfeasibility cone, QA(K)Q' = A(K) for all Q e Q.

13.3 Show that the information function <£(C) = min^c,; is invariant un-der permutations and sign changes, but not under all orthogonal trans-formations [Kiefer (1960), p. 383].

13.4 For a subgroup H C GL(s), define the transposed subgroup H' ={/ / ' : / /€ H] C GL(s). Show that an information function <j> is H-invariant if and only if the polar function $°° is H '-invariant. Findsome subgroups with 'H = 'H'.

C H A P T E R 14

Kiefer Optimality

A powerful concept is the Kiefer ordering of moment matrices, combiningan improvement in the Loewner ordering with increasing symmetry relativeto the group involved. This is illustrated with two-way classification models,by establishing Kiefer optimality for the centered treatment contrasts, as wellas for a maximal parameter system, of the uniform design, and of balancedincomplete block designs. As another example, uniform vertex designs in afirst-degree model over the multidimensional cube are shown to be Kieferoptimal.

14.1. MATRIX MAJORIZATION

Suppose we are given a matrix group H C GL(s). For any two matricesC, D € Sym(5), we say that C is majorized by D, denoted by C x £>, when Clies in the convex hull of the orbit of D under the congruence action of thegroup 7i,

This terminology is somewhat negligent of the group 7i, and that it acts onthe underlying space Sym(s) by congruence. Both are essential and must beunderstood from the context.

Other specifications reproduce the vector majorization of Section 6.9.There the underlying space is R* on which the permutation group Perm(A:)acts by left multiplication, x i-> Qx. In this setting majorization means, forany two vectors jc. y e IR*,

In other words, we have x = Sv, for a matrix 5 = ]CeePerm(*:) aQQ which isa convex combination of permutation matrices. Any such matrix 5 is doubly

352

14.1. MATRIX MAJORIZATION 353

stochastic, and vice versa. This is the content of the Birkhoff theorem (whichwe have circumvented in Section 6.9). Therefore vector majorization andmatrix majorization are close relatives of each other, except for referring todistinct groups and different actions.

Matrix majorization -< as in (1) defines a preordering on Sym(.s), in thatit is a reflexive and transitive relation. Indeed, reflexivitv C x C is evident.Transitivity follows for ifC3, thenHk stays in H because of the group property.

For an orthogonal subgroup H C Orth(s), the preordering -< on Sym(s)is antisymmetric modulo H,

This follows from the strict convexity and the orthogonal invariance of thematrix norm \\C\\ = (trace C2)1/2. If we have C = £V a,//,£>/// -< D, thenwe get

Hence C -< D -< C entails equality in (2), and strict convexity of the normforces C = //,£)/// for some /. In any case, only subgroups of the group ofunimodular transformations are of interest, H C Unim(.s), by Lemma 13.8.Then antisymmetry modulo H prevails provided we restrict the preordering -<to the open cone PD(s). This follows from replacing the matrix norm || • ||by the determinant criterion <fo and reversing the inequality in (2), since <fois strictly concave on PD(s) besides being unimodularly invariant.

For any two matrices C,D € NND(s), matrix majorization C -< D rel-ative to a group H implies an improvement in terms of every 7i-invariantinformation function </>,

Unfortunately the terminology that C is majorized by D is not indicative of Cbeing an improvement over £>. For the purposes of the design of experiments,it would be more telling to call C more balanced than D and to use thereversed symbol >-, but we refrain from this new terminology. The suggestiveorientation comes to bear as soon as we combine matrix majorization withthe Loewner ordering, to generate the Kiefer ordering.

354 CHAPTER 14: KIEFER OPTIMALITY

14.2. THE KIEFER ORDERING OF SYMMETRIC MATRICES

All of the earlier chapters stress the importance of the Loewner ordering > ofsymmetric matrices, for the design of experiments. The previous section hasshown that matrix majorization -< provides another meaningful comparison,in the presence of a group H C GL(s) acting by congruence on Sym(s).

The two order relations capture complementary properties, at least fororthogonal subgroups H C Orth(s). In this case C > D and C -< D implyC = D, for then the matrix C - D is nonnegative definite with trace C -D —trace ̂ a////!)/// - D = 0. The Loewner ordering C > D fails to pick upthe improvement that is captured by matrix majorization, C -< D.

Part (d) of Theorem 13.12 suggests a way of merging the Loewner orderingand matrix majorization into a single relation. Given two matrices C, D eSym(s), we call C more informative than D and write C ~^> D when C isbetter in the Loewner ordering than some intermediate matrix E which ismajorized by /)>

We call the relation > the Kiefer ordering on Sym(s) relative to the group H.The notation > is reminiscent of the fact that two stages C > E -< Denter into definition (1), the Loewner improvement >, and the improvedbalancedness as expressed by the majorization relation -< (see Exhibit 14.1).

Exhibit 14.1 utilizes the isometry

from the cone NND(2) to the ice-cream cone from Section 2.5. The group Hconsists of permutations and sign-changes,

Then the matrix

14.2. THE KIEFER ORDERING OF SYMMETRIC MATRICES 355

EXHIBIT 14.1 The Kiefer ordering. The Kiefer ordering is a two-stage ordering, C »D <$=>• C > E -< D for some E, combining a Loewner improvement C > E with matrixmajorization E € conv{HDH' : H e H}, relative to a group H acting by congruence on thematrix space Sym(.s).

travels through the orbit

These matrices together with D generate the convex hull, a quadrilateral,which contains the points E that are majorized by D. Any matrix C > Ethen improves upon D in the Kiefer ordering, C > D.

The Kiefer ordering on Sym(,s) is reflexive, C > C. It is also transitive,

If we have and thpn WP

may chooseto obtain that is, isa preordering.

The Kiefer ordering is antisymmetric modulo H, under the same provisosthat are required for matrix majorization. First we consider an orthogonalsubgroup n C Orth(s). Then C> D and D » C entail

Equality holds since the final sum has the same trace as C. Invoking strictconvexity of the norm || • || as in (2) of Section 14.1, we obtain C = HiDH/for some /.

Secondly we admit larger groups H C Unim(s), but restrict the antisym-metry property to the open cone PD(s). Here it suffices to appeal to thedeterminant criterion <fo, since for C > D monotonicity, concavity, and uni-modular invariance of <fo imply

In the case of C > D » C, we get equality throughout. Then strict concavityof <f>Q on PD(s) establishes antisymmetry modulo H, of the relation > on thecone PD(s).

The latter argument involves a monotonicity property of </>o relative tothe Kiefer ordering which actually extends to every ^-invariant informationfunction. Also the information matrix mapping CK turns out to be monotonicif we equip the space Sym(A:) with its Kiefer ordering > relative to theunderlying group Q C GL(fc),

for all A,B e Sym(fc). We leave the notation the same, even though some-Q H

thing like A > B and C > D would better highlight the particulars of the

14.4. KIEFER OPTIMALITY 357

Kiefer ordering on Sym(A:) relative to the group Q, and on Sym(s) relativeto the group H.

The following theorem details the implications between these order re-lations, and thus expands on the implication from part (d) to part (a) ofTheorem 13.12. The underlying assumptions are formulated with a view to-wards invariant design problems as outlined in Section 13.6.

14.3. MONOTONIC MATRIX FUNCTIONS

Theorem. Let the k x s matrix K be of rank s. Assume that the trans-formations in the matrix group Q C GL(fc) leave the range of K invariant,(2(range K) — range K for all Q e Q, with equivariance group 7i C GL(s).Then the information matrix mapping CK is isotonic relative to the Kieferorderings on Sym(/c) and on Sym(,s), as is every 7i-invariant informationfunction <j> on NND(s):

for all .4, fi <ENND(s)and

Proof. The mapping CK is isotonic relative to the Loewner ordering andmatrix concave, by Theorem 3.13. Hence implies

Equivariance yieldssay, by Lemma Together we get

that is.Similarly we obtain In the

presence of Ti-invariance of </> the last sum becomes

For the trivial group, H = {/$}, the Kiefer ordering ;» coincides with theLoewner ordering >, and Kiefer optimality is the same as Loewner optimality(compare Section 4.4). Otherwise the two orderings are distinct, with theKiefer ordering having the potential of comparing more pairs C, D e Sym(^)than the Loewner ordering, C > D => C > D. Therefore there is a greaterchance of finding a matrix C that is optimal relative to the Kiefer ordering ».

14.4. KIEFER OPTIMALITY

Let H be a subgroup of nonsingular s x s matrices. No assumption is placedon the set M C NND(fc) of competing moment matrices.

DEFINITION. A moment matrix M e M is called Kiefer optimal for K'6in M relative to the group 7i C GL(s) when the information matrix CK(M)

is %-invariant and satisfies

where > is the Kiefer ordering on Sym(s) relative to the group H.

Given a subclass of designs, H C H, a design £ e H is called Kieferoptimal for K'B in H when its moment matrix A/(£) is Kiefer optimal forK'O inM(E).

We can now be more specific about the two distinct ways to discuss op-timality in invariant design problems, as announced earlier in Section 13.6.There we built on simultaneous optimality under all invariant informationfunctions, now we refer to the Kiefer ordering. In all practical applications theequivariance group is a closed subgroup of orthogonal matrices, H C Orth(s).It is an immediate consequence of Theorem 13.12 that a moment matrixM e M is then Kiefer optimal for K'O in M if and only if, for all H-invariant information functions <£, the matrix M is 0-optimal for K'O in M.Therefore the two different avenues towards optimality in invariant designproblems lead to the same goal, as anticipated in Section 13.6.

An effective method of finding Kiefer optimal moment matrices rests onpart (c) of Theorem 13.12, in achieving a smaller problem dimensionality bya transition from the space of symmetric matrices to the subspace of invari-ant symmetric matrices. However, for invariant design problems, we havemeanwhile created more than a single invariance concept: W-invariance ofinformation matrices in Sym(s), and Q-invariance of moment matrices inSym(/0; we may also consider 7^-invariance of designs in T. It is a manifesta-tion of the coherence of the approach that invariance is handed down fromone level to the next, as we work through the design problem.

14.5. HERTTABILITY OF INVARIANCE

The full ramifications of an invariant design problem have a group 7£ actingon the experimental domain T (Section 13.2), a matrix group Q acting by leftmultiplication on the regression range X (Section 13.3) as well as acting bycongruence on the set of moment matrices M (Section 13.4), and a matrixgroup H acting by congruence on the set of information matrices Cx(M). Theassumptions are such that the regression function / is 7^-Q-equivariant (Sec-tion 13.2), and that the information matrix mapping C# is Q-Ti-equivariant(Section 13.5).

On the design level, a transformation R € 71 "rotates" a design r e T intothe design rR defined by

14.5. HERITABILITY OF INVARIANCE 359

A design r e T is called K-invariant when it satisfies T = TR for all R e 7£. Weclaim that every 72.-invariant design T has a Q-invariant moment matrix M(T),

Indeed, since Q is the homomorphic image of "R, that makes / equivari-ant, every matrix Q e Q originates from a transformation R e 7£ such that<2/(r) = /(/?(0) for all r e T. We obtain

For an 7£-invariant design T, the latter is equal to M(r), and this proves (1).The converse implication in (1) is generally false. For instance in the

trigonometric fit model of Section 9.16, the experimental domain T = [0;27r)is invariant under rotations by a fixed angle r. That is, the group is 72. = [0; 2ir)and the action is addition, (r, r) H-> r +1. The sin-cos addition theorem gives

Hence we get f(r + t) = Qf(t)t where the (24 + 1) x (24 +1) matrix Q is blockdiagonal, with top left entry 1 followed by the 2 x 2 rotation matrices Sa,for a = l , . . . ,d. In this setting, the equispaced support designs T" fromSection 9.16 fail to be "fa-invariant, while their common moment matrix

is evidently Q-invariant. In fact, the only "/^-invariant probability measure isproportional to Lebesgue measure on the circle. Since in our terminologyof Section 1.24 a design requires a finite support, we can make the strongerstatement that no design is 7£-invariant, in this example.

On the moment matrix level, we claim that every Q-invariant matrix A eSym(/c, Q) leads to an ^-invariant information matrix CK(A) 6 Sym(s,H),

Again we appeal to the fact that H is a homomorphic image of Q whichmakes CK equivariant, by Lemma 13.5. Hence if H e H stems from Q eQ, then equivariance yields HCK(A)H' — CK(QAQ'). For a Q-invariantmatrix A the latter is equal to CK(A), and this proves (2).

The converse implication in (2) is generally false. For instance, in theparabola fit model of Section 13.5, the equivariance group becomes trivial if

the parameters K'O = (do, #2)' are of interest, H = {h}- Hence all informa-tion matrices for K'd are H-invariant. On the other hand, Q-invariance ofthe moment matrix requires the odd moments to vanish. Hence for a non-symmetric design r on [—1;1], the moment matrix M(r) is not Q-invariant,but the information matrix CK(M(r)] is ^-invariant, in this example.

In summary, this establishes a persuasive heritability chain,

of which neither implication can generally be reversed.Heritability of invariance has repercussions on Kiefer optimality, as fol-

lows. The definition of Kiefer optimality refers to the subgroup H C GL(^)only. Usually H is the equivariance group that arises from the Q-invarianceof the design problem for K'd in M. If we add our prevalent earlier as-sumption that the set M is compact and convex, then it suffices to seek theoptimum in the subset of Q-invariant moment matrices.

14.6. KIEFER OPTIMALITY AND INVARIANT LOEWNEROPTIMALITY

Theorem. Assume that Q C Orth(fc) is a closed subgroup of orthogonalmatrices such that the design problem for K' 6 in M is Q-invariant, withequivariance group H C GL(s). Let the set M C NND(fc) of competing mo-ment matrices be compact and convex. Then for every Q-invariant momentmatrix M e M n Sym(A:, Q) the following statements are equivalent:

a. (Kiefer optimally) M is Kiefer optimal for K'B in M.b. (Invariant Loewner optimally) M is Loewner optimal for K'0 in M n

Sym(A:, Q).

Proof. The balancing operator maps M into Ai, that is, A =JQQAQ'dQ £ M for all A e M, because of convexity and closednessof M. The image of M under is the set of invariant moment matrices,M = M n Sym(&, Q). _

First we assume (a). For a matrix A e .A/1 n Sym(A:, Q), Kiefer optimalityof M in M yields CX(M) > CK(A), that is _But Q-invariance of A entails ft-in variance of CK(A). Hence we get CK(M) >E = CK(A), that is, Loewner optimality of M for K'd in M.

Now we assume (b). Given a matrix A € M, we use the Caratheodorytheorem to represent the average A e conv{QAQ' : Q e Q} as a finiteconvex combination, A — Y^iatQiAQj. Concavity of C% gives CK(A) >^ = £, say, where £ = ^iiaiHiCK(A)Hi lies in the convexhull of the orbit of CK(A) under U. Altogether we get E -< CK(A), that is, Kiefer optimality of M for K'O in M.

14.7. OPTIMALITY UNDER INVARIANT INFORMATION FUNCTIONS 361

The theorem has a simple analogue for </>-optimality, assuming that $ isan H-invariant information function on NND(s).

14.7. OPTIMALITY UNDER INVARIANT INFORMATIONFUNCTIONS

Theorem. Under the assumptions of Theorem 14.6, the following state-ments are equivalent for every Q-invariant moment matrix Sym(A;, Q) and every "H-invariant information function <f> on NND(s):

a. (All moment matrices) M is <£-optimal for K'O in M.b. (Invariant moment matrices) M is $-optimal for K'O in Mr\Sym(k, Q).

Proof. The direct part is immediate. The converse exploits the mono-tonicity of <J> from Lemma 14.3, for all

The point that comes to bear is that the set of invariant moment matrices,~M — M n Sym(A;, Q), lies in the space Sym(fc, Q) which usually is of muchlower dimension than Sym(fc). The reduction in dimensionality is demon-stratpH in Slprtirm T3 10

as opposed to the dimension k(k + l)/2 of Sym(fc) itself.The dimensionality reduction by invariance also has repercussions on the

General Equivalence Theorem. In Chapter 7 we place the optimal designproblem in the "large" Euclidean space Sym(fc), with inner product (A,B) =trace AB, and with dual space Sym(/c). For invariant design problems, thepresent theorem embeds the optimization problem in the "small" Euclideanspace Sym(/c, Q) equipped with the restriction of the inner product {-,-},and with dual space Sym(A:, Q') where Q' consists of the transposed matri-ces from Q. Therefore all the tools from duality theory (polar informationfunctions, subgradients, normal vectors, and the like) carry a dual invariancestructure of reduced dimensionality.

The greater achievement, however, is the concept of Kiefer optimality. Interms of simultaneous optimality relative to classes of information functions,it mediates between all information functions and the determinant criterion,4> D $(H) D {5<fo : 8 > 0}, reflecting the inclusions {/,} C U C Unim(s).More important, it gives rise to the Kiefer ordering > which focuses on thecomparison of any two moment matrices, including nonoptimal ones, ratherthan being overwhelmed by a narrow desire to achieve optimality.

If a noninvariant moment matrix M e M \ M. is Kiefer optimal then sois its projection M e M, because M ;» M is an improvement in the Kieferordering. On the other hand, the potential of noninvariant moment matricesand designs to be Kiefer optimal is a source of great economy. Invariantdesigns tend to evenly spread out the weights over many, if not all, supportpoints. A noninvariant design that has an invariant information matrix offersthe chance to reduce the support size while maintaining optimality. This isprecisely what balanced incomplete block designs are about, see Section 14.9.

14.8. KIEFER OPTIMALITY IN TWO-WAY CLASSIFICATIONMODELS

In two-way classification models, the concept of Kiefer optimality unifies andextends the results on Loewner optimality from Section 4.8, and on rankdeficient matrix mean optimafity from Section 8.19.

The experimental domain T — {!,...,«} x (1,...,6} comprises all treat-ment-block combinations (/,_/') (see Section 1.5). The labeling of the treat-ments and the labeling of the blocks should not influence the design and itsanalysis. Hence we are aiming at invariance relative to the transformations(/,;') t-» (p(i),<r(;)) where p is a permutation of the rows {!,...,«} and or isa permutation of the columns {!,...,&}. In a more descriptive terminology,p represents a relabeling of treatments and a a relabeling of blocks.

The regression function / maps ( i , j ) into the vector

With the permutation matrices j fromSection 6.8, we find

Therefore the regression function / is equivariant under the treatment-blockrelabeling group Q which is defined by

With the conventions of Section 1.27, a design on T is an a x b block designW, with row sum vector r = Wlb and column sum vector s = Wla. The

14.8. KIEFER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS 363

congruence action on the moment matrices

turns out to be

(Evidently there is also an action on the block designs W themselves, namelyleft multiplication by R and right multiplication by 5'.)

In this setting we discuss the design problem (I) for the centered treatmentcontrasts in the class T(r) of designs with given row sum vector r, (II) forthe same parameters in the full class T, and (III) for a maximal parametersystem in T. Problem I is invariant under the block relabeling subgroupQ C Q, problems II and III are invariant under the group Q itself.

I. The results on Loewner optimality in Section 4.8 pertain to the setT(r) of block designs with a fixed positive row sum vector r. The set M ={M(W): W e T(r)} of competing moment matrices is compact and convex,and invariant under each transformation in the block relabeling group

The parameter system of interest is the centered treatment contrastsAT'(0) = Kaa, with coefficient matrix K — (*a) and with centering matrix

Ka = Ia - Ja as defined in Section 3.20. The transformations in Q fulfill

whence the equivariance group becomes trivial, H = {Ia}.We may summarize as follows: The design problem for the centered treat-

ment contrasts in the set T(r) of designs with fixed row sum vector r isinvariant under relabeling of the blocks; the equivariance group is trivial,and the new concept of Kiefer optimality falls back on the old concept ofLoewner optimality.

To find an optimal block design we may start from any weight matrixW e T(r) and need only average it over Q,

We have usedwith Euclidean unit vectors d; of Rb. This also vields the average of themoment matrix,

The improvement of the moment matrices entails animprovement of the contrast information matrices,

by Theorem 14.3. In fact, since the equivariance group is trivial, we experi-ence the remarkable^ constellation that pure matrix majorization of the mo-ment matrices, M(W) ~< M(W), translates into a pure Loewner comparisonof the information matrices, CK (M(W)) > CK (M(W)). Equality holds if andonly if W is of the form rs1, using the same argument as in Section 4.8.

Hence we have rederived the result from that section: The product de-signs rs' with arbitrary column sum vector s are the only Loewner optimaldesigns for the centered contrasts of factor A in the set T(r) of designs withrow sum vector equal to r; the optimal contrast information matrix is A, -rr'.

The invariance approach would seem to reveal more of the structure ofthe present design problem, whereas in Section 4.8 less theory was availableand we had to rely on more technicalities.

II. In Section 8.19, we refer optimality to the full set T rather than to somesubset. Now we not only rederive that result, on simultaneous optimalityrelative to all rank deficient matrix means <£p', but strengthen them to Kieferoptimality.

The set of competing moment matrices is maximal, M = M(T). It is in-variant under each transformation in the treatment-block relabeling group Qintroduced in the preamble to this section. For the centered treatment con-trasts A^'(g) = Kaa, the transformations in Q fulfill

Hence here the equivariance group is the treatment relabeling group, H —Perm(fl).

We may summarize as follows: The design problem for the centered treat-ment contrasts in the set of all designs is invariant under relabeling of treat-ments and relabeling of blocks; the equivariance group is the permutationgroup Perm(fl), and Kiefer optimality is the same as simultaneous optimalityrelative to the permutationally invariant information functions on NND(a).

This class of criteria contains the rank deficient matrix means that ap-peared in Section 8.19, as well as many other criteria.

14.8. KIEFER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS 365

Centering of an arbitrary block design W e T leads to the uniform design,

The contrast information matrix becomes CK(M(W)} — Ka/a. A weightmatrix W achieves the contrast information matrix Ka/a if and only if it isan equireplicated product design, W = 1 as'.

This improves_upon the first result of Section 8.19: The equireplicatedproduct designs las' with arbitrary column sum vector s are the uniqueKiefer optimal designs for the centered treatment contrasts in the set T ofall block designs; the optimal contrast information matrix is Ka/a.

III. In just the same fashion, we treat the maximal parameter system

relative to the full moment matrix set M = M(T) and the full treatment-block relabeling group Q. From

the equivariance group

is seen to be is isomorphic to the treatment-block relabeling group Q.This generalizes the second result of Section 8.19: The uniform design 1 al b

is the unique Kiefer optimal design for the maximal parameter system (1) inthe set T of all block designs.

These examples have in common that the design problems are invarianton each of the levels discussed in Section 14.5. The balancing operation maybe carried out on the top level of moment matrices, or even for the weightmatrices.

This is not so when restrictions are placed on the support. Yet we profitfrom invariance by switching from the top-down approach to a bottom-up ap-proach, backtracking from the anticipated "balanced solution" and exploitinginvariance as long as is feasible.

14.9. BALANCED INCOMPLETE BLOCK DESIGNS

In a two-way classification model the best block design is the uniform designlal b. It is optimal in the three situations of the previous section, and alsoin others. For it to be realizable, the sample size n must be a multiple of ab,that is, fairly large.

For small sample size, n < ab, a block design for sample size n necessarilyhas an incomplete support, that is, some of the treatment-block combinations(/, y) are not observed. Other than that, it seems persuasive to salvage as manyproperties of the uniform design as is possible. This is achieved by a balancedincomplete block design.

DEFINITION. An a x b weight matrix W is called a balanced incompleteblock design when (a) W is an incomplete uniform design, that is, it hasn < ab support points and assigns uniform weight l/n to each of them,(b) W is balanced for the centered treatment contrasts, in the sense that itscontrast information matrixes completely symmetric and nonzero, and (c) Wis equiblocksized, W'la = 1 b-

With this definition a balanced incomplete block design W is in fact ana x b block design, W € T, and it is realizable for sample size n = # supp Wand multiples thereof. The limiting case n = ab would reproduce the uniformdesign 1 al b, but is excluded by the incompleteness requirement (a).

For a balanced incomplete block design W, the incidence matrix N = nWhas entries 0 or 1. It may be interpreted as the indicator function of thesupport of W, or as the frequency matrix of a balanced incomplete blockdesign for sample size n. The focus on support set and sample size comes tobear better by quoting the weight matrix in the form N/n, with n = /a'N7b.

The baiancedness property (b) can be specified further. We have

since (c) yields column sum vector 5 = lb, while (a) implies nl} e {0,1}and « From Section 13.10, complete symmetrymeans that CK(N/n) is a linear combination of the averaging matrix Ja andthe centering matrix Ka. We get

since CK(N/n) has row sums 0 whence the coefficient of Ja vanishes, l^Cla =

14.9. BALANCED INCOMPLETE BLOCK DESIGNS 367

0. Therefore the balancedness requirement (b) is equivalently expressedthrough the formula CK(N/n) - ^f-Ka, with n > b.

In part I, we list more properties that the parameters of a balanced incom-plete block design necessarily satisfy. Then we establish (II) Kiefer optimalityfor the centered treatment contrasts, and (III) matrix mean optimality for amaximal parameter system.

I. Whether a balanced incomplete block design exists is a combinatorialproblem, of (a) arranging n ones and ab - n zeros into an a x b incidencematrix N such that N is (b) balanced and (c) equiblocksized. This implies afew necessary conditions on the parameters a, b, and n, as follows.

Claim. If an a x b balanced incomplete block design for sample size n ex-ists, then its incidence matrix N has constant column sums as well as constantrow sums, and the treatment concurrence matrix NN' fulfills

with positive integers n/a, n/b, and

This necessitates a < b, and n > a + b - I .

Proof. That N has constant column sums follows from the equiblocksizeproperty (c). The balancedness property (b) yields constant row sums r, = \/aof N/n,

observing n > b. Thus (b) becomes

from which we calculate NN' in (1). The off-diagonal element A is the innerproduct of two rows of N, and hence counts how often two treatments con-cur in the same blocks. From n > b we get A ̂ 0. Thus A is a positive integer,as are the common treatment replication number n/a and the commonblocksize n/b of N. The incompleteness condition n < ab secures n/a > A,

whence NN' is positive definite. This implies a < b, and

The latter entails n > a + b - 1. The proof is complete.

The optimality statements refer to the set T(N) of designs with support in-cluded in the support of N (and which are thus absolutely continuous relativeto N/n),

This set is convex and also closed, since the support condition is an inclusionand not an equality. The design set T(Af) is not invariant under relabelingtreatments or blocks, nor is the corresponding moment matrix set M (T(N))invariant. Hence we can exploit invariance only on the lowest level, that ofinformation matrices.

II. For the centered treatment contrasts Kaa, a balanced incomplete blockdesign is Kiefer optimal relative to the treatment relabeling group H =Perm(a).

Claim. A balanced incomplete block design N/n is Kiefer optimal forthe centered treatment contrasts in the set T(Af) of designs for which thesupport is included in the support of TV; their contrast information matrix is

Proof. To see this, let W e T(N) be a competing block design. Balanc-ing the information matrix relative to Perm(,s) entails matrix majorization,C^(W) -< CK(W), with

where Sj are the column sums of W. Hence the Loewner comparison ofCK(N/n) and C#(W) boils down to comparing traces.

Because of the support inclusion, any competitor W G T(Af) satisfies w,; =riijWjj. The Cauchy inequality now yields

14.9. BALANCED INCOMPLETE BLOCK DESIGNS

In terms of traces this means traceCK(W)

trace CK(N/n). Altogether we obtain CK(N/n) > CK(W) x CK(W). There-fore N/n is Kiefer optimal for Kaa in T(N), and the proof is complete.

In retrospect, the proof has two steps. The first is symmetrization usingthe balancing operator , without asking whether CK(W) is an informationmatrix originating from some design in T or not. The second is maximizationof the trace, which enters because the trace determines the projection ontothe invariant subspace formed by the completely symmetric matrices. Thetrace is not primarily used as the optimality criterion. Nevertheless the situa-tion could be paraphrased by saying that N/n is (fo-optimal for Kaa in T(N).This is what we have alluded to in Section 6.5, that trace optimality has itsplace in the theory provided it is accompanied by some other strong propertysuch as invariance.

For fixed support size n, any balanced incomplete block design N/n, witha possibly different support than N/n, has the same contrast informationmatrix

Hence the optimal contrast information matrix is the same for the design setsT(W) and T(N). As a common superset we define

This set comprises those block designs for which each block contains at mostn/b support points. For competing designs from this set, W € T(n/b), theCauchy inequality continues to yield

Thus optimality of any balanced incomplete block design N/n extends fromthe support restricted set T(N) to the larger, parameter dependent set T(n/b).This set may well fail to be convex.

Another example of a nonconvex set over which optimality extends is thediscrete set Tnb of those block designs for sample size n for which all bblocksizes are positive. This is established by a direct argument. It N e Tn fe

assigns frequency n/y e {0,1,2,...} to treatment-block combination (/,;),"i—*? .-w

then n /y > «,; yields the estimate

369

370 CHAPTER 14. KIEFER OPTIMALITY

EXHIBIT 14.2 Some 3 x 6 block designs for n — 12 observations. The equireplicatedproduct design N\/n is Kiefer optimal in T. The balanced incomplete block design N-i/n isKiefer optimal in T(n/b), and in T12,6/i- The designs N^/n e 1(n/b) and N4 € Tj2i6 performjust as well, as does N$.

with equality if and only if all frequencies «/y are 0 or 1. Hence the incidencematrix N of a balanced incomplete block design is Kiefer optimal for Kaa inthe discrete set TM fe.

Thus we may summarize as follows: An a x b balanced incomplete blockdesign for sample size n is Kiefer optimal relative to the treatment permuta-tion group Perrn(a) for the centered treatment contrasts, in the set T(n/b) ofthose block designs for which each block contains at most n/b support points,as well as in the set TR^/n where TW)fc are the designs for sample size n withat least one observation in each of the b blocks.

Exhibit 14.2 illustrates that the dominating role that balanced incompleteblock designs play in the discrete theory is not supported by their optimalityproperties for the centered treatment contrasts. Design N\ says that it is betternot to make observations in more than one block. This is plainly intelligiblebut somewhat besides the point. Two-way classification models are employedbecause it is thought that blocking cannot be avoided. Hence the class Tnj,

14.9. BALANCED INCOMPLETE BLOCK DESIGNS 371

with all b blocksizes positive is of much greater practical interest. But eventhen there are designs such as N4 which fail to be balanced incomplete blockdesigns but perform just as well.

III. A balanced incomplete block design is distinguished by its optimalityproperties for an appropriate maximal parameter system. In Section 14.8 (III)we studied a maximal parameter set which treats the treatment effects a andthe block effects j8 in an entirely symmetric fashion, E[Y,;] = (a. + j3.) + (a, —a.)+ (07-0.).

However, a balanced incomplete block design concentrates on treatmentcontrasts and handles the set of all parameters in an unsymmetric way,

In other words, expected yield decomposes according to E[F,-y] = (a, - a.) +(j3y-+a.). This system yields K = MG, where M is the moment matrix of a bal-anced incomplete block design N/n, while the specific generalized inverse Gof M is given by

In particular, M is feasible for K'O since MGK ~ MGMG = MG = Kshows that the range of M includes the range of K. The dispersion matrixD = K'GK becomes D = GMGMG - GMG = G, as M is readily verifiedto be a generalized inverse of G. Optimality then holds with respect to therank deficient matrix means <£p' of Section 8.18.

Claim. A balanced incomplete block design N/n is the unique ^'-opti-mal design for the maximal parameter system (2) in the set T(N) of designsfor which the support is included in the support of N, for every p e [-00; 1].

Proof. The proof does not use invariance directly since we do not havea group acting appropriately on the problem. Yet the techniques ofLemma 13.10 prove useful. In the present setting the normality inequalityof Section 8.18 becomes

in case p £ (-00; 0). For p e (0; 1], we replace G~p by (G+y. The task is oneof handling the powers G~p and (G+)p.

To this end we introduce the four-dimensional matrix space £ C Sym(a+fe)

generated by the symmetric matrices

The space L contains the squares (Vt — V})2, for all /,;' = 1,2,3,4. Hence£ is a quadratic subspace of symmetric matrices, that is, for any memberC e C also the powers C2, C3,... lie in £. Arguing as in the proof of part (c)of Lemma 13.10, we find that £ also contains the Moore-Penrose inverseC+, as well as CP and (C+)p for C € £ with C > 0. This applies to tG -Vl - (b/n)V2 + tbV3 + (b2/n2)V4 6 C. For p € (-00;0) therefore G~P is alinear combination of the form

for some ap, pp, yp, 8P e U. For p € (0; 1] the same is true of (G+)p.Now we insert (4) into (3), and use w,-y = rt/yw/y for W € T(W). Some com-

putation proves both sides in (3) to be equal. This establishes <£p'-optimalityfor p 6 (-00; 0) u (0; 1], and Lemma 8.15 extends optimality to p = -oo,0.Uniqueness follows from Corollary 8.14, since in M(W)G = K, the left bot-tom block yields W = (b/ri)Nb.s while the right bottom block necessitatesbA, = Ib. Together this yields W = N/n, thus completing the proof.

Results of a similar nature hold for multiway classification models, usingessentially the same tools. Instead we find it instructive to discuss an exampleof a different type.

14.10. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNITCUBE

A reduction by invariance was tacitly executed in Section 8.6 where we com-puted 0p-optimal designs for a linear fit over the unit square X = [0;!]*,with k = 2. We now tackle the general dimension k, as an invariant designproblem. We start out from a A:-way first-degree model without a constantterm,

also called a multiple linear regression model. The t regression vectors *, areassumed to lie in the unit cube X = [0; 1]*. From Theorem 8.5, we concentrate

14.10. OPTIMAL DESIGNS FOR A LINEAR FIT OVER THE UNIT CUBE 373

EXHIBIT 14.3 Uniform vertex designs. A /-vertex has j entries equal to 1 and k-j entriesequal to 0, shown for k — 3 and y = 0,1,2,3. The y'-vertex design £y assigns uniform weight!/(*) to all y-vertices of [0; 1]*.

on the extreme points of the ElfVing set 71 — conv(A' U (-/V)). We definex e [0; 1]* to be a j-vertex when / entries of x are 1 while the other k-jentries are 0, for / = 0 , . . . , k. The set X of extreme points of 7£ comprisesthose /'-vertices with / > \\(k - 1)].

A further reduction exploits invariance. The regression range X is permu-tationally invariant, Qx e [0; 1]* for all x € [0; 1]* and Q e Perm(A:). Also thdesign problem for the full parameter vector 6 in the set of all moment matri-ces M(H) is permutationally invariant, QM(g)Q' G Af (E) for all £ e H andQ e Perm(A:). The j-vertex design § is defined to be the design which assignsuniform weight l/(k) to the (*) /-vertices of the cube [0;!]*. The y'-vertexdesigns £/ for / = 0,1,2,3 in dimension k = 3 are shown in Exhibit 14.3.

The /-vertex design £, has a completely symmetric moment matrix,with

If k is even, then we have \\(k - 1)] = \\(k + 1)J. If k is odd, then we getM (£[(*-i)/2i) = M (£|_(/fc+i)/2j)- Hence we need to study the /-vertex designs §with/ > l.5(fc + l)J only.

It turns out that mixtures of neighboring /-vertex designs suffice. To this

end we introduce a continuous one-parameter design family,

with integer / running from [|(A: + l ) J t o A : - l . The support parameter svaries through the closed interval from \\(k + 1)J to k. If s — / is an integer,then & is the /-vertex design. Otherwise the two integers / and / +1 closestto s specify the vertices supporting &, and the fractional part s —j determinesthe weighting of § and £;+1. As s grows, the weights shift towards the furthestpoint, the A>vertex. We call the designs in the class

the neighbor-vertex designs. For example, we have & n = 0.89& +0.11&, or^.4-0.66 + 0.44-

We show that the neighbor-vertex designs (I) form an essentially completeclass under the Kiefer ordering relative to the permutation group Perm(fc),(II) contain (^-optimal designs for 6 in H, and (III) have an excessive sup-port size that is improved upon by other optimal designs such as balancedincomplete block designs.

I. Our first result pertains to the Kiefer ordering of moment matrices.

Claim. For every design 17 e E, there exists a neighbor-vertex design& e H so that A/(&) is more informative, M(£s) > M(T?), in the Kieferordering relative to the permutation group Perm (A:).

Proof. Let 17 e H be any competing design. From Theorem 8.5, thereexists a design £ e H which has only /-vertices for its support points suchthat M(£) > M(-q). A transformation Q carries £ into £Q(x) = ^(Q~lx)(compare Section 14.5). The average designs ]C<2€Perm(*)1?2/^' nave moment matrices

Being invariant and supported by /-vertices, the design £ is a mixture of /-vertex designs, ;^, with min, yy > 0 and £); y, — 1. Its momentmatrix is

and


EXHIBIT 14.4 Admissible eigenvalues. In a linear fit model over [0; l ] k , the eigenvalues ofinvariant moment matrices correspond to the points (a, ft) in a polytope with vertices on thecurve x(t) = f2 , y(t) — t(\ - t). The points on the solid line are the admissible ones. Left: fork = 5; right: for k = 6.

The coefficients and j fulfill

where the convex hull is a polytope formed from the k + 1 points (a/,0y)which lie on the curve with coordinates x(t) = t2 and y(/) = t(\ — t) fort € [0; 1] (see Exhibit 14.4).

The geometry of this curve visibly exhibits that we can enlarge (a,/3) inthe componentwise ordering to a point on the solid boundary. That is, forsome ; € {\\(k + 1)J, . . . , k} and some d e [0; 1] we have

With s = j + 8, the neighbor-vertex design & then fulfills M(rj) x M(TJ). Thus we get Af(fe) » M(TJ), and the proof is complete.

We may reinterpret the result in terms of moment matrices. Let M be theset of moment matrices obtained from invariant designs that are supported by/-vertices, with / > \\(k + 1)J. Then M consists of the completely symmetricmatrices that are analysed in the above proof:

That is, M is a two-dimensional polytope in matrix space. The Loewner

ordering of the matrices

in M coincides with the componentwise ordering of the coefficient pair (a, /3).In this ordering, the maximal elements are those which lie on the solid bound-ary, on the far side from the origin. This leads to the moment matrices M(£s)of the neighbor-vertex designs &, with 5

It follows from Theorem 14.3 that the neighbor-vertex designs also performbetter under every information function <f> on NND(&) that is permutationallyinvariant,

Therefore some member of the one-parameter family H of neighbor-vertexdesigns is $-optimal for 0 in H.

II. For the matrix means <f>p with parameter p e [—oo;l], the interval[-00; 1] is subdivided using two interlacing sequences b(j + 1) defined b y a

The (f>p-optimal support parameter turns out to be s*(p) = / in case p [#(/);/?(/)], and is otherwise given as follows.

Claim. The unique <£p-optimal design for & in H is the j-vertex design §in case p e [a(j);b(j)]. In case p € (b(j}\a(j + 1)), it is the neighbor-vertexdesign &4(p) with sk(p) given by

Proof. The proof identifies the optimal support points by maximizing thequadratic form Qp(x) = x'Mp~lx which appears as the left hand side in the

and and

normality inequality of Theorem 7.20. The evaluation simplifies drasticallysince we know that the optimality candidate

is completely symmetric. Furthermore Qp(x) is convex in x e [0; 1]*, whenceit suffices to maximize over the /-vertices x only. With this, the left hand sideof the normality inequality depends on jc only through /,

say. For /-vertex designs and neighbor-vertex designs, we have j8/(A;-l) < a,while p - 1 < 0. Therefore the parabola hp opens downwards, and attainsits maximum over the entire real line at some point jp e R. Let / = \jp\ bethe integer with jp e [/';;' +1). The maximum of hp over integer arguments isattained either at /, or else at / + 1, or else at ;' and / +1 simultaneously.

In case jp < / + 1/2 the integer maximum of hp occurs at j. The (^,-optimalsupport points are the j -vertices, and the j -vertex design £, is <£p-optimal. Tosee when this happens we notice that the double inequality hp(j—\} < hp(j) >hp(j + 1) is equivalent to p € [«(/); £>(/)] - In case jp > j + 1/2, the (f>p-optimaldesign is fy+1.

In case jp = j + 1/2, the integer maximum of hp occurs at / and at j + 1.A neighbor-vertex design & is ^-optimal, with s e [/;;' + 1]. This happensif and only if hp(j) = hp(j + 1), or equivalently, s = sk(p). The propertysk(p) 6 (/;/ +1) translates into p e (b(j}\a(j +1)). The proof is complete. L_3

As a function of p, the support parameter sk is continuous, with constantvalue ; on (0(7); b(j)], and strictly increasing from ; to ;' +1 on (b(j); a(j-f 1)).

The /-vertex design for / = {^(k + 1)J is </>p-optimal over an unboundedinterval p e [-oo;cfc], with ck = b([^(k + 1)J). Some values of ck are asfollows.

kCk

30.34

4-0.46

50.30

6-0.21

70.28

8-0.13

90.26

10-0.09

For odd k the \(k + l)-vertex design is (/^-optimal for all p € [-00;0]. Foreven k the ^ k -design is ^-optimal for all p 6 [—00;—1/2]. For even /c, the

fa-optimal design ^(0) turns out to place uniform weight

on all of the \k-vertices and \k + 1-vertices.Thus for most parameters p G [—oo;0], the ^-optimal design g^) is

supported by those vertices which have half of their entries equal to 1 andthe other half equal to 0. This is also the limiting design for large dimension k,

To see this, we introduce jk — [sk(p)\ < k. We have sk(p) e \jk\jk + 1)»or equivalently, p e [«(/*);a(A + 1))- It suffices to show that jk/k tendsto 1/2. From jk > [^(k + 1)J, we get liminf^oo/jt/fc > 1/2. The assumption\imsupk_toojk/k = a e (\f2\\] leads to

if necessary along a subsequence (km)m>\, and contradicts p < 1.HI. While the ^-optimal moment matrices are unique, the designs are

not. In fact, the /-vertex design with j = \\(k + 1)J has a support size (fc)that grows exponentially with k, and violates the quadratic bound \k(k + 1)from Section 8.3. We cannot reduce the number of candidate support pointssince equality in the normality inequality holds for all /-vertices. But we arenot forced to use all of them, nor must we weigh them uniformly.

More economical designs exist, and may be obtained from balanced incom-plete block designs. The relation is easily recognized once we contemplatethe model matrix X that belongs to a /-vertex design £;. Its I = (*) rows con-sist of the /-vertices x 6 [0;!]*, and X'X is completely symmetric. In otherwords, the transpose N — X' is the incidence matrix of a k x I balancedincomplete block design for jt observations. It is readily checked that theidentity

is valid for every b as long as N is a k x b balanced incomplete block design forsample size jb. Therefore we can replace the /-vertex design £y by any designthat places uniform weight on the columns of the incidence matrix of anarbitrary balanced incomplete block design for k treatments and blocksize /.

EXERCISES 379

The example conveys some impression of the power of exploiting invari-ance considerations in a design problem, and of the labor to do so. Anotherclass of models that submit themselves to a reduction by invariance are poly-nomial fit models, under the generic heading of rotatability.

EXERCISES

14.1 In a quadratic fit model over [—1; 1], show that C for

14.2 Show that the support of TR is #(supp T), for all r e T and all trans-formations R of T.

14.3 Assume that the group 71 acting on T is finite, and induces the trans-formations Q € Q C GL(fc) such that Qf(t) = f ( R ( t ) ) for all t e T.Show that a moment matrix M e M is Q-invariant if and only if thereexists an fc-invariant design T e T with M = M/(T) [Gaffke (1987a),p. 946].

14.4 An a x b block design for sample size n may be balanced, binary, andequireplicated, without being equiblocksized [Rao (1958), p. 294].

14.5 An a x b block design for sample size n may be balanced and binary,without being equireplicated nor equiblocksized [John (1964), p. 899;Tyagi (1979), p. 335].

14.6 An a x b block design for sample size n may be balanced and equi-blocksized, without being equireplicated nor binary [John (1964), p.898].

14.7 Consider an ra-way classification model with additive main effectsand no interaction, treatments i = 1,... ,a and blocking factors k = 1,... ,m with levelsjk — 1,... ,bk. Show that the moment matrix of a design r on theexperimental domain T = {!, . . . ,a] x {!,... ,b\] x • • • x {!,.. .,bm}is

for

say, where Wk and Wk-k are the two-dimensional marginals of T be-tween treatments and blocking factor k, and blocking factors k and k,respectively, with one-dimensional marginals r = Wklbk and sk =Wkklbk [Pukelsheim (1986), p. 340].

14.8 (continued) Show that the centered treatment contrast informationmatrix is C(T) = Ar - WE~W.

14.9 (continued) A design T is called a treatment-factor product designwhen Wk — rs£ for all k < m. For such designs T, show that C(r) =Ar - rr'.

14.10 (continued) A design T is called a factor-factor product design whenW^ = sks-' for all A; 7^ ^ < m. For such designs T, show that C(r) =

C H A P T E R 15

Rotatability and ResponseSurface Designs

In m-way d th-degree polynomial fit models, invariance comes under the head-ing of rotatability. The object of study is the information surface, that is, theinformation that a design contains for the estimated response surface. Orthogo-nal invariance is then called rotatability. The implications on moment matricesand design construction are discussed, in first-degree and second-degree mod-els. For practical purposes, near rotatability is often sufficient and leaves roomto accommodate other, operational aspects of empirical model-building.

15.1. RESPONSE SURFACE METHODOLOGY

Response surface methodology concentrates on the relationship between theexperimental conditions t as input variable, and the expected response E P [Y]as output variable. The precise functional dependence of E/>[y] on t is gen-erally unknown. Thus the issue becomes one of selecting an approximatingmodel, in agreement with the response data that are obtained in an experi-ment or that are available otherwise. The model must be sufficiently complexto approximate the true, though unknown, relationship, while at the sametime being simple enough to be well understood. To strike the right balancebetween model complexity and inferential simplicity is a delicate decisionthat usually evolves by repeatedly looping through a learning process:

There is plenty of empirical evidence that a valuable tool in this process isthe classical linear model of Section 1.3.

In the class of classical linear models, the task is one of finding a regressionfunction / to approximate the expected response, Ep[F] = f(t)'0. For the

381

Conjuture DESIGNA EXPERIMENT ANALYSISCONJUCTURE

382 CHAPTER 15: ROTATABILITY AND RESPONSE SURFACE DESIGNS

most part, we study the repercussions of the choice of / on an experimentaldesign T, in the class T of all designs on the experimental domain T. InSection 15.21 we comment on the ultimate challenge of model-building.

15.2. RESPONSE SURFACES

We assume the classical linear model of Section 1.3,

with regression function / : T —> !R* on some experimental domain T. Themodel response surface is defined to be a function on T,

and depicts the dependence of the expected yield f(t) '0 on the experimentalconditions t e T. The parameter vector 0 is unknown, and so is the model re-sponse surface. Therefore the object of study becomes the estimated responsesurface,

based on an experiment with n x k model matrix X and n x 1 observationvector Y. If f(t)'B is estimable, that is, if f(t) lies in the range of X', then itis estimated by f ( t ) ' 0 = f(t)'(X'X)-X'Y (see Section 3.5). The variance ofthe estimate is o-2f(t)'(X'X)~f(t) in case f(t) e range X', and otherwise itis taken to be oo.

The statistical properties of the estimated response surface are determinedby the moment matrix M = X'X/n, and are captured by the standardizedinformation surface iM : T —> (R which for t € T is given by

and iM (t) — 0 otherwise. In terms of the information matrices CK(M) of Sec-tion 3.2, we have iM(t) = C/(,)(Af). That is, 1^(1) represents the informationthat a design with moment matrix M contains for the model response surface/(r)'0, and emphasis is on the dependence on /. The domain T where theexperimenter is interested in the behavior of IM need not actually coincidewith the experimental domain T, but could be a subset or a superset of T.

Equivalently, we may study the standardized variance surface VM = I/IM-But lack of identifiability of f ( t ) ' 0 entails vM(t) = oo, which makes the studyof VM less convenient than that of iM • Anyway, the notion of an informationsurface is more in line with our information oriented approach.

15.3. INFORMATION SURFACES AND MOMENT MATRICES 383

The following lemma states that the moment matrix M is uniquely deter-mined by its information surface IM- In the earlier chapters, we have concen-trated on designs £ G E over the regression range X. Since interest now shiftsto designs T e T over the experimental domain T, we change the notationand denote the set of all moment matrices by

This is in line with the notation Md(r) for dth-degree polynomial fit modelsin Section 1.28.

15.3. INFORMATION SURFACES AND MOMENT MATRICES

Lemma. Let M,A e A//(T) be two moment matrices. Then we haveIM — IA if and only if M = A.

Proof. Only the direct part needs to be proved. The two informationsurfaces iM and iA vanish simultaneously or not. Hence the matrices M and Ahave the same range of dimension r, say. We choose a full rank decompositionA = KK', with K € R*xr, and some left inverse L of K. Then M and K havethe same range and fulfill KLM — M and MM~K = K, by Lemma 1.17.

The positive definite rx r matrix D = K'M'K has the same trace as AM~~.With a design T e T that has moment matrix A, equality of the informationsurfaces implies

This yields trace D = r. With a design that achieves M, we similarly see thatthe inverse D~l = LML' satisfies trace D"1 = trace MA" = trace MM~ =r. From

we get D-1 = Ir, and conclude that M = KLML'K' = KK' = A.

In other words there is a unique correspondence between informationsurfaces iM and moment matrices M. While it is instructive to visualize aninformation surface iM, the technical discussion usually takes recourse to thecorresponding moment matrix M. This becomes immediately apparent whenstudying invariant design problems as introduced in Section 13.6.

To this end, let K be a group that acts on the experimental domain T insuch a way that the regression function / is equivariant, f(R(t)) = Qnf(t) forall t € T and R e ft, where QR is a member of an appropriate matrix group


Q C GL(&). We show that in variance of an information surface iM under ftis the same as invariance of the moment matrix M under Q. By definition,an information surface i\f is called H-invariant when

In the sequel, T is a subset of Rm and 71 is the group Orth(m) of orthogonalm x m matrices; we then call iM rotatable. The precise relation betweenft-invariant information surfaces and Q-invariant moment matrices is thefollowing.

15.4. ROTATABLE INFORMATION SURFACES AND INVARIANTMOMENT MATRICES

Theorem. Assume that ft is a group acting on the experimental do-main T, and that Q C GL(A;) is a homomorphic image of ft such that theregression function / : T —» R* is ft-Q-equivariant. Let M e A/y(T) be amoment matrix. Then the information surface IM is ft-invariant if and onlyif M is Q-invariant.

Proof. Given a transformation R € ft, let Q e Q be such that f ( R ~ l ( t ) ) =Q~lf(t} for all t € T. We have f ( R ~ l ( t ) ) e range M if and only if /(r) €range QMQ'. Since Q~l'M~Q~l is a generalized inverse of QMQ', we get

and «M (/r!(0) = iQMQ-(t} for all / € T.

In the direct part, we have *A/(0 = i\f(R~l(t)) = iQMQ'(t), andLemma 15.3 yields M — QMQ'. Varying R through ft, we reach every Qin the image group Q. Hence M is Q-invariant. The converse follows simi-larly.

Thus rotatable information surfaces come with moment matrices that dueto their invariance properties enjoy a specific structure. Some examples ofinvariant subspaces of symmetric matrices are given in Section 13.9. Othersnow emerge when we discuss rotatable designs for polynomial fit models.

15.5. ROTATABILITY IN MULTIWAY POLYNOMIAL FIT MODELS

The m-way dth-degree polynomial fit model, as introduced in Section 1.6,has as experimental conditions an m x 1 vector t = ( / i , . . . , tm)', with entry //

15.6. ROTATABILITY DETERMINING CLASSES OF TRANSFORMATIONS 385

representing the level of the i th out of m factors. The case of a single factor,m — 1, is extensively discussed in Chapter 9. In the sequel, we assume

The regression function /(/) consists of the (d^") distinct monomials of degree0,..., d in the variables t\,..., tm.

For the rotatability discussion, we assume the experimental domain to bea Euclidean ball of radius r > 0, Tr = {t e Rm : \\t\\ < r}. Any such ex-perimental domain Tr is invariant under the orthogonal group 72. = Orth(m),acting by left multiplication t H-» Rt as mentioned in Section 13.2. We followstandard customs in preferring the more vivid notion of rotatability over themore systematic terminology of orthogonal invariance.

From Section 14.5, the strongest statements emerge if invariance were topertain to a design T itself. Rotatability of a measure r means that it dis-tributes its mass uniformly over spheres. However, our definition of a designin Section 1.24 stipulates finiteness of the support. Therefore, in our termi-nology, if m > 2 then no design T on Tr is rotatable. Nevertheless, for a givendegree d there are plenty of designs on Tr that have a rotatable informationsurface. These designs are often called rotatable d th-degree designs. We re-frain from this irritating terminology since, as just pointed out, our notion ofa design precludes it to be rotatable. Rotatability is a property pertaining tothe d th-degree information surface, or moment matrix.

In the sequel, we only discuss the special radius r — ^lfm,

With this choice the vertices of the symmetrized unit cube, [-l;l]m, cometo lie on the boundary sphere of T^. This is the appropriate generalizationof the experimental domain [—1;1] for the situation of a single factor. For agiven degree d > 1, the program thus is the following. We compute a matrixgroup Qd C GL(/c) under which the regression function / : T/^ —> IR*is equivariant, as discussed in Section 13.3. We then strive to single out afinite subset Qd C Qd which determines invariance of symmetric matrices.Of course, the associated set K C Orth(m) is also finite. Invariance underselected transformations in 7£ then points to designs T which have a rotatabled th-degree moment matrix.

15.6. ROTATABILITY DETERMINING CLASSES OFTRANSFORMATIONS

Part (b) of Lemma 13.10 secures the existence of a finite subset 7£ C Orth(m)of orthogonal matrices such that for any moment matrix, invariance relative


to the set Qd which is induced by 7£ implies invariance relative to the un-derlying group Qd. The issue is to find a set 72. which is small, and easy tohandle.

Our choice is based on the permutation group Perm(m). Beyond this itsuffices to include the rotation Rv/4, defined by

which rotates the (t\, ̂ -plane by 45° and leaves the other coordinates fixed.This plane exists because of our general dimensionality assumption m >2.We show that the (ml + 1)-element set

is a rotatability determining class for both the first-degree model(Lemma 15.8) and for the second-degree model (Lemma 15.15).

Invariance relative to 7£ entails invariance under all finite productsPQR • with factors P,Q,/?,... in U. This covers the rotation by 45° ofany other (fj,r,)-plane, / ^ j, since this transformation can be written in theform P'Rir^P with P e Perm(m). Furthermore every sign-change matrixcan be written as a finite product of permutations and 45° rotations of thecoordinate axes. Therefore we would not create any additional invarianceconditions by adjoining the sign-change group Sign(m) to Tl.

15.7. FIRST-DEGREE ROTATABILITY

In an m-way first-degree model, the regression function is

with 7^ the ball of radius \fm in IRm and k = 1 + m. The moment matrix ofa design T e T is denoted by MI(T).

For a rotation R e Orth(m), the identity

15.8. ROTATABLE FIRST-DEGREE SYMMETRIC MATRICES 387

suggests the definition of the (1 + m) x (1 + m) matrix group Q\,

With this, the regression function / becomes 7^-Qi-equivariant. The rotata-bility determining class K of Section 15.6 then induces the subset

The rotatable symmetric matrices have a simple pattern, as follows.

15.8. ROTATABLE FIRST-DEGREE SYMMETRIC MATRICES

Lemma. For every matrix A € Sym(l+w), the following three statementsare equivalent:

a. (Q\-invariance) A is Q\-invariant.

b. (Qi-invariance) A is Q\-invariant.c. (Parametrization) For some a,/3 6 R we have

Proof. Partioning A to conform with Q € Qi, invariance A = QAQ'means

That (c) implies (a) is plainly verified by inserting a — 0 and B = film. Itis clear that (a) implies (b) since the latter involves fewer transformations.Now assume (b).

First we average over permutations and obtain

for some 6,/3,-y 6 IR (see Section 13.10). Then, with vector s = R^lm =(0, \/2,1, 1)', invariance under the rotation Rv/^ yields

It follows that 5 = 0 and y = 0, whence (c) is established.

Part (c) says that the subspace Sym(l + m, Qi) of invariant symmetricmatrices has dimension 2, whatever the value of m. An orthogonal basis isgiven by

The projection of A e Sym(l + m) onto Sym(l + m, Qi) is A = (A, Vj> Vj +({^,V2)/m)V2.

We call a first-degree moment matrix rotatable when it is Q\ -invariant.The set of rotatable first-degree moment matrices,

is compact and convex. It turns out to be a line segment in the two-dimension-al space Sym(m + 1, Qi).

15.9. ROTATABLE FIRST-DEGREE MOMENT MATRICES

Theorem. Let M be a symmetric (1+w) x (1+m) matrix. Then M isa rotatable first-degree moment matrix on the experimental domain T^ ifand only if for some /x2 e [0; 1], we have

The moment matrix in (1) is attained by a design r e T if and only if T hasall moments of order 2 equal to ̂

for all / < m, while the other moments up to order 2 vanish.

15.10. KIEFER OPTIMAL FIRST-DEGREE MOMENT MATRICES 389

Proof. For the direct part, let T be a design on^/m] with a rotatable first-degree moment matrix M. Lemma 15.8 entails

Calculating the moments of T, we find

That /3 is the second moment under r common to the components r, isexpressed through a change in notation, /3 = ^2- Hence M has form (1),that is, the moments of T fulfill (2). Clearly ju,2 > 0. As an upper bound, weobtain

For the converse, we notice that the rotatable matrix

is achieved by the one-point design in 0, while

is attained by the uniform distribution on the sphere of radius ^/m, or bythe designs to be discussed in Section 15.11. Then every matrix on the lineconnecting A and B is also a moment matrix, and is rotatable.

A design r with a rotatable first-degree moment matrix induces the ro-tatable first-degree information surface i r ( t ) = 1/(1 +t't/fjL2) for all / e Rm,provided the common second moment /Lt2 of T is positive. If by choice of Twe enlarge /-i/?, then the surface ir is uniformly raised. This type of improve-ment leads to Kiefer optimality relative to the group Qi, as discussed inTheorem 14.6.

15.10. KIEFER OPTIMAL FIRST-DEGREE MOMENT MATRICES

Corollary. The unique Kiefer optimal moment matrix for 9 in A/i(T) is


with associated information surface

Proof. Among all Q\ -invariant moment matrices, M is Loewner optimal,by Theorem 15.9. Theorem 14.6 then yields Kiefer optimality of M in A/i(T).

How do we find designs that have a Kiefer optimal moment matrix? Con-sider a design r which assigns uniform weight \/i to (. vectors f, e Um oflength \/m, with model matrix

If the moment matrix of r is Kiefer optimal,

then X has orthogonal columns of squared lengths L Such designs are calledorthogonal because of the orthogonality structure of X. A model matrix Xwith X'X = ilm is achieved by certain two-level factorial designs.

15.11. TWO-LEVEL FACTORIAL DESIGNS

One way to construct a Kiefer optimal first-degree design on the experimen-tal domain jf/^ is to vary each of the m factors on the two levels ±1 only.Hence the vectors of experimental conditions are r, G {±l}m, the 2m verticesof the symmetrized cube [—I;!]"1 included in the rotatable experimental do-main r^.

The design that assigns uniform weight \/i to each of the t = 2m verticesof [-1; \}m is called the complete factorial design 2m. It has a model matrixX e R*x(1+w) satisfying X'X — Um. For instance, in the two-way or three-wayfirst-degree model, X is given by

15.12. REGULAR SIMPLEX DESIGNS 391

up to permutation of rows. The support size 2m of the complete factorialdesign quickly outgrows the quadratic bound \k(k +1) = \(m + l)(m + 2) ofCorollary 8.3.

The support size is somewhat less excessive for a 2m~p fractional factorialdesign which, by definition, comprises a 2~p fraction of the complete factorialdesign 2m in such a way that the associated model matrix X has orthogonalcolumns. For instance, the following 4 x 4 model matrix X belongs to a one-half fraction of the complete factorial design 23,

It satisfies X'X — 41 4 and hence is Kiefer optimal, with only 4 = j23 runs.There are also optimal designs with a minimum support size £ — l+m = k,

conveniently characterized by their geometric shape as a regular simplex.

15.12. REGULAR SIMPLEX DESIGNS

Since in an m-way first-degree model, the mean parameter vector 6 = (0o, 0i,. . . , dm)' has l + m components, the smallest possible support size is 1 + mif a design is to be feasible for 6. Indeed, there exist Kiefer optimal designswith this minimal support size.

To prove this for any number m > 2 of factors, we need to go beyondtwo-level designs and permit support points /, e T^ other than the ver-tices {±l}m. For t = I + m runs the model matrix X is square. HenceX'X = (m + l)/m+t is the same as XX' — (m + l)/w+i. Thus the vectors r, inthe rows of X fulfill, for all i ̂ ; < m + 1,

In other words, the convex hull of the vectors t\,..., tm+i in Rm is a polytopewhich has edges f, - tj of common squared length 2(m +1). Such a convexbody is called a regular simplex.

A design that assigns uniform weight l/(m +1) to the vertices f 1 ; . . . , tm+\of a regular simplex in Rm is called a regular simplex design. For two factors,the three support points span an equilateral triangle in R2. For three factors,the four support points generate an equilateral tetrahedron in R3, and soon. The diameter of the simplex is such that the vertices r, belong to theboundary sphere of the ball T^ which figures as experimental domain.

If a regular simplex design can be realized using the two levels ±1 only,tf € {±l}m, then the model matrix X has entries ±1 besides satisfying X 'X —(m + \)lm+\. Such matrices X are known as Hadamard matrices. We may

summarize as follows, for an m-way first-degree model on the experimentaldomain 7^ = [t e Rm : \\t\\ < y^}.

A regular simplex design and any rotation thereof is Kiefer optimal for 6in T, it has smallest possible support size m +1 and always exists. A two-levelregular simplex design exists if and only if there is a Hadamard matrix oforder m + 1. The complete factorial design 2m is a two-level Kiefer optimaldesign for 8 in T with a large support size 2m. Any 2m~p fractional factorialdesign reduces the support size while maintaining optimality.

These results rely on the moment conditions (2) in Theorem 15.9. Forsecond-degree models, moments up to order 4 are needed. An efficient book-keeping of higher order moments relies on Kronecker products and vector-izatioti of matrices.

15.13. KRONECKER PRODUCTS AND VECTORIZATIONOPERATOR

The Kronecker product of two matrices A € Ukxm and B e R£x/1 is definedas the ki x mn block matrix

For two vectors s € Rm and t e R", this simplifies to the mn x 1 block vector

The key property of the Kronecker product is how it conforms with matrixmultiplication,

This is easily seen as follows. For Euclidean unit vectors et e Um and dj e IR",the identity (A <8> #)(e, <g> dj) = (Aei) <8> (Bdj) is verified from the definition.For arbitrary vectors s e Um and t € R", bilinearity of the Kronecker productentails (1),

15.13. KRONECKER PRODUCTS AND VECTORIZATION OPERATOR 393

Formula (1) extends to matrices C e Rmxp and D € R"x<?,

Again this is first verified for Euclidean unit matrices E^ — etdj e Rmxp and

E by referring to (1) three times,

Then an appeal to bilinearity justifies the extension to C — ]

andD =As a consequence of (2), a generalized inverse, the Moore-Penrose in-

verse, the inverse, and the transpose of a Kronecker product is equal tothe Kronecker product of the generalized inverses, Moore-Penrose inverses,inverses, and transposes, respectively.

The grand vector s®t assembles the same cross products s/f; that appear inthe rank 1 matrix st'. For an easy transition between the two arrangements,we define

In other words, the matrix st' is converted into a column vector by a con-catenation of rows, CM',. . . ,-W')> followed by a transposition. This is ex-tended to matrices A = Z^/^7^'7 by linearity, vec A — Y^ija'j vec(e,d;') =Y^i,jaij(e' ®dj). The construction results in a linear mapping called the vec-tor ization operator on Rmxn,

which maps any rectangular m x n matrix A into a column vector vec A basedon the lexicographic order of the subscripts, vec A = (a\ \,..., a\n, 02i, • • • , «2w >• • • i ®m\ i • • • i Qmn) •

The matrix version of formula (3) is

provided dimensions match appropriately. This follows from vec(Ae/d;'C) = , and linearity. The

vectorization operator is also scalar product preserving. That is, for all A, B e(frxw, we have

In second-degree moment matrices, the representation of moments of or-der 4 is based on the m2 x m2 matrices

where E/y = e^e- are the Euclidean unit matrices in Rwxm. Identity (5) is im-mediate from Im = Ysi EH and bilinearity of the Kronecker product. Identity(7) additionally uses (vec £/,-)( vec Eyy)' = (£/®£i)(e/<8>e/)' = £/;-<8>£/y. Iden-tity (6) serves as the definition of the matrix /OT,m, called the vec-permutationmatrix. The reason is that lmfn is understood best through its operation onvectorized matrices,

This follows from Im^m(\QcEki) =vec Eik, extending to (8) by linearity.

This provides the technical basis to discuss rotatability in second-degreemodels.

15.14. SECOND-DEGREE ROTATABILITY

In an m-way second-degree model, m > 2, we take the regression functionto be

with T^ the ball of radius ^/m in Rm and k — 1 + m + m2. The momentmatrix of a design r e T is denoted by A/2(r).

The m2 x 1 bottom portion r <8> t represents the mixed products for i ̂ jtwice, as r,-fy and as ifi. Thus the representation of second-degree terms in /(/)

15.14. SECOND-DEGREE ROTATABILITY 395

is redundant. Yet the powerful properties of the Kronecker product make /superior to any other form of parametrizing the second-degree model. To ap-preciate the issue, we refer to the vectorization operator, t®t — vec(ff'), andreiterate our point in terms of the rank 1 matrix tt'. This matrix is symmet-ric whence, of the m2 entries, only \m(m + 1) are functionally independent.Nevertheless the powerful rules of matrix algebra make the arrangement asan m x m matrix superior to any other form of handling the \m(m + 1) dif-ferent components. The very same point is familiar from treating dispersionmatrices, moment matrices, and information matrices as matrices, and not asarrays of a minimal number of functionally independent terms.

Because of the redundancy in t <g> f, the function / satisfies the side condi-tions

for the |ra(ra-l) choices of distinct subscripts i and /'. Therefore any momentmatrix M2(r) = J/(0/(0'^T nas nullity at least equal to \m(m - 1), orequivalently,

for all T e T. Rank deficiency of moment matrices poses no obstacle sinceour development does not presuppose nonsingularity. The introduction ofthe generalized information matrices in Section 3.21 was guided by a similarmotivation.

A rotation R e 7t — Orth(m) leaves the experimental domain T^ invari-ant, and commutes with the regression function / according to

Therefore / is 7£-Q2-equivariant relative to the (1 + m + m2) x (1 + m + m2)matrix group

The rotatability determining class ft of Section 15.6 induces the subset

Rotatable symmetric matrices now achieve a slightly more sophisticated pat-tern than those of Section 15.8.

15.15. ROTATABLE SECOND-DEGREE SYMMETRIC MATRICES

Lemma. For every matrix A e Sym(l + m + m2), the following threestatements are equivalent:

a. (Q.2-invariance) A is Q2-invariant.b. (Qi-invariance) A is Q2-invariant.c. (Parametrization) For some a, /3, y, 5i, ^2, ^3 6 IR we have

where

Proof. We partition A to conform with Q e Q2. Invariance then means

for all R <E Orth(m). That (c) implies (a) follows from R(pIm)R' - f3Im, and

The last line claims that the left hand side is the vec-permutation matrix ofSection 15.13. It suffices to verify (8) of that section for all A e IRmxw:

15.15. ROTATABLE SECOND-DEGREE SYMMETRIC MATRICES 397

Hence F(8i, 82, 83) is invariant under R®R, and A in part (c) is Q2-invariant-

Clearly part (a) implies (b) since the latter comprises fewer transformations.It remains to establish the implication from part (b) to (c). We recall

from Section 15.6 that invariance relative to Ti — Perm(m) U {R^/*} impliesinvariance also relative to the sign-change group Sign(w). In a sequence ofsteps, we use sign-changes, permutations, and the 45° rotation to disclose thepattern of A as asserted in part (c).

I. A feasible transformation is the reflection R — -Im e Sign(m). Withthis, we obtain a — -a and C = -C, whence a = 0 and C = 0.

II. From B = RBR' for R e Perm(m) U {R^/*}, we infer B = plm forsome /3 e 1R, as in the proof of Lemma 15.8. The vector b has m2 entriesand hence may be written as b — vec E for some square matrix E e Rmxm.Then b = (R®R)b translates into E — RER'. Permutational invariance of Eimplies complete symmetry even though E may not be symmetric (compareSection 13.9). Invariance under Rw/^ then necessitates E — ylm. This yieldsb = y vec Im.

III. Now we investigate invariance of the bottom right block, D — (R <g>R)D(R'®R'). It is convenient to display the entries of D according to d^^ —(ei®ej)'D(ek<8)et).

a. For sign-changes R — with andwe get

for all /, j, k, t < m. Hence d^ki vanishes provided four of the subscriptsij,k,£ are distinct, or three are distinct, or two are distinct with mul-tiplicities 1 and 3. This leaves two identical subscripts of multiplicity 2each, or four identical subscripts,

with 3m(m — 1) + m = 3m2 — 2m coefficients to be investigated.b. A permutation matrix Ra = Y^iea(i)e,' € Perm(m) yields d^^ =

d(,(i)a(j),<T(k)<r(i)' Whatever the value of m, this reduces the number ofcoefficients to four, 81,82,8^,84 e IR, say, with

for all / / ;' < m. Hence D attains the form

This is part (c) provided we show that the last term vanishes.c. This is achieved by invoking the 45° rotation Rv/^ from Section 15.6,

giving

Thus we obtain and the theorem is proved.

For second-degree rotatability, the subspace Sym(l +m+m2 , Qi) of invari-ant symmetric matrices has dimension 6, whatever the value of m.

We call a second-degree moment matrix A/2(r) rotatable when it is Q.2-invariant. The set of rotatable second-degree moment matrices,

is parametrized by the moment of order 2, 1^2(7) = f(e-t)2dr, and the mo-ment of order 2-2, M22(T) = /(e//)2(r'ey)2rfr for i ^ /, as follows.

15.16. ROTATABLE SECOND-DEGREE MOMENT MATRICES

Theorem. Let M be a symmetric (l + m + m 2 ) x ( l + m + m 2 ) matrix. ThenM is a rotatable second-degree moment matrix on the experimental domainTj^ if and only if for some

we have

where

15.16. ROTATABLE SECOND-DEGREE MOMENT MATRICES 399

The moment matrix in (1) is attained by a design r € T if and only if r hasall moments of order 2 equal to 112, all moments of order 2-2 equal to /i,22,and all moments of order 4 equal to 3/u,22>

for all / / j < m, while the other moments up to order 4 vanish.

Proof. For the direct part, let r be a design on T^ - (t e Rm : \\t\\ <^/m} with a rotatable second-degree moment matrix M. Then M is of theform given in Lemma 15.15. Calculating the moments of T, we find a = 1 and

Hence M has form (1), that is, the moments of T fulfill (2). The bounds on/u-2 are copied from Theorem 15.9. For 7x22, the upper bound

follows from fixing ; = m and summing over / / m,

The lower bound,

is obtained from the variance of t't = (vec Im)'(t <g> t) under r,


For the converse, we need to construct a design r which for given param-eters 1*2 and fjL22 has M of (1) for its moment matrix. We utilize the uniformdistribution rr on the sphere of radius r, {t e Um : ||/|| = r}; the centralcomposite designs in Section 15.18 have the same moments up to order 4and could also be used. Clearly the measure rr is rotatable, with moments

that is, iL22(Tr) = r4l(m(m + 2)).Now let the numbers

be given. We define

The measure (1 - a)rQ + aTr then places mass a on the sphere of radius r andputs the remaining mass 1 - a into the point 0. It attains the given moments,

Thus there exists a design of which the matrix M in (1) is the momentmatrix.

Rotatable second-degree information surfaces take an explicit form.

15.17. ROTATABLE SECOND-DEGREE INFORMATION SURFACES

Corollary. If the moments /x2 and /A22 satisfy

15.17. ROTATABLE SECOND-DEGREE INFORMATION SURFACES 401

then the moment matrix M in (1) of Theorem 15.16 has maximal rank \(m +l)(m + 2), and induces the rotatable information surface, which for t e R1" isgiven by

Proof. The rotatable moment matrix M in (1) of Theorem 15.16 haspositive eigenvalues /A2 and 2^i22> with associated projectors

where G Hence pt2 has multi-plicity m, while 2ju,22 has multiplicity trace Gm = \m(m + 1) — 1. This accountsfor all but two degrees of freedom, leaving M - \ijP\ - 2ju,22P2 — STS' with

The nonvanishing eigenvalues of STS' are the same as those of the 2 x 2matrix S'ST. The latter has determinant

say. By our moment assumption, d is positive. Hence the rank of M is max-imal.

The Moore-Penrose inverse then is

From part (c) of Lemma 13.10, this matrix is necessarily invariant. Indeed, aparametric representation as in Lemma 15.15 holds, with

Finally, straightforward evaluation of f(t)'M+f(t) yields /A/(0-

15.18. CENTRAL COMPOSITE DESIGNS

In the proof of Theorem 15.16, we established the existence of a design withprescribed moments /x2 and ^22 by taking recourse to the uniform distribu-tion rr on the sphere of radius r. For lack of a finite support, this is not adesign in the terminology of Section 1.24.

Designs with the same lower order moments as the uniform distributionare those that place equal weight on the vertices of a regular polyhedron.In the first-degree model of Section 15.12, moments up to order 2 matterand call for regular simplices. For a second-degree model, moments up toorder 4 must be matched. With m = 3 factors, at least \(m + l)(m + 2) = 10support points are needed to achieve a maximal rank. The twelve vertices ofan icosahedron can be used, or the twenty vertices of a dodecahedron.

The class of central composite designs serves the same purpose, as well asbeing quite versatile in many other respects. These designs are mixtures ofthree building blocks: cubes, stars, and center points. The cube portion rc is a2m-p fractional factorial design. If it is replicated nc times, then ncrc is a designfor sample size 2m~pnc. The star portion rs takes one observation at each ofthe vectors ±rei for / < m, for some star radius r > 0. With ns replications,nsrs is a design for sample size 2mns. The center point portion TO is the one-point design in 0, it is replicated «o times. The design ncrc + nsrs + rtoT0 isthen a central composite design for sample size n = 2m~pnc + 2mns + «0, withstar radius r > 0. The only nonvanishing moments of the standardized designr = (ncTc + nsTs + n0To)/n € T are

Rotatability imposes the condition m(r) = 3/i22(T)> that is, r4 = 2m pnc/ns.Specifically, we choose nc = m2, ns = 2m~p, and «0 = 0 to obtain a central

composite design with no center points, for sample size n = 2m~pm(m + 2).The rotatability condition r4 = 2m~pnc/ns forces r = ^/m. Hence the starpoints ±.^fmei lie on the sphere of radius ^/m, as do the cube points (±1}W.The resulting design is

Up to order 4, it has the same moments as has the uniform distribution T^,

The corresponding design with cube and star points scaled to lie on thesphere of radius r > 0 is denoted by fr'. Its moments up to order 4 match

15.19. SECOND-DEGREE COMPLETE CLASSES OF DESIGNS 403

those of the uniform distribution rr. In retrospect, we can achieve any givenmoments /u,2 and /u<22

m Theorem 15.16 by the central composite design arr +(1 — a)TO, with

This design has a finite support size, at most equal to 2m p + 2m +1, and is alegitimate member of the design set T.

Of the two remaining parameters //,2 and 1*22, we now eliminate ju,22 bya Loewner improvement, and calculate a threshold for /x2 by studying theeigenvalues of M. This leads to admissibility results and complete classes insecond-degree models.

15.19. SECOND-DEGREE COMPLETE CLASSES OF DESIGNS

Theorem. For a e [0;1], let r be the central com-posite design which places mass a on the cube-plus-star design Ty^ from theprevious section while putting weight 1 - a into 0.

a. (Kiefer completeness) For every design T e T, there is some a e [0; 1]such that the central composite design ra improves upon T in the Kieferordering, M2(ra) > M2(r), relative to the group Q2 of Section 15.14.

b. (Qi-invariant (f>) Let (f> be a Q2-invariant information function onNND(1 + m + m2). Then for some a e [0;1], the central compositedesign ra is <f> -optimal for 0 in T.

c. (Orthogonally invariant <f>) Let </> be an orthogonally invariant infor-mation function on NND(1 + w + ra2). Then for some a e [2/(m + 4); 1],the central composite design ra is <£ -optimal for B in T.

Proof. In part (a), we use for Sym(l + m + m2, Q2), the orthogonal basis

where FFor r e T, we calculate

^t2(r), and (M2(r), V4)/{V4, V4) = ^2i(^}- Hence the projection of M2(r) onSym(l + m + m2, Q2) is determined by the moments of T,

Since this is a rotatable second-degree moment matrix, the coefficients ful-fill /n22(T) < (ml(m + 2))/i2(T), by Theorem 15.16. Because of nonnegativedefiniteness of V4, this permits the estimate

With a = 1*2(7), the latter coincides with the moment matrix of the centralcomposite design ra = (1 - CX)TQ + ar^, as introduced in the precedingsection. In summary we have, with a = /x2(r),

Therefore ra is an improvement over r in the Kiefer ordering, M2(ra) >M2(r).

For part (b), we use the monotonicity of Q2 -invariant information functionstf> from Theorem 14.3 to obtain maxT€T<£(M2(T)) < maxa€[o;i] <f>(M2(ra)).

In part (c), orthogonal invariance implies that <j> depends on M2(Ta) onlythrough the eigenvalues,

The respective multiplicities are 1, \m(m + 1) — 1, m, and 1. All eigenvaluesare increasing for a e [0;2/(w + 4)]. Hence for a < 2/(m + 4) the eigen-value vector admits a componentwise improvement, A(a) < A(2/(m + 4)).Thus we getfor all a e [0;2/(/n+4)]. Hence there exists a 0-optimal design ra witha € [2/(w + 4);l|.

The theorem generalizes the special case of a parabola fit model with asingle factor of Section 13.1, to an arbitrary number of factors m > 2. Forthe trace criterion <fo, again the largest value of the parameter is optimal,a =-1, at the expense of giving away feasibility for 6, rank Af2(ri) = \(m +l)(m + 2)-l.

However, if m > 2, then for the rank deficient smallest-eigenvalue criterion(f)^ it is not the smallest value a = 2/(m + 4) which is optimal, but a =(m - 1)/(2m - 1). This is a consequence of the ordering of the eigenvalues,in that the smallest positive eigenvalue is A3(a) for a € [0; (m - l)/(2m - 1)],

15.20. MEASURES OF ROTATABILITY 405

EXHIBIT 15.1 Eigenvalues of moment matrices of central composite designs. For m = 4factors, the positive eigenvalues A, (a) of the moment matrices of the central composite designsra — (l-a)T0 + a72 are increasing on [0;l/4], fori = 1,2,3,4. The minimum of the eigenvaluesis maximized at a = 3/7.

and A4(a) for a G [(m - l)/(2m - 1); 1] (see Exhibit 15.1). As m tends to oowe get 0 <- 2/(m + 4) < (m - I)/(2m - 1) -» 1/2.

The bottom line is that rotatability generates a complete class of designswith a single parameter a no matter how many factors m are being investi-gated.

15.20. MEASURES OF ROTATABILITY

In Section 5.15, we made a point that design optimality opens the way tothe pratically more relevant class of efficient designs, substituting optimalityby near-optimality. In the same vein, we may replace rotatability by near-rotatability.

In fact^ given an arbitrary moment matrix M, the question is how it re-lates to M, the orthogonal projection onto the subspace Sym(l +m + m2, Q2)of rotatable matrices. However, an orthogonal projection is nothing but aleast-squares fit for regressing M on the matrices V0, V2> and V4 which spanSym(l + m + m2, Q2). Hence relative to the squared matrix norm \\A\\2 =trace A 'A the following statements are equivalent:

M is second-degree rotatable,

Deviations from rotatability can thus be measured by the distance between Mand M, \\M —M\\2 > 0. However, it is difficult to assess the numerical value ofthis measure in order to decide which numbers jire too big to be acceptable.Instead we recommend the /?2-type measure \\M — V0\\

2/\\M — VQ\\2, whichlies between 0 and 1 besides being a familiar diagnostic number in linearmodel inference.

15.21. EMPIRICAL MODEL-BUILDING

Empirical model-building deals with the issue of finding a design that is ableto resolve the fine structure, in a model sequence of growing sophistication.The option to pick from a class of reasonable, efficient, near-rotatable designsa good one rather than the best is of great value when it comes to decidewhich of various models to fit.

Optimality in a single model aids the experimenter to assess the perfor-mance of a design. If more than one model is contemplated, then the resultsof Chapter 11 on discrimination designs become relevant, even though theyare computationally expensive. Whatever tools are being considered, theyshould not construe the insight that a detailed understanding of the practicalproblem is needed. It may lead the experimenter to a design choice which,although founded on pragmatic grounds, is of high efficiency. Generally, thetheory of optimal experimental designs provides but one guide for a rationaldesign selection, albeit an important one.

EXERCISES

15.1 Show that the central composite design ra with a = m(m + 3)/((m+l)(m+2)) is ^'-optimal for 6 in T, where <£0' is the rank deficientdeterminant criterion of Section 8.18.

15.2 Show that for four factors, m = 4, the central composite designs ra ofSection 15.19 have rank deficient matrix mean information

with p 7^ 0, — oo.

15.3 (continued) Show that $1^-, $_'j-» </>0'-optimal designs for 6 in T areobtained with a = 3/7, 0.748, 14/15, respectively.

15.4 Let the information functions 0 on NND(5i52) and $, on NND(.s,) fori = 1,2 satisfy ^(Q <g> C2) = <Ai(Q)<fc(C2). Show that if M, e MIis <fo-optimal for /f/0/, for i = 1,2, then MI <8> MI is <£-optimal for(K\®K2)'(8i®&2)inM = conv{Ai®A2 : Al € M\, A2 € M2] [Hoel(1965), p. 1099; Krafft (1978), p. 286; Pukelsheim (1983a), p. 196].

EXERCISES 407

15.5 (continued) Show that the matrix means on NND(,siS2), and onNNDfo) and NND(s 2) satisfyCi 6 NND(si) and C2 € NND(s2). Deduce formulas for trace and for det C\ ® C2.

15.6 (continued) The two-way second-degree unsaturated regression func-tion /(?!, r2) = (1, ?i, f2, r^2)' is the Kronecker product of fi(tt) = (1, /,)'for i = 1,2. Find optimal designs by separately considering the one-wayfirst-degree models for i = 1,2.

for all

Comments and References

The pertinent literature is discussed chapter by chapter and further develop-ments are mentioned.

1. EXPERIMENTAL DESIGNS IN LINEAR MODELS

The material in Chapter 1 is standard. Our linear model is often called alinear regression model in the wide sense, whereas a linear regression modelin the narrow sense is our multiple line fit model. Multiway classificationmodels and polynomial fit models are discussed in such textbooks as Searle(1971), and Box and Draper (1987).

The study of monotonic matrix functions is initiated by Loewner (1934),and is developed in detail by Donoghue (1974). The terminology of aLoewner ordering follows Marshall and Olkin (1979). The Gauss-MarkovTheorem is the key result of the theory of estimation in linear models.In terms of estimation theory, the minimum representation in the Gauss-Markov Theorem 1.19 is interpreted as a covariance adjustment by Rao(1967).

Optimal design theory calls for another, equally important application ofthe Gauss-Markov Theorem when, in Section 3.2, the information matrix fora parameter subsystem is defined. In order that the Gauss-Markov Theoremprovides a basis for the definition of information matrices, it is imperativeto present it without any assumptions on ranks and ranges, as does Theo-rem 1.19. This is closely related to general linear models and generalizedinverses of matrices, see the monograph by Rao and Mitra (1971). Of thevarious classes of generalized inverses, we make do with the simplest one,that of Section 1.16.

Our derivation of the Gauss-Markov Theorem emphasizes matrix algebrarather than the method of least squares. That the Gauss-Markov Theoremand the method of least squares are synonymous is discussed in many statis-tical textbooks. Krafft (1983) derives the Gauss-Markov Theorem by means

408

COMMENTS AND REFERENCES 409

of a duality approach, exhibiting an appropriate dual problem. The historyof the method of least squares, and hence of the Gauss-Markov Theorem, isintriguing, see Plackett (1949, 1972), Farebrother (1985), and Stigler (1986).A brief review follows.

Legendre (1806) was the first to publish the Methode des MoindresQuarres, proposed the name and recommended it because of its simplicity.Gauss, according to a 1812 letter to Laplace [see Gauss (Werke), Band X l,p. 373], had known the method already in 1795, and had found a maximumlikelihood type justification in 1798. However, he did not publish his findingsuntil 1809, in Section 179 of his work Theoria Motus Corporum Coelestium,[see Gauss (Werke), Band VII, p. 245]. A minimum variance argument thatcomes closest to what today we call the Gauss-Markov Theorem appeared in1823, as Section 21 of the paper Theoria Combinationis Observationum Er-roribus Minimis Obnoxiae, Pars Prior, [see Gauss (Werke), Band IV, p. 24].Markov (1912) contributed to the dissemination of the method, by includingit as the seventh chapter in his textbook Wahrscheinlichkeitsrechnung; seealso Section 4.1 of Sheynin (1989).

The standardization of a design £„ for finite sample size n to a discreteprobability distribution £n/n anticipates the generalization to infinite samplesize. From a practical point of view, a design for finite sample size n is alist (*i,... ,*„) of regression vectors jc, which determine experimental run i.The transition to designs for infinite sample size goes back to Elfving (1952),see also Kiefer (1959, p. 281). Ever since, there has been an attempt todiscriminate between designs for finite sample size and designs for infinitesample size by appropriate terminology: discrete versus continuous designs;exact versus approximate designs; concrete designs versus design measures.We believe that the inherent distinction is best brought to bear by speaking ofdesigns for finite sample size, and designs for infinite sample size. In doing so,we adopt the philosophy of Pazman (1986), with a slightly different wording.

In the same vein we find it useful to distinguish between moment matricesof a design and information matrices for a parameter subsystem. Many au-thors simply speak in both cases of information matrices. Kempthorne (1980)makes a point of distinguishing between a design matrix and a model matrix;Box and Hunter (1957) use the terms design matrix and matrix of independentvariables. We reserve the term precision matrix for the inverse of a dispersionmatrix, because it is our understanding that precision ought to be maximizedwhile variability ought to be minimized. In contrast, the precision matrix ofBox and Hunter (1957, p. 199), is the inverse moment matrix n(X'X)~l.

In Lemma 1.26, we use the fact that in Euclidean space, the convex hull ofa compact set is compact; this result can be found in most books on convexanalysis, see for instance Rockafellar (1970, p. 158). An important implica-tion of Lemma 1.26 is that no more moment matrices beyond the set M(H)evolve if the notion of a design is extended to mean an arbitrary proba-bility distribution P on the Borel sigmaalgebra B of the compact regressionrange X. Namely, every such distribution P is the limit of a sequence (£m)m>\

410 COMMENTS AND REFERENCES

of probability measures with finite support, see Korollar 30.5 in Bauer (1990).The limit is in the sense of vague convergence and entails convergence ofthe moment matrices,

In our terminology, &, are designs in the set H, whence the moment matricesM(gm) lie in M(H). Since the set M(H) is closed, it contains the limit matrix,fxxx' dP e Af(E). Hence the moment matrix of P is attained also by somedesign £es,M(P) = M(£).

2. OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS

In the literature, designs that are optimal for c'd are mostly called c-optimal.The main result of the chapter, Theorem 2.14, is due to Elfving (1952). Stud-den (1971) generalizes the result to average-variance optimal designs for K '0,with applications to polynomial extrapolation and linear spline fitting. Otherimportant elaborations of Elfving's result are given in Chernoff (1972), andFellman (1974). Our presentation is from Pukelsheim (1981) and is set up inorder to ease the passage to multidimensional parameter subsystems.

That the cone of nonnegative definite matrices NND(fc) is the same asthe ice-cream cone, as established in Section 2.5, holds true for dimension-ality k = 2 only. A general classification of self-dual cones is given by Bel-lissard, lochum and Lima (1978). The separating hyperplane theorem is astandard result from convex analysis; see for instance Rockafellar (1970, p.95), Bazaraa and Shetty (1979, p. 45), and Witting (1985, p. 71).

The Equivalence Theorem 2.16 on scalar optimality of moment matricesis due to Pukelsheim (1980) where it is obtained as a corollary from theGeneral Equivalence Theorem. The necessity part of the proof puts earlyemphasis on a type of argument that is needed again in Section 7.7. Follow-ing work of Kiefer and Wolfowitz (1959), Hoel and Levine (1964) reducescalar optimality to Chebyshev approximation problems; see also Kiefer andWolfowitz (1965), and Karlin and Studden (1966b). For a dth-degree poly-nomial fit model on [—1;1], Studden (1968) determines optimal designs forindividual coefficients Oj. We include these results in Section 9.12 and Sec-tion 9.14. The geometry that underlies the Elfving Theorem is exploited formore sophisticated criteria than scalar optimality by Dette (1991b).

3. INFORMATION MATRICES

Throughout the book, we determine the s x l parameter system of inter-est K'6 through the choice of the k x s coefficient matrix K, not through the

underlying model parametrization that determines the k x 1 parameter vec-tor 6. For instance, the two-way classification model is always parametrizedas E[Y,j] = a, + j8y, with parameters a, and fy forming the components ofthe vector 6. The grand mean 0., or the contrasts («! — a. , . . . , aa - a.)', orsome other parameter subsystem are then extracted by a proper choice ofthe coefficient matrix K. There is no need to reparametrize the model!

The central role of information matrices for the design of experiments isevident from work as early as Chernoff (1953). The definition of the infor-mation matrix mapping as a Loewner minimum of linear functions is dueto Gaffke (1987a), see also Gaffke (1985a, p. 378), Fedorov and Khabarov(1986, p. 185), and Pukelsheim (1990). It is made possible through the Gauss-Markov Theorem in the form of Theorem 1.21. The close relation with theGauss-Markov Theorem is anticipated by Pukelsheim and Styan (1983).

Feasibility cones, as domains of optimization for the design problem, werefirst singled out by Pukelsheim (1980,1981). That the range inclusion condi-tion which appears in the definition of feasibility cones is essential for estima-bility, testability, and identifiability is folklore of statistical theory, compareBunke and Bunke (1974). Alternative characterizations based on rank areavailable in Alalouf and Styan (1979). Fisher information matrices are dis-cussed in many textbooks on mathematical statistics. The dispersion formulaof Section 3.10 also appears in the differential geometrical analysis of moregeneral models, as in Barndorff-Nielssen and Jupp (1988). The role of the mo-ment matrix in general parametric modeling is emphasized by Pazman (1990).Parameter orthogonality is investigated in greater generality by Cox and Reid(1987). Lemma 3.12 is due to Albert (1969). For a history of Schur comple-ments and their diverse uses in statistics, see Ouellette (1981), Styan (1985),and Carlson (1986).

The derivation of the properties of information matrices CK(A) or of gen-eralized information matrices AK is based on Anderson (1971), and alsoappears in Krein (1947, p. 492). Anderson (1971) calls AK a shorted operator,for the reason that

To see this, we recall that by Lemma 3.14, the matrix AK satisfies AK < A andrange AK C range K. Now let B e NND(fc) be another matrix with B < Aand range B C range K. From B < A, we get range B C range A, whencerange B C range AK by the proof of Theorem 3.15. From Lemma 3.22 andLemma 1.17, we obtain B = AKA~B. Symmetry of B and the assumptionB < A yield B - AKA~B - AKA~BA~AK < AKA~AA~AK = AK, sinceA~AA~ is a generalized inverse of A and hence of AK, again by Lemma 3.22.Thus AK is a maximum relative to the Loewner ordering, among all nonneg-ative definite k x k matrices B < A of which the range is included in therange of K.

Anderson and Trapp (1975, p. 65) establish the minimum property of


shorted operators which we have chosen as the definition. It emphasizes thefundamental role that the Gauss-Markov Theorem plays in linear model the-ory. Mitra and Puri (1979) and Golier (1986) develop a wealth of alternativerepresentations of shorted operators, based on various types of generalizedinverse matrices and generalized Schur complements. Alternative ways toestablish the functional properties of the information matrix mapping are of-fered by Silvey (1980, p. 69), Pukelsheim and Styan (1983), and Hedayat andMajumdar (1985). The method of regularization has more general applica-tions in statistics, see Cox (1988). The line fit example illustrating the discon-tinuity behavior of the information matrix mapping is adapted from Pazman(1986, p. 67).

The C-matrix in a two-way classification model is a well-understood object,but its origin is uncertain. Reference to a C-matrix is made implicitly by Bose(1948, p. (12)), and explicitly by Chakrabarti (1963). Anyway, the name is con-venient in that the C-matrix is the coefficient matrix of the reduced system ofnormal equations, as well as the contrast information matrix. Christof (1987)applies iterated information matrices to simple block designs in two-way clas-sification models. Theorem 3.19 is from Gaffke and Pukelsheim (1988). Forthe occurrence of the C-matrix in models with correlated observations see,for instance, Kunert and Martin (1987), and Kunert (1991).

4. LOEWNER OPTIMALITY

Many authors speak of uniform optimality rather than Loewner optimality.We believe that the latter makes the reference to the Loewner orderingmore visible. The Loewner ordering is the same as the uniform ordering ofPazman (1986, p. 48), in view of Lemma 2 of Ste.pniak, Wang and Wu (1984).It fits in with other desirable notions of information oriented orderings ofexperiments, see Kiefer (1959, Theorem 3.1), Hansen and Torgersen (1974),and Ste.pniak (1989).

The first result on Loewner optimal designs appears in Kurotschka (1971),in the setting of the two-way classification models of Section 4.8; see alsoGaffke and Krafft (1977), Kurotschka (1978), Giovagnoli and Wynn (1981,1985a), and Pukelsheim (1983a,c). The first part of Lemma 4.2 is due toLaMotte (1977). Section 4.5 to Section 4.7 follow Pukelsheim (1980). Thenonexistence of Loewner optimal designs discussed in Corollary 4.7 is im-plicit in the paper of Wald (1943, p. 136). The derivation of the GeneralEquivalence Theorem for scalar optimality from Section 4.9 onwards is new.

5. REAL OPTIMALITY CRITERIA

The first systematic attempt to cover more general optimality criteria thanthe classical ones is put forward by Kiefer (1974a). From the broad class


of criteria discussed in that paper, Pukelsheim (1980) singles out informa-tion functions as an appropriate subclass for a general duality theory. Closelyrelated classes of functions are investigated by Rockafellar (1967), and Mc-Fadden (1978). Other classes may be of interest; for instance, Cheng (1978a)distinguishes between what he calls type I criteria and type II criteria. Nal-imov (1974) and Hedayat (1981) provide an overview of various common andsome not so common optimality criteria. Shewry and Wynn (1987) proposea criterion based on entropy. Related design aspects in dynamic systems arereviewed by Titterington (1980b).

It is worth emphasizing that the defining properties—monotonicity, con-cavity, and homogeneity—of information functions are motivated, not bytechnical convenience, but from statistical aspects; see alsoPukelsheim (1987a). The analysis of information functions parallels to aconsiderable extend the general discussion of norms, compare Rockafel-lar (1970, p. 131), and Pukelsheim (1983b). The Holder inequality datesback to Holder (1889) and Rogers (1888), as mentioned in Hardy, Little-wood and Polya (1934, p. 25). Beckenbach and Bellman (1965, p. 28), callit the Minkowski-Mahler inequality, and emphasize the method of quasi-linearization in defining polar functions. Our formulation of the general de-sign problem in Section 5.15 is anticipated in essence by Elfving (1959). TheExistence Lemma 5.16 is a combination of Theorem 1 in Pukelsheim (1980),and Corollary 5.1 in Muller-Funk, Pukelsheim and Witting (1985).

6. MATRIX MEANS

The concepts of determinant optimality, average-variance optimality, andsmallest-eigenvalue optimality are classical, to which we add trace optimal-ity. Determinant optimality and smallest-eigenvalue optimality were first pro-posed by Wald (1943). Krafft (1978) discusses the relation of the determinantcriterion with the Gauss curvature of the power function of the F-test andwith the concentration ellipsoid, see also Nordstrom (1991). Gaffke (1981,p. 894) proves that the determinant is the unique criterion which inducesan ordering that is invariant under nonsingular reparametrization. That traceoptimality acquires its importance only in combination with some other prop-erties has already been pointed out by Kiefer (1960, p. 385). Our notion oftrace optimality is distinct from the T-optimality concept of Atkinson andFedorov (1975a,b) who use their criterion in order to test which of severalmodels is the true one.

Vector means are discussed, for example, by Beckenbach and Bellman(1965). An in-depth study of majorization is given by Hardy, Littlewood andPolya (1934), and Marshall and Olkin (1979). Two alternative proofs of theBirkhoff (1946) theorem and refinements are given in Marshall and Olkin(1979, pp. 34-38). Our method of proof in Section 6.8, of projecting ontodiagonal matrices through averaging with respect to the sign-change group,

is adapted from Andersson and Perlman (1988). The proof of the Holderinequality follows Beckenbach and Bellman (1965, p. 70), Gaffke and Krafft(1979a), and Magnus (1987). Matrix norms of the form <£(C) = (A(C)),with a symmetric gauge function, are studied by von Neumann (1937).Marshall and Olkin (1969), based on a result of Schatten (1950, p. 85), showthat these norms are monotone. For a general exposition of matrix normssee Horn and Johnson (1985).

According to Wussing and Arnold (1975, p. 230), THospital communicatedhis rule in 1696 in the textbook Analyse des Infiniment Petits. Under the sealof secrecy, 1'Hospital had bought much of the contents of the book from hisprivate tutor, Johann Bernoulli.

Insistence in formulating a convex minimization problem—only becausethis is the preferred problem type in optimization theory—is detrimental tothe general design problem, compare Hoang and Seeger (1991). Theoreti-cally, there is a deep and perfect analogy between convexity and concavity,see Part VII of Rockafellar (1970). Practically, the optimal design problem isan instance testifying to the difference of the two concepts, in that maximiza-tion of a concave information function describes the problem much morecomprehensively than does minimization of a convex risk function.

Logarithmic concavity is an accidental byproduct of the theory, with nointrinsic value. We know of no instance where the logarithm is of any neces-sity in solving a design problem. Rather, use of the logarithm signals that theproblem under study is not yet fully understood. Lindley (1956) and Stone(1959) access the design problem through entropy which, of course, involvesthe logarithm. But before long, when it comes to the optimization problem,the logarithm disappears and the determinant criterion takes over.

7. THE GENERAL EQUIVALENCE THEOREM

In the literature there is only a sporadic discussion of the existence of op-timal designs, as settled by Theorem 7.13. That the existence issue may becrucial is evidenced by Theorem 1 of Whittle (1973, p. 125), whose proof asit stands is incomplete. Existence of determinant optimal designs has neverbeen doubted; see Kiefer (1961, p. 306) and Atwood (1973, p. 343). Forscalar optimality, existence follows from the geometric approach of Chap-ter 2 which is due to ElfVing (1952), see also Chernoff (1972, p. 12). For theaverage-variance criterion, existence of an optimal design is established byFellman (1974, Theorem 4.1.3). Pdzman (1980, Propositions 2 and 4) solvesthe problem for the sequence of matrix means <£_i, </>_2, —

The term sub gradient originates with convex functions g since there anysubgradient defines an affine function that bounds g from below. For a con-cave function g, a subgradient yields an affine function that bounds g fromabove. Hence the term swpergradient would be more appropriate, in theconcave case.

Our formulation of the General Equivalence Theorem 7.14 appears tocome as close as possible to the classical results of Kiefer and Wolfowitz(1960), and Kiefer (1974a). The original proof of Pukelsheim (1980) is basedon Fenchel duality, Pukelsheim and Titterington (1983) outline the approachbased on subgradients. The computation of the set of all subgradients iscarried out by Gaffke (1985a,b).

It is worthwhile contemplating the key role of the subgradient inequality.It demands the existence of a generalized inverse G of M such that

The prevailing property is linearity in the competing moment matrices A eM. It is due to this linearity that, for the full set M = M(E) = conv{*jc' :x G X ] , we need to verify (1) only for the generating rank 1 matrices A = xx',with x E X.

There are alternatives to the subgradient inequality that lack this linearityproperty. For instance, a derivation based on directional derivatives leads tothe requirement

where AM = minQ6R*x*.QAf=M QAQ' is the generalized information matrixfrom Section 3.21. The advantage of (2) is that the product K'M~AMM~Kis invariant to the choice of a generalized inverse of M. This follows fromTheorem 3.24 and Lemma 1.17, see also the preamble in the proof of The-orem 4.6. The disadvantage of (2) is that the dependence of AM on A is ingeneral nonlinear, and for M = A/(E) it does not suffice to verify (2) forA = xx', with jc G X, only. That (1) implies (2) is a direct consequence ofAM < A and the monotonicity properties from Section 1.11. The gist of theequivalence comes to bear in the converse implication.

To see that (2) implies (1), we use tLt alternative representation

following from AM < minG€Af- MG'AGM < MG'AGM = AM. The firstinequality holds because there are more matrices Q satisfying QM = Mthan those with the form Q = MG' with G G M~. The second inequalityholds for any particular member G G M~. With arbitrary matrix G G M~,residual projector R = Ik - MG, and generalized inverse H G (RAR')~,we pick the particular version G = G - R'HRAG to obtain MG'M = Mand MG'AGM = A - AR'HRA. The latter coincides with AM, see (2) in

Section 3.23. Now the left hand side in (2) turns into

As a whole, (2) becomes max^g^ minG€M- IraceK'G'AGKCDC < 1. Ap-pealing to one of the standard minimax theorems such as Corollary 37.3.2 inRockafellar (1970), we finally obtain

This is the same as (1). The proof of the equivalence of (1) and (2) is com-plete.

A closely related question is which generalized inverses G of M are suchthat they appear in the General Equivalence Theorem. It is not always theMoore-Penrose inverse Af+, as asserted by Fedorov and Malyutov (1972),nor is it an arbitrary generalized inverse, as claimed by Bandemer (1977, Sec-tion 5.6.3). Counterexamples are given by Pukelsheim (1981). Silvey (1978,p. 557) proposes a construction of permissible generalized inverses based onsubspaces that are complementary to the range of M, see also Pukelsheim(1980, p. 348), and Pukelsheim and Titterington (1983, p. 1064). However,the choice of the complementary subspace depends on the optimal solutionof the dual problem. Hence the savings in computational (and theoretical)complexity seem to be small.

There are also various versions of the General Equivalence Theoremwhich emphasize the geometry in the space L2(£), see Kiefer and Wolfowitz(1960, p. 364), Kiefer (1962), Karlin and Studden (1966a, Theorem 6.2), andPukelsheim (1980, p. 354).

In our terminology, an Equivalence Theorem for a general informationfunction </> seeks to exhibit necessary and sufficient conditions for </>-optimal-ity which are easy to verify. The original intent, of reconciling two indepen-dent criteria, no longer prevails. We use the qualifying attribute GeneralEquivalence Theorems to indicate such results that allow for arbitrary con-vex compact sets M. of competing moment matrices, rather than insisting onthe largest possible set M(E).

If the set of competing moment matrices is maximal, M = M(H), then forthe full parameter vector 6 the variables of the dual problem lend themselvesto an appealing interpretation. In fact, for every matrix N > 0, we have that

In other words, N induces a cylinder that includes the regression range X,as described in Section 2.10. This dual analysis dates back to Elfving (1952,

p. 260); see also Wynn (1972, p. 174), Silvey and Titterington (1973, p. 25),and Sibson (1974, p. 684).

The dual problem then calls for minimizing the "size" of the cylinder Nas measured by the polar function $°°. For the determinant criterion, thetopic is known among convex geometers and is associated with the nameof Loewner. The ellipsoid of smallest volume that includes a compact set ofpoints is called the Loewner ellipsoid, see Busemann (1955, p. 414), Danzer,Laugwitz and Lenz (1957), Danzer, Grunbaum and Klee (1963, p. 139), Krafft(1981, p. 101), and Gruber (1988); Loewner (1939) investigates closely relatedproblems. Silverman and Titterington (1980) develop an exact terminatingalgorithm for finding the ellipsoid of smallest area covering a plane set ofregression vectors.

The results in Section 7.23 and Section 7.24 on the relation betweensmallest-eigenvalue optimality and scalar optimality are from Pukelsheim andStudden (1993).

8. OPTIMAL MOMENT MATRICES AND OPTIMAL DESIGNS

The upper bound (2) on the support size in Theorem 8.2 is the Caratheodorytheorem which says that for a bounded set SCR", every point in its convexhull can be represented as a convex combination of n + 1 points in 5. For aproof see, for instance, Rockafellar (1970, p. 155), and Bazaraa and Shetty(1979, p. 37). Bound (5) of Theorem 8.2 originates with Fellman (1974, p.62) for scalar optimality, and generalizes earlier results of Elfving (1952, p.260) and Chernoff (1953, p. 590). It is shown to extend to the general designproblem by Pukelsheim (1980, p. 351) and Chaloner (1984). Theoretically,in a Baire category sense, most optimal designs are supported by at most\k(k + 1) many points, as established by Gruber (1988, p. 58). Practically,designs with more support points may be preferable because of their greaterbalancedness; see Section 14.9.

The usefulness of Lemma 8.4 has been apparent to early writers such asKiefer (1959, p. 290). Theorem 8.5 appears in Ehrenfeld (1956, p. 62), in astudy of complete classes and admissibility. The example in Section 8.6 is aparticular case of a line fit model over the k-dimensional cube for k = 2,discussed for general dimension k in Section 14.10. A similarly completediscussion of a parabola fit model is given by Preitschopf and Pukelsheim(1987). Gaffke (1987a) calculates the </>p-optimal designs for the parametersof the two highest coefficients in a polynomial fit model of arbitrary degree.Studden (1989) rederives these results using canonical moments.

Theorem 8.7 on linearly independent support vectors is from Pukelsheimand Torsney (1991), and has forerunners in Studden (1971, Theorem 3.1),Torsney (1981), and Kitsos, Titterington and Torsney (1988, Section 6.1). Thefix point equation for the optimal weights (A * A)w = l f in Theorem 8.11is from Pukelsheim (1980, p. 353). The resulting bound l/s for determi-

nant optimality first appears in Atwood (1973, Theorem 4). The auxiliaryLemma 8.10 on Hadamard products dates back to Schur (1911, p. 14); seealso Styan (1973). The proof of Theorem 8.7 employs an idea of Sibson andKenny (1975), to expand the quadratic form until the matrix M(g) appearsin the middle.

The material in Section 8.11 to Section 8.16 is taken from Pukelsheim(1980). Section 8.19 follows Pukelsheim (1983a). Corollary 8.16 has a historyof its own, see Kiefer (1961, Theorem 2), Karlin and Studden (1966a, Theo-rem 6.1), Atwood (1969, Theorem 3.2), Silvey and Titterington (1973, p. 25),and Sibson (1974, p. 685).

9. D-, A-, E-, T-OPTIMALITY

Farrell, Kiefer and Walbran (1967, p. 113) introduce the name global criterionfor the G-criterion. The first reference to work on globally optimal designsis Smith (1918). The Equivalence Theorem 9.4 establishes the equivalenceof determinant optimality and global optimality, whence its name. This is afamous result due to Kiefer and Wolfowitz (1960), announced as a footnotein Kiefer and Wolfowitz (1959, p. 292). Earlier Guest (1958) had found theglobally optimal designs for polynomial fit models, and Hoel (1958) the de-terminant optimal designs. That the two, apparently distinct criteria lead tothe same class of optimal designs came as a surprise to the people workingin the field. As Kiefer (1974b) informally reports:

In fact the startling coincidence is that these two people have the same first twoinitials (P.G.) and you can compute the odds of that!!

That the determinant optimal support points are the local extrema of theLegendre polynomials follows from a result of calculus of Schur (1918), seeSzego (1939, Section VI.6.7), and Karlin and Studden (1966b, p. 330). Inverifying optimality by solving the dual problem, we follow Fejer (1932).The equivalence of the two problems is also stressed in Schoenberg (1959, p.289) who, on p. 284, has a formula for the optimal determinant value whichyields the recursion formula for u</+i($o) in Section 9.5.

Bandemer and Nather (1980, p. 299), list the determinant optimal designsfor polynomial fit models up to degree d = 6. That monograph also contains awealth of tabulated designs for other criteria and for other models. Karlin andStudden (1966a) allow for different variance functions; the determinant op-timal support points are then associated with other classical polynomials; seealso, for instance, Fedorov (1972, p. 88), Humak (1977, p. 457) Krafft (1978, p.282), Ermakov (1983), and Pazman (1986, p. 176). St.John and Draper (1975)provide a review of early work on determinant optimality and a bibliogra-phy. Applications to multivariate problems are studied by Krafft and Schaefer(1992). Bischoff (1992, 1993) gives conditions on the dispersion structure so


that the determinant optimal design in the homoscedastic model remainsoptimal in the presence of correlated observations.

Kiefer and Studden (1976) study in greater detail the consequences of thefact that for increasing degree d, the determinant optimal designs convergeto a limiting arcsin distribution. Dette and Studden (1992) establish the lim-iting arcsin distribution using canonical moments. Arcsin support designs arealso emphasized by Fedorov (1972, p. 91). Arcsin support points, under theheading of Chebyshev points, first appeared while treating scalar optimalityproblems as a Chebyshev approximation scheme; see Kiefer and Wolfowitz(1959, 1965), Hoel and Levine (1964), and Studden (1968). As the degree dtends to oo, limiting distributions other than the arcsin distribution do occur.Studden (1978) shows that the optimal designs for the quantile component0\dq\, with q G (0;1), have limiting Lebesgue density

Kiefer and Studden (1976) provide the limiting Lebesgue density of the op-timal designs for f (to)'6 with |/o| > 1 (extrapolation design),

The classical Kiefer and Wolfowitz Theorem 9.4 investigates determinantoptimality of a design £ in the set H of all designs, or equivalently, of amoment matrix M in the set M(E) of all moment matrices. Our theory ismore general, by admitting any subset M C A/(E) of competing momentmatrices, as long as M is convex and compact. For instance, if the regressionrange is a Cartesian product, X = X\ x X2, and r is a given design on themarginal set X\, then the set of moment matrices originating from designs £with first marginal distribution equal to r,

is convex and compact, and Theorem 9.4 applies. For this situation, determi-nant optimality is characterized for the full parameter system by Cook andThibodeau (1980), and for parameter subsystems by Nachtsheim (1989).

The concept of average-variance optimality is the prime alternative to de-terminant optimality. Fedorov (1972, Section 2.9), discusses linear optimalitycriteria, in the sense of Section 9.8. Studden (1977) arrives at the criterionby integrating the variance surface x'M~lx, and calls it I-optimality. Theaverage-variance criterion also arises as a natural choice from the Bayes pointof view as proposed, for instance, by Chaloner (1984). An example where the

experimenters favor weighted average-variance optimality over determinantoptimality is Conlisk and Watts (1979, p. 37).

Ehrenfeld (1955) introduces the smallest-eigenvalue criterion. The crite-rion becomes differentiable at a matrix M > 0 for which the smallest eigen-value has multiplicity 1; see Kiefer (1974a, Section 4E). The results in Sec-tion 9.13 on smallest-eigenvalue optimality are drawn from Pukelsheim andStudden (1993), and are obtained independently by Heiligers (1991c). Detteand Studden (1993) present a detailed inquiry into the geometric aspects ofsmallest-eigenvalue optimality, interwoven with other results from classicalanalysis.

If the full parameter 6 is of interest, K = Ik, then the eigenvalue propertyin part I of the proof in Section 9.13 is

In terms of the polynomials P(t) = a'f(t)/\\a\\, standardized to have co-efficient vector a/\\a\\ of Euclidean norm 1, we obtain a least squares prop-erty of the standardized Chebyshev polynomial Td(t) = c'f(t)/\\c\\ relativeto the smallest-eigenvalue optimal design TC for 6,

This complements the usual least squares property of the Chebyshev polyno-mials which pertains to the arcsin distribution, see Rivlin (1990, p. 42). Theeigenvalue property (1) can also be written as 1 > ||fl||2/||c||2 for all vectorsa e Ud+l satisfying a'Ma < 1. That is, for every vector a e IRd+1, we have

A weaker statement, with max,=0,i,...,<*(«'/(•*/)) < 1 in place of the integral,is a corollary to results of Erdos (1947, pp. 1175-1176).

Either way, we may deduce that the Elfving set 7£ for a polynomial fitover [—1;1] has in-ball radius \\c\\~1. Indeed, the supporting hyperplanes ton are given by the vectors 0 ^ a e Ud+l such that max,e[_1;]] \a'f(t)\ =1. As mentioned in the previous paragraph, this entails ||a||2 < ||c||2. Thehyperplane {v e Rd+l : a'v = 1} has distance l/||a|| to the origin. Thereforethe supporting hyperplane closest to the origin is given by the Chebyshevcoefficient vector c, and has distance r = I/\\c\\.

Our approach includes, in Section 9.12 and Section 9.14, an alternatederivation of the optimal designs for the individual components 0j originallydue to Studden (1968), see also Murty (1971). At the same time, we rederive


and extend the classical extremum property of Chebyshev polynomials. WithCKK'C(M) = mma&Kd^.a'KK/c=ia'Ma as in Section 3.2, we have

Hence among all polynomials P ( t ) = a'f(t) that satisfy a'KK'c = 1, the sup-norm (I/11| = max,e[_1;1] |P(01 has minimum value ||Ar'c||"2, and this minimumis attained only by the standardized Chebyshev polynomial Td/\\K'c\\2. Sec-tion 9.14 provides the generalization that stems from the coefficients 6d_i_2j-Among all polynomials P(t) = a'f(t) that satisfy a'KK'c = 1, the sup-norm||P|| has minimum value ||^'cj|~2, and this minimum is attained only by thestandardized Chebyshev polynomial Td_i/\\K'c\\2. For the highest index d,this result is due to Chebyshev (1859), for the other indices to Markoff (1916),see Natanson (1955, pp. 36, 50) and Rivlin (1990, pp. 67, 112). Our formula-tion allows for combinations of the individual components, as given by KK 'candKK'c.

Trace optimality is discussed implicitly in Silvey and Titterington (1974, p.301), Kiefer (1975, p. 338), and Titterington (1975, 1980a). Trace optimalityoften appears in conjunction with Kiefer optimality, for reasons discussed inSection 14.9 (II).

Optimal designs for the trigonometric fit model are given by Hoel (1965,p. 1100) and Karlin and Studden (1966b, p. 347); see also Fedorov (1972,Section 2.4) and Krafft (1978, Section 19(c)). The example in Section 9.17of a design that remains optimal under variation of the model is from Hill(1978a).

The Legendre polynomials in Exhibit 9.1 and the Chebyshev polynomialsin Exhibit 9.8 are taken from Abramowitz and Stegun (1970, pp. 798, 795).The numerical results in Exhibit 9.2 to Exhibit 9.11 were obtained with theFortran program PolyPlan of Preitschopf (1989).

10. ADMISSIBILITY OF MOMENT AND INFORMATION MATRICES

The notion of design admissibility is introduced by Ehrenfeld (1956) in thecontext of complete class theorems. Theorem 10.2 is due to Elfving (1959,p. 71); see also Karlin and Studden (1966a, p. 809). Pilz (1979) develops adecision theoretic framework. Lemma 10.3 is from Gaffke (1982, p. 9).

Design admissibility in polynomial fit models is resolved rather early by dela Garza (1954) and Kiefer (1959, p. 291). Our development in Section 10.4


to Section 10.7 follows the presentations in Gaffke (1982, pp. 90-92) andHeiligers (1988, pp. 84-86). The result from convex analysis that is picturedin Exhibit 10.1 is Theorem 6.8 in Rockafellar (1970). Extensions to r-systemsand polynomial spline regression are put forward by Karlin and Studden(1966a, 1966b), and Studden and VanArman (1969). Wierich (1986) obtainscomplete class theorems for product designs in models with continuous anddiscrete factors.

The way admissibility relates to trace optimality is outlined by Elfving(1959), Karlin and Studden (1966a), Gaffke (1987a), and Heiligers (1991a).That it is preferable to first study its relations to smallest-eigenvalue opti-mality is pointed out in Gaffke and Pukelsheim (1988); see also Pukelsheim(1980, p. 359).

Elfving (1959) calls admissibility of moment matrices total admissibility,in contrast to partial admissibility of information matrices. He makes a pointthat the latter is not only a property of the support points but also in-volves the design weights. Admissibility of special C-matrices is tackled byChristof and Pukelsheim (1985), and Baksalary and Pukelsheim (1985). Con-stantine, Lim and Studden (1987) discuss admissibility of designs for finitesample sizes in polynomial fit models.

11. BAYES DESIGNS AND DISCRIMINATION DESIGNS

There is a rich and diverse literature on incorporating prior knowledge into adesign problem. Our development emphasizes that the General EquivalenceTheorem still provides the basis for most of the results.

Bayes designs have found the broadest coverage. The underlying distri-butional assumptions are discussed in detail by Sinha (1970), Lindley andSmith (1972), and Guttman (1971); see also the overviews of Herzberg andCox (1969), Atkinson (1982), Steinberg and Hunter (1984), and Bandemer,Na'ther and Pilz (1986). The monograph of Pilz (1991) provides a compre-hensive presentation of the subject and gives many references. Another de-tailed review of the literature is included in the authoritative exposition ofChaloner (1984). El-Krunz and Studden (1991) analyse Bayes scalar op-timal designs by geometric arguments akin to the Elfving Theorem 2.14.Mixtures of Bayes models are studied by DasGupta and Studden (1991).Chaloner and Larntz (1989) apply the Bayes approach to logistic regressionexperiments.

That Bayes designs and designs with protected experimental runs leadto the same optimization problem has already been pointed out by Covey-Crump and Silvey (1970). Theorem 11.8 on designs with bounded weightsgeneralizes the result of Wynn (1977, p. 474). Wynn (1982, p. 494) appliesthe approach to finite population sampling, Welch (1982) uses it as a basis toderive a branch-and-bound algorithm.

The setting in Section 11.10, of designing the experiment simultaneouslyfor several models, comes under various headings such as model robustdesign, designs for model discrimination, or multipurpose designs. The is-sue arises naturally from applications as in Hunter and Reiner (1965), orCook and Nachtsheim (1982). Lauter (1974,1976) formulates and solves theproblem as an optimization problem, see also Humak (1977, Kapitel 8), andBunke and Bunke (1986). Atkinson and Cox (1974) use the approach toguard against misspecification of the degree of a polynomial fit model. Atkin-son and Fedorov (1975a,b) treat the testing problem of whether one of two ormore models is the correct one, with a special view of nonlinear models andsequential experimental designs. Fedorov and Malyutov (1972) emphasize theaspects of model discrimination. Hill (1978b) reviews procedures and meth-ods for model discrimination designs. Our presentation follows Pukelsheimand Rosenberger (1993).

Most of the literature deals with geometric means of determinant crite-ria. In this case canonical moments provide another powerful tool for theanalysis. For mixture designs, deep results are presented by Studden (1980),Lau and Studden (1985), Lim and Studden (1988), and Dette (1990,1993a,b).Dette (1991a) shows that in a polynomial fit model every admissible symmet-ric design on [—1;1] becomes optimal relative to a weighted geometric meanof determinant criteria. Example I of Section 11.18 is due to Dette (1990, p.1791). Other examples of mixture determinant optimal designs for weightedpolynomial fit models are given by Dette (1992a,b). Lemma 11.11 is takenfrom Gutmair (1990) who calculates polars and subgradients for mixtures ofinformation functions. A related calculus is in use in electric network theory,as in Anderson and Trapp (1976).

The weighting and scaling issue which we address in Section 11.17 is im-manent in all of the work on the subject. It can be further elucidated byconsidering weighted vector means ^>p(A) — (£),-<m WjAf)1/*', with arbitraryweights Wi > 0, rather than restricting attention to the uniform weightingwi = 1/m as we do.

Interest in designs with guaranteed efficiencies was sparked off by the sem-inal paper of Stigler (1971). Atkinson (1972) broadens the setting by testingthe adequacy of an extended model which reduces to the given model forparticular parameter values. For the determinant criteria, again the canon-ical moment technique proves to be very powerful, leading to the detailedresults of Studden (1982) and Lau (1988). DasGupta, Mukhopadhyay andStudden (1992) call designs with guaranteed efficiencies compromise designs,and study classical and Bayes settings in heteroscedastic linear models.

Our Theorem 11.20 is a consequence of the Kuhn-Tucker theorem; see,for instance, Rockafellar (1970, p. 283). Example I in Section 11.22 is takenfrom Studden (1982). Lee (1987, 1988) considers the same problem whenthe criteria are differentiable, using directional derivatives. Similarly resultsfrom constrained optimization theory are used by Vila (1991) to investigatedeterminant optimality of designs for finite sample size n.

12. EFFICIENT DESIGNS FOR FINITE SAMPLE SIZES

A proper treatment of the discrete optimization problems that come withthe set "En of designs for sample size n requires a combinatorial theory, as inthe monographs of Raghavarao (1971), Raktoe, Hedayat and Federer (1981),Constantine (1987), John (1987), and Shah and Sinha (1989). The usual nu-merical rounding of the quoatas nn>i has little chance of preserving the sidecondition that the rounded numbers sum to n. For instance, the probabilityfor rounded percentages to add to 100% is ^/blirt as the support size i be-comes large; see Mosteller. Youtz and Zahn (1967), and Diaconis and Freed-man (1979). A systematic apportionment method provides an efficient alter-native to convert an optimal design £ € E into a design £, for sample size n,see Pukelsheim and Rieder (1992).

The discretization problem for experimental designs has much in commonwith the apportionment methods for electorial bodies, as studied in the livelytreatise of Balinski and Young (1982). The monotonicity properties that haveshaped the political debate have their counterparts in the design of experi-ments. Of these, sample size monotonicity is the simplest. Exhibit 12.1 showsthe data of the historical Alabama paradox, see Balinski and Young (1982,p. 39). Those authors prove that the monotonicity requirements automati-cally lead to divisor methods, that is, multiplier methods in our terminol-ogy-

Other than in the political sciences, the design of experiments provides nu-merical criteria to compare various apportionment methods, resulting in theefficient apportionment method of Section 12.6. Lemma 12.5 is in the spiritof Fedorov (1972, Chapter 3.1). Theorem 12.7 is from Balinski and Young(1982, p. 105). Exhibit 12.2 takes up an example of Bandemer and Na'ther(1980, p. 267). The efficient design apportionment is called the method ofJohn Quincy Adams by Balinski and Young (1982, p. 28), and is also knownas the method of smallest divisors. For us, the latter translates into the methodof largest multipliers since it uses multipliers v as large as possible such thatthe frequencies \vwi\ sum to n. Kiefer (1971, p. 116), advocates the goalto minimize the total variation distance max/<^ \ntjn - w/|. This is achieveduniquely by the method of Hamilton, see Balinski and Young (1982, pp. 17,104).

The asymptotic order of the efficiency loss due to rounding has alreadybeen mentioned by Kiefer (1959, p. 281) and Kiefer (1960, p. 383). Fellman(1980) and Rieder (1990) address the differentiability assumptions of The-orem 12.10. The subgradient efficiency bound (3) in Section 12.11 is fromPukelsheim and Titterington (1983, p. 1067). It generalizes an earlier resultof Atwood (1969, p. 1596), and is related to the approach of Gribik andKortanek (1977).

From Section 12.12 on, the exposition closely follows the work of Gaffke(1987b), building on Gaffke and Krafft (1982). It greatly improves upon theoriginal paper by Salaevskil (1966) who conjectured that in polynomial fit


models rounding always leads to determinant optimal designs for samplesize n. Gaffke (1987b) reviews the successes and failures in settling this con-jecture, and proposes a necessary condition for discrete optimality whichdisproves the conjecture for degree d > 4 and small sample sizes n. The ex-plicit counterexamples in Exhibit 12.5 are taken from Constantine, Lim andStudden (1987, p. 25). Those authors also prove that the Salaevskii conjec-ture holds true for degree d = 3. This provides some indication that thesample sizes nd of Exhibit 12.6 are very much on the safe side, since therewe find «3 = 12 rather than rij = 4. Further examples are discussed byGaffke (1987b). A numerical study is reported in Cook and Nachtsheim(1980).

The role of symmetry is an ongoing challenge in the design of experiments.The example at the end of Section 12.16 is taken from Kiefer (1959, p. 281),who refers back to it in Kiefer (1971, p. 117).

13. INVARIANT DESIGN PROBLEMS

Many classical designs were first proposed because of their intuitive appeal,on the grounds of symmetry properties such as some sort of "balancedness",as pointed out by Elfving (1959) and Kiefer (1981). The mathematical expres-sion of symmetry is invariance, relative to a group of transformations whichsuitably acts on the problem. Invariance considerations have permeated thedesign literature from the very beginning; see Kiefer (1958), and Kiefer andWolfowitz (1959, p. 279).

The material in Section 13.1 to Section 13.4 is basic. Nevertheless thereis an occasional irritation as if the matrix transformations were to act onthe moment matrices by similarity, A •-> QAQ~l, rather than by congru-ence, A i-> QAQ'. Lemma 13.5, on the homomorphisms under which theinformation matrix mapping CK is equivariant, is new. Sinha (1982) pro-vides a similar discussion of invariance, restricted to the settings of blockdesigns. Lemma 13.7 on invariance of the matrix means extends a techniqueof Lemma 7.4 in Draper, Gaffke and Pukelsheim (1991); Lemma 13.10 com-piles the results of Section 2 of that paper. Part (c) of Lemma 13.10 says thatSym(s, H) forms a quadratic subspace of symmetric matrices as introduced bySeely (1971). This property is also central to multivariate statistical analysis,compare Andersson (1975) and Jensen (1988).

For simultaneous optimality under permutationally invariant criteria in thecontext of block designs, Kiefer (1975, p. 336) coined the notion of universaloptimality. This was followed by a series of papers elaborating on the inter-play of averaging over the transformation group (Section 13.11), simultaneousoptimality relative to sets of invariant optimality criteria (Section 13.12), andmatrix majorization (Section 14.1), see Giovagnoli and Wynn (1981,1985a,b),Bondar (1983), and Giovagnoli, Pukelsheim and Wynn (1987).

14. KIEFER OPTIMALITY

Universal optimality generalizes to the Kiefer ordering as introduced in Sec-tion 14.2. The semantic distinction between uniform optimality and universaloptimality is weak, which is why we prefer to speak of Loewner optimalityand Kiefer optimality, and of the Loewner ordering and the Kiefer ordering.It is the same as the upper weak majorization ordering of Giovagnoli andWynn (1981, 1985b), and the information increasing ordering of Pukelsheim(1987a,b,c).

Shah and Sinha (1989) provide a fairly complete overview of results onuniversally optimal block designs. In the discrete theory, feasibility of a blockdesign for the centered treatment contrasts comes under the heading of con-nectedness of the incidence matrix N, see Krafft (1978, p. 195) and Heiligers(1991b). Eccleston and Hedayat (1974) distinguish various notions of con-nectedness and relate them to optimality properties of block designs. Ournotion of balancedness is often called variance-balancedness, see, for exam-ple, Kageyama and Tsuji (1980). Various other meanings of balancedness arediscussed by Caliriski (1977).

An a x b balanced incomplete block design for sample size n is known inexpert jargon as a BIBD(r,A:, A), with treatment replication number r = n/a,blocksize k — nib, and treatment concurrence number A = r(k — l)/(a — 1).Our terminology is more verbose for the reason that it treats block designsas just one instance of the general design problem. Our Section 14.8 followsGiovagnoli and Wynn (1981, p. 414) and Pukelsheim (1983a, 1987a). Sec-tion 14.9 on balanced incomplete block designs is mostly standard, compareRaghavarao (1971, Section 4.3) and Krafft (1978, Section 21). Rasch and Her-rendorfer (1982) and Nigam, Puri and Gupta (1988) present an expositionwith a view towards applications. Beth, Jungnickel and Lenz (1985) investi-gate for which parameter settings block designs exist and how they are con-structed. The inequality a < b, calling for at least as many blocks as there aretreatments, is due to Fisher (1940, p. 54); see also Bose (1949). A particularparameter system are the contrasts between a treatment and a control, see thereview by Hedayat, Jacroux and Majumdar (1988). Feasibility and optimal-ity of multiway block designs are treated in Raghavarao and Federer (1975),Cheng (1978b, 1981), Gaffke and Krafft (1979b), Eccleston and Kiefer (1981),Pukelsheim (1983c, 1986), and Pukelsheim and Titterington (1986, 1987).

In Section 14.10, the discussion of designs for linear regression on the cubefollows Cheng (1987) and Pukelsheim (1989), but see also Kiefer (1960, p.402). An application of the Kiefer ordering to experiments with mixtures isgiven by Mikaeili (1988).

15. ROTATABILITY AND RESPONSE SURFACE DESIGNS

Response surface methodology originates with Box and Wilson (1951), Boxand Hunter (1957), and Box and Draper (1959). The monograph of Box and


Draper (1987) is the core text on the subject, and includes a large bibliog-raphy. Other expositions of the subject are Khuri and Cornell (1987), andMyers (1971); see also Myers, Khuri and Carter (1989).

The definition of an information surface iM makes sense for an arbi-trary nonnegative definite matrix M, but the conclusion of Lemma 15.3 that'M — i-A implies M = A then ceases to hold true, see Draper, Gaffke andPukelsheim (1991, p. 154; 1993). This is in contrast to the information matrixmapping CK for which it is quite legitimate to extend the domain of definitionfrom the set of moment matrices M(E) to the entire cone NND(/c).

First-degree rotatability and regular simplex designs are studied by Box(1952). Two-level fractional factorial designs are investigated in detail byBox and Hunter (1961). Their usefulness was recognized already in the sem-inal paper of Plackett and Burman (1946). Those authors also point out thatthere are close interrelations with balanced incomplete block designs. Theuse of Hadamard matrices for the design of weighing experiments is workedout by Hotelling (1944); the same result is given much earlier in the textbookby Helmert (1872, pp. 48-49) who attributes it to Gauss. See also the reviewpaper by Hedayat and Wallis (1978).

The origin of the Kronecker product dates back beyond Kronecker toZehfuss (1858), according to Henderson, Pukelsheim and Searle (1983). Inmodern algebra, the Kronecker product is an instance of a tensor product,as is the outer product (s,t) H-» st', see Greub (1967). Henderson and Searle(1981) review the use of the vectorization operator in statistics and elsewhere.These tools are introduced to second-degree models by Draper, Gaffke andPukelsheim (1991).

Central composite designs are proposed in Box and Wilson (1951, p. 16),and Box and Hunter (1957, p. 224). Rotatable second-degree moment matri-ces also arise with the simplex-sum designs of Box and Behnken (1960) whoseconstruction is based on the regular simplex designs for first-degree mod-els. Optimality properties of rotatable second-degree designs are derived byKiefer (1960, p. 398), and Galil and Kiefer (1977). Other ways to adjust thesingle parameter that remains in the complete class of Theorem 15.18 aredictated by more practical considerations, see Box (1982), Box and Draper(1987, p. 486), and Myers, Vining, Giovannitti-Jensen and Myers (1992).Draper and Pukelsheim (1990) study measures of rotatability which can beused to indicate a deviation from rotatability, following work of Draper andGuttman (1988), and Khuri (1988).

Biographies

1. CHARLES LOEWNER 1893-1968

Karl Lowner was bom on May 29, 1893, near Prague, into a large Jewishfamily. He wrote his dissertation under Georg Pick at the Charles Universityin Prague. After some years as an Assistent at the German Technical Univer-sity in Prague, Privatdozent at the Friedrich-Wilhelm-Universitat in Berlin,and auflerordentlicher Professor at Cologne University, he returned to theCharles University where he was promoted to an ordentlicher Professor. TheGerman occupation of Czechoslovakia in 1939 caused him to emigrate to theUnited States and to change his name to Charles Loewner. Loewner taughtat Louisville University, Brown University, and Syracuse University prior tohis appointment in 1951 as Professor of Mathematics at Stanford University,where he remained beyond his retirement in 1963 until his death on January8,1968. Loewner's success as a teacher was outstanding. Even during the lastyear of his life he directed more doctoral dissertations than any other de-partment member. Volume 14 (1965) of the Journal d'Analyse Mathematiqueis dedicated to him, and Stefan Bergman and Gabor Szego. Lipman Berstells of the man and scientist in the Introduction of the Collected PapersLoewner (CP).

Loewner's work covers wide areas of complex analysis and differential ge-ometry. The research on conformal mappings and their iterations ledLoewner to the general study of semi-groups of transformations. In this vein,he axiomatized and characterized monotone matrix functions. There is a largebody of Loewner's work which will not be found in his formal publications.One example is what is now called the Loewner ellipsoid, the ellipsoid ofsmallest volume circumscribing a compact set in Euclidean space.

428

BIOGRAPHIES 429

Top: Loewner. Middle: Elfving. Bottom: Kiefer.

430 BIOGRAPHIES

2. GUSTAV ELFVING 190&-1984

Erik Gustav Elfving was born on June 25,1908, in Helsinki. His father was aProfessor of Botany at the University of Helsinki. Elfving learnt his calculusof probability from J.W. Lindeberg, but wrote his doctoral thesis (1934) underRolf Nevanlinna. As a mathematician member of a 1935 Danish cartographicexpedition to Western Greenland, when incessant rain forced the group tostay in their tents for three solid days, Elfving began to think about leastsquares problems and thereafter turned to statistics and probability theory.He started his academic career 1932 as a lecturer at the Akademi in Turku,and six years later became a docent in the Mathematics Department at theUniversity of Helsinki. In 1948 Elfving was appointed to the chair that be-came vacant after Lars Ahlfors moved to Harvard University. He retiredfrom this position in 1975, and died on March 25, 1984. An appreciation ofthe life and work of Gustav Elfving is given by Makelainen (1990).

Elfving's publications are amongst others on continuous time Markovchains, Markovian two person nonzero sum games, decision theory, andcounting processes. The 1952 paper on Optimum allocation in linear regres-sion theory marks the beginning of the optimality theory of experimentaldesign. Elfving's work in this area is reviewed by Fellman (1991). Duringhis retirement, Elfving wrote a history of mathematics in Finland 1828-1918,including Lindelof, Mellin, and Lindeberg, see also Elfving (1985). The pho-tograph shows Elfving giving a Bayes talk at his seminar in Helsinki rightafter his 1966 visit to Stanford University.

3. JACK KIEFER 1924-1981

Jack Carl Kiefer was born on January 25, 1924, in Cincinnati, Ohio. He at-tended the Massachusetts Institute of Technology and, interrupted by militaryservice in World War II, earned a master's degree in electrical engineering andeconomics. In 1948 he enrolled in the Department of Mathematical Statisticsat Columbia University and wrote a doctoral thesis (1952) on decision the-ory under Jacob Wolfowitz. In 1951, Wolfowitz left Columbia and joined theDepartment of Mathematics at Cornell University. Kiefer went along as aninstructor in the same department, and eventually became a full professor in1959. After 28 years at Cornell, which showed Kiefer as a prolific researcherin almost all parts of modern statistics, he took early retirement in 1979 onlyto join the Statistics Department of the University of California at Berkeley.He died in Berkeley on August 10, 1981.

Kiefer was one of the leading statisticians of the century. Circling aroundthe decision theoretic foundations laid down by the work of Abraham Waldat Columbia, Kiefer's more than 100 publications span an amazing rangeof interests: stochastic approximations, sequential procedures, nonparametricstatistics, multivariate analysis, inventory models, stochastic processes, design

BIOGRAPHIES 431

of experiments. More than 45 papers are on optimal design, highlights beingthe 1959 discussion paper before the Royal Statistical Society (met by somediscussants with rejection and ridicule), the 1974a review paper on the ap-proximate design theory in the Annals of Statistics, and his last paper (1984,with H.P. Wynn) on optimal designs in the presence of correlated errors. Aset of commentaries in the Collected Papers Kiefer (CP) aids in focusing onthe contributions of the scientist and elucidating the warm personality of theman Jack Kiefer; see also Bechhofer (1982).

Bibliography

Numbers in square brackets refer to the page on which the reference is quoted.

Abramowitz, M. and Stegun, LA. (1970). Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables. Dover, New York. |42i]

Alalouf, I.S. and Styan, G.P.H. (1979). "Characterizations of estimability in the general linearmodel." Annals of Statistics 7, 194-200. [4iij

Albert, A. (1969). "Conditions for positive and nonnegative definiteness in terms of pseudoin-verses." SIAM Journal on Applied Mathematics 17, 434-440. [4ii]

Anderson, W.N., Jr. (1971). "Shorted operators." SIAM Journal on Applied Mathematics 20,520-525. [4ii]

Anderson, W.N., Jr. and Trapp, G.E. (1975). "Shorted operators II." SIAM Journal on AppliedMathematics 28, 60-71. HH]

Anderson, W.N., Jr. and Trapp, G.E. (1976). "A class of monotone operator functions relatedto electrical network theory." Linear Algebra and Its Applications 15, 53-67. H23]

Andersson, S. (1975). "Invariant normal models." Annals of Statistics 3, 132-154. [425]

Andersson, S.A. and Perlman, M.D. (1988). "Group-invariant analogues of Hadamard's inequal-ity." Linear Algebra and Its Applications 110, 91-116. [414]

Atkinson, A.G (1972). "Planning experiments to detect inadequate regression models."Biometrika 59, 275-293. [423]

Atkinson, A.C. (1982). "Developments in the design of experiments." International StatisticalReview 50, 161-177. [422]

Atkinson, A.C. and Cox, D.R. (1974). "Planning experiments for discriminating between mod-els." Journal of the Royal Statistical Society Series B 36, 321-334. "Discussion on the paperby Dr Atkinson and Professor Cox." Ibidem, 335-348. [423]

Atkinson, A.C. snd Fedorov, V.V. (1975a). "The design of experiments for discriminating be-tween two rival models." Biometrika 62, 57-70. [413.423)

Atkinson, A.C. and Fedorov, V.V. (1975b). "Optimal design: Experiments for discriminatingbetween several models." Biometrika 62, 289-303. (413,423)

Atwood, C.L. (1969). "Optimal and efficient designs of experiments." Annals of MathematicalStatistics 40, 1570-1602. [6o.4is.424]

432

BIBLIOGRAPHY 433

Atwood, C.L. (1973). "Sequences converging to D-optimal designs of experiments." Annals ofStatistics 1, 342-352. (4u, 4is]

Baksalary, J.K. and Pukelsheim, F. (1985). "A note on the matrix ordering of special C-matrices."Linear Algebra and Its Applications 70, 263-267. [422)

Balinski, M.L. and Young, H.P. (1982). Fair Representation. Meeting the Ideal of One Man, OneVote. Yale University Press, New Haven, CT. (330,424]

Bandemer, H. (Ed.) (1977). Theorie and Anwendung der optimalen Versuchsplanung I. Handbuchzur Theorie. Akademie-Verlag, Berlin. 160,416]

Bandemer, H. and Nather, W. (1980). Theorie und Anwendung der optimalen VersuchsplanungII. Handbuch zur Anwendung. Akademie-Verlag, Berlin. HIS. 424]

Bandemer, H., Nather, W. and Pilz, J. (1986). "Once more: Optimal experimental design forregression models." Statistics 18, 171-198. "Discussion." Ibidem, 199-217. (422]

Barndorff-Nielsen, O.E. and Jupp, P.E. (1988). "Differential geometry, profile likelihood, L-sufficiency and composite transformation models." Annals of Statistics 16, 1009-1043. |4ii]

Bauer, H. (1990). Mass- und Integrationstheorie. De Gruyter, Berlin. [410]

Bazaraa, M.S. and Shetty, C.M. (1979). Nonlinear Programming. Theory and Algorithms. Wiley,New York. HIO. 417]

Bechhofer, R. (1982). "Jack Carl Kiefer 1924-1981." American Statistician 36, 356-357. [431]

Beckenbach, E.F. and Bellman, R. (1965). Inequalities. Springer, Berlin. [413,4141

Bellissard, J., lochum, B. and Lima, R. (1978). "Homogeneous and facially homogeneous self-dual cones." Linear Algebra and Its Applications 19, 1-16. [410]

Beth, T, Jungnickel, D. and Lenz, H. (1985). Design Theory. Bibliographisches Institut,Mannheim. (426]

Birkhoff, G. (1946). "Tres observaciones sobre el algebra lineal." Universidad Nacional de Tu-cumdn, Facultad de Ciencias Exactas, Puras y Aplicadas, Revista, Serie A, Matemdticas yFisica Teorica 5, 147-151. [413]

Bischoff, W. (1992). "On exact D-optimal designs for regression models with correlated obser-vations." Annals of the Institute of Statistical Mathematics 44, 229-238. [418]

Bischoff, W. (1993). "On D-optimal designs for linear models under correlated observations withan application to a linear model with multiple response." Journal of Statistical Planning andInference 37, 69-80. HIK]

Bondar, J.V. (1983). "Universal optimality of experimental designs: definitions and a criterion."Canadian Journal of Statistics 11, 325-331. [425]

Bose, R.C. (1948). "The design of experiments." In Proceedings of the Thirty-Fourth IndianScience Congress, Delhi 1947. Indian Science Congress Association, Calcutta, (l)-(25). |4i2]

Bose, R.C. (1949). "A note on Fisher's inequality for balanced incomplete block designs." Annalsof Mathematical Statistics 20, 619-620. [426]

Box, G.E.P. (1952). "Multi-factor designs of first order." Biometrika 39, 49-57. 1427]

Box, G.E.P. (1982). "Choice of response surface design and alphabetic optimality." Utilitas Math-ematica21B, 11-55. 1427]

Box, G.E.P. (CW). The Collected Works of George E.P. Box (Eds. G.C. Tiao, C.W.J. Granger,I. Guttman, B.H. Margolin, R.D. Snee, S.M. Stigler). Wadsworth, Belmont, CA 1985.

Box, G.E.P. and Behnken, D.W. (1960). "Simplex-sum designs: A class of second order rotatabledesigns derivable from those of first order." Annals of Mathematical Statistics 31, 838-864.

[427]

Box, G.E.P. and Draper, N.R. (1959). "A basis for the selection of a response surface design."Journal of the American Statistical Association 54, 622-654. [426]

434 BIBLIOGRAPHY

Box, G.E.P. and Draper, N.R. (1987), Empirical Model-Building and Response Surfaces. Wiley,New York. [4os, 426,427]

Box, G.E.P. and Hunter, J.S. (1957). "Multi-factor experimental designs for exploring responsesurfaces." Annals of Mathematical Statistics 28, 195-241. [409,426,42?]

Box, G.E.P. and Hunter, J.S. (1961). "The 2k~p fractional factorial designs. Part I." Technometrics3, 311-351. "Part II." Ibidem, 449-458. \/ta\

Box, G.E.P. and Wilson, K.B. (1951). "On the experimental attainment of optimum conditions."Journal of the Royal Statistical Society Series B 13, 1-38. "Discussion on paper by Mr. Boxand Dr. Wilson." Ibidem, 38-45. [426,42?]

Bunke, H. and Bunke, O. (1974). "Identifiability and estimability." Mathematische Operationsfor-schung und Statistik 5, 223-233. HII]

Bunke, H. and Bunke, O. (Eds.) (1986). Statistical Inference in Linear Models. Statistical Methodsof Model Building, Volume 1. Wiley, Chichester. [423]

Busemann, H. (1955). The Geometry of Geodesies. Academic Press, New York. \4iT\

Calinski, T. (1977). "On the notion of balance in block designs." In Recent Developments inStatistics. Ptrtceedings of the European Meeting of Statisticians, Grenoble 1976 (Eds. J.R.Barra, F. Brodeau, G. Romier, B. van Cutsem). North-Holland, Amsterdam, 365-374. |426]

Carlson, D. (1986). "What are Schur complements, anyway?" Linear Algebra and Its Applications74, 257-275. pu]

Chakrabarti, M.C. (1963). "On the C-matrix in design of experiments." Journal of the IndianStatistical Association 1, 8-23. [412]

Chaloner, K. (1984). "Optimal Bayesian experimental design for linear models." Annals of Statis-tics 12, 283-300. "Correction." Ibidem 13, 836. 1302,417,419,422]

Chaloner, K. and Larntz, K. (1989). "Optimal Bayesian design applied to logistic regressionexperiments." Journal of Statistical Planning and Inference 21, 191-208. [4221

Chebyshev, PL. [Tchebychef, PL.] (1859). "Sur les questions de minima qui se rattachent a larepresentation approximative des fonctions." Memoires de VAcademic Imperiale des Sciencesde St.-Petersbourg. Sixieme Serie. Sciences Mathemathiques et Physiques 7, 199-291. (421]

Chebyshev, P.JU [Tchebychef, PL.] ((Euvres). (Euvres de PL. Tchebychef. Reprint, Chelsea, NewYork 1961.

Cheng, C.-S. (1978a). "Optimality of certain asymmetrical experimental designs." Annals ofStatistics 6, 1239-1261. [4i3]

Cheng, C.-S. (1978b). "Optimal designs for the elimination of multi-way heterogeneity." Annalsof Statistics 6, 1262-1212. [426]

Cheng, C.-S. (1981). "Optimality and construction of pseudo-Youden designs." Annals of Statis-tics 9, 201-205. [426]

Cheng, C.-S. (1987). "An application of the Kiefer-Wolfowitz equivalence theorem to a problemin Hadamard transform optics." Annals of Statistics 15, 1593-1603. [426j

Chernoff, H. (1953). "Locally optimal designs for estimating parameters." Annals of Mathemat-ical Statistics 24, 586-602. [411, ui\

Chernoff, H. (1972). Sequential Analysis and Optimal Design. Society for Industrial and AppliedMathematics, Philadelphia, PA. (410,414]

Christof, K. (1987), Optimale Blockpldne turn Vergleich von Kontroll- und Testbehandlungen.Dissertation, Universitat Augsburg, 99 pages. [412]

Christof, K. and Pukelsheim, F. (1985). "Approximate design theory for a simple block designwith random block effects." In Linear Statistical Inference. Proceedings of the InternationalConference, Poznan 1984 (Eds. T. Calirtski, W. Klonecki). Lecture Notes in Statistics 35,Springer, Berlin, 20-28. [422]

BIBLIOGRAPHY 435

Conlisk, J. and Watts, H. (1979). "A model for optimizing experimental designs for estimatingresponse surfaces." Journal of Econometrics 11, 27-42. (4201

Constantine, G.M. (1987). Combinatorial Theory and Statistical Design. Wiley, New York. [424]

Constantine, K.B., Lim, Y.B. and Studden, W.J. (1987). "Admissible and optimal exact designsfor polynomial regression." Journal of Statistical Planning and Inference 16, 15-32. 1422,425)

Cook, R.D. and Nachtsheim, C.J. (1980). "A comparison of algorithms for constructing exactD-optimal designs." Technometrics 22, 315-324. |425i

Cook, R.D. and Nachtsheim, C.J. (1982). "Model robust, linear-optimal designs." Technometrics24, 49-54. (423]

Cook, R.D. and Thibodeau, L.A. (1980). "Marginally restricted D-optimal designs." Journal ofthe American Statistical Association 75, 366-371. [4i9]

Covey-Crump, P.A.K. and Silvey, S.D. (1970). "Optimal regression designs with previous obser-vations." Biometrika 57, 551-566. (422]

Cox, D.D. (1988). "Approximation of method of regularization estimators." Annals of Statistics16, 694-712. [4i2]

Cox, D.R. and Reid, N. (1987). "Parameter orthogonality and approximate conditional infer-ence." Journal of the Royal Statistical Society Series B 49, 1-18. |4ii]

Danzer, L., Griinbaum, B. and Klee, V. (1963). "Helly's theorem and its relatives." In Convex-ity. Proceedings of Symposia in Pure Mathematics 7 (Ed. V. Klee). American MathematicalSociety, Providence, RI, 101-180. |4n]

Danzer, L., Laugwitz, D. and Lenz, H. (1957). "Uber das LOWNERsche Ellipsoid und seinAnalogon unter den einem Eikorper einbeschriebenen Ellipsoiden." Archiv der Mathematik7, 214-219. (4.71

DasGupta, A. and Studden, W.J. (1991). "Robust Bayesian experimental designs in normal linearmodels." Annals of Statistics 19, 1244-1256. [4221

DasGupta, A., Mukhopadhyay, S. and Studden, W.J. (1992). "Compromise designs in het-eroscedastic linear models." Journal of Statistical Planning and Inference 32, 363-384. [423]

de la Garza, A. (1954). "Spacing of information in polynomial regression." Annals of Mathe-matical Statistics 25, 123-130. [4211

Dette, H. (1990). "A generalization of D- and D\ -optimal designs in polynomial regression."Annals of Statistics 18, 1784-1804. 1423]

Dette, H. (1991a). "A note on robust designs for polynomial regression." Journal of StatisticalPlanning and Inference 28, 223-232. [423]

Dette, H. (1991b). Geometric Characterizations of Model Robust Designs. Habilitationsschrift,Georg-August-Universitat Gottingen, 111 pages. [4io]

Dette, H. (1992a). "Experimental designs for a class of weighted polynomial regression models."Computational Statistics and Data Analysis 14, 359-373. (423]

Dette, H. (1992b). "Optimal designs for a class of polynomials of odd or even degree." Annalsof Statistics 20, 238-259. (423)

Dette, H. (1993a). "Elfving's theorem for D-optimality." Annals of Statistics 21, 753-766. 1423]

Dette, H. (1993b). "A mixture of the D- and D(-optimally criterion in polynomial regression."Journal of Statistical Planning and Inference 35, 233-249. 1423]

Dette, H. and Studden, W.J. (1992). "On a new characterization of the classical orthogonalpolynomials." Journal of Approximation Theory 71, 3-17. (419]

Dette, H. and Studden, W.J. (1993). "Geometry of E-optimality." Annals of Statistics 21, 416-433. |420]

436 BIBLIOGRAPHY

Diaconis, P. and Freedman, D. (1979). "On rounding percentages." Journal of the AmericanStatistical Association 74, 359-364. [424]

Donoghue, W.F., Jr. (1974). Monotone Matrix Functions and Analytic Continuation. Springer,Berlin. [408]

Draper, N.R. and Guttman, I. (1988). "An index of rotatability." Technometrics 30, 105-111. |42?|

Draper, N.R. and Pukelsheim, F. (1990), "Another look at rotatability." Technometrics 32, 195-202. |427]

Draper, N.R., Gaffke, N. and Pukelsheim, F. (1991). "First and second order rotatability ofexperimental designs, moment matrices, and information surfaces." Metrika 38, 129-161.

|425, 4271

Draper, N.R., Gaffke, N. and Pukelsheim, F. (1993). "Rotatability of variance surfaces andmoment matrices." Journal of Statistical Planning and Inference 36, 347-356. (427]

Eccleston, J.A. and Hedayat, A. (1974). "On the theory of connected designs: Characterizationand optimality." Annals of Statistics 2, 1238-1255. |4zei

Eccleston, J.A. and Kiefer, J. (1981). "Relationships of optimality for individual factors of adesign." Journal of Statistical Planning and Inference 5, 213-219. 1426]

Ehrenfeld, S, (1955). "On the efficiency of experimental designs." Annals of Mathematical Statis-tics 26, 247-255. H20]

Ehrenfeld, S. (1956). "Complete class theorems in experimental designs." In Proceedings of theThird Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA 1954 and1955, Volume 1 (Ed. J. Neyman). University of California, Berkeley, CA, 57-67. 1417.421)

El-Krunz, S.M. and Studden, W.J. (1991). "Bayesian optimal designs for linear regression mod-els." Annals of Statistics 19, 2183-2208. (422)

Elfving, G. (1952). "Optimum allocation in linear regression theory." Annals of MathematicalStatistics 23, 255-262. (409,4io. 4i4,4i6,417.430]

Elfving, G. (1959). "Design of linear experiments." In Probability and Statistics. The HaroldCramer Volume (Ed. U. Grenander). Almquist and Wiksell, Stockholm, 58-74. (413,421,422.425)

Elfving, G. (1985). "Finnish Mathematical Statistics in the past." In Proceedings of the FirstInternational Tampere Seminar on Linear Statistical Models and their Applications, Tampere1983 (Eds. T. Pukkiia, S. Puntanen). University of Tampere, Tampere, 3-8. (4301

Erdos, P. (1947). "Some remarks on polynomials." Bulletin of the American Mathematical Society53, 1169-1176. 1420]

Ermakov, S.M. (Ed.) (1983). Mathematical Theory of Experimental Planning. Nauka, Moscow(in Russian). [41 g|

Farebrother, R.W. (1985). "The statistical estimation of the standard linear model, 1756-1853."In Proceedings of the First International Tampere Seminar on Linear Statistical Models andtheir Applications, Tampere 1983 (Eds. T. Pukkiia, S. Puntanen). University of Tampere,Tampere, 77-99. poo]

Farrell, R.H., Kiefer, J. and Walbran, A. (1967). "Optimum multivariate designs." In Proceedingsof the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA 1965and 1966, Volume 1 (Eds. L.M. Le Cam, J. Neyman). University of California, Berkeley, CA,113-138. |4i8]

Fedorov, V.V. (1972). Theory of Optimal Experiments. Academic Press, New York. (4is. 419,421.424]

Fedorov, V. and Khabarov, V. (1986). "Duality of optimal designs for model discrimination andparameter estimation." Biometrika 73, 183-190. |4ii|

Fedorov, V.V. and Malyutov, M.B. (1972). "Optimal designs in regression problems." Mathema-tische Operationsforschung und Statistik 3, 281-308. H16.423]

BIBLIOGRAPHY 437

Fejer, L. (1932). "Bestimmung derjenigen Abszissen eines Intervalles, fur welche die Quadrat-summe der Grundfunktionen der Lagrangeschen Interpolation im Intervalle ein moglichstkleines Maximum besitzt." Annali delta R. Scuola Normale Superiore di Pisa Serie II, ScienzeFisiche e Matematiche 1, 263-276. |4i8]

Fellman, J. (1974). "On the allocation of linear observations." Societas Scientiarum Fennica,Commentationes Physico-Mathematicae 44, 27-78. [410.414,41?]

Fellman, J. (1980). "On the behavior of the optimality criterion in the neighborhood of the op-timal point." Working Paper 49, Swedish School of Economics and Business Administration,Helsinki, 15 pages. [424]

Fellman, J. (1991). "Gustav Elfving and the emergence of the optimal design theory." WorkingPaper 218, Swedish School of Economics and Business Administration, Helsinki, 7 pages.

[430]

Fisher, R.A. (1940). "An examination of the different possible solutions of a problem in incom-plete blocks." Annals of Eugenics 10, 52-75. \nk\

Gaffke, N. (1981). "Some classes of optimality criteria and optimal designs for complete two-waylayouts." Annals of Statistics 9, 893-898. [413]

Gaffke, N. (1982). Optimalitatskriterien und optimale Versuchspldne fur lineare Regressionsmo-delle. Habilitationsschrift, Rheinisch-Westfalische Technische Hochschule Aachen, 127 pages.

(421,422]

Gaffke, N. (1985a). "Directional derivatives of optimality criteria at singular matrices in convexdesign theory." Statistics 16, 373-388. [411,415]

Gaffke, N. (1985b). "Singular information matrices, directional derivatives, and subgradientsin optimal design theory." In Linear Statistical Inference. Proceedings of the InternationalConference on Linear Inference, Poznan 1984 (Eds. T. Calinski, W. Klonecki). Lecture Notesin Statistics 35, Springer, Berlin, 61-77. [415]

Gaffke, N. (1987a). "Further characterizations of design optimality and admissibility for partialparameter estimation in linear regression." Annals of Statistics 15, 942-957. [379,411,417,422]

Gaffke, N. (1987b). "On D-optimality of exact linear regression designs with minimum support."Journal of Statistical Planning and Inference 15, 189-204. (424,425)

Gaffke, N. and Krafft, O. (1977). "Optimum properties of Latin square designs and a matrixinequality." Mathematische Operationsforschung und Statistik Series Statistics 8, 345-350.

[209, 412[

Gaffke, N. and Krafft, O. (1979a). "Matrix inequalities in the Lowner ordering." In ModernApplied Mathematics: Optimization and Operations Research (Ed. B. Korte). North-Holland,Amsterdam, 595-622. [4i4]

Gaffke, N. and Krafft, O. (1979b). "Optimum designs in complete two-way layouts." Journal ofStatistical Planning and Inference 3, 119-126. [426]

Gaffke, N. and Krafft, O. (1982). "Exact D-optimum designs for quadratic regression." Journalof the Royal Statistical Society Series B 44, 394-397. (424]

Gaffke, N. and Pukelsheim, F. (1988). "Admissibility and optimality of experimental designs." InModel-Oriented Data Analysis. Proceedings of an International Institute for Applied SystemsAnalysis Workshop on Data Analysis, Eisenach 1987 (Eds. V Fedorov, H. Lauter). LectureNotes in Economics and Mathematical Systems 297, Springer, Berlin, 37-43. [412,422]

Galil, Z. and Kiefer, J. (1977). "Comparison of rotatable designs for regression on balls, I(quadratic)." Journal of Statistical Planning and Inference 1, 27-40. [427]

Gauss, C.F. (Werke). Werke (Ed. Konigliche Gesellschaft der Wissenschaften zu Gottingen).Band IV (Gottingen 1873), Band VII (Leipzig 1906), Band X 1 (Leipzig 1917). [409]

Giovagnoli, A. and Wynn, H.P. (1981). "Optimum continuous block designs." Proceedings of theRoyal Society London Series A 377, 405-416. [412,425,426]

438 BIBLIOGRAPHY

Giovagnoli, A. and Wynn, H.P. (1985a). "Schur-optimal continuous block designs for treatmentswith a control." In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman andJack Kiefer, Volume 2 (Eds. L.M. Le Cam, R.A. Olshen). Wadsworth, Belmont, CA, 651-666.

[412, 425)

Giovagnoli, A. and Wynn, H.P. (1985b). "G-majorization with applications to matrix orderings."Linear Algebra and Its Applications 67, 111-135. [425,426]

Giovagnoli, A., Pukelsheim, F. and Wynn, H.P. (1987). "Group invariant orderings and experi-mental designs." Journal of Statistical Planning and Inference 17, 159-171. [425]

Goller, H. (1986). "Shorted operators and rank decomposition matrices." Linear Algebra andIts Applications 81, 207-236. [412]

Greub, W.H. (1967). Multilinear Algebra. Springer, Berlin. \m\

Gribik, PR. and Kortanek, K.O. (1977). "Equivalence theorems and cutting plane algorithmsfor a class of experimental design problems." SIAM Journal on Applied Mathematics 32,232-259. I424J

Gruber, P.M. (1988). "Minimal ellipsoids and their duals." Rendiconti del Circolo Matematico diPalermo Serie H 37, 35-64. |4n]

Guest, P.G. (1958). "The spacing of observations in polynomial regression." Annals of Mathe-matical Statistics 29, 294-299. [245,418]

Gutmair, S. (1990). Mischungen von Informationsfunktionen: Optimalitatstheorie undAnwendun-gen in der klassischen and Bayes'schen Versuchsplanung. Dissertation, Universitat Augsburg,82 pages. [157,423)

Guttman, I. (1971). "A remark on the optimal regression designs with previous observations ofCovey-Crump and Silvey." Biometrika 58, 683-685. (4221

Hansen, O.H. and Torgersen, E.N. (1974). "Comparison of linear normal experiments." Annalsof Statistics 2, 367-373. [412]

Hardy, G.H., Littlewood, I.E. and P61ya, G. (1934). Inequalities. Cambridge University Press,Cambridge, UK. [413]

Hedayat, A. (1981). "Study of optimality criteria in design of experiments." In Statistics andRelated Topics. Proceedings of the Symposium, Ottawa 1980 (Eds. M. Csorgo, D.A. Dawson,J.N.K. Rao, A.K.Md.E. Saleh). North-Holland, Amsterdam, 39-56. [413]

Hedayat, A.S. and Majumdar, D. (1985). "Combining experiments under Gauss-Markov mod-els." Journal of the American Statistical Association 80, 698-703. [412]

Hedayat, A. and Wallis, W.D. (1978). "Hadamard matrices and their application." Annals ofStatistics 6, 1184-1238. [427]

Hedayat, A.S., Jacroux, M. and Majumdar, D. (1988). "Optimal designs for comparing testtreatments with a control." Statistical Science 3, 462-476. "Discussion." Ibidem, 477-491. [426]

Heiligers, B. (1.988). Zulassige Versuchspldne in linearen Regressionsmodellen. Dissertation,Rheinisch-Westfalische Technische Hochschule Aachen, 194 pages. [350,422]

Heiligers, B. (1991a). "Admissibility of experimental designs in linear regression with constantterm." Journal of Statistical Planning and Inference 28, 107-123. 1422]

Heiligers, B. (1991b). "A note on connectedness of block designs." Metrika 38, 377-381. [426]

Heiligers, B. (1991c). E-optimal Polynomial Regression Designs. Habilitationsschrift, Rheinisch-Westfalische Technische Hochschule Aachen, 88 pages. 1%, 246,420)

Helmert, F.R. (1872). Die Ausgleichungsrechnung nach der Methode der Kleinsten Quadrate, mitAnwendungen auf die Geoddsie und die Theorie der Messinstrumente. Teubner, Leipzig. [427]

Henderson, H.V. and Searle, S.R. (1981). "The vec-permutation matrix, the vec operator andKronecker products: A review." Linear and Multilinear Algebra 9, 271-288. [427]

BIBLIOGRAPHY 439

Henderson, H.V., Pukelsheim, F. and Searle, S.R. (1983). "On the history of the Kroneckerproduct." Linear and Multilinear Algebra 14, 113-120. [42?]

Herzberg, A.M. and Cox, D.R. (1969). "Recent work on the design of experiments: A bibliog-raphy and a review." Journal of the Royal Statistical Society Series A 132, 29-61. (4221

Hill, P.D.H. (1978a). "A note on the equivalence of D-optimal design measures for three rivallinear models." Biometrika 65, 666-667. [421]

Hill, P.D.H. (1978b). "A review of experimental design procedures for regression model discrim-ination." Technometrics 20, 15-21. [423)

Hoang, T. and Seeger, A. (1991). "On conjugate functions, subgradients, and directional deriva-tives of a class of optimality criteria in experimental design." Statistics 22, 349-368. |4U]

Hoel, P.O. (1958). "Efficiency problems in polynomial estimation." Annals of Mathematical Statis-tics 29, 1134-1145. [4i8|

Hoel, P.O. (1965). "Minimax designs in two dimensional regression." Annals of MathematicalStatistics 36, 1097-1106. (406,421)

Hoel, P.O. and Levine, A. (1964). "Optimal spacing and weighting in polynomial prediction."Annals of Mathematical Statistics 35, 1553-1560. [209,410,4i9|

Holder, O. (1889). "Ueber einen Mittelwerthssatz." Nachrichten von der Koniglichen Gesellschaftder Wissenschaften und der Georg-Augusts-Universitat zu Gottingen 2, 38-47. [4U]

Horn, R.A. and Johnson, C.R. (1985). Matrix Analysis. Cambridge University Press, Cambridge,UK. 133,414)

Hotelling, H. (1944). "Some improvements in weighing and other experimental techniques."Annals of Mathematical Statistics 15, 297-306. [42?]

Humak, K.M.S. (1977). Statistische Methoden der Modellbildung, Band I. Statistische Inferenzfur lineare Parameter. Akademie-Verlag, Berlin. HIS. 4231

Hunter, W.G. and Reiner, A.M. (1965). "Designs for discriminating between two rival models."Technometrics 7, 307-323. H23)

Jensen, S.T. (1988). "Covariance hypotheses which are linear in both the covariance and theinverse covariance." Annals of Statistics 16, 302-322. 1425]

John, J.A. (1987). Cyclic Designs. Chapman and Hall, London. 197,424]

John, P.W.M. (1964). "Balanced designs with unequal numbers of replicates." Annals of Mathe-matical Statistics 35, 897-899. [379]

Kageyama, S. and Tsuji, T. (1980). "Characterization of equireplicated variance-balanced blockdesigns." Annals of the Institute of Statistical Mathematics 32, 263-273. [426]

Karlin, S. and Studden, W.J. (1966a). "Optimal experimental designs." Annals of MathematicalStatistics 37, 783-815. [4i6,4i8,421,422)

Karlin, S. and Studden, W.J. (1966b). Tchebycheff Systems: With Applications in Analysis andStatistics. Interscience, New York. [4io, 4is, 421,422]

Kempthorne, O. (1980). "The term design matrix." American Statistician 34, 249. [409]

Khuri, A.I. (1988). "A measure of rotatability for response-surface designs." Technometrics 30,95-104. [427]

Khuri, A.I. and Cornell, J.A. (1987). Response Surfaces. Designs and Analyses. Dekker, NewYork. (427]

Kiefer, J.C. (1958). "On the nonrandomized optimality and randomized nonoptimaltiy of sym-metrical designs." Annals of Mathematical Statistics 29, 675-699. \ns\

Kiefer, J.C. (1959). "Optimum experimental designs." Journal of the Royal Statistical SocietySeries B 21, 272-304. "Discussion on Dr Kiefer's paper." Ibidem, 304-319.

[409, 412. 417, 421, 424, 425, 431]

440 BIBLIOGRAPHY

Kiefer, J.C. (1960). "Optimum experimental designs V, with applications to systematic and rotat-able designs." In Proceedings of the Fourth Berkeley Symposium on Mathematical Statisticsand Probability, Berkeley, CA 1960, Volume 1 (Ed. J. Neyman). University of California,Berkeley, CA, 381-405. psi. 413,424,426.427]

Kiefer, J.C. (1961). "Optimum designs in regression problems, II." Annals of Mathematical Statis-tics 32, 298-325. piMiq

Kiefer, J.C. (1962). "An extremum result." Canadian Journal of Mathematics 14, 597-601. [416]

Kiefer, J.C. (1971). "The role of symmetry and approximation in exact design optimality." InStatistical Decision Theory and Related Topics. Proceedings of a Symposium, Purdue Univer-sity 1970 (Eds. S.S. Gupta, J. Yackel). Academic Press, New York, 109-118. [424.425]

Kiefer, J.C. (1974a). "General equivalence theory for optimum designs (approximate theory)."Annals of Statistics 2, 849-879. [4:2.4is, 420.43ij

Kiefer, J.C. (1974b). "Lectures on design theory." Mimeograph Series No. 397, Department ofStatistics, Purdue University, 52 pages. pis)

Kiefer, J.C. (1975). "Construction and optimality of generalized Youden designs." In A Surveyof Statistical Design and Linear Models (Ed. J.N. Srivastava). North-Holland, Amsterdam,333-353. (421,425]

Kiefer, J.C. (1981). "The interplay of optimality and combinatorics in experimental design."Canadian Journal of Statistics 9, 1-10. 1425]

Kiefer, J.C. (CP). Collected Papers (Eds. L.D. Brown, I. Olkin, J. Sacks, H.P. Wynn). Springer,New York 1985. [431]

Kiefer, J.C. and Studden, W.J. (1976). "Optimal designs for large degree polynomial regression."Annals of Statistics 4, 1113-1123. Hi9]

Kiefer, J.C. and Wolfowitz, J. (1959). "Optimum designs in regression problems." Annals ofMathematical Statistics 30, 271-294. p46,410,4is. 419,425)

Kiefer, J.C. and Wolfowitz, J. (1960). "The equivalence of two extremum problems." CanadianJournal of Mathematics 12, 363-366. [4i5.4i6.4i8]

Kiefer, J.C. and Wolfowitz, J. (1965). "On a theorem of Hoel and Levine on extrapolationdesigns." Annals of Mathematical Statistics 36, 1627-1655. [267.410, «9]

Kiefer, J.C. and Wyflfi, H.P. (1984). "Optimum and minimax exact treatment designs for one-dimensional autoregressive error processes." Annals of Statistics 12, 431-450. [431]

Kitsos, C.P., Titterington, D.M. and Torsney, B. (1988). "An optimal design problem in rhyth-mometry." Biometrics 44, 657-671. [417]

Krafft, O. (1978). Lineare statistische Modelle und optimale Versuchsplane. Vandenhoeck undRuprecht, Gottingen. [406,413.418,421,426]

Krafft, O. (1981). "Dual optimization problems in stochastics." Jahresbericht der DeutschenMathematiker-Vereinigung 83, 97-105. HI?)

Krafft, O. (1983)* "A matrix optimization problem." Linear Algebra and Its Applications 51,137-142. [408]

Krafft, O. (1990). "Some matrix representations occurring in linear two-factor models." In Prob-ability, Statistics and Design of Experiments. Proceedings of the R.C. Base Memorial Confer-ence, Delhi 1988 (Ed. R.R. Bahadur). Wiley Eastern, New Delhi, 461-470. [113]

Krafft, O. and Schaefer, M. (1992). "D-optimal designs for a multivariate regression model."Journal of Multivariate Analysis 42, 130-140. pig]

Krein, M. (1947); "The theory of self-adjoint extensions of semibounded Hermitian transforma-tions and its applications. I." Matematicheskii Sbornik 20(62), 431-495 (in Russian). |4iij

Kunert, J. (1991). "Cross-over designs for two treatments and correlated errors." Biometrika 78,315-324. [4i2]

BIBLIOGRAPHY 441

Kunert, J. and Martin, R.J. (1987). "On the optimality of finite Williams II(a) designs." Annalsof Statistics 15, 1604-1628. [412]

Kurotschka, V. (1971). "Optimale Versuchsplane bei zweifach klassifizierten Beobachtungsmo-dellen." Metrika 17, 215-232. [412]

Kurotschka, V. (1978). "Optimal design of complex experiments with qualitative factors of in-fluence." Communications in Statistics, Theory and Methods A7, 1363-1378. [412]

LaMotte, L.R. (1977). "A canonical form for the general linear model." Annals of Statistics 5,787-789. [412]

Lau, T.-S. (1988). "D-optimal designs on the unit g-ball." Journal of Statistical Planning andInference 19, 299-315. [423]

Lau, T.-S. and Studden, W.J. (1985). "Optimal designs for trigonometric and polynomial regres-sion using canonical moments." Annals of Statistics 13, 383-394. [423]

Lauter, E. (1974). "Experimental design in a class of models." Mathematische Operationsfor-schung und Statistik 5, 379-398. [423]

Lauter, E. (1976). "Optimal multipurpose designs for regression models." Mathematische Ope-rationsforschung und Statistik 7, 51-68. [w, 423]

Lee, C.M.-S. (1987). "Constrained optimal designs for regression models." Communications inStatistics, Theory and Methods 16, 765-783. [423]

Lee, C.M.-S. (1988). "Constrained optimal designs." Journal of Statistical Planning and Inference18, 377-389. [423]

Legendre, A.M. (1806). Nouvelles Methodes pour la Determination des Orbites des Cometes.Courcier, Paris. English translation of the appendix "Sur la methode des moindres quarres"in: D.E. Smith, A Source Book in Mathematics, Volume 2. Dover, New York 1959, 576-579.

[409]

Lim, Y.B. and Studden, WJ. (1988). "Efficient Ds-optimal design for multivariate polynomialregression on the <?-cube." Annals of Statistics 16, 1225-1240. [423]

Lindley, D.V. (1956). "On a measure of the information provided by an experiment." Annals ofMathematical Statistics 27, 986-1005. [4i4]

Lindley, D.V. and Smith, A.F.M. (1972). "Bayes estimates for the linear model." Journal of theRoyal Statistical Society Series B 34, 1-18. "Discussion on the paper by Professor Lindleyand Dr Smith." Ibidem, 18-41. [422]

Loewner, C. [Lowner, K.) (1934). "Uber monotone Matrixfunktionen." Mathematische Zeitschrift38, 177-216. [408]

Loewner, C. [Lowner, K.] (1939). "Grundziige einer Inhaltslehre im Hilbertschen Raume." An-nals of Mathematics 40, 816-833. [4i?]

Loewner, C. [Lowner, K.] (CP). Collected Papers (Ed. L. Bers). Birkhauser, Basel 1988. [428]

Magnus, J.R. (1987). "A representation theorem for (trAp)l/p." Linear Algebra and Its Appli-cations 95, 127-134. [4i4]

Makelainen, T. (1990). "Gustav Elfving 1908-1984." Address presented to the InternationalWorkshop on Linear Models, Experimental Design, and Related Matrix Theory, Tampere1990. 1430)

Markoff, W. (1916). "Uber Polynome, die in einem gegebenen Intervalle moglichst wenig vonNull abweichen." Vorwort von Serge Bernstein in Charkow. Mathematische Annalen 77, 213-258. [42i]

Markov, A.A. (1912). Wahrscheinlichkeitsrechnung. Teubner, Leipzig. [409]

Marshall, A.W. and Olkin, I. (1969). "Norms and inequalities for condition numbers, II." LinearAlgebra and Its Applications 2, 167-172. 1134,414)

442 BIBLIOGRAPHY

Marshall, A.W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications.Academic Press, New York, [4os,4i3]

Mathias, R. (1990). "Concavity of monotone matrix functions of finite order." Linear and Mul-tilinear Algebra 27, 129-138. Ii56]

McFadden, D. (1978). "Cost, revenue, and profit functions." In Production Economics: A DualApproach to Theory and Applications, Volume I (Eds. M. Fuss, D. McFadden). North-Holland, Amsterdam, 3-109. [413]

Mikaeili, F. (1988). "Allocation of measurements in experiments with mixtures." Keio Scienceand Technology Reports 41, 25-37. 1426)

Milliken, G.A. and Akdeniz, F. (1977). "A theorem on the difference of the generalized inversesof two nonnegative matrices." Communications in Statistics, Theory and Methods A6, 73-79.

[209]

Mitra, S.K. and Puri, M.L. (1979). "Shorted operators and generalized inverses of matrices."Linear Algebra and Its Applications 25, 45-56. 1412]

Mosteller, F, Youtz, C. and Zahn, D. (1967). "The distribution of sums of rounded percentages."Demography 4, 850-858. [424]

Miiller-Funk, U., Pukelsheim, F. and Witting, H. (1985). "On the duality between locally optimaltests and optimal experimental designs." Linear Algebra and Its Applications 67, 19-34. (4i3j

Murty, V.N. (1971), "Optimal designs with a Tchebycheffian spline regression function." Annalsof Mathematical Statistics 42, 643-649. (420)

Myers, R.H. (1971). Response Surface Methodology. Allyn and Bacon, Boston, MA. [42?]

Myers, R.H., Khuri, A.I. and Carter, W.H., Jr. (1989). "Response surface methodology: 1966-1988." Technometrics 31, 137-157. [42?]

Myers, R.H., Vining, G.G., Giovannitti-Jensen, A. and Myers, S.L. (1992). "Variance dispersionproperties of second-order response surface designs." Journal of Quality Technology 24, I'-ll. |427]

Nachtsheim, C.J. (1989). "On the design of experiments in the presence of fixed covariates."Journal of Statistical Planning and Inference 22, 203-212. [419]

Nalimov, V.V. (1974). "Systematization and codification of the experimental designs—The surveyof the works of Soviet statisticians." In Progress in Statistics. European Meeting of Statisti-cians, Budapest 1972, Volume 2 (Eds. J. Gani, K. Sarkadi, I. Vincze). Colloquia MathematicaSocietatis Janos Bolyai 9, North-Holland, Amsterdam, 565-581. [4n]

Natanson, I.P. (1955). Konstruktive Funktionentheorie. Akademie-Verlag, Berlin. [421]

Nigam, A.K., Puri, P.D. and Gupta, V.K. (1988). Characterizations and Analysis of Block Designs.Wiley Eastern, New Delhi. [426]

Nordstrom, K. (1991). "The concentration ellipsoid of a random vector revisited." EconometricTheory 7, 397-403. HB]

Ouellette, D.V. (1981). "Schur complements and statistics." Linear Algebra and Its Applications36, 187-295. Hm

Pazman, A. (1980). "Singular experimental design (standard and Hilbert-space approaches)."Mathematischs Operationsforschung und Statistik Series Statistics 11, 137-149. HH]

Pazman, A. (1986). Foundations of Optimum Experimental Design. Reidel, Dordrecht. [409.412,418]

Pazman, A. (1990). "Small-Sample distributional properties of nonlinear regression estimators(a geometric approach)." Statistics 21, 323-346. "Discussion." Ibidem, 346-367. |4ii]

Pilz, J. (1979). "Optimalitatskriterien, Zulassigkeit und Vollstandigkeit im Planungsproblem fureine bayessche Schatzung im linearen Regressionsmodell." Freiberger Forschungshefte D117,67-94. [42i]

BIBLIOGRAPHY 443

Pilz, J (1991). Bayesian Estimation and Experimental Design in Linear Regression Models. Wiley,New York. 1122]

Plackett, R.L. (1949). "A historical note on the method of least squares." Biometrika 36, 458-460. [409]

Plackett, R.L. (1972). "Studies in the history of probability and statistics XXIX: The discoveryof the method of least squares." Biometrika 59, 239-251. MO»I

Plackett, R.L. and Burman, J.P. (1946). "The design of optimum multifactorial experiments."Biometrika 33, 305-325. [427]

Preece, D.A. (1982). "Balance and designs: Another terminological tangle." Utilitas Mathematica21C, 85-186. 1426]

Preitschopf, F. (1989). Bestimmung optimaler Versuchspldne in der polynomialen Regression.Dissertation, Universitat Augsburg, 152 pages. [421)

Preitschopf, F. and Pukelsheim, F. (1987). "Optimal designs for quadratic regression." Journalof Statistical Planning and Inference 16, 213-218. [41?]

Pukelsheim, F. (1980). "On linear regression designs which maximize information." Journal ofStatistical Planning and Inference 4, 339-364. [157. 4i(MU. 415-418,422]

Pukelsheim, F. (1981). "On c-optimal design measures." Mathematische Operationsforschung undStatistik Series Statistics 12, 13-20. ]i86, 410,4ii,4i6j

Pukelsheim, F. (1983a). "On optimality properties of simple block designs in the approximatedesign theory." Journal of Statistical Planning and Inference 8, 193-208. [406,412,418,426]

Pukelsheim, F. (1983b). "On information functions and their polars." Journal of OptimizationTheory and Applications 41, 533-546. [413]

Pukelsheim, F. (1983c). "Optimal designs for linear regression." In Recent Trends in Statistics.Proceedings of the Anglo-German Statistical Meeting, Dortmund 1982 (Ed. S. Heiler). Allge-meines Statistisches Archiv, Sonderheft 21, Vandenhoeck und Ruprecht, Gottingen, 32-39.

[412, 426]

Pukelsheim, F. (1986). "Approximate theory of multiway block designs." Canadian Journal ofStatistics 14, 339-346. [aso. 426]

Pukelsheim, F. (1987a). "Information increasing orderings in experimental design theory." Inter-national Statistical Review 55, 203-219. 1413.426]

Pukelsheim, F. (1987b). "Ordering experimental designs." In Proceedings of the First WorldCongress of the Bernoulli Society, Tashkent 1986, Volume 2 (Eds. Yu.A. Prohorov, V.V.Sazonov). VNU Science Press, Utrecht, 157-165. (426)

Pukelsheim, F. (1987c). "Majorization orderings for linear regression designs." In Proceedingsof the Second International Tampere Conference in Statistics, Tampere 1987 (Eds. T. Pukkila,S. Puntanen). Department of Mathematical Sciences, Tampere, 261-274. [426]

Pukelsheim, F. (1989). "Complete class results for linear regression designs over the multi-dimensional cube." In Contributions to Probability and Statistics. Essays in Honor of IngramOlkin (Eds. L.J. Gleser, M.D. Perlman, S.J. Press, A.R. Sampson). Springer, New York, 349-356. [426]

Pukelsheim, F. (1990). "Information matrices in experimental design theory." In Probability,Statistics and Design of Experiments. Proceedings of the R.C. Bose Memorial Conference,Delhi 1988 (Ed. R.R. Bahadur). Wiley Eastern, New Delhi, 607-618. pm

Pukelsheim, F. and Rieder, S. (1992). "Efficient rounding of approximate designs." Biometrika79, 763-770. p*)

Pukelsheim, F. and Rosenberger, J.L. (1993). "Experimental Designs for Model Discrimination."Journal of the American Statistical Association 88, 642-649. [423]

Pukelsheim, F. and Studden, W.J. (1993). "£-optimal designs for polynomial regression." Annalsof Statistics 21, 402-415. (417.4201

444 BIBLIOGRAPHY

Pukelsheim, F. and Styan, G.P.H. (1983). "Convexity and monotonicity properties of dispersionmatrices of estimators in linear models." Scandinavian Journal of Statistics 10, 145-149.

(411.412)

Pukelsheim, F. and Titterington, D.M. (1983). "General differential and Lagrangian theory foroptimal experimental design." Annals of Statistics 11, 1060-1068. HIS, 416.424]

Pukelsheim, F. and Titterington, D.M. (1986). "Improving multi-way block designs at the costof nuisance parameters." Statistics & Probability Letters 4, 261-264. [426]

Pukelsheim, F. and Titterington, D.M. (1987). "On the construction of approximate multi-factordesigns from given marginals using the Iterative Proportional Fitting Procedure." Metrika34, 201-210. [426]

Pukelsheim, F. and Torsney, B. (1991). "Optimal weights for experimental designs on linearlyindependent support points." Annals of Statistics 19, 1614-1625. [*n\

Raghavarao, D. (1971). Constructions and Combinatorial Problems in Design of Experiments.Wiley, New York. [424.426]

Raghavarao, D. and Federer, W.T. (1975). "On connectedness in two-way elimination of hetero-geneity designs." Annals of Statistics 3, 730-735. (426)

Raktoe, B.L., Hedayat, A. and Federer, W.T. (1981). Factorial Designs. Wiley, New York. [424]

Rao, C.R. (1967). "Least squares theory using an estimated dispersion matrix and its applicationto measurement of signals." In Proceedings of the Fifth Berkeley Symposium on MathematicalStatistics <md Probability, Berkeley, CA 1965 and 1966, Volume 1 (Eds. L.M. Le Cam, J.Neyman). University of California, Berkeley, CA, 355-372. 1%, 408)

Rao, C.R. and Mitra, S.K. (1971). Generalized Inverse of Matrices and Its Applications. Wiley,New York. [408]

Rao, V.R. (1958). "A note on balanced designs." Annals of Mathematical Statistics 29, 290-294.[379]

Rasch, D. and Herrendorfer, G. (1982). Statistische Versuchsplanung. VEB Deutscher Verlag derWissenschaften, Berlin. [426]

Rieder, S. (1990). Versuchspldne mil vorgegebenen Trdgerpunkten. Diplomarbeit, UniversitatAugsburg, 73 pages. [424]

Rivlin, T.J. (1990). Chebyshev Polynomials. From Approximation Theory to Algebra and NumberTheory, Second Edition. Wiley-Interscience, New York. [420,421]

Rockafellar, R.T. (1967). "Monotone processes of convex and concave type." Memoirs of theAmerican Mathematical Society 77, 1-74. HBJ

Rockafellar, R.T. (1970). Convex Analysis. Princeton University Press, Princeton, NJ.[409, 410, 413. 414, 416. 417, 422, 423]

Rogers, L.J. (1888). "An extension of a certain theorem in inequalities." Messenger of Mathe-matics 17, 145-150. [4i3j

Salaevskii, O.V. (1966). "The problem of the distribution of observations in polynomial regres-sion." Proceedings of the Steklov Institute of Mathematics 79, 147-166. (424)

Schatten, R. (1950). A Theory of Cross-Spaces. Princeton University Press, Princeton, NJ. [4i4]

Schoenberg, I.J. (1959). "On the maxima of certain Hankel determinants and the zeros of theclassical orthogonal polynomials." Indagationes Mathematicae 21, 282-290. [4is]

Schur, I. (1911). "Bemerkungen zur Theorie der beschrankten Bilinearformen mit unendlichvielen Veranderlichen." Journal fiir die reine und angewandte Mathematik 140, 1-28. [4i8]

Schur, I. (1918). "Ober die Verteilung der Wurzeln bei gewissen algebraischen Gleichungen mitganzzahligen Koeffizienten." Mathematische Zeitschrift 1, 377-402. pis]

BIBLIOGRAPHY 445

Schur, I. (GA). Issai Schur Gesammelte Abhandlungen (Eds. A. Brauer, H. Rolnbach). Springer,Berlin 1973.

Searle, S.R. (1971). Linear Models. Wiley, New York. poo)

Seely, J.F. (1971). "Quadratic subspaces and completeness." Annals of Mathematical Statistics42, 710-721. 1425)

Shah, K.R.and Sinha, B[ikas].K. (1989). Theory of Optimal Designs. Lecture Notes in Statistics54, Springer, New York. (424,426]

Shewry, M.C. and Wynn, H.P. (1987). "Maximum entropy sampling." Journal of Applied Statistics14, 165-170. HIS,

Sheynin, O.B. (1989). "A.A. Markov's work on probability." Archive for History of Exact Sci-ences 39, 337-377. [409]

Sibson, R. (1974). "D^-optimality and duality." In Progress in Statistics. European Meeting ofStatisticians, Budapest 1972, Volume 2 (Eds. J. Gani, K. Sarkadi, I. Vincze). Colloquia Math-ematica Societatis Janos Bolyai 9, North-Holland, Amsterdam, 677-692. HI?, 4is]

Sibson, R. and Kenny, A. (1975). "Coefficients in D-optimal experimental design." Journal ofthe Royal Statistical Society Series B 37, 288-292. HIS]

Silverman, B.W. and Titterington, D.M. (1980). "Minimum covering ellipses." SIAM Journal inScientific and Statistical Computing 1, 401-409. [417]

Silvey, S.D. (1978). "Optimal design measures with singular information matrices." Biometrika65, 553-559. ISMUI

Silvey, S.D. (1980). Optimal Design. Chapman and Hall, London. HU]

Silvey, S.D. and Titterington, D.M. (1973). "A geometric approach to optimal design theory."Biometrika 60, 21-32. (41?, 4ig|

Silvey, S.D. and Titterington, D.M. (1974). "A Lagrangian approach to optimal design."Biometrika 61, 299-302. |42i]

Sinha, B[ikas].K. (1982). "On complete classes of experiments for certain invariant problems oflinear inference." Journal of Statistical Planning and Inference 7, 171-180. [425)

Sinha, B[imal].K. (1970). "A Bayesian approach to optimum allocation in regression problems."Calcutta Statistical Association Bulletin 19, 45-52. (422)

Smith, K. (1918). "On the standard deviations of adjusted and interpolated values of an observedpolynomial function and its constants and the guidance they give towards a proper choice ofthe distribution of observations." Biometrika 12, 1-85. |4i8)

St.John, R.C. and Draper, N.R. (1975). "D-optimality for regression designs: A review." Tech-nometrics 17, 15-23. |4is)

Steinberg, D.M and Hunter, W.G. (1984). "Experimental design: Review and comment." Tech-nometrics 26, 71-97. [422]

Ste.pniak, C. (1989). "Stochastic ordering and Schur-convex functions in comparison of linearexperiments." Metrika 36, 291-298. [412]

Ste,pniak, C., Wang, S.-G. and Wu, C.F.J. (1984). "Comparison of linear experiments with knowncovariances." Annals of Statistics 12, 358-365. (412]

Stigler, S.M. (1971). "Optimal experimental design for polynomial regression." Journal of theAmerican Statistical Association 66, 311-318. [423]

Stigler, S.M. (1986). The History of Statistics. The Measurement of Uncertainty before 1900.Belknap Press, Cambridge, MA. (409]

Stone, M. (1959). "Application of a measure of information to the design and comparison ofregression experiments." Annals of Mathematical Statistics 30, 55-70. |4U]

446 BIBLIOGRAPHY

Studden, W.J. (1968). "Optimal designs on Tchebycheff points." Annals of Mathematical Statistics59,1435-1447. [59, no. «9,420)

Studden, W.J. (1971). "Elfving's Theorem and optimal designs for quadratic loss." Annals ofMathematical Statistics 42, 1613-1621. [4io,4i7]

Studden, W.J. (1977). "Optimal designs for integrated variance in polynomial regression." InStatistical Decision Theory and Related Topics II. Proceedings of a Symposium, Purdue Uni-versity 1976 (Eds. S.S. Gupta, D.S. Moore). Academic Press, New York, 411-420. pis]

Studden, W.J. (1978). "Designs for large degree polynomial regression." Communications inStatistics, Theory and Methods A7, 1391-1397. [«9]

Studden, W.J. (1980). "Ds -optimal designs for polynomial regression using continued fractions."Annals of Statistics 8, 1132-1141. (423)

Studden, W.J. (1982). "Some robust-type D-optimal designs in polynomial regression." Journalof the American Statistical Association 77, 916-921. (4231

Studden, W.J. (1989). "Note on some 4>p-optimal designs for polynomial regression." Annals ofStatistics 17, 618-623. [4i7]

Studden, W.J. and VanArman, D.J. (1969). "Admissible designs for polynomial spline regres-sion." Annals of Mathematical Statistics 40, 1557-1569. [422]

Styan, G.P.H. (1973). "Hadamard products and multivariate statistical analysis." Linear Algebraand Its Applications 6, 217-240. [4i8]

Styan, G.P.H. (1985). "Schur complements and linear statistical models." In Proceedings of theFirst International Tampere Seminar on Linear Statistical Models and their Applications, Tam-pere 1983 (Eds. T. Pukkila, S. Puntanen). University of Tampere, Tampere, 37-75. [4ni

Szego, G. (1939). Orthogonal Polynomials, Fourth Edition 1975. American Mathematical Society,Providence, RI. HIS]

Titterington, D.M. (1975). "Optimal design: Some geometrical aspects of £>-optimality."Biometrika 62, 313-320. [421]

Titterington, D.M. (1980a). "Geometric approaches to design of experiment." MathematischeOperationsforschung und Statistik Series Statistics 11, 151-163. [4211

Titterington, D.M. (1980b). "Aspects of optimal design in dynamic systems." Technometrics 22,287-299. [4i3]

Torsney, B. (1981). Algorithms for a Constrained Optimization Problem with Applications inStatistics and Optimal Design. Dissertation, University of Glasgow, 336 pages. HI 7]

Tyagi, B.N. (1979). "On a class of variance balanced block designs." Journal of Statistical Plan-ning and Inference 3, 333-336. (379)

Vila, J.P. (1991). "Local optimality of replications from a minimal D-optimal design in regression:A sufficient and a quasi-necessary condition." Journal of Statistical Planning and Inference29, 261-277. H23]

Voltaire [Arouet, P.M.] (CEuvres). (Euvres Completes de Voltaire. Tome 17, Dictionnaire Philos-ophique I (Paris 1878). Tome 21, Romans (Paris 1879). Reprint, Kraus, Nendeln, Liech-tenstein 1967.

von Neumann, J. (1937). "Some matrix-inequalities and metrization of matric-space." Mitteilun-gen des Forschungsinstituts fur Mathematik und Mechanik an der Kujbyschew-UniversitatTomsk 1, 286-300. [4i4]

von Neumann, J. (CW). John von Neumann Collected Works (Ed. A.H. Taub). Pergamon Press,Oxford 1962.

Wald, A. (1943). "On the efficient design of statistical investigations." Annals of MathematicalStatistics 14, 134-140. [412,4i3]

BIBLIOGRAPHY 447

Welch, W.J. (1982). "Branch-and-bound search for experimental designs based on D optimalityand other criteria." Technometrics 24, 41-48. [4221

Whittle, P. (1973). "Some general points in the theory of optimal •experimental design." Journalof the Royal Statistical Society Series B 35, 123-130. [59.4i4]

Wierich, W. (1986). "On optimal designs and complete class theorems for experiments withcontinuous and discrete factors of influence." Journal of Statistical Planning and Inference15, 19-27. [422]

Witting, H. (1985). Mathematische Statistik I. Parametrische Verfahren beifestem Stichprobenum-fang. Teubner, Stuttgart. [410]

Wussing, H. and Arnold, W. (1975). Biographien bedeutender Mathematiker. Volk und Wissen,Berlin. Hi4|

Wynn, H.P. (1972). "Results in the theory and construction of D-optimum experimental designs."Journal of the Royal Statistical Society Series B 34, 133-147. "Discussion of Dr Wynn's andof Dr Laycock's papers." Ibidem, 170-186. [417]

Wynn, H.P. (1977). "Optimum designs for finite populations sampling." In Statistical DecisionTheory and Related Topics II. Proceedings of a Symposium, Purdue University 1976 (Eds.S.S. Gupta, D.S. Moore). Academic Press, New York, 471-478. [422]

Wynn, H.P. (1982). "Optimum submeasures with application to finite population sampling." InStatistical Decision Theory and Related Topics III. Proceedings of the Third Purdue Sym-posium, Purdue University 1981, Volume 2 (Eds. S.S. Gupta, J.O. Berger). Academic Press,New York, 485-495. [422j

Zehfuss, G. (1858). "Uber eine gewisse Determinante." Zeitschrift fur Mathematik und Physik3, 298-301. 1427]

Zyskind, G. (1967). "On canonical forms, non-negative covariance matrices and best and simpleleast squares linear estimators in linear models." Annals of Mathematical Statistics 38, 1092-1109. [209)

Subject Index

A-criterion, see trace criterionabsolute continuity, 248, 305, 368admissibility, 57, 252, 262, 265, 403, 417

of a design, 247, 253, 421of a moment matrix, 247, 256, 422of an information matrix, 262, 264, 422

antisymmetry, 12, 145, 353, 356antitonicity, 13, 89arcsin distribution on [-1;1], 217, 246,

419arcsin support design crd, 209, 217, 223,

230, 238, 241, 281, 419arcsin support set T = {±1, ±1/2,0}, 281,

294, 300arithmetic mean $,, 140, 292average-variance criterion <^_ t , 135, 137,

140, 153, 197, 221, 223, 241, 413, 419averaging matrix Ja — lalg /a, 88, 347, 366

balanced incomplete block design W —N/n, 138, 366, 378, 426

balancedness, 353, 417, 425balancing operator A »-* A, 193, 348, 369Bayes design problem, 275, 278, 422Bayes estimator, 270, 273Bayes linear model, 269, 272Bayes moment matrix Ma = (1 - a)A/o +

aM, 271, 275bijective (i.e. one-to-one and onto), 124,

335Birkhoff theorem, 144, 413block design W, 30, 94, 100, 362, 365block matrix, 9, 75, 392

blocksize vector s, 31, 94, 426bounded away from the origin, 120

C-matrix, see contrast information matrixcanonical moments, 417, 419, 423Caratheodory theorem, 188, 360, 417Cauchy inequality, 56, 235centered contrasts Kaa, 88, 93, 97, 105,

113, 206, 363, 366, 411centering matrix Ka, 88, 347, 366central composite design, 400, 402, 427Chebyshev coefficient vector c, 227, 233,

237, 420Chebyshev indices d,d~2,...,d- 2[d/2\,

233Chebyshev points, see arcsin support

designChebyshev polynomial Td(t), 226, 238,

246, 420classical linear model E[Y] = X0,D[Y] =

o-2/n, 4, 16, 24, 36, 62, 72, 382with normality assumption Y ~

^Wv 4> 67> 72

coefficient matrix K, 19, 47, 61, 206, 410rank deficient, 88, 205, 364, 371, 404

coefficient vector c, 36, 47, 54column sum vector, see blocksize vectorcomplete class, 324, 334, 374, 403, 417completely symmetric matrix, 34, 345, 347complexity,

of computing an information matrix,73, 76, 82

of model versus inferential simplicity,381

A bold face number refers to the page where the item is introduced.

448

SUBJECT INDEX 449

of the design problem, 284componentwise partial ordering >, 285,

333, 375, 404concavity, 115, 119concentration ellipsoid, 413concurrence matrix of treatments NN',

367, 426confidence ellipsoid, 96congruence action (Q,A) H-> QAQ', 337,

425conjugate numbers p + q = pq, 147, 261connectedness, 426contrast information matrix C, 94, 105,

262, 412, 422convex hull convS, 29, 43, 253, 352convex set, 44, 191, 252covariance adjustment, 96, 408cylinder, 44, 50, 57, 259, 417

D-criterion, see determinant criteriondesign £ e E, t € T, 5, 26, 304, 390

for extrapolation, 410, 419for model discrimination, 279for sample size n, 25, 304optimal for c'B in H, 50, 197

^-optimal for K'B in H, 131, 187standardized by sample size, 305with a guaranteed efficiency, 296, 423with bounded weights, 278, 422with protected runs, 277, 422

design matrix, 29, 31, 409design problem, 131, 152, 170, 275, 284,

331, 342, 413, 425for a scalar parameter system c'B, 41,

63, 82, 108for sample size n, 25, 304

design sequence, 409, 419design set,

T, on the experimental domain T, 27,32, 213

T(r) C T, cross section, 105, 363H, on the regression range X, 26, 32,

410H,,, for sample size n, 26, 304, 311H/I/H C H, standardized by sample size

AZ, 305, 311determinant criterion <ft), 119, 136, 140,

153, 195, 213, 217, 285, 293, 320, 344,353, 356, 413, 418, 423, 425

finite sample size optimality, 322, 325,328

diagonal operator A, 8, 31, 94, 142, 146,345, 413

diagonality of a square matrix, 142direct sum decomposition, 23, 52directional derivative, 415, 423discrepancy, 308discrete optimization problem, 26, 320dispersion formula, 73, 411dispersion matrix, 25, 102, 205, 395dispersion maximum d(A), 211dispersion ordering and information

ordering, 91doubly stochastic matrix, 144dual problem, 47, 171, 231duality gap, 47, 184duality theorem, 172, 415, 417

E-criterion, see smallest-eigenvaluecriterion

efficiency, 113, 132, 221, 223, 240, 292,296

efficiency bound e^, 310, 312, 320, 424optimal e{(n), 311

efficiency loss, 314, 317, 321efficient design apportionment £(£,«), 27,

221, 223, 237, 308, 312, 320, 328, 424efficient rounding (fz|, 308eigenvalue, 8, 24, 56, 140, 146, 182, 204,

333, 375, 405eigenvalue decomposition, 8, 24, 375, 401Elfving norm p, 47, 59, 121, 183

Elfving set U = conv(A' u (-*)), 43, 107,134, 191, 231, 420

Elfving theorem, 50, 107, 182, 190, 198,212, 231, 239, 259, 410, 422

equispaced support, 217, 223, 230, 238,241, 281, 419

equivalence theorem [over M(H)],176for the parameter vector 6, 177of Kiefer-Wolfowitz, 212, 222, 313, 324,

418under a matrix mean <£p, 180under a scalar criterion c'Q, 52under the smallest-eigenvalue criterion

4>-oc, 181equivariance, 336, 340, 387, 395, 425equivariance group "H = {LQK : Q 6 Q},

343, 357, 363estimability, 19, 36, 41, 64, 72

estimated response surface / i-> f(t)'6,382

estimation and testing problems, 4, 340Euclidean scalar product, 107, 394

of column vectors x'y, 2, 72, 141, 344

450 SUBJECT INDEX

Euclidean scalar product (Continued)of rectangular matrices traceA'B, 8,

125, 141Euclidean space,

of column vectors Rk, 2of rectangular matrices /?"**, 8of symmetric matrices Sym(^), 8

exchangeable dispersion structure, 34existence theorem, 104, 109, 174experimental conditions t € T, 1experimental domain T, 1, 335experimental run, 3, 27, 278extrapolation design, 410, 419

F-Test, 67, 89, 96, 413factorial design, 390, 402feasibility and formal optimality, 132, 138feasibility cone A(c),A(K), 36, 63, 67, 82,

351, 411and rank of information matrices, 81

Fenchel duality, 415first-degree model, 6, 192Fisher inequality a<b, 367, 426Fisher information, 72, 411formal optimality, 131, 174frequency count, 25, 304full rank reduction, 160, 188

G-criterion, see global criterionGauft-Markov theorem, 13, 20, 34, 51, 62,

66, 89, 408for the parameter vector 6, 22under a range inclusion condition, 21

general equivalence theorem [over M\,17, 52, 140, 175, 177, 282, 415

differentiability proof, 179for a mixture of models or criteria,

286, 290for Bayes designs, 276for designs with bounded weights, 278for guaranteed efficiency designs, 297for the parameter vector 0, 176under a matrix mean <$>p, 178under a scalar criterion c'6, 111, 412under the Loewner ordering >, 103under the smallest-eigenvalue criterion

<£-oc, 180general linear group GL(/c), 336general linear model E[Y] ^ X0,D{Y] -

<r2V, 18, 72generalized information matrix AK, 89,

94, 165, 411, 415four formulas, 92

generalized information matrix mappingA i-. AK, 92

generalized matrix inverse AGA = A, 16,88, 408

set A~ = [G : AGA = A}, 16, 159geometric mean 4>0, 140, 292geometry,

of a feasibility cone A(K), 40of the closed cone of nonnegative

definite matrices NND(£), 10of the moment set M2d-i(T). 252

of the penumbra P, 108of the set M(H) of all moment

matrices, 29of the set of admissible moment

matrices, 258global criterion g, 211, 245, 313, 418grand assumption MnPD(k) ^ 0, 98, 108,

160, 170for mixture models, 286

group of unimodular matrices Unim(s),344, 353, 361

Hadamard matrix, 391, 427Hadamard product A *fl, 199, 418harmonic mean 4>_ t , 140, 292heritability of invariance, 358homomorphism h : Q —> GL(s), 338, 342Holder inequality, 126, 147, 413

I-criterion, 419ice-cream cone NND(2), 38, 354, 410idempotency, 17, 127identifiability, 72, 81, 89, 305inadmissibility, 190, 248, 294incidence matrix N, 366, 426information for c'6, 63information function,

4> : NND(s) -» R, 119, 126, 134, 209,413

i/f : NND(fcj x - - - x km) -» Rm, 285* : R* -» R, 284functional operations, 124matrix means <f>p with p € (-00; 1], 140reconstruction from unit level set {</> >

1}, 122information increasing ordering, 426information matrix CK(A) for K'd, 62,

86, 411four formulas, 76

information matrix mapping C#, 76, 92,129

discontinuity, 79, 82, 319

SUBJECT INDEX 451

upper semicontinuity and regularization,77, 99

information ordering and dispersionordering, 91

information surface / >—> Cf^,)(M), 382,389, 427

injective (i.e. one-to-one), 122, 335integer part [zj, 227, 306invariance,

heritability, 358of a design problem, 331of a matrix mean <f>p, 343of a symmetric matrix, 345, 388, 398of an information function </>, 343, 349of the determinant criterion fa, 137,

344of the experimental domain T, 335of the regression range X, 336of the set of competing matrices M,

337under choice of generalized inverses

G€A~, 16under orthogonal transformations, 335,

351, 359under permutation matrices, 335, 351,

373under reparametrization, 137under rotations, 359under sign-changes, 351

isotonicity, 12, 114, 119iterated information matrix,

CKH(A) = CH(cK(A)),m, 412

Kiefer optimality, 357, 360, 364, 368, 389,421, 426

Kiefer ordering >, 354, 374Kiefer-Wolfowitz theorem, 212, 214, 313,

324, 418inappropriate frame for generalization,

222Kronecker product s <8> t , 392, 427Kuhn-Tucker coefficients, 298, 423

L-criterion, see linear dispersion criterionLagrange polynomials L0(t),...,Ld(t),

for Chebyshev polynomial extrema, 227for Legendre polynomial extrema, 215,

328Lagrange multipliers, 297left identity QA = A, 19left inverse LK = Is, 22, 62

minimizing for A, CK(A) - LAL', 62

Legendre polynomial Pd(t), 214, 246, 418,421

level set {<f> > 1} = {C > 0 :,0(C) > 1},77, 118, 120, 296

1'Hospital rule, 140, 328, 414likelihood ratio, 198, 248, 310limit of a design sequence, 409, 419line fit model, 6, 32, 57, 83, 198, 246, 259linear dispersion criterion, 222, 246, 277,

419linear matrix equation, 16, 85, 201linearly independent regression vectors,

42, 195, 213, 223, 417Loewner comparison, 101, 251, 262, 350Loewner ellipsoid, 417Loewner optimality, 101, 206, 262, 357,

360, 363, 412, 426nonexistence, 104

Loewner ordering >, 12, 19, 25, 62, 90,107, 114, 190, 262, 269, 354, 375, 408,412

logarithmic concavity, 155, 414

majorization ordering -<,of matrices, 352, 425of vectors, 144, 352

matrix inverse, 13, 16, 159matrix mean <f>p, 135, 140, 178, 200, 206,

246, 343, 413as information function or norm, 151gradient V</»P(C), 179polar function <££° = s(f>q, 149simultaneous optimality, 203weighted, 157

matrix mean optimality, 178, 196, 241,260, 376

for a component subset ( f y , . . . , 6 S ) ' ,203

for a rank deficient subsystem, 88, 205,364, 371, 404

matrix modulus |C|, 134, 141, 150, 156matrix outer product, 190matrix power, 140, 371maximization of information, 63, 131

versus minimization of variance, 41, 155mean squared-error matrix, 65, 269measures of rotatability, 405, 427minimum variance unbiased linear

estimator, see optimal estimatormixture,

of information functions, 285, 289of models, 283, 288

model-building, 62, 72, 382, 406

452 SUBJECT INDEX

model discrimination, 279, 423model matrix X, 3, 27, 409model response surface / H-» f ( t ) ' 9 , 382model robust design, 423model variance a2, 3, 69moment matrix M (f), 26, 131, 232, 383,

395, 409as information matrix for 0, 63classical m.m. Md(r) for dth-degree

model, 32, 213, 251eigenvectors of an optimal m.m., 56formal </>-optimality for K'6 in M, 131,

174Loewner comparison, 101, 251Loewner optimality for K'6 in M, 101maximum range and rank, 99, 105optimality for c'0 in A/(H), 41reduced by reparametrization, 111

moment matrix set,A/(H), of all designs, 29, 102, 177, 252,

383M(Hn), of designs for sample size n, 29M, of competing moment matrices, 98,

102, 107, 177A/y(T), induced by regression function

/, 383moment set /i2d-i(T), 251monotonicity, 13, 357Moore-Penrose inverse /4+, 186, 204, 372,

401, 416multiple linear regression over [0;!]*, 192,

372, 426multiplicity of optimal moment matrices,

201, 207, 372, 378multiplier method of apportionment, 307,

424multipurpose design, 423multiway classification model, 5, 372, 379,

426mutual boundedness theorem,

for a scalar criterion, 45, 231for an information function, 171, 174,

261

negative part of a symmetric matrix C_,141, 150

nonexistence examples, 104, 174, 412nonnegative definite matrix A > 0, 9, 52,

141closed cone NND(fc), 10

nonnegativity, 116norm, 13, 54, 124, 126, 134, 151, 353, 356

normal equations, 66, 412normal vector to a convex set, 159, 258normal-gamma distribution, 272normality inequality, 160, 175, 192, 258

under a rank deficient matrix mean <£„',205

notational conventions,0,<r2; Y,y;A,aij, 8t,T; r,T,x,X; £,H, 27

nullspace and range, 13numerical rounding, 306

one-point design, 55, 257one-way classification model, 5, 30optimal estimator, 24, 36, 41, 65optimal variance, 41, 131, 233, 241optimal weights,

on arbitrary support points, 199on linearly independent regression

vectors, 195optimality criterion <f>, 114, 256, 292, 412order preservation or reversal, 13, 65, 91orthodiagonal projector, see centering

matrixorthogonal design, 390orthogonal group Orth(m), 335, 342, 351,

384orthogonal projector P •= P2 — P', 24, 71,

349orthogonality,

of two nonnegative definite matrices,153

of two subspaces, 14, 154

parabola fit model, 6, 32, 138, 173, 184,246, 266, 331, 340, 359, 417

parameter domain 0, 1, 19parameter orthogonality, 76, 411parameter system of interest K'O, 19, 35,

61component subset (0 j , . . . , Os)', 36, 73,

82, 203maximal, 207, 371rank deficient, 88, 205, 364, 371, 404scalar c'0, 47, 137, 170, 410, 419, 420

parameter vector 060, 1, 3partial ordering, 12, 144, 265penumbra P, 107, 134permutation group Perm(A:), 143, 335,

351, 352, 362, 373polar function <f>°°, 126, 149, 285, 351,

413

SUBJECT INDEX 453

polarity equation, 132, 168, 175, 276, 324for a matrix mean $p, 154for a vector mean $p, 157

polarity theorem, 127polynomial, 213, 249, 328polynomial fit model, 6, 32, 251

average-variance optimal design, 223,241

determinant optimal design, 213, 217,243, 320

formal trace optimal design, 174scalar optimal design, 229, 237smallest-eigenvalue optimal design, 232,

316positive definite matrix A > 0, 9

open cone PD(k), 9positive homogeneity, 115, 119, 292positive part of a symmetric matrix C+,

141, 150positive vector A > 0, 139, 262positivity, 100, 117posterior precision increase )3i(y), 273power vector /(/) = (l,t,...,td)', 213,

249, 336precision matrix, 25, 65, 409preordering, 144, 353, 356

criterion induced, 117, 136prior distribution, 268, 272prior moment matrix A/0, 271, 275prior precision /3o, 272prior sample size n0, 269product design rs', 100, 105, 206, 364projector P = P2, 17, 23, 52, 156

onto invariant symmetric matrices, 349orthogonal P = P2 = P', 33, 71

proof arrangement for the theorems ofElfving, Gauss-Markov,Kiefer-Wolfowitz, 51, 212

proportional frequency design, see productdesign

protected runs, 277, 422pseudoquotas vwt, 307

quadratic subspace of symmetric matrices,372, 425

quasi-linearization, 76, 129, 413quota nwj, 306quota method of apportionment, 306, 318,

424

range and nullspace, 13, 15range inclusion condition, 16, 17, 21, 411range summation lemma, 37

rank and nullity, 13rank deficient matrix mean <£', 88, 205,

364, 371, 404rational weights w, = «//«, 312rearrangement jq, 145recession to infinity, 44, 110, 120reduced normal equations C^(A)y — T,

66, 412reflection, 332, 341, 397reflexivity, 12, 144, 353, 356regression function / : T —» X, 2, 381regression range X C Rk, 2

symmetrized r.r. X U (-X), 43, 191regression space C(X) — span(A') C Rk,

42,47regression surface x i-» x'O, 211, 222regression vectors x 6 X, 2, 25

linearly independent, 195, 213, 223, 417regular simplex design, 391, 402regularization, 99, 119, 412relative interior, 44reparametrization, 88, 137, 160, 205

inappropriateness, 411residual projector R = In - P, 23residual sum of squares, 69response surface, 382response vector Y, 3risk function, 265rotatability, 335, 384, 394, 426

bad terminology for designs, 385determining class, 386, 395measures of r., 405, 427of first-degree model, 386of second-degree model, 394, 400

rounding function R(z), 307row sum vector, see treatment replication

vector

sample size monotonicity, 306, 424sampling weight a = n/(n0 + n), 272, 275saturated models, 7scalar criterion c'0, 133, 170, 182, 410,

420in polynomial fit models, 229, 237on linearly independent regression

vectors, 197, 230Schur complement AH - Ai2A22A2\, 75,

82, 92, 263, 274, 284, 411Schur inequality S(C) -< A(C), 146second-degree model, 280, 293, 299, 395,

403separating hyperplane theorem, 128, 162,

410

454 SUBJECT INDEX

shorted operator, 411sign-change group Sign(s), 142, 335, 345,

351, 386, 413simplex design, 391, 402simultaneous optimality,

under all invariant informationfunctions, 349

under all matrix means 4>p,P G [-00; 1],203

under some scalar criteria, 102smallest-eigenvalue criterion <£-oo, 119,

135, 153, 158, 180, 183, 232, 257, 316,404, 413, 420

square root decomposition of anonnegative definite matrix V —UU', 15, 153

standardization,of a design for sample size n, 6,/n, 305of an optimality criterion, <j»(ls) = 1,

117stochastic vector, 262strict concavity, 116, 201, 353,strict isotonicity, 117strict superadditivity, 116subdifferential dtf>(M), 159, 167subgradient, 158, 163, 287subgradient inequality, 158, 164, 317, 414subgradient theorem, 162, 268superadditivity, 77, 115support point x e supp £, 26, 191support set supp£, 26, 265, 312

bound for the size, 188, 417excessive size, 378, 391minimal size, 195, 322, 391

supporting hyperplane theorem, 49, 107,110

surjective (i.e. onto), 123, 335symmetric design problem, 331symmetric matrix C e Sym(s), 13, 345,

372diagonality, 142modulus |C|, 150, 156positive and negative parts C+,C_, 141,

150symmetric three-point design, 334

T-criterion, see trace criterion, 119Taylor theorem, 61, 215testability, 67, 72testing and estimation problems, 340third-degree model, 7, 209, 280, 293, 299total variation distance, 306, 424trace criterion fa, 118, 135, 138, 140, 153,

173, 240, 258, 369, 404, 413, 422geometric meaning, 258

trace operator, 7transitivity, 12, 144, 353, 356transposition ', 13, 185, 351treatment concurrence matrix AW, 367,

426treatment relabeling group, 362, 364, 368treatment replication vector r, 31, 94, 264,

426trigonometric fit model, 58, 138, 241, 249,

359trivial group {Is}, 350, 357two-point design, 83two-sample problem, 4, 30two-way classification model, 5, 30, 88,

93, 97, 138, 206, 249, 262, 362, 411

unbiasedness, 19, 36uniform distribution on a sphere, 389, 400uniform optimality, see Loewner optimalityuniform weights w, = l/£, 94, 100, 201,

213, 320unit level set, 120unity vector la, 30, 36, 88, 94, 140universal optimality, see Kiefer optimalityupper level set, 78, 118upper semicontinuity, 77, 118, 412

Vandermonde matrix, 33, 253, 347variance surface t *-> f(t)'M~f(t), 382vec-permutation matrix lmfn, 394vector mean <$>p, 121, 139, 146, 157, 284,

413, 423vector of totals T, 66vectorization operator vec(^r') = s®t, 393vertex design £,, 373

pukelsheim optimal doe

Documents

ordinary differential

finite sample

x2 y2

scalar parameter

response surface

infinite sample

trigonometric

competing