data analysis book

413
Statistical AND Econometric Methods FOR Transportation Data Analysis Simon P. Washington Matthew G. Karlaftis Fred L. Mannering CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London New York Washington, D.C. ' 2003 by CRC Press LLC

Upload: alok-kumar-singh

Post on 24-Oct-2014

211 views

Category:

Documents


16 download

TRANSCRIPT

Page 1: Data Analysis Book

StatisticalAND EconometricMethods FOR

TransportationData Analysis

Simon P. WashingtonMatthew G. KarlaftisFred L. Mannering

CHAPMAN & HALL/CRCA CRC Press Company

Boca Raton London New York Washington, D.C.

© 2003 by CRC Press LLC

Page 2: Data Analysis Book

Cover Images: Left, 7th and Marquette during Rush Hour, photo copyright 2002 Chris Gregerson,

www.phototour.minneapolis.mn.us. Center, Route 66, and right, Central Albuquerque, copyrightMarble Street Studio, Inc., Albuquerque, NM.

This book contains information obtained from authentic and highly regarded sources. Reprinted materialis quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonableefforts have been made to publish reliable data and information, but the author and the publisher cannotassume responsibility for the validity of all materials or for the consequences of their use.

Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronicor mechanical, including photocopying, microÞlming, and recording, or by any information storage orretrieval system, without prior permission in writing from the publisher.

The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, forcreating new works, or for resale. SpeciÞc permission must be obtained in writing from CRC Press LLCfor such copying.

Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and areused only for identiÞcation and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com

© 2003 by Chapman & Hall/CRC

No claim to original U.S. Government worksInternational Standard Book Number 1-58488-030-9

Library of Congress Card Number 2003046163Printed in the United States of America 1 2 3 4 5 6 7 8 9 0

Printed on acid-free paper

Library of Congress Cataloging-in-Publication Data

Washington, Simon P.Statistical and econometric methods for transportation data analysis /

Simon P. Washington, Matthew G. Karlaftis, Fred L. Mannering.p. cm.

Includes bibliographical references and index.ISBN 1-58488-030-9 (alk. paper)1. Transportation--Statistical methods. 2.

Transportation--Econometric models. I. Karlaftis, Matthew G. II.Mannering, Fred L. III. Title.

HE191.5.W37 2003388'.01'5195--dc21 2003046163

© 2003 by CRC Press LLC

Page 3: Data Analysis Book

Dedication

To Tracy, Samantha, and Devon — S.P.W.

To Amy, George, and John— M.G.K.

To Jill, Willa, and Freyda— F.L.M.

© 2003 by CRC Press LLC

Page 4: Data Analysis Book

Preface

Transportation plays an essential role in developed and developing societies.Transportation is responsible for personal mobility; provides access to ser-vices, jobs, and leisure activities; and is integral to the delivery of consumergoods. Regional, state, national, and world economies depend on the efficientand safe functioning of transportation facilities and infrastructure.

Because of the sweeping influence transportation has on economic andsocial aspects of modern society, transportation issues pose challenges toprofessionals across a wide range of disciplines, including transportationengineering, urban and regional planning, economics, logistics, systems andsafety engineering, social science, law enforcement and security, and con-sumer theory. Where to place and expand transportation infrastructure, howto operate and maintain infrastructure safely and efficiently, and how tospend valuable resources to improve mobility, access to goods, services, andhealth care are among the decisions made routinely by transportation-relatedprofessionals.

Many transportation-related problems and challenges involve stochasticprocesses, which are influenced by observed and unobserved factors inunknown ways. The stochastic nature of transportation problems is largelya result of the role that people play in transportation. Transportation systemusers routinely face decisions in transportation contexts, such as which trans-portation mode to use, which vehicle to purchase, whether or not to partic-ipate in a vanpool or to telecommute, where to relocate a residence orbusiness, whether to support a proposed light-rail project, and whether ornot to utilize traveler information before or during a trip. These decisionsinvolve various degrees of uncertainty. Transportation system managers andgovernmental agencies face similar stochastic problems in determining howto measure and compare system performance, where to invest in safetyimprovements, how to operate transportation systems efficiently, and howto estimate transportation demand.

The complexity, diversity, and stochastic nature of transportation problemsrequires that transportation analysts have an extensive set of analytical toolsin their toolbox. Statistical and Econometric Methods for Transportation DataAnalysis describes and illustrates some of these tools commonly used fortransportation data analysis.

Every book must strike an appropriate balance between depth andbreadth of theory and applications, given the intended audience. Statisticaland Econometric Methods for Transportation Data Analysis targets two generalaudiences. First, it serves as a textbook for advanced undergraduate, mas-ter’s, and Ph.D. students in transportation-related disciplines, including

© 2003 by CRC Press LLC

Page 5: Data Analysis Book

engineering, economics, urban and regional planning, and sociology. There

is sufficient material to cover two 3-unit semester courses in analyticalmethods. Alternatively, a one semester course could consist of a subset oftopics covered in this book. The publisher’s Web site, www.crcpress.com,contains the data sets used to develop this book so that applied modelingproblems will reinforce the modeling techniques discussed throughout thetext. Second, the book serves as a technical reference for researchers andpractitioners wishing to examine and understand a broad range of analyticaltools required to solve transportation problems. It provides a wide breadthof transportation examples and case studies, covering applications in vari-ous aspects of transportation planning, engineering, safety, and economics.Sufficient analytical rigor is provided in each chapter so that fundamentalconcepts and principles are clear, and numerous references are providedfor those seeking additional technical details and applications.

The first section of the book provides statistical fundamentals (Chapters 1and 2). This section is useful for refreshing readers regarding the fundamen-tals and for sufficiently preparing them for the sections that follow.

The second section focuses on continuous dependent variable models. Thechapter on linear regression (Chapter 3) devotes a few extra pages to intro-ducing common modeling practice — examining residuals, creating indica-tor variables, and building statistical models — and thus serves as a logicalstarting chapter for readers new to statistical modeling. Chapter 4 discussesthe impacts of failing to meet linear regression assumptions and presentscorresponding solutions. Chapter 5 deals with simultaneous equation mod-els and presents modeling methods appropriate when studying two or moreinterrelated dependent variables. Chapter 6 presents methods for analyzingpanel data — data obtained from repeated observations on sampling unitsover time, such as household surveys conducted several times on a sampleof households. When data are collected continuously over time, such ashourly, daily, weekly, or yearly, time-series methods and models are oftenapplied (Chapter 7). Latent variable models, presented in Chapter 8, are usedwhen the dependent variable is not directly observable and is approximatedwith one or more surrogate variables. The final chapter in this section pre-sents duration models, which are used to model time-until-event data suchas survival, hazard, and decay processes.

The third section presents count and discrete dependent variable models.Count models (Chapter 10) arise when the data of interest are non-negativeintegers. Examples of such data include vehicles in a queue and the numberof vehicular crashes per unit time. Discrete outcome models, which areextremely useful in many transportation applications, are described in Chap-ter 11. A unique feature of the book is that discrete outcome models are firstderived statistically and then related to economic theories of consumerchoice. Discrete/continuous models, presented in Chapter 12, demonstratethat interrelated discrete and continuous data need to be modeled as a systemrather than individually, such as the choice of which vehicle to drive andhow far it will be driven.

© 2003 by CRC Press LLC

Page 6: Data Analysis Book

The appendices are complementary to the text of the book. Appendix A

presents the fundamental concepts in statistics that support the analyticalmethods discussed. Appendix B is an alphabetical glossary of statisticalterms that are commonly used, serving as a quick and easy reference. Appen-dix C provides tables of probability distributions used in the book. Finally,Appendix D describes typical uses of data transformations common to manystatistical methods.

Although the book covers a wide variety of analytical tools for improvingthe quality of research, it does not attempt to teach all elements of theresearch process. Specifically, the development and selection of usefulresearch hypotheses, alternative experimental design methodologies, the vir-tues and drawbacks of experimental vs. observational studies, and sometechnical issues involved with the collection of data such as sample sizecalculations are not discussed. These issues are crucial elements in the con-duct of research, and can have a drastic impact on the overall results andquality of the research endeavor. It is considered a prerequisite that readersof this book be educated and informed on these critical research elementsso that they appropriately apply the analytical tools presented here.

© 2003 by CRC Press LLC

Page 7: Data Analysis Book

Contents

Part I Fundamentals

1 Statistical Inference I: Descriptive Statistics 1.1 Measures of Relative Standing 1.2 Measures of Central Tendency1.3 Measures of Variability1.4 Skewness and Kurtosis 1.5 Measures of Association1.6 Properties of Estimators

1.6.1 Unbiasedness1.6.2 Efficiency 1.6.3 Consistency1.6.4 Sufficiency

1.7 Methods of Displaying Data1.7.1 Histograms1.7.2 Ogives1.7.3 Box Plots1.7.4 Scatter Diagrams 1.7.5 Bar and Line Charts

2 Statistical Inference II: Interval Estimation, HypothesisTesting, and Population Comparisons2.1 Confidence Intervals

2.1.1 Confidence Interval for m with Known s2

2.1.2 Confidence Interval for the Mean with Unknown Variance

2.1.3 Confidence Interval for a Population Proportion2.1.4 Confidence Interval for the Population Variance

2.2 Hypothesis Testing2.2.1 Mechanics of Hypothesis Testing2.2.2 Formulating One- and Two-Tailed Hypothesis Tests2.2.3 The p-Value of a Hypothesis Test

2.3 Inferences Regarding a Single Population2.3.1 Testing the Population Mean with Unknown Variance2.3.2 Testing the Population Variance2.3.3 Testing for a Population Proportion

2.4 Comparing Two Populations 2.4.1 Testing Differences between Two Means:

Independent Samples

© 2003 by CRC Press LLC

Page 8: Data Analysis Book

2.4.2 Testing Differences between Two Means:

Paired Observations2.4.3 Testing Differences between Two

Population Proportions2.4.4 Testing the Equality of Two Population Variances

2.5 Nonparametric Methods2.5.1 The Sign Test2.5.2 The Median Test2.5.3 The Mann–Whitney U Test2.5.4 The Wilcoxon Signed-Rank Test for Matched Pairs2.5.5 The Kruskal–Wallis Test2.5.6 The Chi-Square Goodness-of-Fit Test

Part II Continuous Dependent Variable Models

3 Linear Regression 3.1 Assumptions of the Linear Regression Model

3.1.1 Continuous Dependent Variable Y3.1.2 Linear-in-Parameters Relationship between Y and X3.1.3 Observations Independently and Randomly Sampled3.1.4 Uncertain Relationship between Variables3.1.5 Disturbance Term Independent of X and Expected

Value Zero 3.1.6 Disturbance Terms Not Autocorrelated 3.1.7 Regressors and Disturbances Uncorrelated3.1.8 Disturbances Approximately Normally Distributed3.1.9 Summary

3.2 Regression Fundamentals3.2.1 Least Squares Estimation3.2.2 Maximum Likelihood Estimation3.2.3 Properties of OLS and MLE Estimators3.2.4 Inference in Regression Analysis

3.3 Manipulating Variables in Regression3.3.1 Standardized Regression Models3.3.2 Transformations3.3.3 Indicator Variables

3.3.3.1 Estimate a Single Beta Parameter3.3.3.2 Estimate Beta Parameter for Ranges

of the Variable3.3.3.3 Estimate a Single Beta Parameter for

m – 1 of the m Levels of the Variable3.3.4 Interactions in Regression Models

3.4 Checking Regression Assumptions3.4.1 Linearity

© 2003 by CRC Press LLC

Page 9: Data Analysis Book

3.4.2 Homoscedastic Disturbances

3.4.3 Uncorrelated Disturbances3.4.4 Exogenous Independent Variables3.4.5 Normally Distributed Disturbances

3.5 Regression Outliers3.5.1 The Hat Matrix for Identifying Outlying Observations3.5.2 Standard Measures for Quantifying Outlier Influence3.5.3 Removing Influential Data Points from the Regression

3.6 Regression Model Goodness-of-Fit Measures3.7 Multicollinearity in the Regression3.8 Regression Model-Building Strategies

3.8.1 Stepwise Regression3.8.2 Best Subsets Regression3.8.3 Iteratively Specified Tree-Based Regression

3.9 Logistic Regression 3.10 Lags and Lag Structure3.11 Investigating Causality in the Regression3.12 Limited Dependent Variable Models3.13 Box–Cox Regression3.14 Estimating Elasticities

4 Violations of Regression Assumptions4.1 Zero Mean of the Disturbances Assumption4.2 Normality of the Disturbances Assumption4.3 Uncorrelatedness of Regressors

and Disturbances Assumption 4.4 Homoscedasticity of the Disturbances Assumption

4.4.1 Detecting Heteroscedasticity4.4.2 Correcting for Heteroscedasticity

4.5 No Serial Correlation in the Disturbances Assumption4.5.1 Detecting Serial Correlation 4.5.2 Correcting for Serial Correlation

4.6 Model Specification Errors

5 Simultaneous Equation Models 5.1 Overview of the Simultaneous Equations Problem5.2 Reduced Form and the Identification Problem5.3 Simultaneous Equation Estimation

5.3.1 Single-Equation Methods5.3.2 System Equation Methods

5.4 Seemingly Unrelated Equations5.5 Applications of Simultaneous Equations

to Transportation DataAppendix 5A: A Note on Generalized Least Squares Estimation

© 2003 by CRC Press LLC

Page 10: Data Analysis Book

6

Panel Data Analysis

6.1 Issues in Panel Data Analysis6.2 One-Way Error Component Models

6.2.1 Heteroscedasticity and Serial Correlation6.3 Two-Way Error Component Models6.4 Variable Coefficient Models 6.5 Additional Topics and Extensions

7 Time-Series Analysis7.1 Characteristics of Time Series

7.1.1 Long-Term Movements7.1.2 Seasonal Movements 7.1.3 Cyclic Movements7.1.4 Irregular or Random Movements

7.2 Smoothing Methodologies7.2.1 Simple Moving Averages7.2.2 Exponential Smoothing

7.3 The ARIMA Family of Models7.3.1 The ARIMA Models7.3.2 Estimating ARIMA Models

7.4 Nonlinear Time-Series Models7.4.1 Conditional Mean Models7.4.2 Conditional Variance Models7.4.3 Mixed Models7.4.4 Regime Models

7.5 Multivariate Time-Series Models7.6 Measures of Forecasting Accuracy

8 Latent Variable Models8.1 Principal Components Analysis 8.2 Factor Analysis 8.3 Structural Equation Modeling

8.3.1 Basic Concepts in Structural Equation Modeling8.3.2 The Structural Equation Model 8.3.3 Non-Ideal Conditions in the Structural

Equation Model8.3.4 Model Goodness-of-Fit Measures 8.3.5 Guidelines for Structural Equation Modeling

9 Duration Models9.1 Hazard-Based Duration Models9.2 Characteristics of Duration Data9.3 Nonparametric Models 9.4 Semiparametric Models

© 2003 by CRC Press LLC

Page 11: Data Analysis Book

9.5 Fully Parametric Models

9.6 Comparisons of Nonparametric, Semiparametric, and Fully Parametric Models

9.7 Heterogeneity9.8 State Dependence9.9 Time-Varying Covariates 9.10 Discrete-Time Hazard Models9.11 Competing Risk Models

Part III Count and Discrete Dependent Variable Models

10 Count Data Models10.1 Poisson Regression Model10.2 Poisson Regression Model Goodness-of-Fit Measures10.3 Truncated Poisson Regression Model10.4 Negative Binomial Regression Model 10.5 Zero-Inflated Poisson and Negative Binomial

Regression Models 10.6 Panel Data and Count Models

11 Discrete Outcome Models11.1 Models of Discrete Data11.2 Binary and Multinomial Probit Models 11.3 Multinomial Logit Model 11.4 Discrete Data and Utility Theory 11.5 Properties and Estimation of Multinomial Logit Models

11.5.1 Statistical Evaluation11.5.2 Interpretation of Findings11.5.3 Specification Errors 11.5.4 Data Sampling 11.5.5 Forecasting and Aggregation Bias11.5.6 Transferability

11.6 Nested Logit Model (Generalized Extreme Value Model)11.7 Special Properties of Logit Models 11.8 Mixed MNL Models 11.9 Models of Ordered Discrete Data

12 Discrete/Continuous Models 12.1 Overview of the Discrete/Continuous Modeling Problem 12.2 Econometric Corrections: Instrumental Variables

and Expected Value Method 12.3 Econometric Corrections: Selectivity-Bias Correction Term12.4 Discrete/Continuous Model Structures

© 2003 by CRC Press LLC

Page 12: Data Analysis Book

Appendix A: Statistical Fundamentals

Appendix B: Glossary of Terms

Appendix C: Statistical Tables

Appendix D: Variable Transformations

References

© 2003 by CRC Press LLC

Page 13: Data Analysis Book

Part I

Fundamentals

© 2003 by CRC Press LLC

Page 14: Data Analysis Book

1Statistical Inference I: Descriptive Statistics

This chapter examines methods and techniques for summarizing and inter-preting data. The discussion begins by examining numerical descriptivemeasures. These measures, commonly known as point estimators, enableinferences about a population by estimating the value of an unknown pop-ulation parameter using a single value (or point). This chapter also overviewsgraphical representations of data. Relative to graphical methods, numericalmethods provide precise and objectively determined values that can easilybe manipulated, interpreted, and compared. They permit a more carefulanalysis of the data than more general impressions conveyed by graphicalsummaries. This is important when the data represent a sample from whichinferences must be made concerning the entire population.

Although this chapter concentrates on the most basic and fundamentalissues of statistical analyses, there are countless thorough introductory sta-tistical textbooks that can provide the interested reader with greater detail.For example, Aczel (1993) and Keller and Warrack (1997) provide detaileddescriptions and examples of descriptive statistics and graphical techniques.Tukey (1977) is the classical reference on exploratory data analysis andgraphical techniques. For readers interested in the properties of estimators(Section 1.7), the books by Gujarati (1992) and Baltagi (1998) are excellentand fairly mathematically rigorous.

1.1 Measures of Relative Standing

A set of numerical observations can be ordered from smallest to largest mag-nitude. This ordering allows the boundaries of the data to be defined andallows for comparisons of the relative position of specific observations. If anobservation is in the 90th percentile, for example, then 90% of the observationshave a lower magnitude. Consider the usefulness of percentile rank in termsof a nationally administered test such as the Scholastic Aptitude Test (SAT) orGraduate Record Exam (GRE). An individual’s score on the test is comparedwith the scores of all people who took the test at the same time, and the relative

© 2003 by CRC Press LLC

Page 15: Data Analysis Book

position within the group is defined in terms of a percentile. If, for example,the 80th percentile of GRE scores is 660, this means that 80% of the sample ofindividuals who took the test scored below 660 and 20% scored 660 or better.A percentile is defined as that value below which lies P% of the numbers inthe remaining sample. For sufficiently large samples, the position of the Pthpercentile is given by (n + 1)P/100, where n is the sample size.

Quartiles are the percentage points that separate the data into quarters:first quarter, below which lies one quarter of the data, making it the 25thpercentile; second quarter, or 50th percentile, below which lies half of thedata; third quarter, or 75th percentile point. The 25th percentile is oftenreferred to as the lower or first quartile, the 50th percentile as the medianor middle quartile, and the 75th percentile as the upper or third quartile.Finally, the interquartile range, a measure of the spread of the data, is definedas the difference between the first and third quartiles.

1.2 Measures of Central Tendency

Quartiles and percentiles are measures of the relative positions of pointswithin a given data set. The median constitutes a useful point because itlies in the center of the data, with half of the data points lying above itand half below. Thus, the median constitutes a measure of the centralityof the observations.

Despite the existence of the median, by far the most popular and usefulmeasure of central tendency is the arithmetic mean, or, more succinctly, themean. The sample mean or expectation is a statistical term that describes thecentral tendency, or average, of a sample of observations, and varies acrosssamples. The mean of a sample of measurements x1, x2, …, xn is defined as

, (1.1)

where n is the size of the sample.When an entire population constitutes the set to be examined, the sample

mean is replaced by , the population mean. Unlike the sample mean,the population mean is constant. The formula for the population mean is

. (1.2)

where N is the number of observations in the entire population.

MEAN X E X Xx

n

ii

n

( ) 1

X

x

N

ii

N

1

© 2003 by CRC Press LLC

Page 16: Data Analysis Book

The mode (or modes because it is possible to have more than one of them)of a set of observations is the value that occurs most frequently, or the mostcommonly occurring outcome, and strictly applies to discrete variables (nom-inal and ordinal scale variables) as well as count data. Probabilistically, it is themost likely outcome in the sample; it has occurred more than any other value.

It is useful to examine the advantages and disadvantages of each the threemeasures of central tendency. The mean uses and summarizes all of the infor-mation in the data, is a single numerical measure, and has some desirablemathematical properties that make it useful in many statistical inference andmodeling applications. The median, in contrast, is the central-most (center)point of ranked data. When computing the median, the exact locations of datapoints on the number line are not considered; only their relative standing withrespect to the central observation is required. Herein lies the major advantageof the median; it is resistant to extreme observations or outliers in the data.The mean is, overall, the most frequently used measure of central tendency;in cases, however, where the data contain numerous outlying observations themedian may serve as a more reliable measure of central tendency.

If the sample data are measured on the interval or ratio scale, then all threemeasures of centrality (mean, median, and mode) make sense, provided thatthe level of measurement precision does not preclude the determination ofa mode. If data are symmetric and if the distribution of the observations hasonly one mode, then the mode, the median, and the mean are all approxi-mately equal (the relative positions of the three measures in cases of asym-metric distributions is discussed in Section 1.4). Finally, if the data arequalitative (measured on the nominal or ordinal scales), using the mean ormedian is senseless, and the mode must be used. For nominal data, the modeis the category that contains the largest number of observations.

1.3 Measures of Variability

Variability is a statistical term used to describe and quantify the spread ordispersion of data around the center, usually the mean. In most practicalsituations, knowing the average or expected value of a sample is notsufficient to obtain an adequate understanding of the data. Sample vari-ability provides a measure of how dispersed the data are with respect tothe mean (or other measures of central tendency). Figure 1.1 illustrates twodistributions of data, one that is highly dispersed and another that is moretightly packed around the mean. There are several useful measures ofvariability, or dispersion. One measure previously discussed is the inter-quartile range. Another measure is the range, which is equal to the differ-ence between the largest and the smallest observations in the data. Therange and the interquartile range are measures of the dispersion of a setof observations, with the interquartile range more resistant to outlying

© 2003 by CRC Press LLC

Page 17: Data Analysis Book

observations. The two most frequently used measures of dispersion are thevariance and its square root, the standard deviation.

The variance and the standard deviation are more useful than the rangebecause, like the mean, they use the information contained in all the obser-vations. The variance of a set of observations, or sample variance, is theaverage squared deviation of the individual observations from the mean andvaries across samples. The sample variance is commonly used as an estimateof the population variance and is given by

. (1.3)

When a collection of observations constitute an entire population, thevariance is denoted by 2. Unlike the sample variance, the population vari-ance is constant and is given by

, (1.4)

where in Equation 1.3 is replaced by .Because calculation of the variance involves squaring the original measure-

ments, the measurement units of the variance are the square of the originalmeasurement units. While variance is a useful measure of the relative variabilityof two sets of measurements, it is often preferable to express variability in thesame units as the original measurements. Such a measure is obtained by takingthe square root of the variance, yielding the standard deviation. The formulasfor the sample and population standard deviations are given, respectively, as

(1.5)

FIGURE 1.1Examples of high and low variability data.

Low Variability

High Variability

sx X

n

ii

n

2

2

1

1

2

2

1x

N

ii

N

X

s sx X

n

ii

n

2

2

1

1

© 2003 by CRC Press LLC

Page 18: Data Analysis Book

. (1.6)

Consistent with previous results, the sample standard deviation s2 is a ran-dom variable, whereas the population standard deviation is a constant.

A mathematical theorem attributed to Chebyshev establishes a generalrule, which states that at least of all observations in a sample orpopulation will lie within k standard deviations of the mean, where k is notnecessarily an integer. For the approximately bell-shaped normal distribu-tion of observations, an empirical rule-of-thumb suggests that the followingapproximate percentage of measurements will fall within 1, 2, or 3 standarddeviations of the mean. These intervals are given as

,

which contains approximately 68% of the measurements,

,

which contains approximately 95% of the measurements, and

,

which contains approximately 99% of the measurements.The standard deviation is an absolute measure of dispersion; it does not

take into consideration the magnitude of the values in the population orsample. On some occasions, a measure of dispersion that accounts for themagnitudes of the observations (relative measure of dispersion) is needed.The coefficient of variation is such a measure. It provides a relative measureof dispersion, where dispersion is given as a proportion of the mean. For asample, the coefficient of variation (CV) is given as

. (1.7)

If, for example, on a certain highway section vehicle speeds wereobserved with mean = 45 mph and standard deviation s = 15, then theCV is s/ = 15/45 = 0.33. If, on another highway section, the averagevehicle speed is = 60 mph and standard deviation s = 15, then the CVis equal to s/ = 15/65 = 0.23, which is smaller and conveys the informa-tion that, relative to average vehicle speeds, the data in the first sampleare more variable.

2

2

1x

N

ii

N

1 1 2k

X s X s,

X s X s2 2,

X s X s3 3,

CVsX

XX

Xx

© 2003 by CRC Press LLC

Page 19: Data Analysis Book

Example 1.1

By using the speed data contained in the “speed data” file that can bedownloaded from the publisher’s Web site (www.crcpress.com), the basicdescriptive statistics are sought for the speed data, regardless of theseason, type of road, highway class, and year of observation. Any com-mercially available software with statistical capabilities can accommo-date this type of exercise. Table 1.1 provides descriptive statistics for thespeed variable.

The descriptive statistics indicate that the mean speed in the samplecollected is 58.86 mph, with little variability in speed observations (s islow at 4.41, while the CV is 0.075). The mean and median are almostequal, indicating that the distribution of the sample of speeds is fairlysymmetric. The data set contains more information, such as the year ofobservation, the season (quarter), the highway class, and whether theobservation was in an urban or rural area, which could give a morecomplete picture of the speed characteristics in this sample. For example,Table 1.2 examines the descriptive statistics for urban vs. rural roads.

Interestingly, although some of the descriptive statistics may seem todiffer from the pooled sample examined in Table 1.1, it does not appearthat the differences between mean speeds and speed variation in urbanvs. rural Indiana roads is important. Similar types of descriptive statisticscould be computed for other categorizations of average vehicle speed.

1.4 Skewness and Kurtosis

Two additional attributes of a frequency distribution that are useful areskewness and kurtosis. Skewness is a measure of the degree of asymmetry

TABLE 1.1

Descriptive Statistics for Speeds on Indiana Roads

Statistic Value

N (number of observations) 1296Mean 58.86Std. deviation 4.41Variance 19.51CV 0.075Maximum 72.5Minimum 32.6Upper quartile 61.5Median 58.5Lower quartile 56.4

© 2003 by CRC Press LLC

Page 20: Data Analysis Book

of a frequency distribution. It is given as the average value over the entirepopulation (this is often called the third moment around the mean, orthird central moment, with variance the second moment). In general, whenthe distribution stretches to the right more than it does to the left, it canbe said that the distribution is right-skewed, or positively skewed. Simi-larly, a left-skewed (negatively skewed) distribution is one that stretchesasymmetrically to the left (Figure 1.2). When a distribution is right-skewed, the mean is to the right of the median, which in turn is to theright of the mode. The opposite is true for left-skewed distributions. Tomake the measure (xi – )3 independent of the units of measurement ofthe variable, it is divided by 3. This results in the population skewnessparameter often symbolized as 1. The sample estimate of this parameter,(g1), is given as

, (1.8)

TABLE 1.2

Descriptive Statistics for Speeds on Rural vs. Urban Indiana Roads

Statistic Rural Roads Urban Roads

N (number of observations) 888 408Mean 58.79 59.0Std. deviation 4.60 3.98Variance 21.19 15.87CV 0.078 0.067Maximum 72.5 68.2Minimum 32.6 44.2Upper quartile 60.7 62.2Median 58.2 59.2Lower quartile 56.4 56.15

FIGURE 1.2Skewness of a distribution.

gm

m m13

2 2( )

Left-SkewedDistribution

Right-SkewedDistribution

SymmetricDistribution

Mean = Median = ModeMedian

MeanModeMedian

Mean Mode

© 2003 by CRC Press LLC

Page 21: Data Analysis Book

where

.

If a sample comes from a population that is normally distributed, then theparameter g1 is normally distributed with mean 0 and standard deviation

.Kurtosis is a measure of the “flatness” (vs. peakedness) of a frequency

distribution and is shown in Figure 1.3; it is the average value of (xi – )4

divided by s4 over the entire population or sample. Kurtosis ( 2) is oftencalled the fourth moment around the mean or fourth central moment.For the normal distribution the parameter 2 has a value of 3. If theparameter is larger than 3 there is usually a clustering of points aroundthe mean (leptokurtic distribution), whereas if the parameter is lowerthan 3 the curve demonstrates a “flatter” peak than the normal distribu-tion (platykurtic).

The sample kurtosis parameter, g2, is often reported as standard output ofmany statistical software packages and is given as

, (1.9)

where

.

For most practical purposes, a value of 3 is subtracted from the samplekurtosis parameter so that leptokurtic sample distributions have positivekurtosis parameters, whereas platykurtic sample distributions have negativekurtosis parameters.

FIGURE 1.3Kurtosis of a distribution.

PlatykurticDistribution

LeptokurticDistribution

m x X n

m x X n

ii

n

ii

n

3

3

1

2

2

1

/

/

6 / n

x

g m m2 2 4 223 3/

m x X nii

n

4

4

1/

© 2003 by CRC Press LLC

Page 22: Data Analysis Book

Example 1.2

Revisiting the speed data from Example 1.1, there is interest in determin-ing the shape of the distributions for speeds on rural and urban Indianaroads. Results indicate that when all roads are examined together theirskewness parameter is –0.05, whereas for rural roads the parameter hasthe value of 0.056 and for urban roads the value of –0.37. It appears that,at least on rural roads, the distribution of speeds is symmetric, whereasfor urban roads the distribution is left-skewed.

Although the skewness parameter is similar for the two types of roads,the kurtosis parameter varies more widely. For rural roads the parameterhas a value of 2.51, indicating a distribution close to normal, whereas forrural urban roads the parameter has a value of 0.26, indicating a relativelyflat (platykurtic) distribution.

1.5 Measures of Association

To this point the discussion has focused on measures that summarize a setof raw data. These measures are effective for providing information regard-ing individual variables. The mean and the standard deviation of a variable,for example, convey useful information regarding the nature of the measure-ments related to that variable. However, the measures reviewed thus far donot provide information regarding possible relationships between variables.The correlation between two random variables is a measure of the linearrelationship between them. The correlation parameter is a commonly usedmeasure of linear correlation and gives a quantitative measure of how welltwo variables move together.

The correlation parameter is always in the interval . When there is no linear association, meaning that a linear relationship does notexist between the two variables examined. It is possible, however, for twovariables to be nonlinearly related and yet have . When , there isa positive linear relationship between the variables examined, such thatwhen one of the variables increases the other variable also increases, at arate given by the value of (Figure 1.4). In the case when , there is a“perfect” positively sloped straight-line relationship between two variables.When there is a negative linear relationship between the two variablesexamined, such that an increase in one variable is associated with a decreasein the value of the other variable, at a rate given by the value of . In thecase when there is a perfect negatively sloped straight-line relation-ship between two variables.

The concept of correlation stems directly from another measure of associ-ation, the covariance. Consider two random variables, X and Y, both nor-mally distributed with population means and , and population

1 1, 0

0 0

1

0

1

X Y

© 2003 by CRC Press LLC

Page 23: Data Analysis Book

standard deviations and , respectively. The population and samplecovariances between X and Y are defined, respectively, as follows:

(1.10)

. (1.11)

As can be seen from Equations 1.10 and 1.11, the covariance of X and Y isthe expected value of the product of the deviation of X from its mean andthe deviation of Y from its mean. The covariance is positive when the twovariables move in the same direction, it is negative when the two variablesmove in opposite directions, and it is zero when the two variables are notlinearly related.

As a measure of association, the covariance suffers from a major drawback.It is usually difficult to interpret the degree of linear association betweentwo variables from the covariance because its magnitude depends on themagnitudes of the standard deviations of X and Y. For example, supposethat the covariance between two variables is 175. What does this say regard-ing the relationship between the two variables? The sign, which is positive,indicates that as one increases, the other also generally increases. However,the degree to which the two variables move together cannot be ascertained.

FIGURE 1.4Positive (top) and negative (bottom) correlations between two variables.

ρ < 0

ρ > 0

X Y

COV X Y

x y

Np

i X i Yi

N

, 1

COV X Y

x X y Y

ns

i ii

n

, 1

1

© 2003 by CRC Press LLC

Page 24: Data Analysis Book

But, if the covariance is divided by the standard deviations, a measure thatis constrained to the range of values , as previously discussed, isobtained. This measure, called the Pearson product-moment correlationparameter, or correlation parameter, for short, conveys clear informationabout the strength of the linear relationship between the two variables. Thepopulation and sample r correlation parameter of X and Y are defined,respectively, as

(1.12)

, (1.13)

where sX and sY are the sample standard deviations.

Example 1.3

By using data contained in the “aviation 1” file, the correlations betweenannual U.S. revenue passenger enplanements, per-capita U.S. gross do-mestic product (GDP), and price per gallon for aviation fuel are exam-ined. After deflating the monetary values by the Consumer Price Index(CPI) to 1977 values, the correlation between enplanements and per-capita GDP is 0.94, and the correlation between enplanements and fuelprice –0.72.

These two correlation parameters are not surprising. One would expectthat enplanements and economic growth go hand-in-hand, while en-planements and aviation fuel price (often reflected by changes in fareprice) move in opposite directions. However, a word of caution isnecessary. The existence of a correlation between two variables doesnot necessarily mean that one of the variables causes the other. Thedetermination of causality is a difficult question that cannot be directlyanswered by looking at correlation parameters. To this end, considerthe correlation parameter between annual U.S. revenue passenger en-planements and annual ridership of the Tacoma-Pierce Transit Systemin Washington State. The correlation parameter is considerably high (-0.90) indicating that the two variables move in opposite directions innearly straight-line fashion. Nevertheless, it is safe to say (1) that neitherof the variables causes the other or (2) that the two variables are noteven remotely related. In short, it needs to be stressed that correlationdoes not imply causation.

To this point, the discussion on correlation has focused solely on contin-uous variables measured on the interval or ratio scale. In some situations,

1 1,

COV X Y

X Y

( , )

rCOV X Y

s sX Y

( , )

© 2003 by CRC Press LLC

Page 25: Data Analysis Book

however, one or both of the variables may be measured on the ordinal scale.Alternatively, two continuous variables may not satisfy the requirement ofapproximate normality assumed when using the Pearson product-momentcorrelation parameter. In such cases the Spearman rank correlation parameter,an alternative (nonparametric method), should be used to determinewhether a linear relationship exists between two variables.

The Spearman correlation parameter is computed first by ranking theobservations of each variable from smallest to largest. Then, the Pearsoncorrelation parameter is applied to the ranks; that is, the Spearman correla-tion parameter is the usual Pearson correlation parameter applied to theranks of two variables. The equation for the Spearman rank correlationparameter is given as

, (1.14)

where di, i = 1, …, n, are the differences in the ranks of xi, yi:di = R(xi) – R(yi).There are additional nonparametric measures of correlation between

variables, including Kendall’s tau. Its estimation complexity, at least whencompared with Spearman’s rank correlation parameter, makes it less pop-ular in practice.

1.6 Properties of Estimators

The sample statistics computed in previous sections, such as the sampleaverage , variance s2, and standard deviation s and others, are used asestimators of population parameters. In practice population parameters(often called parameters) such as the population mean and variance areunknown constants. In practical applications, the sample average is usedas an estimator for the population mean , the sample variance s2 for thepopulation variance , and so on. These statistics, however, are randomvariables and, as such, are dependent on the sample. “Good” statisticalestimators of true population parameters satisfy four important properties:unbiasedness, efficiency, consistency, and sufficiency.

1.6.1 Unbiasedness

If there are several estimators of a population parameter, and if one of theseestimators coincides with the true value of the unknown parameter, thenthis estimator is called an unbiased estimator. An estimator is said to beunbiased if its expected value is equal to the true population parameter it is

rd

n ns

ii

n

16

1

2

12

X

X

X2

© 2003 by CRC Press LLC

Page 26: Data Analysis Book

meant to estimate. That is, an estimator, say, the sample average , is anunbiased estimator of if

. (1.15)

The principle of unbiasedness is illustrated in Figure 1.5. Any systematicdeviation of the estimator away from the population parameter is called abias, and the estimator is called a biased estimator. In general, unbiasedestimators are preferred to biased estimators.

1.6.2 Efficiency

The property of unbiasedness is not, by itself, adequate, because there canbe situations in which two or more parameter estimates are unbiased. Inthese situations, interest is focused on which of several unbiased estimatorsis superior. A second desirable property of estimators is efficiency. Efficiencyis a relative property in that an estimator is efficient relative to another, whichmeans that an estimator has a smaller variance than an alternative estimator.An estimator with the smaller variance is more efficient. As can be seen fromFigure 1.6, both and are unbiased estimators of 1 and 2, respectively;however, while , yielding a relative efficiencyof relative to of 1/n, where n is the sample size.

In general, the unbiased estimator with minimum variance is preferred toalternative estimators. A lower bound for the variance of any unbiasedestimator of is given by the Cramer–Rao lower bound that can be writtenas (Gujarati, 1992)

. (1.16)

The Cramer–Rao lower bound is only a sufficient condition for efficiency.Failing to satisfy this condition does not necessarily imply that the estimator

FIGURE 1.5Biased and unbiased estimators of the mean value of a population x.

Bias

E(X–

) = µxUnbiasedEstimator

E(X*) ≠ µxBiased

Estimator

X

X

E X X

X1 XVAR ˆ

12 VAR nˆ

22

ˆ2

ˆ1

ˆ

VAR nE LNf X nE LNf Xˆ ; ;1 12 2 2

© 2003 by CRC Press LLC

Page 27: Data Analysis Book

is not efficient. Finally, unbiasedness and efficiency hold true for any finitesample n, and when they become asymptotic properties.

1.6.3 Consistency

A third asymptotic property is that of consistency. An estimator is saidto be consistent if the probability of being closer to the true value of theparameter it estimates ( ) increases with increasing sample size. Formally,this says that as , for any arbitrary constant c.For example, this property indicates that will not differ from

as . Figure 1.7 graphically depicts the property of consistency,showing the behavior of an estimator of the population mean withincreasing sample size.

It is important to note that a statistical estimator may not be an unbiasedestimator; however, it may be a consistent one. In addition, a sufficientcondition for an estimator to be consistent is that it is asymptotically unbi-ased and that its variance tends to zero as (Hogg and Craig, 1994).

FIGURE 1.6Comparing efficiencies.

FIGURE 1.7The property of consistency.

µx

E(X–

) = µx

X– ~ N(µ, σ2/n)

X1 ~ N(µ, σ2)

n

ˆ

n P clim [|ˆ | ] 0X

nX

n

Pro

babi

lity

Den

sity

f (X*) for n4 < n3

f(X*) for n3 < n2

f(X*) for n2 < n1

f(X*) for n1

µx

© 2003 by CRC Press LLC

Page 28: Data Analysis Book

1.6.4 Sufficiency

An estimator is said to be sufficient if it contains all the information in thedata about the parameter it estimates. In other words, is sufficient for if contains all the information in the sample pertaining to .

1.7 Methods of Displaying Data

Although the different measures described in the previous sections oftenprovide much of the information necessary to describe the nature of the dataset being examined, it is often useful to utilize graphical techniques forexamining data. These techniques provide ways of inspecting data to deter-mine relationships and trends, identify outliers and influential observations,and quickly describe or summarize data sets. Pioneering methods frequentlyused in graphical and exploratory data analysis stem from the work of Tukey(1977).

1.7.1 Histograms

Histograms are most frequently used when data are either naturally grouped(gender is a natural grouping, for example) or when small subgroups maybe defined to help uncover useful information contained in the data. Ahistogram is a chart consisting of bars of various heights. The height of eachbar is proportional to the frequency of values in the class represented by thebar. As can be seen in Figure 1.8, a histogram is a convenient way of plottingthe frequencies of grouped data. In the figure, frequencies on the first (left)Y-axis are absolute frequencies, or counts of the number of city transit busesin the State of Indiana belonging to each age group (data were taken fromKarlaftis and Sinha, 1997). Data on the second Y-axis are relative frequencies,which are simply the count of data points in the class (age group) dividedby the total number of data points.

Histograms are useful for uncovering asymmetries in data and, as such,skewness and kurtosis are easily identified using histograms.

1.7.2 Ogives

A natural extension of histograms is ogives. Ogives are the cumulativerelative frequency graphs. Once an ogive such as the one shown in Figure1.9 is constructed, the approximate proportion of observations that are lessthan any given value on the horizontal axis can be read directly from thegraph. Thus, for example, it can be estimated from Figure 1.9 that the pro-portion of buses that are less than 6 years old is approximately 60%, and theproportion less than 12 years old is 85%.

XX

© 2003 by CRC Press LLC

Page 29: Data Analysis Book

1.7.3 Box Plots

When faced with the problem of summarizing essential information of adata set, a box plot (or box-and-whisker plot) is a pictorial display that isextremely useful. A box plot illustrates how widely dispersed observationsare and where the data are centered. This is accomplished by providing,graphically, five summary measures of the distribution of the data: the largestobservation, the upper quartile, the median, the lower quartile, and thesmallest observation (Figure 1.10).

Box plots can be very useful for identifying the central tendency of thedata (through the median), identifying the spread of the data (through theinterquartile range (IQR) and the length of the whiskers), identifying pos-sible skewness of the data (through the position of the median in the box),

FIGURE 1.8Histogram for bus ages in the State of Indiana (1996 data).

FIGURE 1.9Ogive for bus ages in the State of Indiana.

Freq

uenc

y

35 0.18

0.15

0.12

0.09

0.06

25

15

5

0 01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

10

20

30

Rel

ativ

e Fr

eque

ncy

0.03

Age

Cum

ulat

ive

Rel

ativ

e Fr

eque

ncy

1,0

0,8

0,5

0,3

0,016151413121110987654321

Age

© 2003 by CRC Press LLC

Page 30: Data Analysis Book

identifying possible outliers (points beyond the 1.5(IQR) mark), and forcomparing data sets.

1.7.4 Scatter Diagrams

Scatter diagrams are most useful for examining the relationship between twocontinuous variables. As examples, assume that transportation researchersare interested in the relationship between economic growth and enplane-ments, or the effect of a fare increase on travel demand. In some cases, whenone variable depends (to some degree) on the value of the other variable,then the first variable, the dependent, is plotted on the vertical axis. Thepattern of the scatter diagram provides information about the relationshipbetween two variables. A linear relationship is one that can be approximatelygraphed by a straight line (see Figure 1.4). A scatter plot can show a positivecorrelation, no correlation, and a negative correlation between two variables(Section 1.5 and Figure 1.4 analyzed this issue in greater depth). Nonlinearrelationships between two variables can also be seen in a scatter diagramand typically will be revealed as curvilinear. Scatter diagrams are typicallyused to uncover underlying relationships between variables, which can thenbe explored in greater depth with more quantitative statistical methods.

1.7.5 Bar and Line Charts

A common graphical method for examining nominal data is a pie chart. TheBureau of Economic Analysis of the U.S. Department of Commerce in its1996 Survey of Current Business reported the percentages of the U.S. GDPaccounted for by various social functions. As shown in Figure 1.11, trans-portation is a major component of the economy, accounting for nearly 11%of GDP in the United States. The data are nominal since the “values” of thevariable, major social function, include six categories: transportation, housing,food, education, health care, and other. The pie graph illustrates the propor-tion of expenditures in each category of major social function.

The U.S. Federal Highway Administration (FHWA, 1997) completed areport for Congress that provided information on highway and transit assets,

FIGURE 1.10The box plot.

Smallest obs.within 1.5(IQR)

Largest obs.within 1.5(IQR)

Whisker Whisker

Media

Lower Quartile Upper

IQR

© 2003 by CRC Press LLC

Page 31: Data Analysis Book

trends in system condition, performance, and finance, and estimated invest-ment requirements from all sources to meet the anticipated demands in bothhighway travel and transit ridership. One of the interesting findings of thereport was the pavement ride quality of the nation’s urban highways asmeasured by the International Roughness Index. The data are ordinalbecause the “values” of the variable, pavement roughness, include five cate-gories: very good, good, fair, mediocre, and poor. This scale, although itresembles the nominal categorization of the previous example, possesses theadditional property of natural ordering between the categories (withoutuniform increments between the categories). A reasonable way to describethese data is to count the number of occurrences of each value and then toconvert these counts into proportions.

Bar charts are a common alternative to pie charts. They graphically repre-sent the frequency (or relative frequency) of each category as a bar risingfrom the horizontal axis; the height of each bar is proportional to the fre-quency (or relative frequency) of the corresponding category. Figure 1.13,for example, presents the motor vehicle fatal accidents by posted speed limit

FIGURE 1.11U.S. GDP by major social function (1995) (From U.S. DOT, 1997.)

FIGURE 1.12Percent miles of urban interstate by pavement roughness category. (From FHWA, 1997.)

Healthcare15%

Housing24%

Transportation11%

Food13%

Education7%

Other30%

Good27%

Fair24%

Very Good12%

Poor10%

Mediocre27%

© 2003 by CRC Press LLC

Page 32: Data Analysis Book

for 1985 and 1995, and Figure 1.14 presents the percent of on-time arrivalsfor some U.S. airlines for December 1997.

The final graphical technique considered in this section is the line chart.A line chart is obtained by plotting the frequency of a category above thepoint on the horizontal axis representing that category and then joining thepoints with straight lines. A line chart is most often used when the categoriesare points in time (time-series data). Line charts are excellent for uncovering

FIGURE 1.13Motor vehicle fatal accidents by posted speed limit. (From U.S. DOT, 1997.)

FIGURE 1.14Percent of on-time arrivals for December 1997. (From Bureau of Transportation Statistics,www.bts.gov.)

Mot

or V

ehic

le F

atal

Acc

iden

ts

20000

18000

16000

14000

12000

10000

8000

6000

4000

2000

00–25 26–35 36–45 46–54 55 65

Posted Speed Limit1985 1995

Airl

ines

America West

8075706560

TWA

Northwest

Delta

United

Southwest

Continental

American

US Airways

Percentage of Arrivals on Time (December 1997)

© 2003 by CRC Press LLC

Page 33: Data Analysis Book

trends of a variable over time. For example, consider Figure 1.15, whichrepresents the evolution of the U.S. air-travel market. A line chart is usefulfor showing the growth in the market over time. Two points of particularinterest to the air-travel market, the deregulation of the market and the GulfWar, are marked on the graph.

FIGURE 1.15U.S. revenue passenger enplanements 1954 through 1999. (From Bureau of TransportationStatistics, www.bts.gov.)

U.S

. Rev

enue

Pas

seng

er E

npla

nem

ents

(103

)

700,000

600,000

500,000

400,000

300,000

200,000

100,000

0

1999

1954

1957

1960

1963

1966

1969

1972

1975

1978

1981

1984

1990

1987

1993

1996

Year

Deregulation

Gulf War

© 2003 by CRC Press LLC

Page 34: Data Analysis Book

2Statistical Inference II: Interval Estimation, Hypothesis Testing, and Population Comparisons

Scientific decisions should be based on sound analysis and accurate infor-mation. This chapter provides the theory and interpretation of confidenceintervals, hypothesis tests, and population comparisons, which are statisticalconstructs (tools) used to ask and answer questions about the transportationphenomena under study. Despite their enormous utility, confidence intervalsare often ignored in transportation practice and hypothesis tests and popu-lation comparisons are frequently misused and misinterpreted. The tech-niques discussed in this chapter can be used to formulate, test, and makeinformed decisions regarding a large number of hypotheses. Such questionsas the following serve as examples. Does crash occurrence at a particularintersection support the notion that it is a hazardous location? Do traffic-calming measures reduce traffic speeds? Does route guidance informationimplemented via a variable message sign system successfully divert motor-ists from congested areas? Did the deregulation of the air-transport marketincrease the market share for business travel? Does altering the levels ofoperating subsidies to transit systems change their operating performance?To address these and similar types of questions, transportation researchersand professionals can apply the techniques presented in this chapter.

2.1 Confidence Intervals

In practice, the statistics calculated from samples such as the sample aver-age , variance s2, standard deviation s, and others reviewed in theprevious chapter are used to estimate population parameters. For example,the sample average is used as an estimator for the population mean x,the sample variance s2 is an estimate of the population variance 2, and soon. Recall from Section 1.6 that desirable or “good” estimators satisfy four

X

X

© 2003 by CRC Press LLC

Page 35: Data Analysis Book

important properties: unbiasedness, efficiency, consistency, and sufficiency.However, regardless of the properties an estimator satisfies, estimates willvary across samples and there is at least some probability that it will bedifferent from the population parameter it is meant to estimate. Unlike thepoint estimators reviewed in the previous chapter, the focus here is oninterval estimates. Interval estimates allow inferences to be drawn abouta population by providing an interval, a lower and upper boundary, withinwhich an unknown parameter will lie with a prespecified level of confi-dence. The logic behind an interval estimate is that an interval calculatedusing sample data contains the true population parameter with some levelof confidence (the long-run proportion of times that the true populationparameter interval is contained in the interval). Intervals are called confi-dence intervals (CIs) and can be constructed for an array of levels ofconfidence. The lower value is called the lower confidence limit (LCL) andthe upper value the upper confidence limit (UCL). The wider a confidenceinterval, the more confident the researcher is that it contains the populationparameter (overall confidence is relatively high). In contrast, a relativelynarrow confidence interval is less likely to contain the population param-eter (overall confidence is relatively low).

All the parametric methods presented in the first four sections of thischapter make specific assumptions about the probability distributions ofsample estimators, or make assumptions about the nature of the sampledpopulations. In particular, the assumption of an approximately normallydistributed population (and sample) is usually made. As such, it is imper-ative that these assumptions, or requirements, be checked prior to apply-ing the methods. When the assumptions are not met, then thenonparametric statistical methods provided in Section 2.5 are moreappropriate.

2.1.1 Confidence Interval for with Known 2

The central limit theorem (CLT) suggests that whenever a sufficientlylarge random sample is drawn from any population with mean andstandard deviation , the sample mean is approximately normallydistributed with mean and standard deviation . It can easily beverified that this standard normal random variable Z has a 0.95 proba-bility of being between the range of values [–1.96, 1.96] (see Table C.1 inAppendix C). A probability statement regarding Z is given as

. (2.1)

With some basic algebraic manipulation the probability statement of Equa-tion 2.1 can be written in a different, yet equivalent form:

X/ n

PX

n1 96 1 96 0 95.

/. .

© 2003 by CRC Press LLC

Page 36: Data Analysis Book

. (2.2)

Equation 2.2 reveals that, with a large number of intervals computed fromdifferent random samples drawn from the population, the proportion ofvalues of for which the interval captures is 0.95. This interval is called the 95% confidence interval estimator of . Ashortcut notation for this interval is

. (2.3)

Obviously, probabilities other than 95% can be used. For example, a 90%confidence interval is

.

In general, any confidence level can be used in estimating the confidenceintervals. The confidence interval is , and is the value of Z suchthat the area in each of the tails under the standard normal curve is .Using this notation, the confidence interval estimator of can be written as

. (2.4)

Because the confidence level is inversely proportional to the risk that theconfidence interval fails to include the actual value of , it generally rangesbetween 0.90 and 0.99, reflecting 10% and 1% levels of risk of not includingthe true population parameter, respectively.

Example 2.1

A 95% confidence interval is desired for the mean vehicular speed onIndiana roads (see Example 1.1 for more details). First, the assumptionof normality is checked; if this assumption is satisfied we can proceedwith the analysis. The sample size is n = 1296, and the sample mean is

= 58.86. Suppose a long history of prior studies has shown the popu-lation standard deviation as = 5.5. Using Equation 2.4, the confidenceinterval can be obtained:

0 951 96 1 96

1 96 1 96

.. .

. .

Pn

Xn

P Xn

Xn

X ( . , . )X n X n1 96 1 96

Xn

1 96.

Xn

1 645.

1 Z 22

X Zn2

X

© 2003 by CRC Press LLC

Page 37: Data Analysis Book

.

The result indicates that the 95% confidence interval for the unknownpopulation parameter consists of lower and upper bounds of 58.56 and59.16. This suggests that the true and unknown population parameterwould lie somewhere in this interval about 95 times out of 100, on average.The confidence interval is rather “tight,” meaning that the range of possiblevalues is relatively small. This is a result of the low assumed standarddeviation (or variability in the data) of the population examined.

The 90% confidence interval, using the same standard deviation, is [58.60,59.11], and the 99% confidence interval is [58.46, 59.25]. As the confidenceinterval becomes wider, there is greater and greater confidence that theinterval contains the true unknown population parameter.

2.1.2 Confidence Interval for the Mean with Unknown Variance

In the previous section, a procedure was discussed for constructing confi-dence intervals around the mean of a normal population when the varianceof the population is known. In the majority of practical sampling situations,however, the population variance is rarely known and is instead estimatedfrom the data. When the population variance is unknown and the populationis normally distributed, a (1 – )100% confidence interval for is given by

, (2.5)

where s is the square root of the estimated variance (s2), is the value ofthe t distribution with n 1 degrees of freedom (for a discussion of the tdistribution, see Appendix A).

Example 2.2

Continuing with the previous example, a 95% confidence interval for themean speed on Indiana roads is computed, assuming that the populationvariance is not known, and instead an estimate is obtained from the datawith the same value as before. The sample size is n = 1296, and the samplemean is = 58.86. Using Equation 2.3, the confidence interval can beobtained as

.

Interestingly, inspection of probabilities associated with the t distribution(see Table C.2 in Appendix C) shows that the t distribution converges to

Xn

1 96 58 86 1 965 51296

58 86 0 30 58 56 59 16. . ..

. . . , .

X tsn2

t 2

X

X tsn2 58 86 1 96

4 411296

58 61 59 10. ..

. , .

© 2003 by CRC Press LLC

Page 38: Data Analysis Book

the standard normal distribution as . Although the t distribution isthe correct distribution to use whenever the population variance is un-known, when sample size is sufficiently large the standard normal distri-bution can be used as an adequate approximation to the t distribution.

2.1.3 Confidence Interval for a Population Proportion

Sometimes, interest centers on a qualitative (nominal scale) variable, ratherthan a quantitative (interval or ratio scale) variable. There might be interest inthe relative frequency of some characteristic in a population such as, for exam-ple, the proportion of people in a population who are transit users. In suchcases, an estimate of the population proportion, p, whose estimator is hasan approximate normal distribution provided that n is sufficiently large (and , where ). The mean of the sampling distribution is thepopulation proportion p and the standard deviation is .

A large sample confidence interval for the population propor-tion, p is given by

, (2.6)

where the estimated sample proportion, , is equal to the number of “suc-cesses” in the sample divided by the sample size, n, and .

Example 2.3

A transit planning agency wants to estimate, at a 95% confidence level,the share of transit users in the daily commute “market” (that is, thepercentage of commuters using transit). A random sample of 100 commut-ers is obtained and it is found that 28 people in the sample are transit users.By using Equation 2.6, a 95% confidence interval for p is calculated as

.

Thus, the agency is 95% confident that transit users in the daily commuterange from 19.2 to 36.8%.

2.1.4 Confidence Interval for the Population Variance

In many situations, in traffic safety research for example, interest centerson the population variance (or a related measure such as the populationstandard deviation). As a specific example, vehicle speeds contribute tocrash probability, with an important factor the variability in speeds on the

n

pnp 5

nq 5 q p1 ppq n

1 100%

ˆˆ ˆ

p Zpqn2

pˆ ˆq p1

ˆˆ ˆ

. .. .

. . . ,p Zpqna 2 0 28 1 96

0 28 0 72100

0 28 0 088 0 192 368 0.

© 2003 by CRC Press LLC

Page 39: Data Analysis Book

roadway. Speed variance, measured as differences in travel speeds on aroadway, relates to crash frequency in that a larger variance in speedbetween vehicles correlates with a larger frequency of crashes, especiallyfor crashes involving two or more vehicles (Garber, 1991). Large differencesin speeds results in an increase in the frequency with which motorists passone another, increasing the number of opportunities for multivehiclecrashes. Clearly, vehicles traveling the same speed in the same directiondo not overtake one another; therefore, they cannot collide as long as thesame speed is maintained (for additional literature on the topic of speedingand crash probabilities, covering both the United States and abroad, theinterested reader should consult FHWA, 1995, 1998, and TRB, 1998).

A confidence interval for 2, assuming the population is nor-mally distributed, is given by

, (2.7)

where is the value of the distribution with n 1 degrees of freedom.The area in the right-hand tail of the distribution is , while the area inthe left-hand tail of the distribution is . The chi-square distribution isdescribed in Appendix A, and the table of probabilities associated with thechi-square distribution is provided in Table C.3 of Appendix C.

Example 2.4

A 95% confidence interval for the variance of speeds on Indiana roadsis desired. With a sample size of 100 and a variance of 19.51 mph2, andusing the values from the 2 table (Appendix C, Table C.3), one obtains

= 129.56 and = 74.22. Thus, the 95% confidence interval isgiven as

.

The speed variance is, with 95% confidence, between 15.05 and 26.02.Again, the units of the variance in speed are in mph2.

2.2 Hypothesis Testing

Hypothesis tests are used to assess the evidence on whether a difference ina population parameter (a mean, variance, proportion, etc.) between two or

1 100%

n s n s1 12

22

2

1 22,

22 2

22

1 22

22

1 22

99 19 51129 56

99 19 5174 22

15 05 26 02.

.,

..

. , .

© 2003 by CRC Press LLC

Page 40: Data Analysis Book

more groups is likely to have arisen by chance or whether some other factoris responsible for the difference. Statistical distributions are employed inhypothesis testing to estimate probabilities of observing the sample data,given an assumption about what “should have” occurred. When observedresults are extremely unlikely to have occurred under assumed conditions,then the assumed conditions are considered unlikely. In statistical terms, thehypothesis test provides the following probability

, (2.8)

which is the probability of obtaining or observing the sample data condi-tional upon a true null hypothesis. It is not, however, the probability of thenull hypothesis being true, despite the number of times it is mistakenlyinterpreted this way. This and other common misinterpretations of hypoth-esis tests are described in Washington (2000a).

Consider an example. Suppose there is interest in quantifying the impactof the repeal of the National Maximum Speed Limit (NMSL) on averagespeeds on U.S. roads. Representative speed data are collected in time periodsbefore and after the repeal. Using a simple before-after study design, themean speeds before and after repeal of the NMSL are compared.

In statistical terms, the data are used to assess whether the observeddifference in mean speeds between the periods can be explained by thenatural sampling variability, or whether the observed difference is attrib-utable to the repeal of the NMSL. Using hypothesis testing, the researchercalculates the probability of observing the increase (or decrease) in meanspeeds (as reflected in the sample data) from the period before to theperiod after repeal, under the assumption that the repeal of the NMSLhad no effect. If an observed increase in speed is extremely unlikely tohave been produced by random sampling variability, then the researcherconcludes that the repeal was instrumental in bringing about the change.Conversely, if the observed speed differences are not all that unusual andcan rather easily be explained by random sampling variability, then it isdifficult to attribute the observed differences to the effect of the repeal.At the conclusion of the hypothesis test the probability of observing theactual data is obtained, given that the repeal of the NMSL had no effecton speeds.

Because many traffic investigations are observational types of studies, itis not easy to control the many other factors that could influence speeds inthe after period (besides the repeal). Such factors include changes in vehiclemiles traveled (VMT), driver population, roadside changes, adjacent land-use changes, weather, and so on. It is imperative to account for, or control,these other factors to the extent possible, because lack of control could resultin attributing a change in speeds to the repeal of the NMSL, when in factother factors were responsible for the change.

P data|true null hypothesis

© 2003 by CRC Press LLC

Page 41: Data Analysis Book

2.2.1 Mechanics of Hypothesis Testing

To formulate questions about transportation phenomena a researcher mustpose two competing statistical hypotheses: a null hypothesis (the hypoth-esis to be nullified) and an alternative. The null hypothesis, typicallydenoted by H0, is an assertion about one or more population parametersthat are assumed to be true until there is sufficient statistical evidence toconclude otherwise. The alternative hypothesis, typically denoted by Ha,is the assertion of all situations not covered by the null hypothesis. Togetherthe null and the alternative constitute a set of hypotheses that covers allpossible values of the parameter or parameters in question. Consideringthe NMSL repeal previously discussed, the following pair of competinghypotheses could be formulated:

Null Hypothesis (H0): There has not been a change in mean speeds asa result of the repeal of the NMSL

Alternative Hypothesis (Ha): There has been a change in mean speedsas a result of the repeal of the NMSL.

The purpose of a hypothesis test is to determine whether it is appro-priate to reject or not to reject the null hypothesis. The test statistic isthe sample statistic upon which the decision to reject, or fail to reject,the null hypothesis is based. The nature of the hypothesis test is deter-mined by the question being asked. For example, if an engineering inter-vention is expected to change the mean of a sample (the mean of vehiclespeeds), then a null hypothesis of no difference in means is appropriate.If an intervention is expected to change the spread or variability of data,then a null hypothesis of no difference in variances should be used. Thereare many different types of hypothesis tests that can be conducted.Regardless of the type of hypothesis test, the process is the same: theempirical evidence is assessed and will either refute or fail to refute thenull hypothesis based on a prespecified level of confidence. Test statisticsused in many parametric hypothesis testing applications rely upon theZ, t, F, and 2 distributions.

The decision to reject or fail to reject the null hypothesis may or maynot be based on the rejection region, which is the range of values suchthat, if the test statistic falls into the range, the null hypothesis is rejected.Recall that, upon calculation of a test statistic, there is evidence either toreject or to fail to reject the null hypothesis. The phrases reject and fail toreject have been chosen carefully. When a null hypothesis is rejected, theinformation in the sample does not support the null hypothesis and it isconcluded that it is unlikely to be true, a definitive statement. On the otherhand, when a null hypothesis is not rejected, the sample evidence isconsistent with the null hypothesis. This does not mean that the nullhypothesis is true; it simply means that it cannot be ruled out using theobserved data. It can never be proved that a statistical hypothesis is true

© 2003 by CRC Press LLC

Page 42: Data Analysis Book

using the results of a statistical test. In the language of hypothesis testing,any particular result is evidence as to the degree of certainty, ranging fromalmost uncertain to almost certain. No matter how close to the twoextremes a statistical result may be, there is always a non-zero probabilityto the contrary.

Whenever a decision is based on the result of a hypothesis test, there is achance that it will be incorrect. Consider Table 2.1. In this classical Ney-man–Pearson methodology, the sample space is partitioned into two regions.If the observed data reflected through the test statistic falls into the rejectionor critical region, the null hypothesis is rejected. If the test statistic falls intothe acceptance region, the null hypothesis cannot be rejected. When the nullhypothesis is true, there is percent chance of rejecting it (Type I error).When the null hypothesis is false, there is still a percent chance of acceptingit (Type II error). The probability of Type I error is the size of the test. It isconventionally denoted by and called the significance level. The power ofa test is the probability that it will correctly lead to rejection of a false nullhypothesis, and is given as 1 .

Because both probabilities and reflect probabilities of making errors,they should be kept as small as possible. There is, however, a trade-offbetween the two. For several reasons, the probability of making a Type IIerror is often ignored. Also, the smaller the , the larger the . Thus, if ismade to be really small, the “cost” is a higher probability for making a TypeII error, all else being equal. The determination of which statistical error isleast desirable depends on the research question asked and the subsequentconsequences of making the respective errors. Both error types are undesir-able, so attention to proper experimental design prior to data collection andsufficiently large sample sizes will help to minimize the probability of mak-ing these two statistical errors. In practice, the probability of making a TypeI error is usually set in the range from 0.01 to 0.10 (1 and 10% error rates,respectively). The selection of an appropriate level is based on the conse-quences of making a Type I error. For example, if human lives are at stakewhen an error is made (accident investigations, medical studies), then an of 0.01 or 0.005 may be most appropriate. In contrast, if an error results inmonies being spent for improvements (congestion relief, travel time, etc.)that might not bring about improvements, then perhaps a less stringent is appropriate.

TABLE 2.1

Results of a Test of Hypothesis

RealityTest Result H0 is true H0 is false

Reject Type I errorP(Type I error) =

Correct decision

Do not reject Correct decision Type II errorP(Type II error) =

© 2003 by CRC Press LLC

Page 43: Data Analysis Book

2.2.2 Formulating One- and Two-Tailed Hypothesis Tests

As discussed previously, the decision of whether the null hypothesis isrejected (or not) is based on the rejection region. To illustrate a two-tailedrejection region, suppose a hypothesis test is conducted to determinewhether the mean speed on U.S. highways is 60 mph. The null and alterna-tive hypotheses are formulated as follows:

.

If the sample mean (the test statistic in this case) is significantly differentfrom 60, and falls in the rejection region, the null hypothesis is rejected.On the other hand, if is sufficiently close to 60, the null hypothesis cannotbe rejected. The rejection region provides a range of values below or abovewhich the null hypothesis is rejected. In practice, however, a standardizednormal test statistic is employed. A standardized normal variable is con-structed (Equation 2.1) based on a true null hypothesis such that

. (2.9)

The random variable is approximately standard normally distributed (N(0,1)) under a true null hypothesis. Critical values of Z, or Zc, are defined suchthat . The values of Zc that correspond todifferent values of are provided in Table C.1 in Appendix C; some com-monly used values are shown in Table 2.2. For example, if = 0.05 then Zc

= 1.96. Using Equation 2.9 the test statistic Z* is calculated; for this statisticthe following rules apply:

1. If , then the probability of observing this value (or larger)if the null hypothesis is true is . In this case the null hypothesis isrejected in favor of the alternative.

2. If , then the probability of observing this value (or smaller)is 1 . In this case the null hypothesis cannot be rejected.

TABLE 2.2

Critical Points of Z for Selected Levels of Significance

Level of Significance 0.10 0.05 0.01

One-tailed test ±1.28 ±1.645 ±2.326Two-tailed test ±1.645 ±1.96 ±2.576

H

Ha

0 60

60

:

:

XX

ZX

n

*

P Z Z P Z Zc c* * 2

Z Zc*

Z Zc*

© 2003 by CRC Press LLC

Page 44: Data Analysis Book

The conclusions corresponding with the null and alternate hypotheses are“the population mean is equal to 60 mph” and “the population mean is notequal to 60 mph,” respectively. This example illustrates a two-sided hypoth-esis test.

There are many situations when, instead of testing whether a populationparameter is equal to a specific value, interest may center on testing whethera parameter is greater than (or smaller than) a specific value. For example,consider a hypothesis test of whether a decrease in domestic airfare ticketprices for a certain airline have a positive impact on the market share p ofthe airline, currently at 7%. The null and alternative hypotheses are stated as

.

In these hypotheses there is an a priori belief that the direction of theanticipated effect is an increase in market share (and the hypotheses areconstructed to test this specific directional effect). There could have also beenan a priori belief of a decrease in market share, which would yield thefollowing hypotheses

.

Finally, nondirectional hypotheses, motivated by lack of a priori beliefregarding the direction of the effect, would yield

.

One- and two-tailed tests have different critical points on the standardnormal distribution since is allocated to a single tail for directional tests,whereas it is divided between two tails for nondirectional tests. Table 2.2provides critical points for the Z distribution for one-tailed and two-tailedhypothesis tests for commonly used values of . Critical values for otherlevels of are found in Table C.1.

Example 2.5

Using the data from Example 2.1, a test is conducted to assess whetherthe mean speed on Indiana roads is 59.5 mph at the 5% significancelevel. The sample size is n = 1296, and the sample mean is = 58.86.Suppose that numerous past studies have revealed the population

H p

H pa

0 0 07

0 07

: .

: .

H p

H pa

0 0 07

0 07

: .

: .

H p

H pa

0 0 07

0 07

: .

: .

© 2003 by CRC Press LLC

Page 45: Data Analysis Book

standard deviation to be = 5.5. The parameter of interest is thepopulation mean, and the hypotheses to be tested are

.

From Example 2.1, a 95% confidence interval for mean speeds is [58.56,59.16]. Because the value 59.5 mph is not within the confidence interval,the null hypothesis is rejected. Thus, there is sufficient evidence to inferthat mean speed is not equal to 59.5 mph.

An alternative method for obtaining this result is to use the standardizedtest statistic presented in Equation 2.9 and follow the appropriate deci-sion rules. The test statistic is

.

Since the test statistic |–3.27| = 3.27 is greater than 1.96, the critical valuefor a two-tailed test at the 5% level of significance, the null hypothesisis rejected. As expected, a confidence interval and the standardized teststatistic lead to identical conclusions.

2.2.3 The p-Value of a Hypothesis Test

An increasingly common practice in reporting the outcome of a statistical testis to state the value of the test statistic along with its “probability-value” or“p-value.” The p-value is the smallest level of significance that leads torejection of the null hypothesis. It is an important value because it quantifiesthe amount of statistical evidence that supports the alternative hypothesis. Ingeneral, the more evidence that exists to reject the null hypothesis in favor ofthe alternative hypothesis, the larger the test statistic and the smaller is the p-value. The p-value provides a convenient way to determine the outcome of astatistical test based on any specified Type I error rate ; if the p-value is lessthan or equal to , then the null hypothesis is rejected. For example, a p-valueof 0.031 suggests that a null hypothesis will be rejected at = 0.05, but willnot be rejected at = 0.01. Using p-values will always lead to the sameconclusions as the usual test procedure given a level of significance .

The p-value of is calculated as follows:

.

H

Ha

0 59 5

59 5

: .

: .

Zx

n

* . .. .

58 86 59 55 5

1296

3 27

Z* .3 27

p Z p Z Z

p Z p Z

-value and * . . .

. .

. .

3 27 3 27 3 27

2 3 27 2 1 3 27

2 1 99946 001

© 2003 by CRC Press LLC

Page 46: Data Analysis Book

As a result, the null hypothesis is rejected at the 0.05 and 0.01levels, because the p-value of the test is 0.001.

2.3 Inferences Regarding a Single Population

The previous section discussed some of the essential concepts regardingstatistical inference and hypothesis tests concerning the mean of a populationwhen the variance is known. However, this test is seldom applied in practicebecause the population standard deviation is rarely known. The section wasuseful, however, because it provided the framework and mechanics that areuniversally applied to all types of hypothesis testing.

In this section, techniques for testing a single population are presented.The techniques introduced in this section are robust, meaning that if thesamples are nonnormal the techniques are still valid provided that thesamples are not extremely or significantly nonnormal. Under extremenonnormality, nonparametric methods should be used to conduct theequivalent hypothesis tests (see Section 2.5 for a discussion on nonpara-metric tests).

2.3.1 Testing the Population Mean with Unknown Variance

Applying the same logic used to develop the test statistic in Equation 2.9, atest statistic for testing hypotheses about given that 2 is not known andthe population is normally distributed, is

, (2.10)

which has a t distribution with n – 1 degrees of freedom.

Example 2.6

Similar to the previous example, a test of whether the mean speed onIndiana roads is 60 mph is conducted at the 5% significance level. Thesample size is n = 1296, and the sample mean is = 58.86. As is mostoften the case in practice, 2 is replaced with s = 4.41, the estimatedstandard deviation. The statistical hypotheses to be tested are

.

H0 59 5: .

tXs

n

*

X

H

Ha

0 60

60

:

:

© 2003 by CRC Press LLC

Page 47: Data Analysis Book

Using Equation 2.10, the value of the test statistic is

.

The test statistic is very large, and much larger than any reasonablecritical value of t (Table C.2), leading again to a rejection of the nullhypothesis. That is, the evidence suggests that the probability of observ-ing these data, if in fact the mean speed is 60 mph, is extremely smalland the null hypothesis is rejected.

Why is the test statistic so large, providing strong objective evidence toreject the null hypothesis? If the sample size were 36 instead of 1296, thetest statistic would yield 0.81, leading to the inability to reject the nullhypothesis. It is the square root of n in Equation 2.10 that dominates thiscalculation, suggesting that larger samples yield more reliable results.This emphasizes an earlier point: lack of evidence to reject the nullhypothesis does not mean that the null hypothesis is true; it may meanthat the effect is too small to detect , the sample size is too small,or the variability in the data is too large relative to the effect.

2.3.2 Testing the Population Variance

With the same logic used to develop the confidence intervals for 2 in Equa-tion 2.7, the test statistic used for testing hypotheses about 2 (with s2 theestimated variance) is

, (2.11)

which is 2 distributed with n 1 degrees of freedom when the populationvariance is approximately normally distributed with variance equal to 2.

Example 2.7

A test of whether the variance of speeds on Indiana roads is larger than20 is calculated at the 5% level of significance, assuming a sample sizeof 100. The parameter of interest is the population variance, and thehypothesis to be tested is

.

tXs

n

* ..

58 86 604 41

1296

335

( )X

Xn s2

2

2

1

H

Ha

02

2

20

20

:

:

© 2003 by CRC Press LLC

Page 48: Data Analysis Book

Using results from Example 2.4, the null hypothesis cannot be rejected.Nevertheless, using Equation 2.11 the test statistic is

.

From Table C.3 in Appendix C, the critical value for a chi-squared randomvariable with 99 degrees of freedom, 2 = 0.05 and a right-tailed test is129.561. As expected, the null hypothesis cannot be rejected at the 0.05level of significance.

2.3.3 Testing for a Population Proportion

The null and alternative hypotheses for proportions tests are set up in thesame manner as done previously, where the null is constructed so that p isequal to, greater than, or less than a specific value, while the alternativehypothesis covers all remaining possible outcomes. The test statistic isderived from the sampling distribution of and is given by

, (2.12)

where the estimated sample proportion is equal to the number of “suc-cesses” observed in the sample divided by the sample size, n, and q = 1 – p.

Example 2.8

The transit planning agency mentioned in Example 2.3 tests whether thetransit market share in daily commute is over 20%, at the 5% significancelevel. The competing hypotheses are

.

The test statistic is:

.

The test statistic is not sufficiently large to warrant the rejection of thenull hypothesis.

Xn s2

2

2

1 99 19 5120

96 57.

.

p

Zp ppq n

* ˆ

p

H p

H pa

0

0 20

:

: .

= 0.20

Zp p

pq n* ˆ . .

. ..

0 20 0 28

0 28 0 72 1001 78

© 2003 by CRC Press LLC

Page 49: Data Analysis Book

2.4 Comparing Two Populations

An extremely useful application of statistics is in comparing different samplesor groups. Frequently in transportation research we seek to compare quantitiessuch as speeds, accident rates, pavement performance, travel times, bridgepaint life, and so on. This section presents methods for conducting compari-sons between groups; that is, testing for statistically significant differencesbetween two populations. As before, the methods presented here are for inter-val and ratio scale variables and are robust. For ordinal scale variables andfor extreme nonnormality, nonparametric alternatives should be used.

2.4.1 Testing Differences between Two Means: Independent Samples

Random samples drawn from two populations are used to test the differencebetween two population means. This section presents the analysis of inde-pendent samples, and the following section presents the analysis of pairedobservations (or matched pairs experiments). It is assumed that large sam-ples are used to test for the difference between two population meansbecause when sample sizes can be considered sufficiently large, then thedistribution of their means can be considered as approximately normallydistributed using the central limit theorem. A general rule of thumb suggeststhat sample sizes are large if both and .

There are several hypothesis testing options. The first is a test of whetherthe mean of one population is greater than the mean of another population(a one-tailed test). A second is a test of whether two population means areequal, assuming lack of a prior intention to prove that one mean is greaterthan the other. The most common test for the difference between two pop-ulation means, 1 and 2, is the one presented below where the null hypoth-esis states that the two means are equal

.

The competing hypotheses for a directional test, that one population meanis larger than another, are

.

The test statistic in both hypothesis testing situations is the same. As aresult of the assumption about sample sizes and approximate normality ofthe populations, the test statistic is Z*, such that

n1 25 n2 25

H

Ha

0 1 2

1 2

0

0

:

:

H

Ha

0 1 2

1 2

0

0

:

:

© 2003 by CRC Press LLC

Page 50: Data Analysis Book

, (2.13)

where ( 1 – 2) is the difference between 1 and 2 under the null hypothesis.The expression in the denominator is the standard error of the differencebetween the two sample means and requires two independent samples.Recall that hypothesis tests and confidence intervals are closely related. Alarge sample (1 – )100% confidence interval for the difference between twopopulation means ( 1 – 2), using independent random samples is

. (2.14)

When sample sizes are small (n1 25 and n2 25) and both populationsare approximately normally distributed, the test statistic in Equation 2.13has approximately a t distribution with degrees of freedom given by

. (2.15)

In Equations 2.13 and 2.14 it is assumed that and are not equal.When and are equal, there is an alternative test for the differencebetween two population means. This test is especially useful for smallsamples as it allows a test for the difference between two populationmeans without having to use the complicated expression for the degreesof freedom of the approximate t distribution (Equation 2.15). When twopopulation variances and are equal, then the variances are pooledtogether to obtain a common population variance. This pooled variance,

, is based on the sample variance obtained from a sample of sizen1, and a sample variance obtained from a sample of size n2, and isgiven by

. (2.16)

A test statistic for a difference between two population means with equalpopulation variances is given by

ZX X

sn

sn

* 1 2 1 2

12

1

22

2

X X Zsn

sn1 2 2

12

1

22

2

df

sn

sn

sn

n

sn

n

12

1

22

2

2

12

1

2

1

22

2

2

21 1

12

22

12

22

12

22

sp2 s1

2

s22

sn s n s

n np2 1 1

22 2

2

1 2

1 1

2

© 2003 by CRC Press LLC

Page 51: Data Analysis Book

, (2.17)

where the term ( 1 – 2) is the difference between 1 and 2 under the nullhypothesis. The degrees of freedom of the test statistic in Equation 2.17 are

, which are the degrees of freedom associated with the pooledestimate of the population variance . The confidence interval for a differencein population means is based on the t distribution with degrees offreedom, or on the Z distribution when degrees of freedom are sufficientlylarge. A (1 – )100% confidence interval for the difference between two pop-ulation means ( 1 – 2), assuming equal population variances is

. (2.18)

Example 2.9

Interest is focused on whether the repeal of the NMSL had an effect onthe mean speeds on Indiana roads. To test this hypothesis, 744 observa-tions in the before period and 552 observations in the after the repealperiod are used. A 5% significance level is used. Descriptive statisticsshow that average speeds in the before and after periods are = 57.65and = 60.48, respectively. Further, the variances for the before andafter the repeal periods are = 16.4 and = 19.1, respectively. Thecompeting hypotheses are

.

Using Equation 2.13, the test statistic is

.

The test statistic is much larger than 1.96, the critical value for a two-tailed test at the 5% significance level, and so the null hypothesis isrejected. This result indicates that the mean speed increased after therepeal of the NMSL and that this increase is not likely to have arisen byrandom chance. Using Equation 2.14, a confidence interval is obtained

tX X

sn np

* 1 2 1 2

2

1 2

1 1

n n1 2 2sp

2

n n1 2 2

X X t sn np1 2 2

2

1 2

1 1

XbXa

sb2 sa

2

H

H

a b

a a b

0 0

0

:

:

ZX X

sn

sn

a b a b

a

a

b

b

* . .

. ..

2 2

60 48 57 65 0

19 1552

16 4744

11 89

© 2003 by CRC Press LLC

Page 52: Data Analysis Book

.

Thus, with 95% confidence the increase in speeds from the before to theafter period is between 2.36 and 2.29 mph.

2.4.2 Testing Differences between Two Means: Paired Observations

Presented in this section are methods for conducting hypothesis tests andfor constructing confidence intervals for paired observations obtainedfrom two populations. The advantage of pairing observations is perhapsbest illustrated by example. Suppose that a tire manufacturer is interestedin whether a new steel-belted tire lasts longer than the company’s currentmodel. An experiment could be designed such that two new-design tiresare installed on the rear wheels of 20 randomly selected cars and existing-design tires are installed on the rear wheels of another 20 cars. All driversare asked to drive in their usual way until their tires wear out. Thenumber of miles driven by each driver is recorded so a comparison oftire life can be tested. An improved experiment is possible. On 20 ran-domly selected cars, one of each type of tire is installed on the rear wheelsand, as in the previous experiment, the cars are driven until the tireswear out.

The first experiment results in independent samples, with no relationshipbetween the observations in one sample and the observations in the secondsample. The statistical tests designed previously are appropriate for thesedata. In the second experiment, an observation in one sample is paired withan observation in the other sample because each pair of “competing” tiresshares the same vehicle and driver. This experiment is called a matched pairsdesign. From a statistical standpoint, two tires from the same vehicle arepaired to remove the variation in the measurements due to driving styles,braking habits, driving surface, etc. The net effect of this design is thatvariability in tire wear caused by differences other than tire type is zero(between pairs), making it more efficient with respect to detecting differencesdue to tire type.

The parameter of interest in this test is the difference in means betweenthe two populations, denoted by , and is defined as . Thenull and alternative hypotheses for a two-tailed test case are

.

The test statistic for paired observations is

X X Zsn

sna b

a

a

b

b2

2 2

2 83 1 9619 1552

16 4744

2 36 3 29. .. .

. , .

d d 1 2

H

H

d

a d

0 0

0

:

:

© 2003 by CRC Press LLC

Page 53: Data Analysis Book

, (2.19)

where is the average sample difference between each pair of observa-tions, is the sample standard deviation of these differences, and the sam-ple size, nd, is the number of paired observations. When the null hypothesisis true and the population mean difference is nd, the statistic has a t distri-bution with n – 1 degrees of freedom. Finally, a (1 – )100% confidenceinterval for the mean difference is

. (2.20)

2.4.3 Testing Differences between Two Population Proportions

In this section a method for testing for differences between two populationproportions and drawing inferences is described. The method pertains todata measured on a qualitative (nominal), rather than a quantitative, scale.When sample sizes are sufficiently large, the sampling distributions of thesample proportions and and their difference are approxi-mately normally distributed, giving rise to the test statistic and confidenceinterval computations presented.

Assuming that sample sizes are sufficiently large and the two populationsare randomly sampled, the competing hypotheses for the difference betweenpopulation proportions are

.

As before, one-tailed tests for population proportions could be constructed.The test statistic for the difference between two population proportions whenthe null-hypothesized difference is 0 is

, (2.21)

where is the sample proportion for sample 1, is thesample proportion for sample 2, and symbolizes the combined proportionin both samples and is computed as follows

tXs

n

d d

d

d

*

Xdsd

d

X ts

ndd

d2

p1 p2ˆ ˆp p1 2

H p p

H p pa

0 1 2

1 2

0

0

:

:

Zp p

p pn n

*ˆ ˆ

ˆ ˆ

1 2

1 2

0

11 1

p x n1 1 1 p x n2 2 2p

© 2003 by CRC Press LLC

Page 54: Data Analysis Book

.

When the hypothesized difference between the two proportions is someconstant c, the competing hypotheses become

.

Equation 2.21 is revised such that the test statistic becomes

. (2.22)

The two equations reflect a fundamental difference in the two null hypoth-eses. For the zero-difference null hypothesis it is assumed that and are sample proportions drawn from one population. For the non-zero-dif-ference null hypothesis it is assumed that and are sample proportionsdrawn from two populations and a pooled standard error of the differencebetween the two sample proportions is estimated.

When constructing confidence intervals for the difference between twopopulation proportions, the pooled estimate is not used because the twoproportions are not assumed to be equal. A large sample (1 – )100% confi-dence interval for the difference between two population proportions is

. (2.23)

2.4.4 Testing the Equality of Two Population Variances

Suppose there is interest in the competing hypotheses

or

.

px xn n

1 2

1 2

H p p

H p pa

0 1 2

1 2

0

0

:

:

Zp p c

p p

n

p p

n

*ˆ ˆ

ˆ ˆ ˆ ˆ1 2

1 1

1

2 2

2

1 1

p1 p2

p1 p2

ˆ ˆˆ ˆ ˆ ˆ

p p Zp p

n

p p

n1 2 21 1

1

2 2

2

1 1

H

H

o

a

:

:

12

22

12

22

H

H

o

a

:

:

12

22

12

22

© 2003 by CRC Press LLC

Page 55: Data Analysis Book

Consider first a test of whether is greater than . Two independentsamples are collected from populations 1 and 2, and the following teststatistic is computed:

, (2.24)

where is an F-distributed random variable with n1 – 1 degrees offreedom in the numerator and n2 – 1 degrees of freedom in the denominator(for additional information on the F distribution, see Appendix A). It isimportant to remember to place in the numerator because, in a one-tailedtest, rejection may occur in the right tail of the distribution only. If issmaller than , the null hypothesis cannot be rejected because the statisticvalue will be less than 1.00.

For a two-tailed test a practical approach is to insert the larger samplevariance in the numerator. Then, if a test statistic is greater than a criticalvalue associated with a specific , the null hypothesis is rejected at the 2significance level (double the level of significance obtained from Table C.4in Appendix C). Similarly, if the p-value corresponding with one tail of theF distribution is obtained, it is doubled to get the actual p-value. The equiv-alent F test that does not require insertion of the largest sample variance inthe numerator can be found in Aczel (1993).

Example 2.10

There is interest in testing whether the repeal of the NMSL had an effecton the speed variances rather than mean speeds. The test is conductedat the 10% significance level. Descriptive statistics show that and ,the standard deviations for the before and after the repeal periods, are16.4 and 19.1, respectively. The parameter of interest is variance and thecompeting hypotheses are

.

Using a two-tailed test, the larger sample variance is placed in the nu-merator. The test statistic value is obtained from Equation 2.24 as

.

The critical value for = 0.05 (2 = 0.1) for 522 degrees of freedom inthe numerator and 744 degrees of freedom in the denominator is 1.0. Asa result, the analyst rejects the null hypothesis of equality of variances

12

22

Fssn n1 1 2 1

12

22,

*

Fn n1 21 1,*

s12

s12

s22

sb2 sa

2

H

H

a b

a a b

02 2

2 2

0

0

:

:

F Fssn n1 2 522 7441 1

12

22

19 116 4

1 16,

* *

,

.

..

© 2003 by CRC Press LLC

Page 56: Data Analysis Book

in the before and after periods of the repeal of the NMSL. A correctinterpretation is that random sampling alone would not likely haveproduced the observed difference in sample variances if in fact the vari-ances were equal.

2.5 Nonparametric Methods

Statistical procedures discussed previously in this chapter have focused onmaking inferences about specific population parameters and have reliedupon specific assumptions about the data being satisfied. One assumptionis that the samples examined are approximately normally distributed. Forthe means and variances tests discussed, data are required to be measuredon a ratio or interval scale. Finally, sample sizes are required to be sufficientlylarge. The statistical procedures discussed in the remainder of this chaptermake no assumptions regarding the underlying population parameters andare, thus, called nonparametric. Some tests do not even require assumptionsto be made about the distributions of the population of interest (as such,nonparametric methods are often called distribution-free methods).

Nonparametric methods typically require fewer stringent assumptionsthan do their parametric alternatives, and they use less information con-tained in the data. If nonparametric techniques are applied to data that meetconditions suitable for parametric tests, then the likelihood of committing aType II error increases. Thus, parametric methods are more powerful andwill be more likely to lead correctly to rejection of a false null hypothesis.The explanation for this is straightforward. When observations measured oninterval or ratio scales are ranked, that is, when they are converted to theordinal scale, information about the relative distance between observationsis lost. Because the probability of a Type I error is fixed at some value andbecause statistical tests are generally concerned with the size of an effect(difference in means, variances, etc.), using only ranked data requires moreevidence to demonstrate a sufficiently large effect and so (1 – ), or statisticalpower, is reduced. Because a nonparametric test usually requires fewerassumptions and uses less information in the data, it is often said that aparametric procedure is an exact solution to an approximate problem, whilea nonparametric procedure is an approximate solution to an exact problem(Conover, 1980).

Given the trade-offs between parametric and nonparametric methods, someguidance on when to use nonparametric methods is appropriate. A nonpara-metric technique should be considered under the following conditions:

1. The sample data are frequency counts and a parametric test is notavailable.

2. The sample data are measured on the ordinal scale.

© 2003 by CRC Press LLC

Page 57: Data Analysis Book

3. The research hypotheses are not concerned with specific populationparameters such as and 2.

4. Requirements of parametric tests such as approximate normality, largesample sizes, and interval or ratio scale data, are grossly violated.

5. There is moderate violation of parametric test requirements, as wellas a test result of marginal statistical significance.

The remainder of this section presents nonparametric techniques com-monly encountered in practice. There are many available nonparametric teststhat are not discussed and a more comprehensive list of the available tech-niques is presented in Table 2.2. Readers interested in comprehensive text-books on the subject should consult Daniel (1978), Conover (1980), andGibbons (1985a, b).

2.5.1 The Sign Test

The sign test can be used in a large number of situations, including casesthat test for central location (mean, median) differences in ordinal data orfor correlation in ratio data. But, its most common application is to identifythe most preferred alternative, among a set of alternatives, when a sampleof n individuals (questionnaires, for example) is used. In this case the dataare nominal because the expressions of interest for the n individuals simplyindicate a preference. In essence, the objective of this test is to determinewhether a difference in preference exists between the alternatives compared.

Consider, for example, the case where drivers are asked to indicate theirpreference for receiving pre-trip travel time information via the Internet vs.receiving the information on their mobile phones. The purpose of the studyis to determine whether drivers prefer one method over the other. Letting pindicate the proportion of the population of drivers favoring the Internet,the following hypotheses are to be tested

.

If cannot be rejected, there is no evidence indicating a difference inpreference for the two methods of delivering pre-trip information. How-ever, if can be rejected, it can be concluded that driver preferences aredifferent for the two methods. In that case, the method selected by thegreater number of drivers can be considered the most preferred method.Interestingly, the logic of the sign test is quite simple. For each individual,a plus sign is used if a preference is expressed for the Internet and a minussign if the individual expresses an interest for the mobile phone. Becausethe data are recorded in terms of plus or minus signs, this nonparametrictest is called the sign test.

H p

H pa

0 0 50

0 50

: .

: .

H0

H0

© 2003 by CRC Press LLC

Page 58: Data Analysis Book

TABLE 2.3

A Survey of Nonparametric Testing Methodologies

Test Ref.

Location Parameter Hypotheses

Data on One Sample or Two Related Samples (Paired Samples)

Sign test Arbuthnott (1910)Wilcoxon signed rank test Wilcoxon (1945, 1947, 1949); Wilcoxon, Katti,

and Wilcox (1972)

Data on Two Mutually Independent Samples

Mann–Whitney U test Mann and Whitney (1947)Wilcoxon rank sum test Wilcoxon (1945, 1947, 1949); Wilcoxon, Katti,

and Wilcox (1972)Median test Brown and Mood (1951); Westenberg (1948)Tukey’s quick test Tukey (1959)Normal scores tests Terry (1952); Hoeffding (1951); van der

Waerden (1952, 1953); van der Waerden and Nievergelt (1956)

Percentile modified rank tests Gastwirth (1965); Gibbons and Gastwirth (1970)

Wald–Wolfowitz runs tests Wald and Wolfowitz (1940)

Data on k Independent Samples (k 3)Kruskal–Wallis One-Way Analysis of Variance Test

Kruskal and Wallis (1952); Iman, Quade, and Alexander (1975)

Steel tests for comparison with a control Steel (1959a,b, 1960, 1961); Rhyne and Steel (1965)

Data on k Related Samples (k 3)Friedman two-way analysis of variance test Friedman (1937)Durbin test for balanced incomplete block designs

Durbin (1951)

Scale or Dispersion Parameter Hypotheses

Data on Two Mutually Independent Samples

Siegel–Tukey test Siegel and Tukey (1960)Mood test Mood (1954); Laubscher, Steffens, and

deLange (1968)Freund–Ansari test Freund and Ansari (1957); Ansari and

Bradley (1960)Barton–David test David and Barton (1958)Normal-scores test Klotz (1962); Capon (1961)Sukhatme test Sukhatme (1957); Laubcher and Odeh (1976)Rosenbaum test Rosenbaum (1953)Kamat test Kamat (1956)Percentile modified rank tests Gastwirth (1965); Gibbons and Gastwirth

(1970)Moses ranklike tests Moses (1963)

© 2003 by CRC Press LLC

Page 59: Data Analysis Book

Tests of Independence

Data on Two Related Samples

Spearman rank correlation parameter Spearman (1904); Kendall (1962); Glasser and Winter (1961)

Kendall parameter Kendall (1962); Kaarsemaker and van Wijngaarden (1953)

Data on k Related Samples (k 3)Kendall parameter of concordance for complete rankings

Kendall (1962)

Kendall parameter of concordance for balanced incomplete rankings

Durbin (1951)

Partial correlation Moran (1951); Maghsoodloo (1975); Maghsoodloo and Pallos (1981)

Contingency Table Data

Chi square test of independence

Tests of Randomness with General Alternatives

Data on One Sample

Number of runs test Swed and Eisenhart (1943)Runs above and below the medianRuns up and down test Olmstead (1946); Edgington (1961)Rank von Neumann runs test Bartels (1982)

Tests of Randomness with Trend Alternatives

Time Series Data

Daniels test based on rank correlation Daniels (1950)Mann test based on Kendall parameter Mann (1945)Cox–Stuart test Cox and Stuart (1955)

Slope Tests in Linear Regression Models

Data on Two Related Samples

Theil test based on Kendall’s Theil (1950)

Data on Two Independent Samples

Hollander test for parallelism Hollander (1970)

Tests of Equality of k Distributions (k 2)

Data on Two Independent Samples

Chi square test

TABLE 2.3 (CONTINUED)

A Survey of Nonparametric Testing Methodologies

Test Ref.

© 2003 by CRC Press LLC

Page 60: Data Analysis Book

Under the assumption that is true (p = 0.50), the number of signsfollows a binomial distribution with p = 0.50, and one would have to referto the binomial distribution tables to find the critical values for this test. But,for large samples (n > 20), the number of plus signs (denoted by x) can beapproximated by a normal probability distribution with mean and standarddeviation given by

, (2.25)

and the large sample test-statistic is given by

. (2.26)

Kolmogorov–Smirnov test Kim and Jennrich (1973)Wald–Wolfowitz test Wald and Wolfowitz (1940)

Data on k Independent Samples (k 3)Chi square testKolmogorov–Smirnov test Birnbaum and Hall (1960)

Tests of Equality of Proportions

Data on One Sample

Binomial test

Data on Two Related Samples

McNemar test McNemar (1962)

Data on Two Independent Samples

Fisher’s exact testChi square test

Data on k Independent Samples

Chi square test

Data on k Related Samples (k 3)Cochran Q test Cochran (1950)

Tests of Goodness of Fit

Chi square test

TABLE 2.3 (CONTINUED)

A Survey of Nonparametric Testing Methodologies

Test Ref.

H0

E n

n

0 50

0 25

.

.

Zx E*

© 2003 by CRC Press LLC

Page 61: Data Analysis Book

Example 2.11

Consider the example previously mentioned, in which 200 drivers areasked to indicate their preference for receiving pre-trip travel time infor-mation via the Internet or via their mobile phones. Results show that 72drivers preferred receiving the information via the Internet, 103 via theirmobile phones, and 25 indicated no difference between the two methods.Do the responses to the questionnaire indicate a significant differencebetween the two methods in terms of how pre-trip travel time informa-tion should be delivered?

Using the sign test, we see that n = 200 – 25 = 175 individuals wereable to indicate their preferred method. Using Equation 2.25, we findthat the sampling distribution of the number of plus signs has thefollowing properties:

.

In addition, with n = 175 it can be assumed that the sampling distributionis approximately normal. By using the number of times the Internet wasselected as the preferred alternative (x = 72), the following value of thetest statistic (Equation 2.25) is obtained:

.

From the test result it is obvious that the null hypothesis of no differencein how pre-trip information is provided should be rejected at the 0.05level of significance (since ). It should be noted that althoughthe number of plus signs was used to determine whether to reject thenull hypothesis that p = 0.50, one could use the number of minus signs;the test results would be the same.

2.5.2 The Median Test

The median test can be used to conduct hypothesis tests about a populationmedian. Recall that the median splits a population in such a way that 50% ofthe values are at the median or above and 50% are at the median or below. Totest for the median, the sign test is applied by simply assigning a plus signwhenever the data in the sample are above the hypothesized value of themedian and a minus sign whenever the data in the sample are below thehypothesized value of the median. Any data exactly equal to the hypothesizedvalue of the median should be discarded. The computations for the sign testare done in exactly the same way as was described in the previous section.

E n

n

0 50 0 50 175 87 5

0 25 0 25 175 6 6

. . ( ) .

. . ( ) .

Zx E* .

..

72 87 56 6

2 35

2 35 1 96. .

© 2003 by CRC Press LLC

Page 62: Data Analysis Book

Example 2.12

The following hypotheses have been formulated regarding the medianprice of new sport utility vehicles (SUVs):

.

In a sample of 62 new SUVs, 34 have prices above $35,000, 26 have pricesbelow $35,000, and 2 have prices of exactly $35,000. Using Equation 2.24for the n = 60 SUVs with prices different from $35,000, we obtain

, and, with x = 34 as the number of plus signs, thetest statistic becomes

.

Using a two-tail test and a 0.05 level of significance, cannot berejected. On the basis of these data, we cannot reject the null hypothesisthat the median selling price of a new SUV is $35,000.

2.5.3 The Mann–Whitney U Test

The Mann–Whitney–Wilcoxon (MWW) test is a nonparametric test of thenull hypothesis that probability distributions of two ordinal scale variablesare the same for two independent populations. The test is sensitive to thedifferences in location (mean or median) between the two populations. Thistest was first proposed for two samples of equal sizes by Wilcoxon in 1945.In 1947, Mann and Whitney introduced an equivalent statistic that couldhandle unequal sample sizes as well as small samples. The MWW test ismost useful for testing the equality of two population means and is analternative to the two independent samples t test when the assumption ofnormal population distributions is not satisfied. The null and alternativehypotheses for the Mann–Whitney U test are

H0: The two sample distributions are drawn from the same population.Ha: The two sample distributions are drawn from two different popu-

lations.

The hypotheses can be formulated to conduct both one-tailed and two-tailedtests and can also be formulated so that differences in population medians areexamined. The assumptions of the test are that the samples are randomlyselected from two populations and are independent. To obtain the test statistic,the two samples are combined and all the observations are ranked from smallest

H

Ha

0 35 000

35 000

: $ ,

: $ ,

median

median

E 30 3 87, .

Zx E*

..

234 303 87

1 03

H0

© 2003 by CRC Press LLC

Page 63: Data Analysis Book

to largest (ties are assigned the average rank of the tied observations). Thesmallest observation is denoted as 1 and the largest observation is n. R1 isdefined as the sum of the ranks from sample 1 and R2 as the sum of the ranksfrom sample 2. Further, if n1 is the sample size of population 1 and n2 is thesample size of population 2, the U statistic is defined as follows:

. (2.27)

The U statistic is a measure of the difference between the ranks of twosamples. Based on the assumption that only location (mean or median)differences exist between two populations, a large or small value of the teststatistic provides evidence of a difference in the location of the two popula-tions. For large samples the distribution of the U statistic is approximatedby the normal distribution. The convergence to the normal distribution israpid, such that for and there is a satisfactory normalapproximation. The mean of U is given by

, (2.28)

the standard deviation of U is given by

, (2.29)

and the large sample test statistic is given by

. (2.30)

For large samples the test is straightforward. For small samples, tables ofthe exact distribution of the test statistic should be used. Most software pack-ages have the option of choosing between a large and a small sample test.

Example 2.13

Speed data were collected from two highway locations (Table 2.4). Thequestion is whether the speeds at the two locations are identical. The firststep in the MWW test is to rank speed measurements from the lowest tothe highest values. Using the combined set of 22 observations in Table 2.4,the lowest data value is 46.9 mph (6th observation of location 2) and is thusassigned the rank of 1. Continuing the ranking yields the results of Table 2.5.

U n nn n

R1 21 1

1

1

2

n1 10 n2 10

E Un n1 2

2

U

n n n n1 2 1 2 1

12

ZU E U

U

*

© 2003 by CRC Press LLC

Page 64: Data Analysis Book

In ranking the combined data, it may be that two or more data valuesare the same. In that case, the tied values are given the average rankingof their positions in the combined data set. The next step in the test isto sum the ranks for each sample. The sums are given in Table 2.6. Thetest procedure can be based on the ranks for either sample. For example,the sum of the ranks for sample 1 (location 1) is 169.5. Using =169.5, , and , the U statistic from Equation 2.27 equals28.5. Further, the sampling distribution of the U statistic has the fol-lowing properties:

.

As previously noted, with and , it can be assumed thatthe distribution of the U statistic is approximately normal; the value ofthe test statistic (Equation 2.30) is obtained as

TABLE 2.4

Speed Measurements for Two Highway Locations

Location 1 Location 2Observation Speed (mph) Observation Speed (mph)

1 68.4 1 55.32 59.7 2 52.13 75.0 3 57.24 74.7 4 59.45 57.8 5 50.06 59.4 6 46.97 50.3 7 54.18 59.1 8 62.59 54.7 9 65.6

10 65.9 10 58.411 64.112 60.9

TABLE 2.5

Ranking of Speed Measurements

Speed (mph) Item Assigned Rank

46.9 6th observation of location 2 150.0 5th observation of location 2 250.3 7th observation of location 1 3· · ·· · ·· · ·

74.7 4th observation of location 1 2175.0 3rd observation of location 1 22

R1n1 12 n2 10

E U

U

60

15 17.

n1 12 n2 10

© 2003 by CRC Press LLC

Page 65: Data Analysis Book

.

At the 0.05 level of significance we reject the null hypothesis and concludethat the speeds at the two locations are not the same.

2.5.4 The Wilcoxon Signed-Rank Test for Matched Pairs

The MWW test described in the previous section deals with independentsamples; the Wilcoxon signed-rank test, presented in this section, is usefulfor comparing two populations, say x and y, for which the observations arepaired. As such, the test is a good alternative to the paired observations ttest in cases where the difference between paired observations is not nor-mally distributed. The null hypothesis of the test is that the median differencebetween the two populations is zero. The test begins by computing the paireddifferences , where n is the number ofpairs of observations. Then, the absolute values of the differences,

, are ranked from smallest to largest disregarding their sign.Following this, the sums of the ranks of the positive and of the negativedifferences are formed; that is, the smallest observation is denoted as 1 andthe largest observation is n. is defined as the sum of the ranks for thepositive differences and the sum of the ranks for the negative differ-ences. The Wilcoxon T statistic is defined as the smaller of the two sums ofranks as follows:

, (2.31)

TABLE 2.6

Combined Ranking of the Speed Data in the Two Samples

Location 1 Location 2

ObservationSpeed(mph) Rank Observation

Speed(mph) Rank

1 68.4 20 1 55.3 72 59.7 14 2 52.1 43 75.0 22 3 57.2 84 74.7 21 4 59.4 12.55 57.8 9 5 50.0 26 59.4 12.5 6 46.9 17 50.3 3 7 54.1 58 59.1 11 8 62.5 169 54.7 6 9 65.6 18

10 65.9 19 10 58.4 1011 64.1 17 Sum of ranks 82.512 60.9 15

Sum of ranks 169.5

ZU E U

U

* ..

.28 5 60

15 172 07

d x y d x y d x yn n n1 1 1 2 2 2, , ...,

d i ni , , ...,1

( )(–)

T MIN ,

© 2003 by CRC Press LLC

Page 66: Data Analysis Book

As the sample size increases, the distribution of the T statistic approximatesthe normal distribution; the convergence to the normal distribution is goodfor . The mean of the T distribution is given by

, (2.32)

the standard deviation of T is given by

, (2.33)

and the large sample test-statistic is given by

. (2.34)

2.5.5 The Kruskal–Wallis Test

The Kruskal–Wallis test is the nonparametric equivalent to the independentsamples single-factor analysis of variance. The test is applicable to problemswhere data are either ranked or quantitative, samples are independent, pop-ulations are not normally distributed, and the measurement scale is at leastordinal. The Kruskal–Wallis test is identical to the MWW test for comparingk populations where k is greater than 2. In all cases the competing hypothesesare as follows:

H0: All k populations have the same locations (median or mean).Ha: At least two of the k population locations differ.

First, all observations are ranked from smallest to largest disregarding theirsample of origin. The smallest observation is denoted as 1 and the largestobservation is n. R1 is defined as the sum of the ranks from sample 1, R2 thesum of the ranks from sample 2,…; Rk is the sum of the ranks from samplek. The Kruskal–Wallis test statistic is then defined as follows:

. (2.35)

For large samples (ni 5), the distribution of the test statistic W under thenull hypothesis is approximated by the chi-square distribution with k 1degrees of freedom. That is, the null hypothesis H0 is rejected for a givenlevel of significance if .

n 25

E Tn n 1

4

T

n n n1 2 124

ZT E T

T

*

Wn n

Rn

ni

ii

k12

13 1

2

1( )

( )

W k a12

;

© 2003 by CRC Press LLC

Page 67: Data Analysis Book

If the null hypothesis is rejected an important question arises: Whichpopulations differ? To answer this question, suppose two populations i andj are compared. For each pair of data points from the populations the averagerank of the sample is computed as

, (2.36)

where is the sum of the ranks for sample i, and is the sum of theranks for sample j. The test statistic D is defined as the absolute differencebetween and such that

. (2.37)

The test is conducted by comparing the test statistic D with the criticalvalue , which is computed as

, (2.38)

where is the critical value of the chi-squared distribution used in theoriginal test. The comparisons of all paired values can be conducted jointlyat the specified level of significance . The null hypothesis is rejected if andonly if .

2.5.6 The Chi-Square Goodness-of-Fit Test

The 2 test sees widespread use in a variety of transportation analyses. Itspopularity stems from its versatility and its ability to help assess a largenumber of questions. The data used in 2 tests are either counts or frequenciesmeasured across categories that may be measured on any scale. Examplesinclude the number of accidents by accident type, number of people whofall into different age and gender categories, number of speeding tickets byroadway functional class, number of vehicles purchased per year by house-hold type, etc. The 2 distribution and associated statistical tests are verycommon and useful. All versions of the 2 test follow a common five-stepprocess (Aczel, 1993):

1. Competing hypotheses for a population are stated (null and alter-native).

2. Frequencies of occurrence of the events expected under the null arecomputed. This provides expected counts or frequencies based on

RRn

RR

ni

i

i

jj

j

and

Ri Rj

Ri R j

D R Ri j

cKW

cn n

n nKW k a

i j

21

112

1 1;

( )

k a12

;

D cKW

© 2003 by CRC Press LLC

Page 68: Data Analysis Book

some “statistical model,” which may be a theoretical distribution,an empirical distribution, an independence model, etc.

3. Observed counts of data falling in the different cells are noted.4. The difference between the observed and the expected counts are

computed and summed. The difference leads to a computed valueof the 2 test statistic.

5. The test statistic is compared to the critical points of the 2 distribu-tion and a decision on the null hypothesis is made.

In general, the 2 test statistic is equal to the squared difference betweenthe observed count and the expected count in each cell divided by theexpected count and summed over all cells. If the data are grouped into kcells, let the observed count in cell i be Oi and the expected count (expectedunder H0) be Ei. The summation is over all cells i = 1, 2,…, k. The test statistic is

. (2.39)

With increasing sample size and for a fixed number of k cells, the distributionof the X2 test statistic approaches the 2 distribution with k 1 degrees offreedom provided that the expected frequencies are 5 or more for all cate-gories.

A goodness-of-fit test assesses how well the sample distribution sup-ports an assumption about the population distribution. For example, instatistical analyses it is often assumed that samples follow the normaldistribution; the 2 test can be used to assess the validity of such anassumption. Other assumed distributions and assumptions can also betested with the chi-square goodness-of-fit test (for an example in trans-portation see Washington et al., 1999). Most statistical software packagesthat compare actually observed data to hypothesized distributions useand report the X2 test statistic. Caution should be exercised, however,when applying the 2 test on small sample sizes or where cells are definedsuch that small expected frequencies are obtained. In these instances the

2 test is inappropriate and exact methods should be applied (see Mehtaand Patel, 1983, for details).

The 2 distribution is useful for other applications besides goodness-of-fittests. Contingency tables can be helpful in determining whether two classi-fication criteria, such as age and satisfaction with transit services, are inde-pendent of each other. The technique makes use of tables with cellscorresponding to cross-classification of attributes or events. The null hypoth-esis that factors are independent is used to obtain the expected distributionof frequencies in each of the contingency table cells. The competing hypoth-eses for a contingency table are as follows (the general form of a contingencytable is shown in Table 2.7):

XO E

Ei i

ii

k2

2

1

© 2003 by CRC Press LLC

Page 69: Data Analysis Book

H0: The two classification variables are statistically independent.Ha: The two classification variables are not statistically independent.

The test statistic X2 of Equation 2.39 for a two-way contingency table isrewritten as follows:

, (2.40)

where the differences between observed and expected frequencies aresummed over all rows and columns (r and c, respectively). The test statisticin Equation 2.40 is approximately 2 distributed with degrees of freedom,df = (r – 1)(c – 1). Finally, the expected count in cell (i, j), where Rj and Ci

are the row and column totals, respectively, is

. (2.41)

The expected counts obtained from Equation 2.41 along with the observedcell counts are used to compute the value of the X2 statistic, which providesobjective information needed to accept or reject the null hypothesis. The X2

test statistic can easily be extended to three or more variables, where sum-mation is simply extended to cover all cells in the multiway contingencytable. As stated previously, caution must be applied to contingency tablesbased on small sample sizes or when expected cell frequencies become small;in these instances the X2 test statistic is unreliable, and exact methods shouldbe used.

Contingency tables and the X2 test statistic are also useful for assessingwhether the proportion of some characteristic is equal in several populations.A transit agency, for example, may be interested in knowing whether theproportion of people who are satisfied with transit quality of service is aboutthe same for three age groups: under 25, 25 to 44, and 45 and over. Whetherthe proportions are equal is of paramount importance in assessing whetherthe three age populations are homogeneous with respect to satisfaction with

TABLE 2.7

General Layout of a Contingency Table

Second First Classification CategoryClassification

Category 1 . j Total

1 C11 . . R1

i . . Cij Ri

Total C1 . Cj n

XO E

Eij ij

ijj

c

i

r2

2

11

ER C

niji j

© 2003 by CRC Press LLC

Page 70: Data Analysis Book

the quality of service. Therefore, tests of equality of proportions across sev-eral populations are called tests of homogeneity.

Homogeneity tests are conducted similarly to previously described tests,but with two important differences. First, the populations of interest areidentified prior to the analysis and sampling is done directly from them,unlike contingency table analysis where a sample is drawn from one popu-lation and then cross-classified according to some criteria. Second, becausethe populations are identified and sampled from directly, the sample sizesrepresenting the different populations of interest are fixed. This experimentalsetup is called a fixed marginal totals 2 analysis and does not affect theanalysis procedure in any way.

© 2003 by CRC Press LLC

Page 71: Data Analysis Book

Part II

Continuous Dependent Variable Models

© 2003 by CRC Press LLC

Page 72: Data Analysis Book

3Linear Regression

Linear regression is one of the most widely studied and applied statisticaland econometric techniques. There are numerous reasons for this. First,linear regression is suitable for modeling a wide variety of relationshipsbetween variables. In addition, the assumptions of linear regression modelsare often suitably satisfied in many practical applications. Furthermore,regression model outputs are relatively easy to interpret and communicateto others, numerical estimation of regression models is relatively easy, andsoftware for estimating models is readily available in numerous “non-spe-cialty” software packages. Linear regression can also be overused or mis-used. In some cases the assumptions are not strictly met, and the correctalternatives are not known, understood, or applied. In particular situations,suitable alternative modeling techniques are not known, or the specializedsoftware and knowledge required to estimate such models are not obtained.

It should not be surprising that linear regression serves as a good startingpoint for illustrating statistical model estimating procedures. Applyinglinear regression when other methods are more suitable should be avoidedat all costs.

This chapter illustrates the estimation of linear regression models, explainswhen linear regression models are inappropriate, describes how to interpretlinear regression model outputs, and discusses how a linear regression modelis selected among competing models. Matrix algebra is used throughout thechapter to illustrate the concepts, but only to the extent necessary to illustratethe most important analytical aspects.

3.1 Assumptions of the Linear Regression Model

Linear regression is used to model a linear relationship between a continuousdependent variable and one or more independent variables. Most applica-tions of regression seek to identify a set of independent variables that arethought to covary with the dependent variable. In general, explanatory or“causal” models are based on data obtained from well-controlled experiments

© 2003 by CRC Press LLC

Page 73: Data Analysis Book

(for example, those conducted in a laboratory), predictive models are basedon data obtained from observational studies, and quality control models arebased on data obtained from a process or system being controlled. Whetherindependent variables are causal or merely associative depends on manyfactors, only one of which is the result of statistical analysis.

There are numerous assumptions of the linear regression model, whichshould be thought of as requirements. When any of the requirements are notmet, remedial actions should be taken, and in some cases, alternative mod-eling approaches should be adopted. The following are the assumptions ofthe linear regression model.

3.1.1 Continuous Dependent Variable Y

The assumption in regression is that the response is continuous; that is, itcan take on any value within a range of values. A continuous variable ismeasured on the interval or ratio scale. Although it is often done, regressionon ordinal scale response variables is incorrect. For example, count variables(nonnegative integers) should be modeled with Poisson and negative bino-mial regression (see Chapter 10). Modeling nominal scale dependent vari-ables (discrete variables that are not ordered) requires discrete outcomemodels, presented in Chapter 11.

3.1.2 Linear-in-Parameters Relationship between Y and X

This requirement can cause some confusion. The form of the regressionmodel requires that the relationship between variables is inherently linear— a straight-line relationship between the dependent variable Y and theindependent variables. The simple linear regression model is given by

. (3.1)

In this algebraic expression of the simple linear regression model, thedependent variable Y is a function of a constant term (the point wherethe line crosses the Y axis) and a constant times the value x of independentvariable X for observation i, plus a disturbance term . The subscript icorresponds to the individual or observation, where i = 1, 2, 3,…, n. In mostapplications the response variable Y is a function of many independentvariables. In these cases it is more efficient to express the linear regressionmodel in matrix notation

. (3.2)

Equation 3.2 is the regression model in matrix terms, where the subscriptsdepict the size of the matrices. For instance, the X matrix is an n p matrix

Y Xi i i0 1 1

Y Xn n p p n 1 1 1

© 2003 by CRC Press LLC

Page 74: Data Analysis Book

of observations, with n the number of observations and p the number ofvariables measured on each observation.

This linearity requirement is not as restrictive as it first sounds. Becausethe scales of both the dependent and independent variables can be trans-formed, a suitable linear relationship can often be found. As is discussedlater in the chapter and in Appendix D, there are cases when a transformationis inappropriate, requiring a different modeling framework such as logisticregression, Poisson regression, survival analysis, and so on.

3.1.3 Observations Independently and Randomly Sampled

Although this requirement can be relaxed if remedial actions are taken, anassumption necessary to make inferences about the population of interest isthat the data are randomly sampled from the population. Independencerequires that the probability that an observation is selected is unaffected byother observations selected into the sample. In some cases random assign-ment can be used in place of random sampling, and other sampling schemessuch as stratified and cluster samples can be accommodated in the regressionmodeling framework with corrective measures.

3.1.4 Uncertain Relationship between Variables

The difference between the equation of a straight-line and a linear regressionmodel is the addition of a stochastic, disturbance, or disturbance term, .This disturbance term consists of several elements of the process beingmodeled. First, it can contain variables that were omitted from the model —assumed to be the sum of many small, individually unimportant effects,some positive and others negative. Second, it contains measurement errorsin the dependent variable, or the imprecision in measuring Y, again assumedto be random. Finally, it contains random variation inherent in the underly-ing data-generating process.

3.1.5 Disturbance Term Independent of X and Expected Value Zero

The requirements of the disturbance term can be written as follows:

, (3.3)

and

. (3.4)

Equation 3.4 shows that the variance of the disturbance term, , is independentacross observations. This is referred to as the homoscedasticity assumption and

E i 0

VAR i2

© 2003 by CRC Press LLC

Page 75: Data Analysis Book

implies that the net effect of model uncertainty, including unobserved effects,measurement errors, and true random variation, is not systematic across obser-vations; instead it is random across observations and across covariates. Whendisturbances are heteroscedastic (vary systematically across observations), thenalternative modeling approaches such as weighted least squares or generalizedleast squares may be required.

3.1.6 Disturbance Terms Not Autocorrelated

This requirement is written as follows:

. (3.5)

Equation 3.5 specifies that disturbances are independent across observations.Common violations of this assumption occur when observations arerepeated on individuals, so the unobserved heterogeneity portion of thedisturbance term is the same across repeated observations. Observationsacross time often possess autocorrelated disturbances as well. When distur-bances are correlated across observations, generalized least squares or othercorrection methods are required.

3.1.7 Regressors and Disturbances Uncorrelated

This property is known as exogeneity of the regressors. When the regres-sors are exogenous, they are not correlated with the disturbance term.Exogeneity implies that the values of the regressors are determined byinfluences “outside of the model.” So Y does not directly influence thevalue of an exogenous regressor. In mathematical terms, this requirementtranslates to

. (3.6)

When an important variable is endogenous (depends on Y), then alternativemethods are required, such as instrumental variables, two and three stageleast squares, or structural equations models.

3.1.8 Disturbances Approximately Normally Distributed

Although not a requirement for the estimation of linear regression models,the disturbance terms are required to be approximately normally distributedin order to make inferences about the parameters from the model. In thisregard the central limit theorem enables exact inference about the propertiesof statistical parameters. Combined with the independence assumption, this

COV i ji j, 0 if

COV X i ji j, 0 for all and

© 2003 by CRC Press LLC

Page 76: Data Analysis Book

assumption results in disturbances that are independently and identicallydistributed as normal (i.i.d. normal). In notation, this assumption requires

. (3.7)

The approximation holds because the disturbances for any sample are notexpected to be exactly normally distributed due to sampling variation. Itfollows that .

3.1.9 Summary

The analytical assumptions of the regression model are summarized in Table3.1. The table lists the assumptions described previously. It is prudent toconsider the assumptions in the table as modeling requirements that needto be met. If the requirements are not met, then remedial measures must betaken. In some cases a remedial action is taken that enables continued mod-eling of the data in the linear regression modeling framework. In other casesalternative modeling methodologies must be adopted to deal effectively withthe non-ideal conditions. Chapter 4 discusses in detail remedial actions whenregression assumptions are not met.

3.2 Regression Fundamentals

The objective of linear regression is to model the relationship between adependent variable Y with one or more independent variables X. The abilityto say something about the way X affects Y is through the parameters in theregression model — the betas. Regression seeks to provide information andproperties about the parameters in the population model by inspecting prop-erties of the sample-estimated betas, how they behave, and what they cantell us about the sample and thus about the population.

TABLE 3.1

Summary of Ordinary Least Squares Linear Regression Model Assumptions

Statistical Assumption Mathematical Expression

1. Functional form Yi = 0 + 1X1i + ei

2. Zero mean of disturbances E[ i] = 03. Homoscedasticity of disturbances VAR[ i] = 2

4. Nonautocorrelation of disturbances COV[ i, j] = 0 if i j5. Uncorrelatedness of regressor and disturbances COV [Xi, j] = 0 for all i and j6. Normality of disturbances i N(0, 2)

i N 0 2,

Y Xi N , 2

© 2003 by CRC Press LLC

Page 77: Data Analysis Book

The linear regression model thought to exist for the entire population ofinterest is

. (3.8)

The true population model is formulated from theoretical considerations, pastresearch findings, and postulated theories. The expected value of Yi givencovariate vector Xi is a conditional expectation. In some texts the conditionalexpectation notation is dropped, but it should be understood that the meanor expected value of Yi is conditional on the covariate vector for observationi. The population model represents a theoretically postulated model whoseparameter values are unknown, constant, and denoted with betas, as shownin Equation 3.8. The parameters are unknown because Equation 3.8 is basedon all members of the population of interest. The parameters (betas) are con-stant terms that reflect the underlying true relationship between the indepen-dent variables X1, X2,…, Xp–1 and dependent variable Yi, because thepopulation N is presumably finite at any given time. The true populationmodel contains p parameters in the model, and there are n observations.

The unknown disturbance term for the population regression model(Equation 3.8) is given by

. (3.9)

The equivalent expressions for population models (Equations 3.8 and 3.9) inmatrix form (dropping the conditional expectation and replacing expectedvalue notation with the hat symbol) are given by

, (3.10)

and

. (3.11)

Regression builds on the notion that information is learned about theunknown and constant parameters (betas) of the population by using informa-tion contained in the sample. The sample is used for estimating betas — randomvariables that fluctuate from sample to sample — and the properties of theseare used to make inferences about the true population betas. There are numer-ous procedures to estimate the parameters of the true population model basedon the sample data, including least squares and maximum likelihood.

3.2.1 Least Squares Estimation

Least squares estimation is a commonly employed estimation method forregression applications. Often referred to as “ordinary least squares” or

E Y X E X X Xi i i i p p i| ..., , ,0 1 1 2 2 1 1

i i i i i i p p iY Y Y E X X Xˆ ..., , ,0 1 1 2 2 1 1

Y X

Y Y Y Xˆ

© 2003 by CRC Press LLC

Page 78: Data Analysis Book

OLS, it represents a method for estimating regression model parametersusing the sample data. OLS estimation using both standard and matrixalgebra is shown using Equations 3.8 and 3.10 for the simple regressionmodel case as the starting point.

First consider the algebraic expression of the OLS regression model shownin Equation 3.8. OLS, as one might expect, requires a minimum (least) solu-tion of the squared disturbances. OLS seeks a solution that minimizes thefunction Q (the subscript for observation number is not shown):

(3.12)

Those values of and that minimize the function Q are the least squaresestimated parameters. Of course and are parameters of the populationand are unknown, so estimators B and B are obtained, which are randomvariables that vary from sample to sample. By setting the partial derivativesof Q with respect to and equal to zero, the least squares estimatedparameters B and B are obtained:

(3.13)

. (3.14)

Solving these equations using B and B to denote the estimates of and ,respectively, and rearranging terms yields

, (3.15)

and

. (3.16)

One can verify that these equations are minimums by their positive secondpartial derivatives. As one might expect, there is a least squares normalequation generated for each parameter in the model, and so a linear

Q Y Y Y X

Y X

i ii

n

i ii

n

i ii

n

min min min

min

ˆ 2

10 1

2

1

0 1

2

1

QY Xi i

i

n

00 1

1

2 0

QX Y Xi i i

i

n

10 1

1

2 0

Y nB B Xii

n

ii

n

01

11

X Y B X B Xi i ii

n

i

n

ii

n

011

12

1

© 2003 by CRC Press LLC

Page 79: Data Analysis Book

regression model with p = 5 would yield five normal equations. Equations3.15 and 3.16 are the least squares normal equations for the simple linearregression model. Solving simultaneously for the betas in Equations 3.15and 3.16 yields

, (3.17)

and

. (3.18)

The derivation of the matrix algebra equivalent of the least squares normalequations for the simple regression case is straightforward. In the simpleregression case, the expression Y = X consists of the following matrices:

. (3.19)

The following matrix operations are used to solve for the betas:

(3.20)

The beta vector is shown as B, since it is the estimated vector of betas forthe true beta vector . Step 2 of Equation 3.20 is used to demonstrate theequivalence of the normal equations (Equations 3.15 and 3.16) obtained usingalgebraic and matrix solutions. Step 2 is rewritten as

. (3.21)

The matrix products in Equation 3.21 yield

B

X X Y Y

X X

i ii

n

ii

n1

1

2

1

Bn

Y B X Y B Xi ii

n

i

n

0 111

11

Y

y

y

X

x

xn n

1 10

1

111

M M

(step 1):

(step 2):

(step 3):

Y X

X Y X XB

X X X Y X X X XB B

T T

T T T T1 1.

111

111

111

1 1 1 10

1

x

x

y

y

x

x

x

x

B

Bn

T

n n

T

n

M M M M

© 2003 by CRC Press LLC

Page 80: Data Analysis Book

. (3.22)

By writing out the equations implied by Equation 3.22, the least squaresnormal Equations 3.15 and 3.16 are obtained. It turns out that this is a generalresult, and step 3 of Equation 3.20 can be used to obtain the least squaresestimated parameters for a regression model with p parameters.

Most standard statistical software packages calculate least squares param-eters. The user of software should be able to find in supporting literatureand/or user’s manuals the type of estimation method being used to estimateparameters in a model. There are occasions when violations of regressionassumptions require remedial actions be taken, as the properties of the leastsquares estimators become inefficient, biased, or both (see Chapter 1 forproperties of estimators in general).

Example 3.1

A database consisting of 121 observations is available to study annualaverage daily traffic (AADT) in Minnesota. Variables available in thisdatabase, and their abbreviations, are provided in Table 3.2. A simpleregression model is estimated using least squares estimated parametersas a starting point. The starter specification is based on a model that hasAADT as a function of CNTYPOP, NUMLANES, and FUNCTIONAL-CLASS:AADT = 0 + 1(CNTYPOP) + 2(NUMLANES) + 3(FUNCTION-ALCLASS) + disturbance.

TABLE 3.2

Variables Collected on Minnesota Roadways

VariableNo. Abbreviation: Variable Description

1 AADT: Average annual daily traffic in vehicles per day2 CNTYPOP: Population of county in which road section is located (proxy for

nearby population density)3 NUMLANES: Number of lanes in road section4 WIDTHLANES: Width of road section in feet5 ACCESSCONTROL: 1 for access controlled facility; 2 for no access control6 FUNCTIONALCLASS: Road sectional functional classifications; 1 = rural

interstate, 2 = rural non-interstate, 3 = urban interstate, 4 = urban non-interstate7 TRUCKROUTE: Truck restriction conditions: 1 = no truck restrictions,

2 = tonnage restrictions, 3 = time of day restrictions, 4 = tonnage and time of day restrictions, 5 = no trucks

8 LOCALE: Land-use designation: 1 = rural, 2 = urban with population 50,000,3 = urban with population > 50,000

Y

Y X

n X

X X

B

B

ii

n

i ii

n

ii

n

ii

n

ii

n1

1

1

1

2

1

0

1

© 2003 by CRC Press LLC

Page 81: Data Analysis Book

The model is not a deterministic model because numerous random fluc-tuations in AADT are thought to exist from day to day as a result offluctuations in the number of commuters, vacation travelers, out-of-areatraffic, etc. In addition, numerous other variables thought to affect AADTwere not collected or available, such as the spatial distribution of em-ployment, shopping, recreation, and residential land uses, which arethought to affect local traffic volumes and AADT.

An initial regression model is estimated. The matrices (with observations4 through 118 not shown) used in the least squares estimation are shownbelow. Note that independent variable X1 is CNTYPOP, X2 is NUM-LANES, and variables X3 through X5 refer to various functional classifi-cations (see section on indicator variables later in this chapter), withX3 = 1 for rural interstates, X4 = 1 for rural non-interstates, and X5 = 1for urban interstates; otherwise these variables are zero. For example,the first observation had an average annual daily traffic of 1616 vehiclesper day, a county population of 13,404, two lanes, and was a rural non-interstate facility.

.

The estimated least squares regression parameters are shown in Table3.3. The values of the estimated betas give an indication of the value ofthe true population parameters if the presumed model is correct. Forexample, for each additional 1000 people in the local county population,there is an estimated 29 additional AADT. For each lane there is anestimated 9954 AADT.

TABLE 3.3

Least Squares Estimated Parameters (Example 3.1)

Parameter Parameter Estimate

Intercept 26234.490CNTYPOP 0.029NUMLANES 9953.676FUNCLASS1 885.384FUNCLASS2 4127.560FUNCLASS3 35453.679

,

,

,

,,,

,,,

X X X X X

Y X

1 2 3 4 5

1616

1329

3933

14 905

15 408

1 266

1 13 404 2 0 1 01 52 314 2 0 1 01 30 982 2 0 0 0

1 459 784 4 0 0 01 459 784 2 0 0 01 43 784 2 0 0 0

M M M M M M M

0

1

2

3

4

5

© 2003 by CRC Press LLC

Page 82: Data Analysis Book

.f x x x f x Ln ii

n

1 21

, , ..., u u, , |u )X

© 2003 by CRC Press LLC

Urban interstates are associated with an estimated 35,454 AADT morethan the functional classifications that are not included in the model(urban non-interstates). Similarly, rural non-interstates are associatedwith an estimated 4128 AADT more, and rural interstates are associatedwith 886 AADT more on average. Using these values, it can be deter-mined that urban interstates are associated with 31,326 (35,454 minus4128) more AADT on average than rural non-interstates.

What remains to be examined is whether the estimated regression modelis correct. In other words, are the variables entered in the model trulyassociated with changes in AADT, are they linearly related, and do theydo an adequate job of explaining variation in AADT? Additional theoryand tools are needed to assess the viability of the entered variables andthe goodness of fit of the model to the Minnesota AADT data.

3.2.2 Maximum Likelihood Estimation

The previous section showed the development of the OLS estimators throughthe minimization of the function Q. Another popular and sometimes usefulstatistical estimation method is called maximum likelihood estimation,which results in the maximum likelihood estimates, or MLEs. The generalform of the likelihood function is described in Appendix A as the jointdensity of observing the sample data from a statistical distribution withparameter vector , such that

For the regression model, the likelihood function for a sample of n indepen-dent, identically, and normally distributed disturbances is given by

. (3.23)

As is usually the case, the logarithm of Equation 3.23, or the log likelihood, issimpler to solve than the likelihood function itself, so taking the log of L yields

. (3.24)

Maximizing the log likelihood with respect to and 2 reveals a solutionfor the estimates of the betas that is equivalent to the OLS estimates, that is

Y X

Y X Y X

i iT

i

n

T

1

2

1

2

2

2

1

2

__

__

L EXP

EXP

n

n

2

2

2 2

2 2

_

_

[ ]

[ ]

LN L LLn

LNn

LN T

22

21

22

2 Y X Y X

Page 83: Data Analysis Book

B = (XTX)–1XTY. Thus, the MLE and OLS estimates are equivalent for theregression model with assumptions listed in Table 3.1. It turns out that theMLE for the variance is biased toward zero, and is a result of small samplebias. In other words, MLEs are borne out of asymptotic theory, and as thesample size increases, the MLE estimates are consistent.

A couple of issues are worth mentioning at this stage. First, the selectionof MLE or OLS estimators depends on the regression setting. It is shown inlater chapters that, in general, when the response variable and disturbancesare not normally distributed, MLE is the preferred method of estimation.When regression is being performed under ideal conditions expressed inTable 3.1, OLS is the preferred method. Second, MLE requires that the dis-tribution family of the response variable be specified a priori, whereas OLSdoes not.

3.2.3 Properties of OLS and MLE Estimators

As discussed in Chapter 1, statistical estimators are sought that have small(minimum) sampling variance and are unbiased. The previous sectionshowed that the MLE and OLS estimators are equivalent for the regressionmodel with assumptions listed in Table 3.1.

The expected value of the MLE and OLS estimators is obtained as follows:

. (3.25)

Thus, the MLE and OLS estimators are unbiased estimators of the betas.The variance of the MLE and OLS estimator for the betas in the classical

linear regression model without the assumption of normality has been derivedthrough the Gauss–Markov theorem (for proofs see Myers, 1990, or Greene,1990a). The theorem states that under the classical regression assumptions(without normality), the MLE and OLS estimators achieve minimum varianceof the class of linear unbiased estimators. This means that only through theuse of biased estimators can smaller variances be achieved. If the assumptionof normality is imposed (in addition to independence and E[ ] = 0), then theOLS and MLE estimators are minimum variance among all unbiased estima-tors (the linear restriction is dropped). To compute the variance of the estima-tor, it is useful first to write the relation between B and ,

. (3.26)

Then, using the fact that VAR[X] = E[(x )2], (AB)T = BTAT, (A 1)T = (AT) 1,and E[ T] = 2I, the variance of the OLS and MLE estimator is

E ET T T T

T T

X X X Y X X X Y

X X X X

1 1

1

B X X X Y X X X X X X XT T T T T T1 1 1

© 2003 by CRC Press LLC

Page 84: Data Analysis Book

. (3.27)

Both the MLE and OLS estimates, which are equivalent for the normal theoryregression model, are the best linear unbiased estimators of the underlyingpopulation parameters .

3.2.4 Inference in Regression Analysis

To illustrate how inferences are made with the beta parameters in classicallinear regression, consider the sampling distribution of the estimated param-eter B1. The point estimator B1 in algebraic terms is given in Equation 3.17,and the matrix equivalent is provided in Equation 3.20. The sampling dis-tribution of B1 is the distribution of values that would result from repeatedsamples drawn from the population with levels of the independent variablesheld constant. To see that this sampling distribution is approximately nor-mal, such that

(3.28)

one must first show, using simple algebra (see Neter et al., 1996), that B1 isa linear combination of the observations, Y, such that B1 = kiYi. Since the Yi

terms are independently and normally distributed random variables andgiven that a linear combination of independent normal random variables isnormally distributed, Equation 3.28 follows.

The variance of the OLS estimator for some parameter Bk was shown tobe a minimum unbiased estimator in Equation 3.27. However, the true

VAR E

E

E

E

T

T T T TT

T T T TT

T T T T T

B B B

X X X X X X

X X X X X X

X X X X X X

1 1

1 1

1 1

X X X X X X

X X X IX X X

X X

T T T T

T T T

T

E1 1

1 2 1

2 1

© 2003 by CRC Press LLC

B NX Xi

1 1

2

2,

Page 85: Data Analysis Book

population variance is typically unknown, and instead is estimated withan unbiased estimate called mean squared error (error is equivalent termfor disturbance), or MSE. MSE is an estimate of the variance in the regres-sion model and is given as

, (3.29)

where n is the sample size and p is the number of estimated model param-eters. It can be shown that MSE is an unbiased estimator of , or that E[MSE]= .

Because Bk is normally distributed, and k is a constant and is the expectedvalue of Bk, the quantity

is a standard normal variable. In practice the true variance in the denomi-nator is not known and is estimated using MSE. When is estimated usingMSE, the following is obtained:

, (3.30)

where is the level of significance and n – p is the associated degrees offreedom. This is an important result; it enables a statistical test of the prob-abilistic evidence in favor of specific values of k.

Perhaps the most common use of Equation 3.30 is in the development ofa confidence interval around the parameter, k. A confidence interval for theparameter 1 is given by

. (3.31)

The correct interpretation of a confidence interval is as follows. In a(1 – )% confidence interval (CI) on , the true value of will lie withinthe CI (1 – ) 100 times out of 100 on average for repeated samplesdrawn from the population with levels of the independent variables heldconstant from sample to sample. Thus, the confidence interval providesthe long-run probability that the true value of lies in the computed

MSE

Y Y

n p

i ii

n

ˆ 2

1

ZB

Bk k

k

*

tB

MSE

X X

B

s Bt n pk k

i

k k

k

* ;

2

B t n p s Bk k12

;

© 2003 by CRC Press LLC

Page 86: Data Analysis Book

confidence interval, conditioned on the same levels of X being sampled.Incorrect interpretations of confidence intervals include statements thatassign a probability to the particular experimental or observational out-come obtained and statements that exclude the conditional dependenceon repeated samples at levels of the independent variables.

Often the statistic t* in Equation 3.30 is employed to conduct hypothesistests. A two-sided hypothesis test is set up as

H0: k = 0

Ha: k 0.

The decision rule given a level of significance is

, (3.32)

where tcrit is the critical value of t corresponding with level of significance and degrees of freedom n 2.

A one-sided hypothesis test may be set up as follows. The null and alter-native hypotheses are

H0: k 0

Ha: k > 0.

The decision rules for the one-sided test are

, (3.33)

where again tcrit is the critical value of t corresponding with level of signifi-cance and degrees of freedom n 2. Note that in the one-sided test theabsolute value brackets are removed from the decision rule, making thedecision rule directional.

Example 3.2

Consider again the study of AADT in Minnesota. Interest centers on thedevelopment of confidence intervals and hypothesis tests on some of the

If , conclude

If , conclude

| | ;

| | ;

*

*

t t n p H

t t n p H

crit

crit a

12

12

0

If , conclude

If , conclude

t t n p H

t t n p H

crit

crit a

*

*

;

;

1

1

0

© 2003 by CRC Press LLC

Page 87: Data Analysis Book

parameters estimated in Example 3.1. Specifically, it is of interest to assesswhether observed data support the originally proposed model. Shownin Table 3.4 is expanded statistical model output.

Shown in the table are the least squares regression parameter estimates,the estimated standard errors of the estimate SBk, the t-values correspond-ing with each of the estimated parameters, and finally the p-value — theprobability associated with rejection of the null hypothesis that Bk = 0. Inmany standard statistical packages the default null hypothesis is the two-tailed test of k = 0. For example, consider the hypothesis test of whetheror not county population (CNTYPOP) has a statistically significant effecton AADT. The null and alternative hypotheses are

.

Using the two-sided decision rule (see Equation 3.32 and an of 0.005)results in the following:

.

The critical value tcrit for 1 = 0.995 and 115 degrees of freedom is 2.617.Because 6 is greater than 2.617, the null hypothesis is rejected. The standardS-Plus output also gives this information in slightly different form (as domost other statistical packages). The output shows that t*, 5.9948 6, isassociated with a p-value of less than 0.0001. Stated another way, theprobability of obtaining the parameter value of 0.0288 conditional on thenull hypothesis being true (the parameter is equal to zero) is almost 0.There is always some small probability greater than zero, which meansthere is always some probability of making a Type I error.

The range of plausible values of the estimated effect of FUNCLASS2 onAADT is of interest. A 99% confidence interval is computed as

TABLE 3.4

Least Squares Estimated Parameters (Example 3.2)

ParameterParameterEstimate

Standard Error of Estimate t-value P(> |t|)

Intercept 26234.490 4935.656 5.314 <0.0001CNTYPOP 0.029 0.005 5.994 <0.0001NUMLANES 9953.676 1375.433 7.231 <0.0001FUNCLASS1 885.384 5829.987 0.152 0.879FUNCLASS2 4127.560 3345.418 1.233 0.220FUNCLASS3 35453.679 4530.652 7.825 <0.0001

H

H

CNTYPOP

a CNTYPOP

0 0

0

:

:

IfB

s BtCNTYPOP CNTYPOP

CNTYPOPcrit

0 0288 00 0048

6 10 01

2120 5

..

.;

© 2003 by CRC Press LLC

Page 88: Data Analysis Book

.

The linear relationship between AADT and FUNCLASS2 (facility is a ruralnon-interstate) is between –4629.9 and 12885.1 with 99% confidence. Inparticular, the confidence interval would contain the true value of theparameter in 99 out of 100 repeated samples. Thus, the true parametervalue could be positive, negative, or zero. At this confidence level there isinsufficient evidence to support the parameter FUNCLASS2 as a statisti-cally significant or meaningful variable for explaining variation in AADT.

Assessment of the proposed model suggests that the intercept term andthe variables CNTYPOP, NUMLANES, and FUNCLASS3 are statisticallysignificant variables, while FUNCLASS2 and FUNCLASS1 are not statis-tically significant at the 95% confidence level.

3.3 Manipulating Variables in Regression

There are numerous motivations and techniques for manipulating variablesin the regression. Standardized regression models allow for direct compar-ison of the relative importance of independent variables. Transformationsallow regression assumptions to be met, indicator variables allow appropri-ate expression of nominal and ordinal scale variables, and interactions enablesynergistic effects among variables to be modeled.

3.3.1 Standardized Regression Models

Often interest is focused on the relative impacts of independent variables onthe response variable Y. Using the original measurement units of the variableswill not provide an indication of which variables have largest relative impacton Y. For example, if two independent variables in a model describing theexpected number of daily trip-chains were number of children and householdincome, one could not directly compare the estimated parameters for these twovariables, because the former is in units of marginal change in the number ofdaily trip chains per unit child and the latter is marginal change in the numberof daily trip chains per unit of income. Number of children and householdincome are not directly comparable. Households may have children rangingfrom 0 to 8, while income may range from $5000 to $500,000, making it difficultto make a useful comparison between the relative impact of these variables.

B t n p s BFUNCLASS2 FUNCLASS2

FUNCLASS2

12

4127 6 2 617 3346 4 4127 6 8757 5

4629 9 12885 1

;

. . ( . ) . .

. .

© 2003 by CRC Press LLC

Page 89: Data Analysis Book

The standardized regression model is obtained by standardizing all inde-pendent variables. Standardization strictly works on continuous variables,those measured on interval or ratio scales. New variables are created byapplying the standardization formula

to all continuous independent variables, where standardized variables arecreated with expected values equal to 0 and variances equal to 1. Thus, allcontinuous independent variables will be on the same scale, so comparisonsacross standardized units can be made. The estimated regression parametersin a standardized regression model are interpreted as a change in theresponse variable per unit change of one standard deviation of the indepen-dent variable. Most statistical software packages will generate standardizedregression model output as an option in ordinary least squares regression.If this is not a direct option, most packages will allow for the standardizationof variables, which then can be used in the regression.

3.3.2 Transformations

The application of linear regression requires relationships that are linear(with some rare exceptions such as polynomial regression models). This isa direct result of the linear-in-parameters restriction. It is not always expectednor defensible, however, that physical, engineering, or social phenomena arebest represented by linear relationships, and nonlinear relationships are oftenmore appropriate. Fortunately, nonlinear relationships can be accommo-dated within the linear regression framework. Being able to accommodatenonlinear relationships provides a great deal of flexibility for finding defen-sible and suitable relationships between two or more variables. Becausevariable transformations are useful in the application of many statisticalmethods, Appendix D presents a comprehensive list of transformationoptions, the nature of different transformations, and the consequences. Inthis section two example transformations, as they might be used in regressionas well as their consequences and impacts, are presented and discussed.

Some ground rules for using transformations in the linear regression mod-eling framework are worthy of note:

1. Linear relationships between the dependent and all independentvariables is a requirement of the regression modeling framework,with rare exceptions being polynomial expressions of variables suchas X, X2, X3, etc.

2. Research interest is typically focused on Y in its original measure-ment units, and not transformed units.

XX Xs X1

1

1

© 2003 by CRC Press LLC

Page 90: Data Analysis Book

3. Transformations on Y, X, or both, are merely a way to rescale theobserved data so that a linear relationship can be identified.

Suppose the relationship between variables Y and X is modeled usinglinear regression. To obtain a linear relationship required by linear regression,the reciprocals of both Y and X are computed such that the estimated modeltakes the form:

. (3.34)

To obtain original Y measurement units and determine the nature of theunderlying true model assumed by this transformation, the reciprocal ofboth sides is computed to obtain the following

. (3.35)

By taking the reciprocal of both sides of Equation 3.34, the unknown truerelationship is assumed to take the form shown in Equation 3.35, where

= B1, = B0, and the disturbance term is a function of X. Important tonote is that the underlying model implied by this transformation is morecomplex than the linear regression model assumes: linear additive parame-ters and an additive disturbance term.

Consider another example. Suppose that the logarithm of Y is a linear functionof the inverse of X. That is, an estimated regression function takes the form:

. (3.36)

What underlying model form do these transformations imply about therelationship between Y and X? Again, the original Y units are obtained byperforming the following transformations:

. (3.37)

EY

B BX

e1 1

0 1ˆ

11

11

11

0 1

0 10 1

ˆ

ˆ

YB B

Xe

YB B

Xe

XX

XB X B eX

XX Xe

LN Y B BX

eˆ0 1

1

LN Y BBX

e

EXP EXP

Y EXP EXP EXP

LN Y BBX

e

BBX e

ˆ

ˆ

ˆ

01

01

01

© 2003 by CRC Press LLC

Page 91: Data Analysis Book

In this function , , and the disturbance term is a func-tion of , , and X, and is multiplicative.

Similar derivations can be carried out for a wide range of transformationsthat are often applied to dependent and independent variables. Appendix Ddescribes a variety of typical transformations and some of their permutations.The use of transformations implies alternative relationships about the under-lying true and unknown model, leading to the following points of caution:

1. The implied underlying model should be consistent with the theo-retical expectations of the phenomenon being studied.

2. An additive disturbance term may be converted to a multiplicativedisturbance term through the application of transformations.

3. If the disturbances do not behave well in the linear regressionframework, then it is likely that the correct specification has notyet been found.

4. Sometimes it is easier and more appealing to estimate directly theunderlying true model form using a nonlinear regression modelingtechnique than to apply a set of transformations on the raw data.

3.3.3 Indicator Variables

Often there is interest in modeling the effects of ordinal and nominal scalevariables. As examples the effects of qualitative variables such as roadwayfunctional class, gender, attitude toward transit, and trip purpose are often sought.The interpretation of nominal and ordinal scale variables in regression mod-els is different from that for continuous variables.

For nominal scale variables, m 1 indicator variables must be created torepresent all m levels of the variable in the regression model. These m 1indicator variables represent different categories of the response, with theomitted level captured in the slope intercept term of the regression. Theo-retically meaningless and statistically insignificant indicator variables shouldbe removed or omitted from the regression, leaving only those levels of thenominal scale variable that are important and relegating other levels to the“base” condition (slope intercept term).

Example 3.3

Consider again the study of AADT in Minnesota. In Example 3.2 threeindicator variables representing three different functional classes ofroadways were used. The variable FUNCLASS3 represents the subsetof observations belonging to the class of facilities that are urbaninterstates. The example shows that urban interstates have an esti-mated marginal effect of 35,453.7 AADT; that is, urban interstates areassociated with 35,453.7 more AADT on average than urban non-interstates (FUNCLASS4). The effect of this indicator variable is

00EXPB

11EXPB

© 2003 by CRC Press LLC

Page 92: Data Analysis Book

statistically significant, whereas the effects for FUNCLASS1 andFUNCLASS2 are not. In practice these findings would be contrastedto theoretical expectations and other empirical findings. The effectof urban non-interstates (FUNCLASS4) is captured in the interceptterm. Specifically, for urban non-interstate facilities the county pop-ulation (CNTYPOP) is multiplied by 0.0288, then 9953.7 is added foreach lane of the facility, and then 26,234.5 (the Y-intercept term B0)is subtracted to obtain an estimate of AADT.

Ordinal scale variables, unlike nominal scale variables, are ranked, andcan complicate matters in the regression. Several methods for dealing withordinal scale variables in the regression model, along with a description ofwhen these methods are most suitable, are provided in the next section.

Estimate a Single Beta Parameter

The assumption in this approach is that the marginal effect is equivalentacross increasing levels of the variable. For example, consider a variablethat reflects the response to the survey question: Do you support conges-tion pricing on the I-10 expressway? The responses to the question include:1 = do not support; 2 = not likely to support; 3 = neutral; 4 = likely tosupport; and 5 = strongly support. This variable reflects an orderedresponse with ordered categories that do not possess even intervals acrossindividuals or within individuals. If a single beta parameter for this vari-able is estimated, the assumption is that each unit increase (or decrease)of the variable has an equivalent effect on the response. This is unlikelyto be a valid assumption. Thus, disaggregate treatment of ordinal variablesis more appropriate.

Estimate Beta Parameter for Ranges of the Variable

Suppose that an ordinal variable had two separate effects, one across oneportion of the variable’s range of values, and another over the remainder ofthe variable. In this case two indicator variables can be created, one for eachrange of the variable. Consider the variable NUMLANES used in previouschapter examples. Although the intervals between levels of this variable areequivalent, it may be believed that a fundamentally different effect on AADTexists for different levels of NUMLANES. Thus, two indicator variables couldbe created such that

.

IndNUMLANES NUMLANES

IndNUMLANES NUMLANES

1

2

2

0

if 1

otherwise

if 2

0 otherwise

© 2003 by CRC Press LLC

Page 93: Data Analysis Book

These two indicator variables would allow the estimation of two param-eters, one for the lower range of the variable NUMLANES and one for theupper range. If there was an a priori reason to believe that these effectswould be different given theoretical, behavioral, or empirical consider-ations, then the regression would provide evidence on whether these sep-arate ranges of NUMLANES supported separate regression parameters.Note that a linear dependency exists between these two indicator variablesand the variable NUMLANES.

Estimate a Single Beta Parameter for m 1 of the m Levels of the Variable

The third, most complex treatment of an ordinal variable is equivalentto the treatment of a nominal scale variable. In this approach m 1indicator variables are created for the m levels of the ordinal scale vari-able. The justification for this approach is that each level of the variablehas a unique marginal effect on the response. This approach is generallyapplied to an ordinal scale variable with few responses and with theo-retical justification. For the variable NUMLANES, for example, three indi-cator variables could be created (assuming the range of NUMLANES isfrom 1 to 4) as follows:

.

This variable is now expressed in the regression as three indicator variables,one for each of the first three levels of the variable. As before, linear regres-sion is used to assess the evidence whether each level of the variable NUM-LANES deserves a separate beta parameter. This can be statistically evaluatedusing an F test described later in this chapter.

3.3.4 Interactions in Regression Models

Interactions in regression models represent a combined or synergistic effectof two or more variables. That is, the response variable depends on the jointvalues of two or more variables. A second-order effect is an interactionbetween two variables, typically denoted as X1X2. A third-order effect is aninteraction between three variables, and so on. As a general rule, the lower

IndNUMLANES

IndNUMLANES

IndNUMLANES

1

2

3

1 1

0

1 2

0

1 3

0

if =

otherwise

if =

otherwise

if =

otherwise

© 2003 by CRC Press LLC

Page 94: Data Analysis Book

the order of effect (first order being a non-interacted variable), the moreinfluence the variable has on the response variable in the regression. Inaddition, higher-order effects are generally included in the regression onlywhen their lower-order counterparts are also included. For example, thesecond-order interaction X1X2 in the regression would accompany the first-order effects X1 and X2.

To arrive at some understanding of the impact of including indicatorvariables and interaction variables, assume that X1 is a continuous vari-able in the model. Also suppose that there are three indicator variablesrepresenting three levels of a nominal scale variable, denoted X2 throughX4 in the model. The following estimated regression function is obtained(the subscript i denoting observations is dropped from the equation forconvenience):

. (3.38)

Careful inspection of the estimated model shows that, for any given obser-vation, at most one of the new indicator variables remains in the model. Thisis a result of the 0–1 coding: when X2 = 1 then X3 and X4 are equal to 0 bydefinition. Thus, the regression model is expanded to four different models,which represent the four different possible levels of the indicator variable,such that

Inspection of these regression models reveals an interesting result. First,depending on which of the indicator variables is coded as 1, the slope of theregression line with respect to X1 remains fixed, while the Y-intercept param-eter changes by the amount of the parameter of the indicator variable. Thus,indicator variables, when entered into the regression model in this simpleform, represent marginal adjustments to the Y-intercept term.

Y B B X B X B X B X0 1 1 2 2 3 3 4 4

level 1:

level 2:

level 3:

level 4:

X X X

Y B B X

X

Y B B X B X B B B X

X

Y B B X B X B B B X

X

Y B B X B X B

2 3 4

0 1 1

2

0 1 1 2 2 0 2 1 1

3

0 1 1 3 3 0 3 1 1

4

0 1 1 4 4 0

0

1

1

1

ˆ

ˆ

ˆ

ˆ BB B X4 1 1.

© 2003 by CRC Press LLC

Page 95: Data Analysis Book

Suppose that each indicator variable is thought to interact with the variableX1. That is, each level of the indicator variable has a unique effect on Y wheninteracted with X1. To include these interactions in the model, the formermodel is revised to obtain

. (3.39)

The difference between this regression model and Equation 3.38 is that eachindicator is entered in the model twice: as a stand-alone variable and inter-acted with the continuous variable X1. This more complex model is expandedto the following set of regression equations:

Each level of the indicator variable now has an effect on both the Y-intercept and slope of the regression function with respect to the variableX1. The net effect of this model is that each level of the nominal scale variablecorresponds with a separate regression function with both a unique slopeand Y-intercept. Figure 3.1 shows the graphical equivalent of the regressionfunction for the model previously described with three levels of an indicatorcoded, allowing both individual slopes and Y-intercepts for the three levelsand the one base condition to vary. The estimated regression parameters B2

through B7 can be positive or negative, and the signs and magnitudes or theregression functions depend on their magnitudes relative to the parametersB0 and B1. The parameters B2 through B4 represent estimated marginal effectsof the variables X2 through X4 on the Y-intercept, while the parameters B5

through B7 represent estimated interaction effects of the variables X2 throughX4 with the continuous variable X1.

When interactions are between two or more continuous variables theregression equation becomes more complicated. Consider the followingregression model, where variables X1 and X2 are continuous:

Y B B X B X B X B X B X X B X X B X X0 1 1 2 2 3 3 4 4 5 2 1 6 3 1 7 4 1

level 1:

level 2:

level 3:

level 4:

X X X

Y B B X

X

Y B B X B X B X X B B B B X

X

Y B B X B X B X X B B B B X

2 3 4

0 1 1

2

0 1 1 2 2 5 2 1 0 2 1 5 1

3

0 1 1 3 3 6 3 1 0 3 1 6 1

0

1

1

ˆ

ˆ

ˆ

X

Y B B X B X B X X B B B B X

4

0 1 1 4 4 7 4 1 0 4 1 7 1

1

ˆ .

© 2003 by CRC Press LLC

Page 96: Data Analysis Book

.

Because the variables X1 and X2 do not represent a 0 or 1 as in the previouscase, the model cannot be simplified as done previously. Notice that theresponse variable in this model is dependent not only on the individualvalues X1 and X2, but also on the joint value X1 times X2. This model containsan interaction term, which accounts for the fact that X1 and X2 do not actindependently on Y. The regression function indicates that the relationshipbetween X1 and Y is dependent on the value of X2, and conversely, that therelation between X2 and Y is dependent on the value of X1. The nature ofthe dependency is captured in the interaction term, X1X2. There are numerouscases when the effect of one variable on Y depends on the value of one ormore independent variables. Interaction terms often represent importantaspects of a relationship and should be included in the regression. Althoughthey may not explain a great deal of the variability in the response, they mayrepresent important theoretical aspects of the relation and may also representhighly significant (statistically) effects.

3.4 Checking Regression Assumptions

Regression assumptions should be thought of as requirements. Refer to Table3.1 to recall the requirements of the ordinary least squares regression model.

FIGURE 3.1Regression function with four levels of indicator variable represented.

B0 + B4

B0 + B3

B1 + B7

B1 + B6

B1 + B5

B1

B0 + B2

B0

Y

X1

1

1

1

1

Y B B X B X B X X0 1 1 2 2 3 1 2

© 2003 by CRC Press LLC

Page 97: Data Analysis Book

Assumption one requires a linear-in-parameters relationship between theresponse and predictor variables. Assumption two is met as a consequenceof ordinary least squares estimation. Assumption three requires that thedisturbances have the same variance across observations. Assumption fourrequires that there is not a temporal trend over observations. Assumptionfive requires that the X terms are exogenous, or determined by factors outsidethe regression model. Assumption six requires that the distribution of dis-turbances is approximately normal.

The majority of regression assumptions can be checked using informal graph-ical plots. It is stressed that informal plots are simply methods for checking forgross, extreme, or severe violations of regression assumptions. In most cases,moderate violations of regression assumptions have a negligible effect, and soinformal methods are adequate and appropriate. When cases are ambiguous,more formal statistical tests should be conducted. Each of the informal andsome of the more formal tests are discussed in the following sections.

3.4.1 Linearity

Linearity is checked informally using several plots. These include plots ofeach of the independent variables on the X-axis vs. the disturbances (resid-uals) on the Y-axis, and plots of model predicted (fitted) values on the X-axis vs. disturbances on the Y-axis. If the regression model is specified cor-rectly, and relationships between variables are linear, then these plots willproduce plots that lack curvilinear trends in the disturbances. Gross, severe,and extreme violations are being sought, so curvilinear trends should beclearly evident or obvious.

Example 3.4

Consider the regression model from Example 3.2. The model depicted isclearly not the most desirable model. Two of the independent variablesare statistically insignificant, and linearity may be in question. Figure 3.2shows the model fitted values vs. the model disturbances.

Inspection of the plot shows an evident curvilinear pattern in the distur-bances. All disturbances at low predicted AADT are positive, the bulk ofdisturbances between 10,000 and 70,000 AADT are negative, and the dis-turbances after 80,000 AADT are positive. This “U-shape” of disturbancesis a clear indicator of one or more nonlinear effects in the regression.

To determine which independent variable(s) is contributing to the non-linearity, plots of individual independent variables vs. the model distur-bances are constructed and examined. After inspection of numerous plotsthe culprit responsible for the nonlinearity is determined to be the vari-able NUMLANES. Figure 3.3 shows the variable NUMLANES vs. modeldisturbances for the model described in Example 3.2.

© 2003 by CRC Press LLC

Page 98: Data Analysis Book

The nonlinearity between NUMLANES and model disturbances suggeststhat treating the variable NUMLANES as a continuous variable may notbe the best specification of this variable. Clearly, the variable NUMLANEShas an effect on AADT; however, this effect is probably nonlinear.

Indicator variables for categories of NUMLANES are created, and insignif-icant levels of FUNCLASS are removed from the regression. The resultingmodel is shown in Table 3.5. The disturbances for the improved model areshown in Figure 3.4. Although there is still a slight U-shaped pattern to thedisturbances, the effect is not as pronounced and both positive and negative

FIGURE 3.2Fitted values vs. disturbances (Example 3.3).

FIGURE 3.3NUMLANES vs. disturbances (Example 3.3).

60,0

0040

,000

20,0

000

0

71

65

Fitted: CNTYPOP + NUMLANES + FUNCLASS1 + FUNCLASS2 + FUNCLASS3

20,000 40,000 60,000 80,000 100,000

Res

idua

ls

−20,

000

−40,

000

64

60,0

0040

,000

20,0

00−2

0,00

0−4

0,00

00

2 3 4 5

NUMLANES

6 7 8

Res

idua

ls

© 2003 by CRC Press LLC

Page 99: Data Analysis Book

TABLE 3.5

Least Squares Estimated Parameters (Example 3.2)

ParameterParameterEstimate

Standard Error of Estimate t-value P(> |t|)

Intercept 58698.910 5099.605 11.510 <0.0001CNTYPOP 0.025 0.004 5.859 <0.0001NUMLANES2 58718.141 5134.858 11.435 <0.0001NUMLANES4 48867.728 4685.006 10.431 <0.0001FUNCLASS3 31349.211 3689.281 8.497 <0.0001

FIGURE 3.4Fitted values vs. disturbances (Example 3.5).

FIGURE 3.5NUMLANES vs. disturbances (Example 3.5).

40,0

0020

,000

0 20,000 40,000 60,000

Fitted: CNTYPOP + NUMLANES2 + NUMLANES4 + FUNCLASS3

80,000 100,000

−20,

000

−40,

000

−60,

000

0

91

71

Res

idua

ls

64

40,0

0020

,000

−20,

000

−40,

000

−60,

000

0

2 3 4 5

NUMLANES

6 7 8

Res

idua

ls

© 2003 by CRC Press LLC

Page 100: Data Analysis Book

disturbances can be found along the entire fitted regression line. Severalobservations that are outlying with respect to the bulk of the data are noted:observations 71, 91, and 64. A plot of NUMLANES vs. the new modeldisturbances, shown in Figure 3.5, reveals a more acceptable pattern ofdisturbances, which are distributed around 0 rather randomly and evenly.

Gross departures from the linearity assumption are detected throughinspection of disturbance plots. A nonlinear relationship between theresponse Y and independent variables will be revealed through curvilineartrends in the plot. Nonlinear shapes that appear in these plots include “U,”inverted “U,” “waves,” and arcs. Remedial measures to improve nonlinear-ities through transformation of variable measurement scales are describedin Appendix D.

3.4.2 Homoscedastic Disturbances

Constancy of disturbances is called homoscedasticity. When disturbancesare not homoscedastic, they are said to be heteroscedastic. This requirementis derived from the variance term in the regression model, which is assumedto be constant over the entire regression. It is possible, for example, to havevariance that is an increasing function of one of the independent variables,but this is a violation of the ordinary least squares regression assumptionand requires special treatment. The consequence of a heteroscedastic regres-sion is reduced precision of beta parameter estimates. That is, regressionparameters will be less efficient under heteroscedastic compared tohomoscedastic disturbances. This is intuitive. Since MSE is used to estimate

, MSE will be larger for a heteroscedastic regression. It follows that MSEused in the calculation of the precision of beta parameters will also resultin a larger estimate.

Scatter plots are used to assess homoscedasticity. A plot of model fittedvalues vs. disturbances is typically inspected first. If heteroscedasticity isdetected, then plots of the disturbances vs. independent variables should beconducted to identify the culprit.

Example 3.5

The AADT model estimated in Example 3.4 was an improvement overthe original model presented in Example 3.2. The apparent nonlinearityis improved, and all model variables are statistically significant andtheoretically meaningful. The assumption of homoscedasticity is checkedby inspecting the disturbances. The plot in Figure 3.6 shows the plot ofdisturbances vs. model fitted values for the model estimated in Example3.4. The horizontal lines shown on the plot show a fairly constant bandof equidistant disturbances around the regression function. In otherwords, the disturbances do not become systematically larger or smalleracross fitted values of Y, as required by the regression. With the exception

© 2003 by CRC Press LLC

Page 101: Data Analysis Book

of observation numbers 64 and 71, most disturbances fall within thesebands. Inspection of the plot reveals lack of a severe, extreme, or signif-icant violation of homoscedasticity.

Often, heteroscedasticity is easily detected. In many applications, distur-bances that are an increasing function of fitted values of Y are often encoun-tered. This simply means that there is greater variability for higher valuesof Y than for smaller values of Y. Consider a hypothetical model of householdlevel daily trip generation as a function of many sociodemographic inde-pendent variables. The variability in trip-making for households that makemany trips, on average, is much larger than households that make very fewdaily trips. Some wealthy households, for example, may make extra shop-ping and recreational trips, whereas similar households may be observed ondays when few of these additional non-essential trips were made. This resultsin large variability for those households. Households that depend on transit,in contrast, may not have the financial means or the mobility to exhibit thesame variability in day-to-day trip making.

Remedial measures for dealing with heteroscedasticity include transfor-mations on the response variable, Y, weighted least squares (WLS), andridge regression and generalized least squares. Only the first of these,transforming Y, can be done within the ordinary least squares regressionframework. Care must be taken not to improve one situation (heterosce-dasticity) at the expense of creating another, such as nonlinearity. Fortu-nately, fixing heteroscedasticity in many applications also tends to improvenonlinearity. WLS regression is a method used to increase the precision ofbeta parameter estimates and requires a slight modification to ordinaryleast squares regression. Ridge regression is a technique used to producebiased but efficient estimates of beta parameters. Generalized least squaresis presented in Chapter 5.

FIGURE 3.6NUMLANES vs. disturbances (Example 3.5).

40,0

0020

,000

−20,

000

−40,

000

−60,

000

0

0 20,000

Fitted: CNTYPOP + NUMLANES2 + NUMLANES4 + FUNCLASS3

40,000 60,000 80,000 100,000

Res

idua

ls

64

91

71

© 2003 by CRC Press LLC

Page 102: Data Analysis Book

3.4.3 Uncorrelated Disturbances

Correlated disturbances can result when observations are dependent acrossindividuals, time, or space. For example, traffic volumes recorded every 15min are typically correlated across observations. Traffic volumes recorded atthe same time each day would also be correlated. Other examples includeobservations on people made over time and a survey response to a questionposed each year to the same household. Observations can also be correlatedacross space. For example, grade measurements recorded every 20 ft froma moving vehicle would be correlated.

Correlation of disturbances across time is called serial correlation. Thestandard plot for detecting serial correlation is a plot of disturbances vs.time, or a plot of disturbances vs. ordered observations (over space). Approx-imately normal, homoscedastic, and nonserially correlated disturbances willnot reveal any patterns in this plot. Serially correlated disturbances willreveal a trend over time, with peaks and valleys in the disturbances thattypically repeat themselves over fixed intervals. A familiar example is theplot of the voltage generated by a person’s heart over time. Each beat of theheart results in high voltage output, followed by low voltage readingsbetween heartbeats.

A formal test for autocorrelation is based on the Durbin–Watson statistic.This statistic is provided by most regression software and is calculated fromthe disturbances of an OLS regression. When no autocorrelation is present theDurbin–Watson statistic will yield a value of 2.0. As the statistic gets fartheraway from 2.0 one becomes less confident that no autocorrelation is present.Additional details of this test are provided in Greene (2000) and in Chapter 4.

Standard remedial measures for dealing with serial correlation are gen-eralized least squares or time-series analysis, both of which explicitlyaccount for the correlation among disturbances. Chapter 5 presents a dis-cussion of generalized least squares. Chapter 7 presents details of time-series analysis, provides examples of disturbances plots, and describeswhen time-series analysis is an appropriate substitute for ordinary leastsquares regression.

3.4.4 Exogenous Independent Variables

Assumption number 5 in Table 3.1 is often called the exogeneity assumption,meaning that all independent variables in the regression are exogenous.Exogeneity may be one of the most difficult assumptions in the regressionto describe and understand. The phrase “a variable is exogenous” meansthat the variable can be expected to vary “autonomously,” or independentlyof other variables in the model. The value of an exogenous variable isdetermined by factors outside the model. Alternatively, an endogenous vari-able is a variable whose variation is caused by other exogenous or endoge-nous variables in the model (one or more of the independent variables, orthe dependent variable).

© 2003 by CRC Press LLC

Page 103: Data Analysis Book

The reader should note that exogeneity does not mean that two indepen-dent variables cannot covary in the regression; many variables do covary,especially when the regression is estimated on observational data. In theabsence of a formal theory between them, correlated variables in the regres-sion are assumed to be merely associated with one another and not causallyrelated. If a causal relationship exists between independent variables in themodel, then an alternative model structure is needed, such as a multiequa-tion model (see Chapter 5) or a structural equation model (see Chapter 8).By ignoring the “feedback effect” caused by endogeneity, statistical infer-ences can be strongly biased.

As an example of endogeneity, consider the following hypothetical linearregression model:

.

Recall that Y is to be influenced by variables X1 and X2, and under theexogeneity assumption X1 and X2 are influenced by factors outside of theregression. Past military conflicts, X1, presumably would directly influencecurrent military expenditures because a history of military conflicts wouldencourage a future military preparedness. However, the direction of cau-sality might also be in the reverse direction — being prepared militarilymight enable a country to become involved in more conflicts, since, afterall, it is prepared. In this example the variable X1 in this model is said tobe endogenous. In the case of endogenous variable X1, the variation in Ynot explained by X1 is . Under endogeneity, however, also containsvariation in X1 not explained by Y — or the feedback of Y on X1.

To see the problem caused by endogeneity, consider a case where theordinary least squares regression model and equations are given, respec-tively, by

.

Assume that X is endogenous, and is a function of Y; thus . Thevariable X consists of a systematic component, which is a function of Y anda stochastic component e*. Clearly, the variable X is correlated with thedisturbance term e*. However, the OLS regression equation does not containa separate disturbance term e*, and instead the term e contains both theunexplained variation of X on Y and Y on X. If the effect of endogeneity isignored, the OLS regression parameter is estimated as usual (see Equation3.17), where the variables X and Y are standardized:

Y X X

Y

X

X

0 1 1 2 2

1

2

current military expenditures, $ (millions)

= number of past military conflicts

gross domestic product, $ (millions)

Y B B X Y B B X e0 1 0 1 and

X Y e*

© 2003 by CRC Press LLC

Page 104: Data Analysis Book

.

Evaluating the estimator for bias, and using the fact that the covariance oftwo variables X and is given as

,

and the variance of standardized X is

,

then

.(3.40)

As depicted in Equation 3.40, when the covariance between X and iszero the bias term disappears, which is the case under exogeneity. Whenendogeneity is present, however, the covariance between X and is non-zero and the least squares estimate is biased. The direction of biasdepends on the covariance between X and . Because the variance of Xis always positive, a negative covariance will result in a negative bias,whereas a positive covariance will result in positive bias. In rare caseswhere the variance of X dwarfs the magnitude of the covariance, the biasterm can be neglected.

B

X y

Xi

ii

n

i

ii

n1

2

1

COV X E X E X E E X Xi ii

n

,1

VAR X Xii

n2

1

E B E

X Y

X

E

X X

X

E

X X

Xi

ii

n

i

ii

n

ii

n

i i

ii

n

i

i

n

ii

n

i

ii

n1

2

1

1

2

1

2

1 1

2

1

E

X

X

X

X

ECOV X

VAR XBias

i

i

n

ii

n

ii

n

i

ii

ni i

2

1

2

1

1

2

1

,

© 2003 by CRC Press LLC

Page 105: Data Analysis Book

3.4.5 Normally Distributed Disturbances

An assumption imposed simply to allow regression parameter inferences tobe drawn is that disturbances are approximately normally distributed. Inaddition, the disturbances, or residuals, have an expected value of zero.Recall that the assumption of normality led to regression parameters thatare approximately t distributed. It is the t distribution that allows probabi-listic inferences about the underlying true parameter values. If making infer-ences is not important, then the assumption of normality can be droppedwithout consequence. For example, if the prediction of future meanresponses is the only modeling purpose, then ignoring normality will nothinder the ability to make predictions.

In addition to future predictions, however, interest is often also focusedon inferences about the values of beta parameters, and so the normalityassumption must be checked. There are nongraphical, graphical, andnonparametric methods for assessing normality. The degrees to whichthese tools are used to assess normality depend on the degree of satis-faction obtained using more informal methods. If graphical methodssuggest serious departures from normality, more sophisticated methodsmight be used to test the assumption further. The various methodsinclude the following:

1. Summary statistics of the disturbances, including minimum, firstand third quartiles, median, and maximum values of the distur-bances. Recall that the normal distribution is symmetric and themean and median should be equivalent and approximately equal tozero. Thus, the summary statistics for symmetry about zero can beinspected. Most statistical software packages either provide sum-mary statistics of the disturbances by default or can generate thestatistics easily.

2. Histograms of the disturbances. A histogram of the disturbancesshould reveal the familiar bell-shaped normal curve. There shouldbe an approximately equivalent number of observations that fallabove and below zero, and they should be distributed such thatdisturbances above and below zero are mirror images of each other.Most statistical software packages can produce histograms of thedisturbances rather easily. Histograms show symmetry but do notprovide information about the appropriate allocation of probabilityin the tails of the normal distribution.

3. Normal probability quantile-quantile (Q-Q) plots of the disturbances.The normal distribution has approximately 67% of the observationswithin 1 standard deviation above and below the mean (0 in the caseof the disturbances), 95% within 2 standard deviations, and 99%within 3 standard deviations. Normal Q-Q plots are constructed suchthat normally distributed disturbances will plot on a perfectly straightline. Departures from normality will reveal departures from the

© 2003 by CRC Press LLC

Page 106: Data Analysis Book

straight-line fit to the Q-Q plot. It is through cumulative experienceplotting and examining Q-Q plots that an ability to detect extreme,severe, or significant departures from normality is developed. Takentogether with summary statistics and histograms, Q-Q plots providesufficient evidence on whether to accept or reject approximate nor-mality of the disturbances.

4. Nonparametric methods such as the chi square goodness-of-fit(GOF) test or the Kolmogorov–Smirnov GOF test. In particular whensample sizes are relatively small, these methods might seem to revealserious departures from normality. Under such circumstances sta-tistical tests that assess the consistency of the disturbances withnormality can be performed. In essence, both the chi square GOFtest and the Kolmogorov–Smirnov GOF test provide probabilisticevidence of whether or not the data were likely to have been drawnfrom a normal distribution. Readers wishing to pursue these simpletests should refer to Chapter 2, or consult Neter et al. (1996) orConover (1980).

Sample size directly influences the degree to which disturbances exhibit“normal” behavior. Normally distributed small samples can exhibit appar-ently non-normal behavior as viewed through diagnostic tools. Distur-bances obtained from sample sizes less than 30 should be expected to raisedifficulties in assessing the normality assumption. On the other hand,disturbances obtained from large sample sizes (for example, greater than100) should reveal behavior consistent with normality if indeed the distur-bances are normal.

Example 3.6

Consider the regression model in Example 3.4. A check of the homosce-dasticity assumption produced no evidence to reject it. Because the datawere collected at different points in space and not over time, serialcorrelation is not a concern. The normality assumption is now checked.Inspection of standard summary statistics provides an initial glimpse atthe normality assumption.

There are 121 observations in the sample — a relatively large sample.The minimum value is 57,878, a fairly large negative number, whereasthe maximum is only 42,204. Symmetry of the disturbances may be inquestion, or perhaps there are a handful of observations that are skewingthe “tails” of the disturbances distribution. Similarly, the 1st and 3rd quar-tiles are of different magnitudes, almost by 25%, and so symmetry isagain in question. The median should be zero, but is less than 1% of themaximum disturbance and so does not appear to raise significant doubt

Min 1st Quartile Median 3rd Quartile Max–57878 -4658 671.5 3186 42204

© 2003 by CRC Press LLC

Page 107: Data Analysis Book

about symmetry. In addition, potential skew appears to be associatedwith the negative disturbances in contrast to the median, which exhibitspositive skew. In summary, the evidence is mixed and inconclusive, butcertainly raises some doubts regarding normality.

A histogram of the disturbances is shown in Figure 3.7. The Y-axis showsthe number of observations, and the X-axis shows bins of disturbances.Disturbances both positive and negative appear to be outlying withrespect to the bulk of the data. The peak of the distribution does notappear evenly distributed around zero, the expected result under ap-proximate normality. What cannot be determined is whether the depar-tures represent serious, extreme, and significant departures, or whetherthe data are consistent with the assumption of approximate normality.In aggregate the results are inconclusive.

A normal Q-Q plot is shown in Figure 3.8. When disturbances are nor-mally distributed the Q-Q plot reveals disturbances that lie on the dottedstraight line shown in the figure.

Inspection of the figure shows that the tails of the disturbances’ distri-bution appear to depart from normality. Several observations, specificallyobservations 71, 91, and 64, appear to seriously depart from normality,as shown in Figure 3.8.

Based on the cumulative evidence from the summary statistics, the his-togram, and the normal Q-Q plot, the statistical model can probably beimproved. Outlying observations, which should be examined, could bepartly responsible for the observed departures from normality.

FIGURE 3.7Histogram of residuals (model in Example 3.4).

6050

4030

2010

0

−60,000 −40,000 −20,000 20,000 40,0000

Residuals

Freq

uenc

y

© 2003 by CRC Press LLC

Page 108: Data Analysis Book

3.5 Regression Outliers

Identifying influential cases is important because influential observations mayhave a disproportionate affect on the fit of the regression line. If an influentialobservation is outlying with respect to the underlying relationship for someknown or unknown reason, then this observation would serve to misrepresentthe true relationship between variables. As a result, the risk is that an “errant”observation is dominating the fitted regression line used and thus the infer-ences drawn. It is recommended that outlying influential observations besystematically checked to make sure they are legitimate observations.

An observation can be influential with respect to its position on the Y-axis,its position on the X-axis, or both. Recall that the sum of squared disturbancesis minimized using ordinary least squares. Thus, an observation two timesas far from the fitted regression line as another will receive four times theweight in the fitted regression. In general, observations that are outlyingwith respect to both and have the greatest influence on the fit of theregression and need to be scrutinized. Figure 3.9 shows a hypothetical regres-sion line fit to data with three outlying observations. Observation a in thefigure is relatively distant from . Similarly, observation b is relatively distantfrom , and observation c is relatively distant from both and .

3.5.1 The Hat Matrix for Identifying Outlying Observations

A matrix called the “hat” matrix is often used to identify outlying observa-tions. Recall that the matrix solution to the least squares regression parameter

FIGURE 3.8Q-Q plot of residuals (model in Example 3.4).

40,0

0020

,000

−20,

000

−40,

000

−60,

000

0

−2 −1 1 20

Quantiles of Standard Normal

Res

idua

ls

64

91

71

X Y

YX X Y

© 2003 by CRC Press LLC

Page 109: Data Analysis Book

estimates is B = (XTX)–1XTY. Fitted values of Y are given as = XB =X(XTX)–1XTY. The hat matrix H is defined such that = HY. Thus, the n nhat matrix is defined as

. (3.41)

Fitted values of Yi can be expressed as linear combinations of the observedYi through the use of the hat matrix. The disturbances e can also be expressedas linear combinations of the observations Y using the hat matrix

. (3.42)

Because VAR[Y] = 2, it can easily be seen that the variance–covariance matrixcan be expressed in terms of the hat matrix as

. (3.43)

Finally, with subscripting i for observations, the variance of disturbancesei is computed as

, (3.44)

where hii is the ith element on the main diagonal of the hat matrix. Thiselement can be obtained directly from the relation , whereXi corresponds to the Hi vector derived from the ith observation (row) in X.The diagonal elements in hii possess useful properties. In particular, theirvalues are always between 0 and 1, and their sum is equal to p, the numberof regression model parameters.

Some interesting observations can be made about the hii contained in thehat matrix:

FIGURE 3.9Regression line with three influential observations a, b, and c.

Yaverage

Xaverage

a

b

c

YY

H X X X XT T1

e I H Y

VAR e 2 I H

VAR e hi ii2 1

hii iT T

iX X X X

© 2003 by CRC Press LLC

Page 110: Data Analysis Book

1. The larger is hii, the more important Yi is in determining the fittedvalue .

2. The larger is hii, the smaller is the variance of the disturbance ei. Thiscan be seen from the relation VAR[ei] = 2[1 – hii]. In other words,high leverage values have smaller disturbances.

3. As a rule of thumb, a leverage value is thought to be large when anyspecific leverage value is greater than twice the average of all leveragevalues. It can easily be seen that the average value is p/n, where p isthe number of parameters and n is the number of observations.

3.5.2 Standard Measures for Quantifying Outlier Influence

Perhaps the most common measure used to assess the influence of an obser-vation is Cook’s distance, D. Cook’s distance quantifies the impact ofremoval of each observation from the fitted regression function on estimatedparameters in the regression function. If the effect is large, then Di is alsolarge. Thus, Cook’s distance provides a relative measure of influence forobservations in the regression.

In algebraic terms, Cook’s distance measure is given by (see Neter et al., 1996)

. (3.45)

The first form in Equation 3.45 considers the sum of squared observedminus fitted values with the ith case deleted from the regression. Thiscomputation is burdensome because it requires computing the sum ofsquares for each of the i observations. The second equivalent form shownin Equation 3.45 makes use of the hat matrix discussed previously andsimplifies computation considerably. The diagonals in the hat matrix arecomputed and used along with the ith disturbance to determine the Cook’sdistance value across observations. An equivalent form of Cook’s distancein matrix form is given as

. (3.46)

Cook’s distance depends on two factors, the magnitudes of the ith distur-bance and the leverage value hii. Larger disturbances, larger leverage values,or both lead to large values of D.

Although Cook’s distance is a commonly applied measure in statisticalsoftware packages, there are other measures for assessing the influence of

Yi

D

Y Y

p MSEe

p MSEh

hi

j j ij

n

i ii

ii

( ˆ )( )2

12

21

Dp MSEi

i

T

iˆ ˆ ˆ ˆ

( ) ( )Y Y Y Y

© 2003 by CRC Press LLC

Page 111: Data Analysis Book

observations. All methods use similar principles for assessing the influenceof outliers (see Neter et al., 1996).

3.5.3 Removing Influential Data Points from the Regression

So far, methods for quantifying influential observations have been discussed.There are many possible scenarios that could give rise to outliers. The fol-lowing describes how these scenarios arise and appropriate actions thatshould be taken.

1. Misspecification error. A specified model may be inappropriate andfail to account for some important effects, particularly with respectto influential cases. For example, an observation or group of obser-vations could possess a trait (effect) in common that is omitted fromthe regression model and whose effect, therefore, is not estimated.If the omitted trait is available or subsequently measurable, it shouldbe included in the model. Another misspecification error might bea nonlinearity in the regression — this could be improved by apply-ing transformations or by using a nonlinear regression technique. Incases where an outlying observation(s) is suspected of possessing(sharing) a trait that cannot be identified, a difficult decision remains.Removing the observation(s) from the regression begs for outsidecriticism, with the accusation that “the data were fit to the model.”Leaving the observation(s) in the model results in an apparentlyunsatisfactory model that is open to criticism for “lack of fit.” Per-haps the most defensible course of action in these cases is to leavethe outlying observations in the model and provide possible expla-nations for their being anomalous. If the outlying observations arealso influential on the fit of the regression, robust regression tech-niques can be applied to reduce their relative influence, such asabsolute deviation regression.

2. Coding error. An influential data point (or points) was recorded incor-rectly during data collection. The interim step between observationof a data point and recording of that observation often presents anopportunity for a recording or coding error. For example, a fieldobserver writing down radar-detected maximum speeds on a fielddata collection sheet might accidentally record a 56 (mph) when infact the reading was 65. In cases where it can be documented that acoding error was probable, usually through examination of data onthe collection forms, erasures on the forms, illegibility, etc., it isjustifiable to remove the data point(s) from the regression.

3. Data collection error. Influential observations were the result of mal-functioning equipment, human error, or other errors that occurredduring data collection. For example, an in-vehicle “black box” thatrecords continuous vehicle speed may not have been calibrated

© 2003 by CRC Press LLC

Page 112: Data Analysis Book

correctly, and thus collects “bad” data during a data collectionperiod. The outlier resulting from this type of error can be difficultto diagnose, as in some cases there is insufficient information todetermine whether an error occurred. It is typically more defensi-ble to leave these types of outliers in the model — as removingthem from the regression without demonstrable reason reduces thecredibility of the research.

4. Calculation error. Often there is a significant amount of data manip-ulation that occurs prior to analysis. Human error can easily creepinto data manipulations — especially when these manipulations arenot automated, leading to erroneous data. Typically, the best remedyfor this situation is not to discard an outlier but instead fix thecalculation and rerun the regression with the corrected observations.

Removing outliers from the regression should raise suspicion among thosewho critique the regression results. In effect, removal of outliers from aregression analysis results in an improved fit of the data to the model. Theobjective of regression in contrast is to have the model fit the data. Removingoutliers without proper justification raises the possibility that data have beenmanipulated to support a particular hypothesis or model. As a result, com-plete documentation and justification for removing any and all data froman analysis is good practice.

Example 3.7

Consider the model estimated in Example 3.4 with a focus on outliers.Two plots are of prime interest — disturbances vs. fitted values, andobservations vs. Cook’s distance. Figure 3.10 shows a plot of distur-bances vs. fitted observations, and Figure 3.11 shows observations vs.Cook’s distance.

Three outliers or influential cases are identified as having disproportion-ate influence on the fit of the regression function: observations 58, 64,and 71. Having considerable influence on the fitted regression, theseobservations may mischaracterize the underlying “true” relationship ifthey are erroneous. An inquiry into the nature of observations 58, 64,and 71 is worthwhile. Table 3.6 shows summary statistics for the threeoutlying observations. Observations 58 and 64 are observations withrelatively large observed AADT and CNTYPOP in comparison to the bulkof the data. All facilities were urban interstates.

The field data collection forms are examined to identify whether therewere errors in coding, equipment problems, or any other extenuatingcircumstances. A comment in the field notes regarding observation 64suggests that there was some question regarding the validity of the AADTcount, as the counting station had not been calibrated correctly. Forobservation 71 there is a series of erasures and corrections made to the

© 2003 by CRC Press LLC

Page 113: Data Analysis Book

FIGURE 3.10Distrubances vs. fitted observation (model in Example 3.4).

FIGURE 3.11Observations vs. Cook’s distance, D (model in Example 3.4).

TABLE 3.6

Assessment of Outliers (Example 3.6)

ObservationVariable Mean Median 58 64 71

AADT 19,440 8,666 123,665 155,547 36,977CNTYPOP 263,400 113,600 459,784 941,411 194,279NUMLANES2 3.099 2.000 8 6 6FUNCLASS3 2.727 2.000 3 3 3

40,0

0020

,000

−20,

000

−40,

000

0

0 20,000 40,000

Fitted: CNTYPOP + NUMLANES2 + NUMLANES4 + NUMLANES8 + FUNCLASS3

60,000 80,000 100,000 120,000

Res

idua

ls

64

65

71

0.8

0.6

0.4

0.2

0.0

0 20 40 60

58

64

71

Index

80 100 120

Coo

k’s

Dis

tanc

e

© 2003 by CRC Press LLC

Page 114: Data Analysis Book

AADT figure, and it is not clear whether the coded figure is correct. Errorsor extenuating circumstances for observation 58 cannot be found. As aresult observations 64 and 71 are removed from the analysis, leavingobservation 58 as a valid observation.

A reestimation of the model after removing observations 64 and 71 andentering the variable NUMLANES8, which is now statistically significant,is shown in Table 3.7. The model is similar to the previous models,although the disturbances have been improved compared to the previousmodel. A histogram of the disturbances of the revised model, shown inFigure 3.12, reveals a more balanced bell-shaped distribution.

Although there are outlying observations with respect the bulk of obser-vations, they cannot be removed without proper justification. Additionalvariables that might influence the AADT on various facilities — such as

TABLE 3.7

Least Squares Estimated Parameters (Example 3.6)

ParameterParameterEstimate

Standard Error of Estimate t-value P(>|t|)

Intercept 59771.4183 4569.8595 13.0795 <0.0001CNTYPOP 0.0213 0.0029 7.2198 <0.0001NUMLANES2 59274.8788 4569.1291 –12.9729 <0.0001NUMLANES4 48875.1655 4269.3498 11.4482 <0.0001NUMLANES8 22261.1985 10024.2329 2.2207 0.0284FUNCLASS3 31841.7658 2997.3645 10.6233 <0.0001R-squared 0.8947F-statistic 192.1 on 5 and 113 degrees of freedom, the p-value is <0.0001

FIGURE 3.12Histogram of disturbances (Example 3.6).

2520

1510

50

−20,000 −10,000 10,0000

Residuals

20,000 30,000

Freq

uenc

y

© 2003 by CRC Press LLC

Page 115: Data Analysis Book

characteristics of the driving population, the nature of cross-traffic suchas driveways from industrial, commercial, and residential areas (for non-interstates only), and perhaps roadway geometric and operational char-acteristics are missing from the set of explanatory variables. However,a fairly reliable predictive tool for forecasting AADT given predictorvariables in the model has been obtained.

3.6 Regression Model Goodness-of-Fit Measures

Goodness-of-fit (GOF) statistics are useful for comparing the results acrossmultiple studies, for comparing competing models within a single study,and for providing feedback on the extent of knowledge about the uncertaintyinvolved with the phenomenon of interest. Three measures of model GOFare discussed: R-squared, adjusted R-squared, and the generalized F test.

To develop the R-squared GOF statistic, some basic notions are required.Sum of squares and mean squares are fundamental in both regression andanalysis of variance. The sum of square errors (disturbances) is given by

,

the regression sum of squares is given by

,

and the total sum of squares is given by

.

The SSE is the variation of the fitted regression line around the observations.The SSR is the variation of the fitted regression line around , and SST isthe total variation — the variation of each observation around . It also canbe shown algebraically that SST = SSR + SSE.

Mean squares are just the sum of squares divided by their degrees offreedom. SST has n 1 degrees of freedom, because 1 degree of freedom islost in the estimation of . SSE has n – p degrees of freedom, because pparameters are used to estimate the fitted regression line. Finally, SSR hasp – 1 degrees of freedom associated with it. As one would expect, the degrees

SSE Y Yi ii

n

ˆ 2

1

SSR Y Yii

n2

1

SST Y Yii

n2

1

YY

Y

© 2003 by CRC Press LLC

Page 116: Data Analysis Book

of freedom are additive such that n – 1 = n – p + p – 1. The mean squares,then, are MSE = SSE/(n – p) and MSR = SSR/(p – 1). The coefficient ofdetermination, R-squared, is defined as

. (3.47)

R2 can be thought of as the proportionate reduction of total variationaccounted for by the independent variables (X). It is commonly interpretedas the proportion of total variance explained by X. When SSE = 0, R2 = 1,and all of the variance is explained by the model. When SSR = 0, R2 = 0, andthere is no association between X and Y.

Because R2 can only increase when variables are added to the regressionmodel (SST stays the same, and SSR can only increase even when statisticallyinsignificant variables are added), an adjusted measure, R2

adjusted, is used toaccount for the degrees of freedom changes as a result of different numbersof model parameters, and allows for a reduction in R2

adjusted as additional,potentially insignificant variables are added. The adjusted measure is con-sidered to be superior for comparing models with different numbers ofparameters. The adjusted coefficient of multiple determination is

. (3.48)

The following guidelines should be applied:

• The R2 and R2adjusted measures provide only relevant comparisons

with previous models that have been estimated on the phenomenonunder investigation. Thus, an R2

adjusted of 0.40 in one study may beconsidered “good” only if it represents an improvement over similarstudies and the model provides new insights into the underlyingdata-generating process. Thus, it is possible to obtain an improve-ment in the R2 or R2

adjusted value without gaining a greater under-standing of the phenomenon being studied. It is only thecombination of a comparable R2

adjusted value and a contribution tothe fundamental understanding of the phenomenon that justifies theclaim of improved modeling results.

• The absolute values of R2 and R2adjusted measures are not sufficient

measures to judge the quality of a model. Thus, an R2 of 0.20 froma model of a phenomenon with a high proportion of unexplainedvariation might represent a breakthrough in the current level ofunderstanding, whereas an R2 of 0.90 of another phenomenon might

R2 SST SSE

SST

SSRSST

1SSESST

[ ]

Rn p

n

nn p

2adjusted

SSE

SSTSSESST

1

1

11

© 2003 by CRC Press LLC

Page 117: Data Analysis Book

reveal no new insights or contributions. Thus, it is often better toexplain a little of a lot of total variance rather than a lot of a littletotal variance.

• Relatively large values of R2 and R2adjusted can be caused by data

artifacts. Small variation in the independent variables can result ininflated values. This is particularly troublesome if in practice themodel is needed for predictions outside the range of the independentvariables. Extreme outliers can also inflate R2 and R2

adjusted values.• The R2 and R2

adjusted assume a linear relation between the responseand predictor variables, and can give grossly misleading results ifthe relation is nonlinear. In some cases R2 could be relatively largeand suggest a good linear fit, when the true relationships are curvi-linear. In other cases R2 could suggest a very poor fit when in factthe relationships are nonlinear. This emphasizes the need to plot,examine, and become familiar with data prior to statistical modeling.

• The R2 and R2adjusted values are bound by 0 and 1 only when an inter-

cept term is included in the regression model. When the intercept isforced through zero, the R2 and R2

adjusted values can exceed the value1 and more caution needs to be used when interpreting them.

Another measure for assessing model fit is the generalized F test. Thisapproach is a general and flexible approach for testing the statistical differencebetween competing models. First, a full or unrestricted model is estimated.This could be a model with ten independent variables, for example, or the“best” model obtained to date. The full model is fit using the method of leastsquares and SSE is obtained — the sum of square errors for the full model.For convenience, the sum of square errors for the full model is denoted as

,

where the predicted value of Y is based on the full model.A reduced model is then estimated, which represents a viable competitor

to the full model with fewer variables. For example, this could be a modelwith nine independent variables, or a model with no independent variables,leaving only the Y-intercept term B0. The sum of squared errors is estimatedfor the competing or reduced model, where

.

The logic of the F test is to compare the values of SSER and SSEF. Recallfrom the discussion of R-squared that SSE can only be reduced by adding

SSEF i Fii

n

Y Y( ˆ )2

1

SSER i Rii

n

Y Y( ˆ )2

1

© 2003 by CRC Press LLC

Page 118: Data Analysis Book

variables into the model, thus SSER SSEF. If these two sum of square errorsare the same, then the full model has done nothing to improve the fit of themodel; there is just as much “lack of fit” between observed and predictedobservations as with the reduced model, so the reduced model is superior.Conversely, if SSEF is considerably smaller than SSER, then the additionalvariables add value to the regression by adding sufficient additional explan-atory power. In the generalized F test the null and alternative hypothesesare as follows:

H0: all k = 0

Ha: all k 0.

In this test the null hypothesis is that all of the additional parameters in thefull model (compared to the reduced model) k are equal to zero.

When the null hypothesis is true (making the F test a conditional proba-bility), the F* statistic is approximately F distributed, and is given by

. (3.49)

where dfF = n – pF and dfR = n – pR (n is the number of observations and p isthe number of parameters). To calculate this test statistic, the sum of squareerrors for the two models is first computed, then the F* statistic is comparedto the F distribution with appropriate numerator and denominator degreesof freedom (see Appendix C, Table C.4 for the F distribution). Specifically,

. (3.50)

The generalized F test is very useful for comparing models of different sizes.When the difference in size between two models is one variable (for example,a model with six vs. five independent variables), the F test yields an equiv-alent result to the t test for that variable. Thus, the F test is most useful forcomparing models that differ by more than one independent variable.

A modification of this test, the Chow test, can be used to test temporalstability of model parameters. For two time periods three models must beestimated: a regression model using the complete data; a regression modelusing data from period 1; and a regression model using data from period 2.Equation 3.49 would then be applied with SSER computed from the all-datamodel (sample size nT) and SSEF computed as the sum of SSEs from the two

Fdf df

df

F df df df

R F

R F

F

F

R F F* ; ,

SSE SSE

SSE 1

If then conclude

If then conclude

0F F df df df H

F F df df df H

R F F

R F F a

* ; , ,

* ; , ,

1

1

© 2003 by CRC Press LLC

Page 119: Data Analysis Book

time-period models (sample sizes n1 and n2 where n1 + n2 = nT). For this, dfF

= n1 + n2 – 2pF (where pF is the number of variables in each of the time-periodmodels, multiplied by 2 because the number of variables is the same for bothtime-period models) and dfR = nT – pR (where pR is the number of variablesin all-data model). The F statistic is again .

Example 3.8

Continuing with Example 3.7, GOF statistics are compared. Table 3.7 showsgoodness of fit statistics for the model. The R-squared value is 0.8947. Thismodel suggests that if model specification is correct (for example, linear,homoscedastic, etc.), then the collection of independent variables accountsfor about 89% of the uncertainty or variation in AADT. This is consideredto be a good result because previous studies have achieved R-squaredvalues of around 70%. It is only with respect to other models on the samephenomenon that R-squared comparisons are meaningful.

This standard statistic provided by most statistical analysis software isa test of the naïve regression model (the model with the intercept termonly) vs. the specified model. This can typically be verified by checkingthe numerator and denominator degrees of freedom. The numeratordegrees of freedom is the reduced minus full model degrees of freedom,or (nR pR) – (nF pF) = (119 – 1) – (119 – 6) = 5, and the denominatordegrees of freedom is the degrees of freedom for the full model, or(119 – 6) = 113. The value of the F statistic, or F*, is 192.1. Although notreported by the statistics package, F* is much larger than the critical valueof F, and is in fact at the very tail end of the F distribution, whereprobability is less than 0.01%. Thus, if the naïve model is correct, thenthe probability of observing data with all specified model parametersequal to zero is less than 0.01%. This test result provides objective evi-dence that the specified model is better than a naïve model.

3.7 Multicollinearity in the Regression

Multicollinearity among independent variables is a problem often encoun-tered with observational data. In fact, the only way to ensure that multicol-linearity is minimized is to conduct a carefully designed experiment.Although designed experiments should be conducted when possible, oftenan experiment is impossible due to the lack of ability to control many vari-ables of interest.

In the presence of multicollinearity or intercorrelation between variables,numerous reasonable and well-intentioned research-oriented questions aredifficult to answer (Neter et al., 1996), such as the following. What is therelative importance of various factors (predictors) in the regression? What

F df df dfR F F1 ; ,

© 2003 by CRC Press LLC

Page 120: Data Analysis Book

is the magnitude of the effect of a given predictor variable on the response?Can some predictor variables be dropped from the model without predic-tive or explanatory loss? Should omitted predictor variables be includedin the model?

Multicollinearity enters the regression when independent variables arecorrelated with each other or when independent variables are correlatedwith omitted variables that are related to the dependent variable. Perfectmulticollinearity between predictor variables represents a singularity prob-lem (see Appendix A). In these unusual cases, one variable provides noadditional information over and above its perfectly correlated counterpart.Often perfect multicollinearity is a result of a data coding error, such ascoding indicator variables for all levels of a nominal scale variable. Perfectmulticollinearity, typically resulting in error messages in statistical analysispackages, means that one independent variable is redundant.

Near or high multicollinearity is a more difficult and common problem,and produces the following effects:

1. The consequence of near multicollinearity is the high standard errorsof the estimated parameters of the collinear variables. Near multicol-linearity does not prevent least squares from obtaining a best fit to thedata, nor does it affect inferences on mean responses or new observa-tions, as long as inferences are made within the observation limits spec-ified by the joint distribution of the correlated variables. For example,if household income and value of vehicle are strongly positively correlatedin a model of total daily vehicle trips, then making an inference for ahousehold with high income and a low-valued vehicle may be difficult.

2. Estimated regression parameters have large sampling variabilitywhen predictor variables are highly correlated. As a result, estimatedparameters vary widely from one sample to the next, perhaps result-ing in counterintuitive signs. In addition, adding or deleting a corre-lated variable in the regression can change the regression parameters.

3. The standard interpretation of a regression parameter does notapply: one cannot simply adjust the value of an independent variableby one unit to assess its affect on the response, because the value ofthe correlated independent variable will change also.

The most common remedial measures are now described. The interestedreader is directed to other references to explore in greater detail remedialmeasures for dealing with multicollinearity (e.g., Neter et al., 1996; Greene,2000; Myers, 1990).

1. Often pairwise correlation between variables is used to detect multi-collinearity. It is possible, however, to have serious multicollinearitybetween groups of independent variables, for example, X3 is highly

© 2003 by CRC Press LLC

Page 121: Data Analysis Book

correlated with a linear combination of X1 and X2. The variance infla-tion factor (VIF) is often used to detect serious multicollinearity.

2. Ridge regression, a method that produces biased but efficient esti-mators for obtaining parameter estimates, can be used to avoid theinefficient parameter estimates obtained under multicollinearity.

3. Often when variables have serious multicollinearity one of the offend-ing variables is removed from the model. This is justified when thereare theoretical grounds for removing this variable. For example, anindependent variable that has an associative rather than causal rela-tionship with the response might be omitted. Another justificationmight be that only one of the variables can be collected in practiceand used for future predictions. Removing a variable, however, alsoremoves any unique effects this variable has on the response.

4. A common response to multicollinearity is to do nothing. In thiscase the effects of leaving correlated variables in the model and thelimitations involved with interpreting the model should be recog-nized and documented.

5. The most ideal solution for multicollinearity is to control levels ofthe suspected multicollinear variables during the study designphase. Although this solution requires the ability to understand thethreats of multicollinearity prior to data collection, it is the bestsolution. For example, a suspected high correlation between thevariables household income and value of vehicle might be remedied bystratifying households to ensure that all stratifications of low,medium, and high household income and low, medium, and high valueof vehicle are obtained.

Multicollinearity is perhaps the most common negative consequence tothe regression resulting from observational data. Good planning in thestudy design is the best remedy for multicollinearity. Other remedies areperformed post hoc, and range from doing nothing to applying ridge regres-sion techniques.

Finally, it is important to keep in mind that highly correlated variablesmay not necessarily cause estimation problems. It is quite possible that twohighly correlated variables produce well-defined parameter estimates (esti-mates with low standard errors). The problem arises when the standarderrors of one or both of the correlated variables are high.

3.8 Regression Model-Building Strategies

The following model-building strategies are frequently used in linear regres-sion analysis.

© 2003 by CRC Press LLC

Page 122: Data Analysis Book

3.8.1 Stepwise Regression

Stepwise regression is a procedure that relies on a user-selected criterion,such as R-squared, adjusted R-squared, F-ratio, or other GOF measures, toselect a “best” model among competing models generated by the procedure.Stepwise regression procedures can either be backward or forward. Back-ward stepwise regression starts by comparing models with large numbersof independent variables (as many as, say, 30 or 40) and sequentially remov-ing one independent variable at each step. The variable removed is the onethat contributes least to the GOF criterion. The procedure iterates until aregression model is obtained in the final step. The user can then compare“best” models of different sizes in the computer printout. As one wouldexpect, forward stepwise begins with a simple regression model and sequen-tially grows the regression by adding the variable with the largest contribu-tion to the GOF criterion.

There are several problems with stepwise regression. First, the procedureis mechanical, resulting in perhaps many models that are not appealingtheoretically but simply produce superior GOF statistics. Second, the numberof model fits is limited by the user-defined independent variables. For exam-ple, if two-way interactions are not “entered” into the procedure, they willnot be considered. Finally, in some programs once a variable is entered intothe regression it cannot be removed in subsequently estimated models,despite the real possibility that an overall improvement in GOF is possibleonly if it is removed.

3.8.2 Best Subsets Regression

Best subsets regression is a procedure that fits all permutations of p-inde-pendent variables, and can grow or shrink models similar to stepwiseprocedures. Best subsets procedures take longer than stepwise methodsfor a given set of variables. However, computing capability is rarely aproblem except for extremely large data sets with many variables. So, inthe majority of applications best subsets procedures are preferred to step-wise ones. In all other respects best subsets and stepwise procedures aresimilar. The same criticisms apply to best subsets — it is a mechanicalprocess that provides little insight in the way of new independent variablesand may produce many models that lack theoretical appeal.

3.8.3 Iteratively Specified Tree-Based Regression

Iteratively specified tree-based regression is a procedure that enables thedevelopment of statistical models in an iterative way to help identify func-tional forms of independent variables (see Washington, 2000b). Because theprocedure is not offered in neatly packaged statistical analysis software, itis more difficult to apply. However, some of the limitations of stepwise and

© 2003 by CRC Press LLC

Page 123: Data Analysis Book

best subset procedures are overcome. The procedure involves iterating backand forth between regression trees and ordinary least squares to obtain atheoretically appealing model specification. It is particularly useful whenthere are many noncontinuous independent variables that could potentiallyenter the regression model. For the theory of regression trees and transpor-tation applications of this methodology, the reader should consult Breimanet al. (1984), Washington (2000b), Wolf et al. (1998), Washington and Wolf(1997), and Washington et al. (1997).

3.9 Logistic Regression

The regression models described in this chapter have been derived with theassumption that the dependent variable is continuous. However, there arecases when the dependent variable is discrete — count data, a series ofqualitative rankings, and categorical choices are examples. For these cases,a wide variety of techniques are used and are the subject of later chapters.Logistic regression, which is useful for predicting a binary (or dichotomous)dependent variable as a function of predictor variables, is presented in thissection. An example of logistic regression applied to grouped data is foundin White and Washington (2001).

The goal of logistic regression, much like linear regression, is to identifythe best fitting model that describes the relationship between a binarydependent variable and a set of independent or explanatory variables. Incontrast to linear regression, the dependent variable is the populationproportion or probability (P) that the resulting outcome is equal to 1. It isimportant to note that the parameters obtained for the independent vari-able can be used to estimate odds ratios for each of the independentvariables in the model.

As an example, consider a model of a driver’s stated propensity to exceedthe speed limit on a highway as a function of various exogenous factors(drivers have responded to a questionnaire where 0 indicates that they donot exceed the speed limit and 1 that they do). Given a sample of driverswho have also provided socioeconomic information, prior accident involve-ment, annual vehicle-miles driven, and perceived police presence, a modelwith these variables is estimated. The resulting model could be used to deriveestimates of the odds ratios for each factor, for example, to assess whethermales are more likely to exceed the speed limit.

The logit is the LN (to base ) of the odds, or likelihood ratio, that thedependent variable is 1, such that

, (3.51)

e

Y P LNP

PB B Xi

ii ilogit( )

1 0

© 2003 by CRC Press LLC

Page 124: Data Analysis Book

where B0 is the model constant and the Bi are the parameter estimates forthe independent variables ( the set of independent variables).

The probability P ranges from 0 to 1, while the natural logarithm LN(P/(1 – P)) ranges from negative infinity to positive infinity. The logistic regres-sion model accounts for a curvilinear relationship between the binarychoice Y and the predictor variables Xi, which can be continuous or discrete.The logistic regression curve is approximately linear in the middle rangeand logarithmic at extreme values. A simple transformation of Equation3.51 yields

. (3.52)

The fundamental equation for the logistic regression shows that whenthe value of an independent variable increases by one unit, and all othervariables are held constant, the new probability ratio [Pi/(1 – Pi)] is givenas

. (3.53)

Thus when independent variable Xi increases by one unit, with all otherfactors remaining constant, the odds [Pi/(1 – P)] increases by a factor .The factor is called the odds ratio (OR) and ranges from zero topositive infinity. It indicates the relative amount by which the odds of theoutcome increases (OR > 1) or decreases (OR < 1) when the value of thecorresponding independent variable increases by 1 unit.

3.10 Lags and Lag Structure

Regression models estimated using time-series data frequently include a timelag variable in the regression. For example, assume that a substantial periodof time elapses between a change in the price of airline tickets and the effecton passenger demand. If the time between changes in the independent vari-ables and their effects on the dependent variable is sufficiently long, then alagged explanatory variable may be included in the model. A general form ofa regression model with lagged independent variables is given as

X i ni , , ...,1

PP

EXP EXP EXPi

i

B B X B B Xi i i i

10 0

PP

EXP EXP EXP EXP EXP

PP

EXP

i

i

B B X B B X B

i

i

B

i i i i i

i

1

1

0 01

*

( )

EXPBi

EXPBi

© 2003 by CRC Press LLC

Page 125: Data Analysis Book

. (3.54)

When the lagged variables have long-term effects, then infinite lag modelswhose effects fade slowly over time are used. When changes in the indepen-dent variables tend to have short-term effects on the dependent variable,then finite lag models are developed. An example of an infinite lag modelis the geometric lag model where the weights of the lagged variables are allpositive and decline geometrically over time. This model implies that themost recent observations receive the largest weight, and the influence ofobservations fades uniformly with the passage of time. The general form ofthis model is

. (3.55)

The weights of the explanatory variables in the geometric lag model neverbecome zero, but decrease in such a way that after a reasonable passage oftime the effects become negligible. The form described in Equation 3.55yields some difficulties regarding the large number of regressors. To simplifythe model, if lagging observations are limited to one period , then themodel becomes

, (3.56)

where .The adaptive expectations model, another popular specification, assumes

that changes in the dependent variable Y are related to changes in theexpected level of the explanatory independent variable X. The form of thismodel is

, (3.57)

where denotes the desired or expected level of X and is given by

, (3.58)

which indicates that the expected level of X is a weighted average of itscurrent and past observations. Chapter 7 presents a more detailed discussionof time-series models.

Y B Xt ii

t i t00

Y a X wX wX

a w X w

t t t t t

st s

tt

1 22

0

...

, 0 < < 1

( )t 1

Y a w wY X ut t t t( )1 1

u wt t t 1

Y a Xt t t* * * *

X *

X X Xt t t* *( ) ;1 0 11

© 2003 by CRC Press LLC

Page 126: Data Analysis Book

3.11 Investigating Causality in the Regression

Causality is a common concern in many statistical investigations. For exam-ple, an issue of considerable concern in public transit is in evaluating theeffect of public transit subsidies on the performance of a transit system. Theremay be an issue with reverse causality, which suggests that system perfor-mance is not only affected by public transit subsidies but also affects thelevel of public transit subsidies. If true and left unaccounted for, then endo-geneity bias may influence estimated parameters in the resulting statisticalmodel.

To investigate causality — whether evidence supports that changes in onevariable cause changes in another — the procedure of Granger (1969) canbe followed. Consider two time-series variables Xt and Yt. Granger causalitytests are conducted by running two sets of regressions on these variables.First, Yt is regressed on lagged values of Xt and lagged values of Yt in anunrestricted regression model such that

, (3.59)

where m is the number of lagged variables.Next, Yt is only regressed on lagged values of Yt in the restricted regression

model such that

. (3.60)

The two previous regression models are then estimated with the dependentand explanatory variables reversed. The entire procedure is based on thesimple hypothesis that, if Xt causes Yt, then changes in Xt should precedechanges in Yt. But to claim that Xt indeed causes Yt, two conditions must bemet. First, it should be possible to predict Yt if Xt is known. This implies thatin a regression model using Yt against past values of Yt, the addition of pastvalues of Xt as independent explanatory variables should significantlyimprove the model. Second, the reverse cannot be true. Yt should not helppredict the values of Xt because if both Xt and Yt help predict one another,then it is probable that omitted variables “cause” both X and Y.

To summarize, Xt is said to “Granger cause Yt” if two conditions hold: (1)in the regressions of Xt on Yt, the null hypothesis that the parameter estimateson the lagged values of Xt are jointly 0 is rejected; and (2) in the regressionsof Yt on Xt, the null hypothesis that the parameter estimates on the laggedvalues of Yt are jointly 0 is accepted. It is important to note that to ensure

Y a Y Xi t ii

m

ii

m

t i t1 1

Y a Yii

m

t t1

1

© 2003 by CRC Press LLC

Page 127: Data Analysis Book

objectivity of the results, tests have to be run using different values of m(number of lags for the regressions).

3.12 Limited Dependent Variable Models

Limited dependent variables pose problems in the usual linear regressionestimation framework. Limited variables are measured on a continuousscale, such as interval or ratio, but are characterized by the fact that theirobserved values do not cover their entire range of possible values. There arethree types of limited outcome variables.

Censored variables are observations where measurements are clustered ata lower threshold (left censored), an upper threshold (right censored), orboth. Censored dependent variables can be analyzed using Tobin regression,which was named in recognition of James Tobin, the models’ developer(Tobin, 1958). Cohen (1959, 1961) offers a maximum likelihood method forestimating the mean of censored samples, and Dhrymes (1986) offers anextensive exposition and analysis of limited dependent variable models andestimation processes. The left censored regression model is given as

. (3.61)

For this model the standard assumptions for linear regression listed in Table3.1 also apply. Careful examination of the model structure reveals that estima-tion of this model using ordinary least squares leads to serious specificationerrors because in Equation 3.61 the zero observations are not generated by thesame process that generates the positive observations. Ordinary least squaresestimation yields biased and inconsistent parameter estimates, because unbi-asedness and consistency require that . In Equation 3.61, how-ever, the cases with zero-valued observations yield i = – a – Xi, which violatesthe requirement that .

There are three main reasons for not simply conducting an analysis on allnonzero observations: (1) it is apparent that by focusing solely on the nonzeroobservations some potentially useful information is ignored; (2) ignoringsome sample elements would affect the degrees of freedom and the t- andF-statistics; and (3) a simple procedure for obtaining efficient and/or asymp-totically consistent estimators by confining the analysis to the positive sub-sample is lacking. As such, the most widely used approach is to analyze theentire sample by using maximum likelihood estimation (for details refer toDhrymes, 1986). Similar to censored variables, truncated dependent vari-ables can be analyzed using regression techniques structured specifically fortruncated regression problems.

Y a X a Xi i i i i, if > 0

otherwise0

i N~ ,0 2

E i 0

© 2003 by CRC Press LLC

Page 128: Data Analysis Book

3.13 Box–Cox Regression

The Box–Cox transformation is a method for generalizing the linear regres-sion model and for selecting an appropriate specification form for a givenmodel (see Appendix D for a general discussion of variable transformations).The procedure involves transforming a variable from to such that

, (3.62)

where is an estimated parameter. The basic form of this transformation is

. (3.63)

For a given value of , a linear regression can be estimated using ordinaryleast squares estimation. Table 3.8 lists some of the common parametervalues and the corresponding functional specifications used in ordinary leastsquares regression. The equation becomes

. (3.64)

The estimated parameter need not be restricted to values in the [0,1]interval, and it is possible to obtain a wide range of values; implying thatthe Box–Cox procedure is not limited to the set of linear, double log, andsemilogarithmic functional forms shown in Table 3.8. Common values forthe parameter range between –2 and 2. Further, is generally assumed tobe the same for all variables in the model because differences become com-putationally cumbersome. Finally, it should be noted that if is not known,the regression model in Equation 3.64 becomes nonlinear in parameters andnonlinear least squares estimation must be used.

TABLE 3.8

Common Parameter Values and Corresponding Functional Specifications

y, x Values Functional Form

y = x = 1 Linear regression (Y on X)y = x = 0 Double log regression LN (Y) on LN (X)y = 0, x = 1 Log-linear regression LN (Y) on Xy = 1, x = 0 Linear-log regression Y on LN (X)

Xi Xi*

XX

ii* 1

Y Xi ii

1 1

Y a Xkk

K

k1

( )

© 2003 by CRC Press LLC

Page 129: Data Analysis Book

3.14 Estimating Elasticities

It is frequently useful to estimate the responsiveness and sensitivity of adependent variable Y with respect to changes in one or more independentvariables X. Although the magnitudes of the regression parameters yieldthis information in terms of measurement units, it is sometimes more con-venient to express the sensitivity in terms of the percentage change in thedependent variable resulting from a 1% change in an independent variable.This measure is the elasticity, and is useful because it is dimensionless unlikean estimated regression parameter, which depends on the units of measure-ment. The elasticity of the dependent variable Y with respect to an indepen-dent variable X is given as

. (3.65)

Roughly speaking, a value suggests that a 1% increase in Xi resultsin a 2.7% increase in Y. Table 3.9 summarizes the elasticity estimation forfour commonly used functional forms in regression models.

TABLE 3.9

Elasticity Estimates for Various Functional Forms

Model Elasticity

Linear

Log-linear

Linear-log

Double log

Adapted from McCarthy, P.S., Transportation Eco-nomics Theory and Practice: A Case Study Approach.Blackwell, Boston, 2001.

eXY

YX

XYi i

i

i

i

i

i

i

ei 2 7.

YX

XY

XY

i

i

i

ii

i

i

YX

XY

Xi

i

i

ii i

YX

XY Y

i

i

i

ii

i

1

YX

XY

i

i

i

ii

© 2003 by CRC Press LLC

Page 130: Data Analysis Book

4Violations of Regression Assumptions

A number of assumptions or requirements must be substantively met inorder for the linear regression model parameters to be best linear unbi-ased estimators (BLUE). That is, parameters must be unbiased, asymp-totically efficient, and consistent. The six main regression modelassumptions are as follows: the disturbance terms have zero mean; thedisturbance terms are normally distributed; the regressors and distur-bance terms are not correlated; the disturbance terms are homoscedastic;the disturbance terms are not serially correlated (nonautocorrelated); anda linear-in-parameters relationship exists between the dependent andindependent variable(s).

This chapter extends the practical understanding of the linear regressionmodel development and answers some questions that arise as a result ofviolating the previously listed assumptions. It answers questions such as thefollowing. Why does it matter that a particular assumption has been vio-lated? How can regression violations be detected? What are the substantiveinterpretations of violations of the assumptions? What remedial actionsshould be taken to improve on the model specification?

4.1 Zero Mean of the Disturbances Assumption

The assumption that the disturbance term has a mean of zero is the leastrestrictive and “damaging” of the violations examined. A violation of thisassumption can result from consistent positive or negative errors of mea-surement in the dependent variable (Y) of a regression model. Two distinctforms of this violation can be identified. The first and most commonlyencountered violation occurs when . In this case can besubtracted from the to obtain the new disturbances , whichhave zero mean. The regression model then becomes

(4.1)

E i( ) 0i i i

Y Xi i i0* *

© 2003 by CRC Press LLC

Page 131: Data Analysis Book

where . It is clear that only and can be estimated and not or , which suggests that and cannot be retrieved from without

additional assumptions. Despite this, Equation 4.1 satisfies the remainingassumptions and as a result ordinary least squares (OLS) yields BLUE esti-mators for and . In practical terms this means that a no-zero mean forthe disturbance terms affects only the intercept term and not the slope ( ).If the regression model did not contain an intercept term, the would bebiased and inconsistent.

The second form of the violation occurs when , suggesting thatthe disturbances vary with each i. In this case the regression model of Equa-tion 4.1 could be transformed by adding and subtracting . However, withthis approach, would vary with each observation, resulting inmore parameters than observations (there are n terms and one to beestimated using n observations). This model cannot be estimated unless thedata consist of repeated observations such as when analyzing panel data(discussed in Chapter 6).

4.2 Normality of the Disturbances Assumption

The violation of the assumption that the disturbance terms are approximatelynormally distributed is commonly the result of the existence of measurementerrors in the variables and/or unobserved parameter variations. The hypoth-esis of normally distributed disturbance terms can be straightforwardlytested using many statistical software packages employing either quantile-quantile plots of the disturbances or the and/or Kolmogorov–Smirnov,tests, which provide inferential statistics on normality.

The property of normality results in OLS estimators that are minimumvariance unbiased (MVU) and estimators that are identical to the ones obtainedusing maximum likelihood estimation (MLE). Furthermore, normality permitsthe derivation of the distribution of these estimators, which, in turn, permitst and F tests of hypotheses to be performed. A violation of the normalityassumption therefore results in hypothesis testing problems and OLS estimatesthat cannot be shown to be efficient, since the Cramer–Rao bound cannot bedetermined without a specified distribution. The parameter estimates are con-sistent, however, further pointing to hypothesis testing problems. To correctthe problem of non-normally distributed disturbances, two alternativeapproaches can be used. First, if the sample size used for model estimation islarge, then Theil (1978) suggests that the hypothesis of normality can beinferred asymptotically for the OLS estimates by relying on the central limittheorem (Theil’s proof applies for cases of fixed X in panel data samples, zeromean, and constant variance). Second, alternative functional forms can be usedfor the regression model such as the gamma regression model (Greene, 1990b)and the stochastic frontier regression model (Aigner et al., 1977).

0 0 0

0

0

E i i( )

i

0 0i i

0

2

© 2003 by CRC Press LLC

Page 132: Data Analysis Book

4.3 Uncorrelatedness of Regressors and Disturbances Assumption

The assumption that the random regressors are contemporaneously uncor-related with the disturbances is the most important assumption of all tomeet. If this assumption is not met, not only are the parameter estimatesbiased and inconsistent, but the methods for detecting and dealing withviolations of other assumptions will fail. To illustrate this, consider the slopeestimator for the usual simple linear regression model:

. (4.2)

When is an unbiased estimator of , the ratio division on the right-hand side of Equation 4.2 is equal to zero. When there is no correlationbetween Xi and , then

(the first equality holds because and the secondbecause ). Thus, when , is not an unbiased esti-mator of .

A violation of this assumption usually arises in four ways. The first iswhen the explanatory variables of the regression model include a laggeddependent variable and its disturbances are serially correlated (autocorrela-tion). For example, consider the model:

, (4.3)

where the and are independent and identically distributed accordingto . The disturbance is correlated with the regres-sor this happens because the disturbance has two components( ) and is dependent on one of them ( ), as shown in Equation4.3. This model is frequently called a correlated error model, as both and

contain the component . When this occurs, the OLS estimator isinconsistent (possible corrections for this problem are discussed in Section4.5) in most cases.

The second source is when the random regressors are nonstationary. Theconcept of stationarity is discussed in more detail in Chapter 7; however,

ˆOLS

i ii

n

ii

n

X

X 2

ˆOLS

i

E X X E X X Ei ii

n

ii

n

ii

n

ii

n

ii

n

1

2

1 1

2

110

COV X, 0E i 0 COV X, 0 ˆ

OLS

Y X Y u

u

i i i i

i i i

0 1 2 1

1

ui i

i N~ ( , )0 2 ui i i 1Yi 1

i i, 1 Yi 1 i 1ui

ui 1 i 1

© 2003 by CRC Press LLC

Page 133: Data Analysis Book

with a nonstationary random regressor, the assumption that the samplevariance of an independent variable X approaches a finite limit is violated.Under these circumstances, the OLS estimator may no longer be consistentand the usual large sample approximate distribution is no longer appropri-ate. As a result, confidence intervals and hypothesis tests are no longer valid.To solve the problem, either the process generating the observations on therandom regressor needs to be modeled or the regressor needs to be trans-formed to obtain stationarity (usually by differencing).

The third source is when the values of a number of variables are jointlydetermined (simultaneous equations model). Consider the following system:

. (4.4)

In this model are jointly determined by the two equations whereas is exogenously determined. The OLS estimation of is inconsistent

because and are correlated ( depends on through the first equa-tion and depends on through the second equation and from this itfollows that depends on ; hence, they are correlated). Modeling systemsof simultaneous equations is discussed in Chapter 5.

The fourth source is when one or more of the variables are measured witherror (measurement error). Consider the variables and that are mea-sured with error and and the true, yet unobservable, variables. Then,

, (4.5)

where and are random disturbances capturing the measurementerrors for the ith observation. For these two disturbance terms the usualassumptions are valid:

.

The problem with measuring the variables with error arises when trying toestimate a model of the following form:

Y X

X y Z

i i i

i i i

0

0

Y Xi i,Zi ,

Xi i Xi YiYi i

Xi i

Xi YiXi Yi

X X u

Y Y v

i i i

i i i

ui vi

E u E v

VAR u E u VAR v E v

COV u u E u u COV v v E v v i j

COV u v E u v i j

i i

i i u i i v

i j i j i j i j

i j i j

0

0 0

0

2 2 2 2,

( , ) , ( , ) ,

( , ) , ,

for

© 2003 by CRC Press LLC

Page 134: Data Analysis Book

. (4.6)

When and are unobservable the following model is estimated:

. (4.7)

As Griffiths et al. (1993) show, and , whichsuggest that has zero mean and constant variance. However, and are correlated resulting in the OLS estimators being biased and inconsistent.To show this, the covariance between and is estimated as

.

Further, Griffiths et al. (1993) show that the inconsistency of the OLS esti-mator for the slope is

. (4.8)

This finding suggests that the inconsistency of the estimator depends onthe magnitude of the variance of the measurement error relative to thevariance of . Similar to the case when both and are measured witherror, it is possible that only one of the two variables ( or ) is measuredwith error. In the first case, measurement error in , the parameters are onceagain biased and inconsistent. In the second case, measurement error in

only, the slope parameter will be unbiased and consistent but with anincreased error variance. The most common technique for dealing with theproblem of measurement error is to use instrumental variables estimation.This is a common technique for estimating systems of simultaneous equa-tions and is discussed in Chapter 5.

4.4 Homoscedasticity of the Disturbances Assumption

The assumption that the disturbances are homoscedastic — constant varianceacross observations — arises in many practical applications, particularly whencross-sectional data are analyzed (as a grammatical note, the correct spellingof this violation is heteroskedasticity, not heteroscedasticity; but the former isso well established in the literature that rarely does the term appear with a k).

Y Xi i i0

Yi Xi

Y X v ui i i i i i0 , where

E i( ) 0 VAR i v u( ) 02 2 2

i i Yi

i Yi

COV X E X E X E

E u v u E u v E u

i i i i i i

i i i i i i u

( , )

2 2 2 0

ˆOLS

u

x

12

Xi Xi YiXi Yi

Xi

Yi

© 2003 by CRC Press LLC

Page 135: Data Analysis Book

For example, when analyzing household mobility patterns, there is oftengreater variation in mobility among high-income families than low-incomefamilies, possibly due to the greater flexibility in traveling allowed by higherincomes. Figure 4.1(left) shows that as a variable X (income) increases, the meanlevel of Y (mobility) also increases, but the variance around the mean valueremains the same at all levels of X (homoscedasticity). On the other hand, asFigure 4.1(right) shows, although the average level of Y increases as X increases,its variance does not remain the same for all levels of X (heteroscedasticity).

Heteroscedasticity, or unequal variance across observations, does not typ-ically occur in time-series data because changes in the variables of the modelare likely to be of similar magnitude over time. This is not always true,however. Heteroscedasticity can be observed in time-series data and is ana-lyzed using the so-called ARCH (autoregressive conditionally heteroscedas-tic) models that are briefly examined in Chapter 7 (see Maddala, 1988, formore details on these models).

To formalize, heteroscedasticity implies that the disturbances have a non-constant covariance, that is . For the linear regressionmodel, given in Equation 4.2 is still unbiased and consistent becausethese properties do not depend on the assumption of constant variance.However, the variance of is now different since

(4.9)

where the second equality holds because the is now . Note thatif , this reverts back to

,

the usual formula for under homoscedasticity. Further, it can beshown that will involve all of the and not a single common

FIGURE 4.1Homoscedasticity (left); heteroscedasticity (right).

Y Y

X X

E e i ni i( ) , , , ...,2 2 1 2ˆ

OLS

ˆOLS

VAR VARX

X

X

XOLS

i ii

n

ii

n

i ii

n

ii

n

ˆ 1

2

1

2 2

1

2

1

2

VAR i( ) i2

i2 2

2 2

1Xii

n

VAR OLS( ˆ )E s2

i2 2

© 2003 by CRC Press LLC

Page 136: Data Analysis Book

(Baltagi, 1993). This implies that when a software package reports theas

for a regression problem, it is committing two errors; it is not using thecorrect formula for the variance (Equation 4.9) and it is using to estimatea common when the are different. The magnitude of the bias result-ing from this error depends on the heteroscedasticity and the regressor.Under heteroscedasticity, OLS estimates lose efficiency and are no longerBLUE. However, they are still unbiased and consistent. In addition, thestandard error of the estimated parameters is biased and any inferencebased on the estimated variances including the reported t and F statisticsis misleading.

4.4.1 Detecting Heteroscedasticity

When the disturbances in a regression model ( ) have constant variance 2

(homoscedastic), the residual plot of vs. any one of the explanatory vari-ables X should yield disturbances that appear scattered randomly aroundthe zero line with no differences in the amount of variation in the residualsregardless of the value of X (Figure 4.2, left). If the residuals are more spreadout for large values of X than for small ones, then the assumption of constantvariance may be violated (Figure 4.2, right). Conversely, variance could besmaller for high values of X under heteroscedasticity.

The intuitive and rather straightforward graphical approach for detectingheteroscedasticity has been formalized into a statistical test by Park (1966).The Park test suggests that if heteroscedasticity is present, the heteroscedasticvariance may be systematically related to one or more of the explanatory

FIGURE 4.2Homoscedastic disturbances (left); heteroscedastic disturbances (right).

VAR OLS( ˆ )

s Xii

n2 2

1

s2

2i2

i

i2

0 0

X

X

εi^εi

^

© 2003 by CRC Press LLC

Page 137: Data Analysis Book

variables. Consider, for example, a two-variable regression model of Y onX; after the model is fitted the following regression is run:

(4.10)

where is a disturbance term (the are obtained from the original regres-sion). The Park test involves testing the null hypothesis of homoscedasticity( ). If a statistically significant relationship exists between andLN(Xi), the null hypothesis is rejected and remedial measures need to be taken.If the null hypothesis is not rejected, then the constant ( ) from Equation 4.10can be interpreted as giving the value of the homoscedastic variance ( ).

Another common test for heteroscedasticity is the Goldfeld–Quandt test(or split sample regressions of Goldfeld and Quandt, 1966). This test inves-tigates the null hypothesis of homoscedasticity against the alternativehypothesis that . The Goldfeld–Quandt test proceeds by estimatingtwo OLS regressions. One is estimated using data hypothesized to be asso-ciated with low variance errors, and the other is estimated using data fromhigh variance errors (typically the middle 20% of the data set is excluded).The null hypothesis of homoscedasticity cannot be rejected if the residualvariances for the two regressions are approximately equal, whereas, if theyare not equal, the null hypothesis is rejected. The Goldfeld–Quandt teststatistic is

,

where SSE is the sum of square error, n is the total sample size, d is thenumber of observations excluded, and k is the number of independent vari-ables in the model.

There are numerous other tests for investigating heteroscedasticity, includ-ing the Breusch and Pagan (1979) test, which is an alternative to the Gold-feld–Quandt test when an ordering of the observations according toincreasing variance is not simple. A further option is the White (1980) testthat does not depend on the normality assumption for the disturbances asdo the Goldfeld–Quandt and Breusch–Pagan tests.

4.4.2 Correcting for Heteroscedasticity

When heteroscedasticity is detected, two different approaches can be taken tomake the terms approximately equal: the weighted least squares approachand the variance stabilizing transformations approach, which transforms thedependent variable in a way that removes heteroscedasticity. This secondapproach is applicable when the variance of is a function of its mean (seeAppendix D for more information on variable transformations).

LN LN X ui i i2

0

ui i

0 LN i( )2

2

i icx2 2

SSESSE

low variance

high variance

Fn d k n d k

; ,2

22

2

i2

Yi

© 2003 by CRC Press LLC

Page 138: Data Analysis Book

The weighted least squares (WLS) approach to addressing heteroscedas-ticity in a regression model is a common procedure. Suppose, for example,that , where the are known constants. Such heterosce-dasticity models, in which the variance of the disturbances is assumed to beproportional to one of the regressors raised to a power, are extensivelyreviewed by Geary (1966), Goldfeld and Quandt (1972), Kmenta (1971),Lancaster (1968), Park (1966), and Harvey (1976). Constancy of variance isachieved by dividing both sides of the regression model

(4.11)

by , yielding

. (4.12)

Each is called a weight and is the result of minimizing theweighted sum of squares,

. (4.13)

When the , or a proportional quantity, are known, the weights are notdifficult to compute. Estimates for the of Equation 4.12 are obtained byminimizing Equation 4.13 and are called the WLS estimates because each Yand X observation is weighted (divided) by its own (heteroscedastic) stan-dard deviation, .

The variance stabilizing transformations approach has the following basis:for any function f(Y) of Y with a continuous first derivative , a finitesecond derivative , and , it follows that

(4.14)

where lies between and . Squaring both sides of Equation 4.14 andwith small yields

, (4.15)

where is the variance of with mean . To find a suitable trans-formation f of that would make approximately a constant,the following equation needs to be solved:

(4.16)

VAR ci i i( ) 2 2 ci

Y X X i ni i k ik i0 1 1 1... , , ...,

ci

Y c c X c X c c i ni i i i i k ik i i i0 1 1 1... , , ...,

i ic2

i i i k ikY X X0 1 1

2...

i2

i

f Yf Y i iE Y

f Y f Y f Y fi i i i i i i

12

2

Yi iYi i

2

VAR f Y fi i i i

2 2

i i2 Yi i

Yi VAR f Yi

f ci i i

© 2003 by CRC Press LLC

Page 139: Data Analysis Book

where c is a constant. This transformation f is called a variance stabilizingtransformation. For example, if , then Equation 4.16 yields

. Frequently used transformations include the square roottransformation when the error variance is proportional to an independentvariable and (1/Xi) when the error variance is proportional to .

Example 4.1

A study of the effects of operating subsidies on transit system perfor-mance was undertaken for transit systems in the State of Indiana (seeKarlaftis and McCarthy, 1998, for additional details and a review of theempirical literature). Data were collected (see Table 4.1 for availablevariables) for 17 transit systems over a 2-year period (1993 and 1994).Although these are panel data (they contain time-series data for a crosssection of transit systems), they are of very “short” duration (only 2 yearsof data exist) and will be analyzed as if they were cross sectional. A briefoverview of the data set suggests the possible existence of heteroscedas-ticity because systems of vastly different sizes and operating character-istics are pooled together. To determine significant factors that affectperformance (performance is measured with the use of performanceindicators), a linear regression model was estimated for each of theseindicators. The estimation results are shown in Tables 4.2 through 4.4.For each of these indicators both plots of the disturbances and the Gold-feld–Quandt test (at the 5% significance level) indicate the existence ofheteroscedasticity. As a result, a series of WLS estimations were under-taken and the results appear in the same tables.

In the absence of a valid hypothesis regarding the variable(s) causingheteroscedasticity, a “wild goose chase” might ensue, as in this example.The most striking result, which applies to all three estimations and forall weights, is that the slope parameters ( ) are (more) significant in

TABLE 4.1

Variables Available for the Analysis

VariableName Explanation

SYSTEM Transit system nameYEAR Year when data were collectedOE_RVM Operating expenses (U.S. dollars) per revenue vehicle milePAX_RVM Passengers per revenue vehicle mileOE_PAX Operating expenses per passenger in U.S. dollarsPOP Transit system catchment area populationVEH Number of vehicles availableEMP Total number of employeesFUEL Total annual gallons of fuelC_LOC Total subsidies from local sources in U.S. dollarsC_STAT Total subsidies from state sources in U.S. dollarsC_FED Total subsidies from the federal government in U.S. dollars

i if LNi i

Xi2

© 2003 by CRC Press LLC

Page 140: Data Analysis Book

TABLE 4.2

Regression of Operating Expenses per Revenue Vehicle Mile on Selected Independent Variables (t statistics in parentheses)

ExplanatoryVariable OLS WLS1 WLS2 WLS3

Constant 1.572 (8.554) 1.579 (11.597) 1.578 (9.728) 1.718 (11.221)POP –2.168E-06 ( 0.535) 8.476E-07 (0.562) –9.407E-07 (–0.582) –4.698E-07 (–0.319)VEH –8.762E-04 ( 0.025) 0.031 (2.242) 0.012 (1.191) 0.003 (0.466)EMP 3.103E-03 (0.198) –0.011 ( 1.814) 0.003 ( 0.603) 0.001 (0.308)FUEL –1.474E-05 (–3.570) –1.443E-05 (–9.173) –1.334E-05 (–7.741) –1.367E-05 (–8.501)C_LOC 1.446E-06 (1.075) 1.846E-06 (4.818) 1.350E-06 (3.965) 1.320E-06 (5.052)C_STAT 3.897E-06 (1.305) 1.352E-06 (1.390) 2.647E-06 (3.050) 2.961E-06 (4.183)C_FED 2.998E-06 (1.990) 3.794E-06 (6.777) 3.221E-06 (5.799) 2.938E-06 (6.958)R2 0.555 0.868 0.917 0.942

Note: WLS1 weighted least squares (weight: FUEL1.1); WLS2 weighted least squares (weight:VEH1.3); WLS3 weighted least squares (weight: EMP1.6).

TABLE 4.3

Regression of Passengers per Revenue Vehicle Mile on Selected Independent Variables (t statistics in parentheses)

ExplanatoryVariable OLS WLS1 WLS2

Constant 0.736 (7.888) 0.771 (9.901) 0.756 (11.689)POP –3.857E-06 (–1.877) –1.923E-06 (–2.583) –1.931E-06 (–3.729)VEH 5.318E-03 (0.299) 0.023 (3.336) 0.025 (13.193)EMP –1.578E-03 (–0.198) –0.009 (–3.162) –0.010 (–13.026)FUEL –2.698E-07 (–0.129) –7.069E-07 (–1.120) –4.368E-07 (–1.041)C_LOC –3.604E-07 (–0.528) 4.982E-08 (0.341) 5.753E-08 (0.807)C_STAT 1.164E-06 (0.768) –2.740E-07 (–0.620) –4.368E-07 (–2.089)C_FED 1.276E-06 (1.669) 1.691E-06 (5.906) 1.726E-06 (11.941)R2 0.558 0.859 0.982

Note: WLS1 weighted least squares (weight: POP2.1);WLS2 weighted least squares(weight: VEH2.1).

TABLE 4.4

Regression of Operating Expenses per Passenger on Selected Independent Variables (t statistics in parentheses)

ExplanatoryVariable OLS WLS1

Constant 2.362 (12.899) 1.965 (17.144)POP 5.305E-06 (1.315) 3.532E-06 (3.123)VEH –1.480E-02 (–0.425) –0.007 (–0.677)EMP 7.339E-03 (0.470) 0.005 (1.018)FUEL –1.015E-05 (–2.467) –1.028E-05 (–10.620)C_LOC 1.750E-06 (1.307) 1.273E-06 (5.700)C_STAT 5.460E-07 (0.183) 1.335E-06 (1.982)C_FED 4.323E-08 (0.029) 7.767E-07 (1.794)R2 0.460 0.894

Note: WLS1 weighted least squares (weight: POP2.1).

© 2003 by CRC Press LLC

Page 141: Data Analysis Book

the WLS estimation than in the OLS estimation. As previously noted,in the presence of heteroscedasticity OLS estimators of standard errorsare biased and one cannot foretell the direction of this bias. In thepresent case study the bias is upward, implying that it overestimatesthe standard error. One would have to accept the WLS estimates asmore trustworthy because they have explicitly accounted for the het-eroscedasticity problem.

From a policy perspective, the OLS results suggest the absence of a directeffect of operating subsidies on performance, whereas the WLS resultssuggest a generally degrading effect of subsidies on performance (withthe exception of state-level subsidies on passengers per revenue vehiclemiles). As a final point, the WLS estimation does not completely alleviateheteroscedasticity. As such, it may be appropriate to use variance stabi-lizing transformations to account fully for this problem.

4.5 No Serial Correlation in the Disturbances Assumption

The assumption that disturbances corresponding to different observationsare uncorrelated is frequently violated, particularly when modeling time-series data. When disturbances from different, usually adjacent time periodsare correlated, then the problem is called disturbance (error term) serialcorrelation or autocorrelation. Serial correlation may occur in time-seriesstudies either as a result of correlation over time or as a result of explanatoryvariables omitted from the model. For example, consider an analysis ofmonthly time-series data involving the regression of transit output (vehicle-miles) on labor, fuel, and capital inputs (a production function). If there is alabor strike affecting output in one quarter, there may be no reason to believethat this disruption will be carried over to the next quarter (no serial corre-lation problem). However, the disruption of the strike may affect output inthe subsequent month(s), leading to serial correlation. Figure 4.3 shows avariety of patterns that correspond to autocorrelated disturbances.

Similar to serial correlation, spatial correlation may occur when data aretaken over contiguous geographical areas sharing common attributes. Forexample, household incomes of residents on many city blocks are typicallynot unrelated to household incomes of residents from a neighboring block.Spatial statistics is an area of considerable research interest and potential.Readers interested in this scientific area should refer to Anselin (1988) for anin-depth treatment of the subject.

To formalize the discussion on serial correlation, suppose the followingregression model is analyzed:

, (4.17)Y B Xi i i0 1

© 2003 by CRC Press LLC

Page 142: Data Analysis Book

where the usual assumption of independence for the disturbance is vio-lated, and instead of being random and independent, the disturbances aredependent as , where the are independent normally distrib-uted with constant variance. Furthermore, in cases where disturbances inone time period are correlated with disturbances in only the previous timeperiod, first-order serial correlation is examined. The strength of serial cor-relation depends on the parameter , which is called the autocorrelationcoefficient. When the are independent, and when 0 1(–1

0) the are positively (negatively) correlated. In the case of positive serialcorrelation, successive error terms are given by

FIGURE 4.3Patterns of autocorrelated disturbance.

Time

Time

Time

εi^

εi^

εi^

i

i i iu1 ui

0 i

i

i i i

i i i

i n i n i n

u

u

u

1

1 2 1

1 1

...

© 2003 by CRC Press LLC

Page 143: Data Analysis Book

Combining the first and second equations yields i = ( i–2 + ui–2) + ui =2

i–2 + i–1 + ui, which implies that the process has “memory,” in that it“remembers” past conditions to some extent. The strength of this memoryis reflected in . Continuing the substitution and reversing the order ofthe terms yields

.

And, since E(ui) = E(ui–1) = … = E(ui–n+1) = 0 and E( i) = E( i–1) = … = E( i–n+1),then VAR( i) VAR(ui). Furthermore, , and it follows that

It follows that under serial correlation, the estimated disturbances variance is larger than the true variance of the random independent disturbances by a factor of 1/(1 – 2). As such, the OLS estimates lose efficiency and

are no longer BLUE. On the other hand, they are unbiased and consistent,similar to the case of heteroscedasticity.

4.5.1 Detecting Serial Correlation

The disturbances in a regression model may not have apparent correlationover time. In such cases a plot of the disturbances over time should appearscattered randomly around the zero line (Figure 4.2a). If the disturbancesshow a trend over time, then serial correlation is obvious (Figure 4.3). Serialcorrelation is also diagnosed using the well-known and widely usedDurbin–Watson (DW) test. The DW d statistic is computed as (Durbin andWatson, 1951)

. (4.18)

When the disturbances are independent, d should be approximately equalto 2. When the disturbances are positively (negatively) correlated, d should beless than (more than) 2. Therefore, this test for serial correlation is based onwhether d is equal to 2 or not. Unfortunately, the critical values of d dependon the Xi, which vary across data sets. To overcome this inconvenience, Durbin

i i i in

i nu u u u12

2 1...

VAR ui u( ) 2

21

22 1

2 2 4

2

1

1

VAR u VAR u VAR u VAR ui i in

i n

u

u

( ) ( ) ( ) ... ( )

...

.

2

u2

di ii

n

ii

n

ˆ ˆ

ˆ

1

2

1

2

1

© 2003 by CRC Press LLC

Page 144: Data Analysis Book

and Watson established upper (dU) and lower (dL) bounds for this critical value(Figure 4.4). It is clear that if or , the null hypothesis of no serialcorrelation is rejected. If the null hypothesis cannot be rejected.But what should be done when d falls in the inconclusive region? Althoughsome procedures have been developed to examine these cases further, theyhave not been incorporated into standard statistical software. A prudentapproach is to consider the inconclusive region as suggestive of serial corre-lation and take corrective measures. If the results after correction are the sameas before correction, then the corrective process was probably not necessary.If the results are different, then the process was necessary.

Note here that the DW test is applicable for regression models with anintercept (constant) term and without a lagged dependent variable as aregressor. An alternative test, typically used in the presence of a laggeddependent variable(s), is Durbin’s h statistic (Durbin, 1970)

, (4.19)

where d is the reported DW statistic ( ), T is the number of obser-vations, and is the square of the standard error of the parameter ofthe lagged dependent variable (note that the h test is not valid for T[VAR( )] > 1). Durbin has shown that the h statistic is approximately normallydistributed with unit variance and, as a result, the test for first-order serialcorrelation can be done directly using the tables for the standard normaldistribution.

4.5.2 Correcting for Serial Correlation

As noted previously, serial correlation of the disturbances may be the resultof omitted variables, of correlation over time, and a consequence of the

FIGURE 4.4Durbin–Watson critical values.

0 2 4dL dU 4-dU 4-dL

d

Reject H0

H0: ρ = 0

Reject H0

Do Not Reject H0

Inconclusive Inconclusive

d dL d dUd d dL U

hd T

T VAR1

2 1 ˆ

d 2 1( ˆ)VAR( ˆ )

ˆ

© 2003 by CRC Press LLC

Page 145: Data Analysis Book

nature of the phenomenon under study. In the first case, it is best to locateand include the missing variable in the model rather than to attempt othercorrections. In the second case, the most common approach is to transformthe original time-series variables in the regression so that the transformedmodel has independent disturbances. Consider again the regression modelof Equation 4.17, with its disturbances that suffer from first-order positiveserial correlation (most of the discussion regarding autocorrelation and pos-sible corrections concentrates on first-order serial correlation; readers inter-ested in the theoretical aspects of higher-order serial correlation should referto Shumway and Stoffer, 2000),

. (4.20)

To remove serial correlation, and are transformed into and asfollows:

Furthermore, the first time-period observations are transformed usingand (many software packages, depending on

the procedure used to estimate , ignore the first observation although recentresearch has shown this to be inappropriate because it may lead to loss ofinformation and adversely affect the results of the new regression). The newregression model after the transformations is

where the disturbances are independent. The critical aspect for the suc-cess of the previously mentioned approach is the estimation of , which isnot known in advance. If , the estimated autocorrelation parameter is aconsistent estimate of , then the corresponding estimates of and areasymptotically efficient. Numerous methods for estimating are describedin the literature and also are readily available in standard statistical softwarepackages; the most common ones are the Cochrane–Orcutt (1949) method,the Hildreth–Lu (1960) search procedure, the Durbin (1960) method, and theBeach–McKinnon (1978) maximum likelihood procedure.

Example 4.2

Monthly bus demand and socioeconomic data for Athens, Greece, werecollected for the period between April 1995 and April 2000 to assess the

Y X

u

i i i

i i i

0 1

1

Yi Xi Yi Xi

Y Y Y X X X i ni i i i i i1 1 1 and , ...,

Y Y12

11 X X12

11

Y X ui i i0 1

ui

ˆ

0 1

© 2003 by CRC Press LLC

Page 146: Data Analysis Book

sensitivity of bus travelers to public transport costs, as well as to otherfactors that are hypothesized to influence demand (the variables avail-able for this analysis appear in Table 4.5). Public transport demandcannot be directly estimated so two different measures were used:monthly ticket and travel card sales. Their sum approximates total tran-sit demand. In general, two separate models, one for ticket and one fortravel card sales should be estimated, to allow for the possibility ofdifferential price elasticities between these two measures. In this exam-ple the latter is examined.

TABLE 4.5

Variables Available for Public Transport Demand Estimation Model

Variable Abbreviation Explanation

PRICE Ticket/travel card price in U.S. dollarsVMT Total monthly vehicle miles traveledSTRIKES Total monthly hours of strikesNUMAUTOS Number of automobilesUNEMPLOYMENT Unemployment (%)FUELPRICE Auto fuel price in U.S. dollarsINCOMEPERCAP Income per capita in U.S. dollars per yearTIMETREND Time trendMARCH Dummy variable for MarchAUGUST Dummy variable for AugustDEC Dummy variable DecemberMETRO Dummy variable for the introduction of two new Metro lines

(0: before, 1: after the introduction)

TABLE 4.6

Regression Model Coefficient Estimates for Travel Card Sales in Athens (t statistics in parentheses)

Independent VariableNo Correction

for Autocorrelation MLEa MLEb

Constant 13.633 ( 2.97) 2.371 ( 0.37) 15.023 ( 4.21)PRICE 0.764 ( 6.42) 0.771 ( 4.96) 0.765 ( 9.84)VMT 0.351 (1.54) 0.382 (2.12) 0.06 (0.35)NUMAUTOS 2.346 (3.79) 0.509 (0.65) 2.713 (5.54)UNEMPLOYMENT 0.033 (0.36) 0.165 (1.35) 0.059 (0.70)FUELPRICE 0.007 ( 0.05) 0.134 ( 0.60) 0.002 ( 0.03)INCOMEPERCAP 2.105 (3.35) 0.013 (–0.02) 2.611 (5.51)MARCH 0.024 (1.43) 0.006 (0.57) 0.021 (1.79)AUGUST 0.216 ( 10.10) 0.161 ( 12.51) 0.287 ( 11.73)DEC 0.026 (1.56) 0.004 (0.38) 0.072 (3.83)METRO 0.003 ( 0.15) 0.042 (1.46) 0.004 ( 0.29)First-order c — 0.715 (7.71) See Table 4.7Durbin–Watson statistic 1.189 1.531 2.106R2 0.83 0.89 0.95

a MLE estimation of for first-order serial correlation.b MLE estimation of for first- to 12th-order serial correlation.c autocorrelation coefficient.

© 2003 by CRC Press LLC

Page 147: Data Analysis Book

Table 4.6 presents the results using both OLS estimation and estima-tion with a correction for serial correlation ( was estimated usingmaximum likelihood and the first observation of the series was notdiscarded). Correction for serial correlation was considered becausethe data reflect a monthly time series, and the results of the DW d-test suggest that serial correlation is present. Table 4.6 indicates thatprior to the correction for first-order serial correlation, the t statisticswere biased upward (for example, for the estimated parameters forPRICE, NUMAUTOS, and INCOMEPERCAP), whereas others weredownward biased (VMT, UNEMPLOYMENT, METRO). Note that thed statistic for this model was nearly doubled after first-order correc-tion.

Finally, because the data collected were monthly observations, besidesfirst-order serial correlation that is readily available in most statisticalpackages, serial correlation from first to 12th order was investigated.As is shown in Table 4.7, the first- and 12th-order serial correlationparameters are clearly statistically significant and conceptually im-portant. Some others, such as the fifth and seventh order, are alsostatistically significant, but their practical interpretation is not clear.The results for higher-order correction are interesting, particularlywhen compared to the first-order correction. For example, in manycases such as for VMT, NUMAUTOS, and INCOMEPERCAP, the tstatistics yield values closer to those that resulted in the absence offirst-order correction. As a result, the validity of the results for first-order correction is questioned. However, it must be stressed that forformal policy analyses a more in-depth and careful analysis of thedisturbances autoregression parameters should be undertaken, be-cause it is clear that the results are indeed sensitive to the order ofthe correction for serial correlation.

TABLE 4.7

Estimates of Autoregressive Parameters (correction for serial correlation)

Lag Coefficient t Statistic

1 0.344 2.652 0.17 1.333 0.18 1.424 0.22 1.705 0.24 1.866 0.14 1.127 0.27 1.958 0.22 1.689 0.09 0.70

10 0.01 0.1011 0.24 1.7612 0.33 2.54

© 2003 by CRC Press LLC

Page 148: Data Analysis Book

4.6 Model Specification Errors

In the preceding four sections, violations regarding the disturbance termsof the linear regression model were discussed. It was implicitly assumedthat the estimated models represent correctly and capture mathematicallythe phenomena under investigation. Statistically speaking, it was assumedthat specification bias or specification errors were avoided in the previousdiscussion. A specification error occurs when the functional form of a modelis misspecified, and instead of the “correct” model another model is esti-mated. Specification errors, which may result in severe consequences forthe estimated model, usually arise in four ways (readers interested in testsfor specification errors should refer to Godfrey, 1988).

The first is when a relevant variable is omitted from the specified model(underfitting a model). Excluding an important variable from a model isfrequently called omitted variable bias. To illustrate this, suppose the correctmodel is

(4.21)

but, instead of this model, the following is estimated:

. (4.22)

If the omitted variable is correlated with the included variable , thenboth and are biased because it can be shown that

. (4.23)

Furthermore, since the bias (the second term of Equation 4.23) does notdiminish to zero as the sample size becomes larger (approaches infinity), theestimated parameters are inconsistent. However, in cases when is notcorrelated with the included variable (COV(X2, X1) = 0), then the esti-mated parameter is unbiased and consistent but remains biased. Inboth cases, with correlated and uncorrelated variables, the estimated errorvariance of from Equation 4.22 is a biased estimator of the variance ofthe true estimator and the usual confidence interval and hypothesis-testing procedures are unreliable (for proofs see Kmenta, 1986).

The second source of misspecification error occurs when an irrelevantvariable is included in the specified model (overfitting a model). Includingan irrelevant variable in a model is frequently called irrelevant variable bias;for example, suppose the “correct” model is

Y X Xi i i i0 1 1 2 2

Y Xi i i0 1 1

X i2 X i1

0 1

ˆ ,1 1 2

1 2

2

COV X X

VAR X

X i2X i1

1 0

1

1

© 2003 by CRC Press LLC

Page 149: Data Analysis Book

, (4.24)

but instead the following model is estimated

. (4.25)

The inclusion of the irrelevant variable implies that the restriction has not been considered. The OLS estimators for this incorrect model

are unbiased because , as well as consistent. However, the esti-mated are inefficient (because the information that is excluded),suggesting that their variances will generally be higher than the variancesof the terms, making OLS estimators LUE (linear unbiased estimators)but not BLUE.

The third source of misspecification error arises when an incorrect func-tional form for the model is used. Assume, for example, that the correctfunctional form is the one appearing in Equation 4.21 but, instead, the fol-lowing is estimated:

. (4.26)

The problem with specifying an incorrect functional form is similar toomitted variable bias: the parameter estimates are biased and inconsistent.When developing a statistical model, it is important not only that alltheoretically relevant variables be included, but also that the functionalform is correct. Box–Cox transformations, discussed in Chapter 3, are awidely used approach for evaluating the appropriateness of the functionalform of a model.

The fourth source of misspecification error arises when the independentvariables are highly correlated (multicollinearity). In general, in a regres-sion it is hoped that the explanatory variables are highly correlated withthe dependent variable but, at the same time, are not highly correlatedwith other independent variables. The severity of the problem dependson the degree of multicollinearity. Low correlations among the regressorsdo not result in any serious deterioration in the quality of the estimatedmodel. However, high correlations may result in highly unstable OLSestimates of the regression parameters. In short, in the presence of highto extreme multicollinearity the OLS estimators remain BLUE, but thestandard errors of the estimated parameters are disproportionately large.As a result, the t test values for the parameter estimates are generallysmall and the null hypothesis that the parameters are zero may not berejected, even though the associated variables are important in explainingthe variation in the dependent variable. Finally, under multicollinearitythe parameter estimates are usually unstable, and, because of the highstandard errors, reliable estimates of the regression parameters may be

Y Xi i i0 1 1

Y X Xi i i i0 1 1 2 2

X i2

2 0E( ˆ )1 1

2 0

LN Y LN X Xi i i i0 1 1 2 2

© 2003 by CRC Press LLC

Page 150: Data Analysis Book

difficult to obtain. As a result, counterintuitive parameter signs are fre-quent, and dropping or adding an independent variable to the modelcauses large changes in the magnitudes and directions of other indepen-dent variables included in the model.

© 2003 by CRC Press LLC

Page 151: Data Analysis Book

5Simultaneous Equation Models

Some transportation data are best modeled by a system of interrelated equa-tions. Examples include the interrelation of utilization of individual vehicles(measured in kilometers driven) in multivehicle households, the interrelationbetween travel time from home to an activity and the duration of the activity,the interrelation between the temperature at a sensor in a pavement and thetemperature at adjacent sensors in the pavement, and the interrelation ofaverage vehicle speeds by lane with the vehicle speeds in adjacent lanes. Allof these examples produce interrelated systems of equations where thedependent variable in one equation is the independent variable in another.

Interrelated systems of equations create a potentially serious estimationproblem if their interrelated structure is not considered. This problem arisesbecause estimation of equation systems using ordinary least squares (OLS)violates a key OLS assumption in that a correlation between regressors anddisturbances will be present because not all independent variables are fixedin random samples (one or more of the independent variables will be endo-genous and OLS estimates will be erroneous). All too often, the general issueof endogeneity, such as that resulting from simultaneous equation models,is ignored in the analysis of transportation data. Ignoring endogeneity resultsin erroneous conclusions and inferences.

5.1 Overview of the Simultaneous Equations Problem

As an example of a simultaneous equations problem, consider the task ofmodeling the utilization of vehicles (kilometers driven per year) in house-holds that own two vehicles. It is a classic simultaneous equations problembecause the usage of one vehicle is interrelated with the usage of the other(or others). A household’s travel allocation is limited, so as one vehicle isused more, the other is used less (see Mannering, 1983, for a detailed studyof this problem). To show this problem, consider annual vehicle utilizationequations (one for each vehicle) of the following linear form (suppressingthe subscripting for each observation, n, to improve exposition),

© 2003 by CRC Press LLC

Page 152: Data Analysis Book

(5.1)

, (5.2)

where u1 is the kilometers per year that vehicle 1 is driven, u2 is the kilometersper year that vehicle 2 is driven, Z1 and Z2 are vectors of vehicle attributes(for vehicles 1 and 2, respectively), X is a vector of household characteristics,the and terms are vectors of estimable parameters, the are estimablescalars, and the are disturbance terms.

If Equations 5.1 and 5.2 are estimated separately using ordinary leastsquares (OLS), one of the key assumptions giving rise to best linear unbiasedestimators (BLUE) is violated; correlation between regressors and distur-bances will be present since an independent variable, u2 in Equation 5.1 oru1 in Equation 5.2, would not be fixed in repeated samples. To be fixed inrepeated samples, the value of the dependent variable (left-hand side vari-able) must not influence the value of an independent variable (right-handside). This is not the case in Equations 5.1 and 5.2 since in Equation 5.1 theindependent variable u2 varies as the dependent variable u1 varies, and inEquation 5.2 the independent variable u1 varies as the dependent variableu2 varies. Thus, u2 and u1 are said to be endogenous variables in Equations5.1 and 5.2, respectively. As a result, estimating Equations 5.1 and 5.2 sepa-rately using OLS results in biased and inconsistent estimates of modelparameters.

5.2 Reduced Form and the Identification Problem

To understand the issues involved in estimating simultaneous equationsmodel parameters, a natural starting point is to consider a reduced formsolution. In Equations 5.1 and 5.2 the problem becomes a simple one ofsolving two equations and two unknowns to arrive at reduced forms. Sub-stituting Equation 5.2 for u2 in Equation 5.1 gives

. (5.3)

Rearranging,

(5.4)

and similarly substituting Equation 5.1 for u1 in Equation 5.2 gives

u u1 1 1 1 1 2 1 Z X

u u2 2 2 2 2 1 2 Z X

u Z X Z X u1 1 1 1 1 2 2 2 2 1 2 1

u Z X Z11

1 21

1 1 2

1 2

1 2

1 22

1 2 1

1 21 1 1 1

© 2003 by CRC Press LLC

Page 153: Data Analysis Book

. (5.5)

Because the endogenous variables u1 and u2 are replaced by their exoge-nous determinants, Equations 5.4 and 5.5 are estimated using OLS as

, (5.6)

and

, (5.7)

where

(5.8)

. (5.9)

OLS estimation of the reduced form models (Equations 5.6 and 5.7) iscalled indirect least squares (ILS) and is discussed later in this chapter.Although estimated reduced form models are readily used for forecastingpurposes, if inferences are to be drawn from the model system, the under-lying parameters need to be determined. Unfortunately, uncovering theunderlying parameters (the , , and terms) in reduced form models isproblematic because either too little or too much information is often avail-able. For example, note that Equations 5.8 and 5.9 provide two possiblesolutions for 1:

. (5.10)

This reveals an important problem with the estimation of simultaneousequations — model identification. In some instances, it may be impossibleto determine the underlying parameters. In these cases, the modeling systemis said to be unidentified. In cases where exactly one equation solves theunderlying parameters, the model system is said to be exactly identified.When more than one equation solves the underlying parameters (as shownin Equation 5.10), the model system is said to be over-identified.

A commonly applied rule to determine if an equation system is identified(exactly or over-identified) is the order condition. The order condition

u Z X Z22

2 12

2 2 1

2 1

2 1

2 11

2 1 2

2 11 1 1 1

u a Z b X c Z1 1 1 1 1 2 1

u a Z b X c Z2 2 2 2 2 1 2

a b c11

1 21

1 1 2

1 21

1 2

1 21

1 2 1

1 21 1 1 1; ; ;

a b c22

2 12

2 2 1

2 12

2 1

2 11

2 1 2

2 11 1 1 1; ; ;

1 1 1 2 12 2 1

2

11

ac

and

© 2003 by CRC Press LLC

Page 154: Data Analysis Book

determines an equation to be identified if the number of all variablesexcluded from an equation in an equation system is greater than or equalto the number of endogenous variables in the equation system minus 1.For example, in Equation 5.1, the number of elements in the vector Z2,which is an exogenous vector excluded from the equation, must be greaterthan or equal to 1 because there are two endogenous variables in theequation system (u1 and u2).

5.3 Simultaneous Equation Estimation

There are two broad classes of simultaneous equation techniques — single-equation estimation methods and systems estimation methods. The distinc-tion between the two is that systems methods consider all of the parameterrestrictions (caused by over-identification) in the entire equation system andaccount for possible contemporaneous (cross-equation) correlation of distur-bance terms. Contemporaneous disturbance term correlation is an importantconsideration in estimation. For example, in the vehicle utilization equationsystem (Equations 5.1 and 5.2), one would expect 1 and 2 to be correlatedbecause vehicles operated by the same household will share unobservedeffects (common to that household) that influence vehicle utilization. Becausesystem estimation approaches are able to utilize more information (param-eter restrictions and contemporaneous correlation), they produce vari-ance–covariance matrices that are at worst equal to, and in most cases smallerthan, those produced by single-equation methods (resulting in lower stan-dard errors and higher t statistics for estimated model parameters).

Single-equation methods include instrumental variables (IV), indirect leastsquares (ILS), two-stage least squares (2SLS), and limited information max-imum likelihood (LIML). Systems estimation methods include three-stageleast squares (3SLS) and full information maximum likelihood (FIML). Thesedifferent estimation techniques are discussed in the following section.

5.3.1 Single-Equation Methods

ILS involves OLS estimation of reduced form equations. In the case of vehicleutilization in two-vehicle households, this would be OLS estimation asdepicted in Equations 5.6 and 5.7. The estimates of these reduced formparameters are used to determine the underlying model parameters solvingsuch equations as 5.8 and 5.9. Because solving underlying model parameterstends to result in nonlinear equations (such as Equation 5.10), the unbiasedestimates of the reduced form parameters (from Equations 5.6 and 5.7) donot produce unbiased estimates of the underlying model parameters. Thisis because the ratio of unbiased parameters is biased. There is also the

© 2003 by CRC Press LLC

Page 155: Data Analysis Book

problem of having multiple estimates of underlying model parameters if theequation system is over-identified (as in Equation 5.10).

An IV approach is the most simplistic approach to solving the simulta-neous equations estimation problem. This approach simply replaces theendogenous variables on the right-hand side of the equations in the equationsystem with an instrumental variable — a variable that is highly correlatedwith the endogenous variable it replaces and is not correlated to the distur-bance term. For example, in Equations 5.1 and 5.2, the IV approach wouldbe to replace u2 and u1, respectively, with appropriate instrumental variablesand to estimate Equations 5.1 and 5.2 using OLS. This approach yieldsconsistent parameter estimates. The problem, however, is one of findingsuitable instruments, which is difficult to near impossible in many cases.

2SLS is an extension of instrumental variables in that it seeks the best instru-ment for endogenous variables in the equation system. Stage 1 regresses eachendogenous variable on all exogenous variables. Stage 2 uses regression-esti-mated values from stage 1 as instruments and estimates each equation usingOLS. The resulting parameter estimates are consistent, and studies have shownthat most small-sample properties of 2SLS are superior to ILS and IV.

LIML maximizes the likelihood function of the reduced form models,generally assuming normally distributed error terms. Unlike ILS, the likeli-hood function is written to account for parameter restrictions (critical forover-identified models). This alleviates the ILS problem of having multipleestimates of underlying model parameters in over-identified equations.

In selecting among the single-equation estimation methods, 2SLS andLIML have obvious advantages when equations are over-identified. Whenthe equations are exactly identified and the disturbances are normally dis-tributed, ILS, IV, 2SLS, and LIML produce the same results. In over-identifiedcases, 2SLS and LIML have the same asymptotic variance–covariance matrix,so the choice becomes one of minimizing computational costs. A summaryof single-equation estimation methods is provided in Table 5.1.

5.3.2 System Equation Methods

System equation methods are typically preferred to single-equation methodsbecause they account for restrictions in over-identified equations and con-temporaneous (cross-equation) disturbance term correlation (the correlationof disturbance terms across the equation system). 3SLS is the most popularof the system equation estimation methods. In 3SLS, stage 1 is to obtain the2SLS estimates of the model system. In stage 2, the 2SLS estimates computeresiduals from which cross-equation disturbance term correlations are cal-culated. In stage 3, generalized least squares (GLS) computes parameterestimates. Appendix 5A at the end of this chapter provides an overview ofthe GLS estimation procedure.

Because of the additional information considered (contemporaneous cor-relation of disturbances), 3SLS produces more efficient parameter estimates

© 2003 by CRC Press LLC

Page 156: Data Analysis Book

than single-equation estimation methods. An exception is when there is nocontemporaneous disturbance term correlation. In this case, 2SLS and 3SLSparameter estimates are identical.

FIML extends LIML by accounting for contemporaneous correlation ofdisturbances. The assumption typically made for estimation is that thedisturbances are multivariate normally distributed. Accounting for con-temporaneous error correlation complicates the likelihood function con-siderably. As a result, FIML is seldom used in simultaneous equationestimation. And, because under the assumption of multivariate normallydistributed disturbances, FIML and 3SLS share the same asymptotic vari-ance–covariance matrix, there is no real incentive to choose FIML over3SLS in most applications. A summary of system equation estimation meth-ods is provided in Table 5.2.

Example 5.1

To demonstrate the application of a simultaneous equations model, con-sider the problem of studying mean vehicle speeds by lane on a multilanefreeway. Because of the natural interaction of traffic in adjacent lanes, asimultaneous equations problem arises because lane mean speeds aredetermined, in part, by the lane mean speeds in adjacent lanes. Thisproblem was first studied by Shankar and Mannering (1998) and theirdata and approach are used here.

TABLE 5.1

Summary of Single Equation Estimation Methods for Simultaneous Equations

Method ProcedureResulting Parameter

Estimates

Indirect least squares (ILS)

Applies ordinary least squares to the reduced form models

Consistent but not unbiased

Instrumental variables (IV)

Uses an instrument (a variable that is highly correlated with the endogenous variable it replaces, but is not correlated to the disturbance term) to estimate individual equations

Consistent but not unbiased

Two-stage least squares (2SLS)

Approach finds the best instrument for endogenous variables: Stage 1 regresses each endogenous variable on all exogenous variables; Stage 2 uses regression-estimated values from stage 1 as instruments, and estimates equations with OLS

Consistent but not unbiased; generally better small-sample properties than ILS or IV

Limited information maximumlikelihood (LIML)

Uses maximum likelihood to estimate reduced form models; can incorporate parameter restrictions in over-identified equations

Consistent but not unbiased; has same asymptotic variance–covariancematrix as 2SLS

© 2003 by CRC Press LLC

Page 157: Data Analysis Book

For this example, data represent speeds obtained from a six-lane freewaywith three lanes in each direction separated by a large median (eachdirection is considered separately). A summary of the available data isshown in Table 5.3. At the point where the data were gathered, highlyvariable seasonal weather conditions were present. As a consequence,seasonal factors are expected to play a role. The data were collected overa period of a year, and the mean speeds, by lane, were the mean of thespot speeds gathered over 1-h periods. The equation system is writtenas (see Shankar and Mannering, 1998)

TABLE 5.2

Summary of System Estimation Methods for Simultaneous Equations

Method ProcedureResulting Parameter

Estimates

Three-stage least squares (3SLS)

Stage 1 obtains 2SLS estimates of the model system; Stage 2 uses the 2SLS estimates to compute residuals to determine cross-equation correlations; Stage 3 uses GLS to estimate model parameters

Consistent and more efficient than single-equation estimation methods

Full information maximumlikelihood (FIML)

Similar to LIML but accounts for contemporaneous correlation of disturbances in the likelihood function

Consistent and more efficient than single-equation estimation methods; has same asymptoticvariance–covariancematrix as 3SLS

TABLE 5.3

Lane Mean-Speed Model Variables

VariableNo. Variable Description

1 Mean speed in the right lane in kilometers per hour (gathered over a 1-h period)2 Mean speed in the center lane in kilometers per hour (gathered over a 1-h

period)3 Mean speed in the left lane in kilometers per hour (gathered over a 1-h period)4 Traffic flow in right lane (vehicles per hour)5 Traffic flow in center lane (vehicles per hour)6 Traffic flow in left lane (vehicles per hour)7 Proportion of passenger cars (including pickup trucks and minivans) in the

right lane8 Proportion of passenger cars (including pickup trucks and minivans) in the

center lane9 Proportion of passenger cars (including pickup trucks and minivans) in the left

lane10 Month that speed data was collected (1 = January, 2 = February, etc.)11 Hour in which data was collected (the beginning hour of the 1-h data collection

period)

© 2003 by CRC Press LLC

Page 158: Data Analysis Book

(5.11)

(5.12)

(5.13)

where the s are the mean speeds (over a 1-h period in kilometers/h) forthe right-most lane (subscript R) relative to the direction of travel (the slowlane), the center lane (subscript C), and the left lane (subscript L), respec-tively. The Z are vectors of exogenous variables influencing the meanspeeds in the corresponding lanes, the are vectors of estimable parame-ters, the and are estimable scalars, and the are disturbance terms.

The equation system (Equations 5.11 through 5.13) is estimated with 2SLSand 3SLS and the estimation results are presented in Table 5.4. These resultsshow that there are noticeable differences between 2SLS and 3SLS param-eter estimates in the equations for mean speeds in the right, center, andleft lanes. These differences underscore the importance of accounting forcontemporaneous correlation of disturbance terms. In this case, it is clearthat the disturbances R, C, L share unobserved factors occurring over thehour during which mean speeds are calculated. These unobserved factors,which are captured in equation disturbances, could include vehicle dis-ablements, short-term driver distractions, weather changes, and so on. Onewould expect contemporaneous disturbance term correlation to diminishif more complete data were available to estimate the model. Then, thedifference between 2SLS and 3SLS parameter estimates would diminish.The interested reader should see Shankar and Mannering (1998) for a moredetailed analysis of these mean-speed data.

5.4 Seemingly Unrelated Equations

In some transportation-related studies it is possible to have a series of depen-dent variables that are considered a group, but do not have direct interactionas simultaneous equations do. An example is the time individual householdmembers spend traveling during a typical week. Consider such a model forindividuals in households with exactly three licensed drivers. The time theyspend traveling is written as

(5.14)

, (5.15)

(5.16)

s sR R R R C Z

s s sC C C C L C R Z

s sL L L L C Z

tt1 1 1 1 1 Z X

tt2 2 2 2 2 Z X

tt3 3 3 3 3 Z X

© 2003 by CRC Press LLC

«z

«R

«c

Page 159: Data Analysis Book

where tt1, tt2 and tt3 are the times spent traveling in a typical week forhousehold drivers 1, 2, and 3, respectively, Z1, Z2, and Z3 are traveler char-acteristics, X is a vector of household characteristics, the , are vectors ofestimable parameters, and the are disturbance terms. Unlike Equations 5.1and 5.2 and Equations 5.11 through 5.13, Equations 5.14 through 5.16 do notdirectly interact with each other. That is, tt1, does not directly determine tt2,tt2 does not directly affect tt3, and so on. However, because all three licenseddrivers represented by Equations 5.14 through 5.16 are from the same house-hold, they are likely to share unobserved characteristics. In this case, theequations are seemingly unrelated, but there will be contemporaneous

TABLE 5.4

Simultaneous Equation Estimation Results (t statistics in parentheses)

Variable Description 2SLS 3SLS

Dependent Variable: Mean Speed in the Right Lane (km/h)

Constant –20.82 (–8.48) –67.92 (–4.75)Mean speed in the center lane (km/h)a 1.10 (55.6) 1.49 (12.8)Winter: 1 if mean speeds were determined from speed data gathered during hours in the months of November, December, January, or February; 0 if not

–0.455 (–1.59) –2.00 (–4.33)

Spring: 1 if mean speeds were determined from speed data gathered during hours in the months of March, April, or May; 0 if not

0.08 (0.28) –1.49 (–3.25)

A.M. peak period: 1 if mean speeds were determined from speed data gathered during the hours from 7:00 A.M. to 10:00 A.M.; 0 if not

–1.99 (–6.44) –0.29 (–1.48)

Dependent Variable: Mean Speed in the Center Lane(km/h)

Constant 23.71 (12.52) 13.58 (6.32)Mean speed in the right lane (km/h)a 0.23 (3.07) 0.38 (4.84)Mean speed in the left lane (km/h)a 0.58 (8.77) 0.52 (7.69)P.M. peak period: 1 if mean speeds were determined from

speed data gathered during the hours from 5:00 P.M. to 7:00 P.M.; 0 if not

–1.79 (–5.29) –0.57 (–5.40)

Dependent Variable: Mean Speed in the Left Lane (km/h)

Constant –23.40 (–8.43) –20.56 (–5.19)Mean speed in the center lane (km/h)a 1.21 (57.7) 1.20 (37.7)Number of observations 6362 6362R-squared – Equation for the mean speed in the right lane

0.35 0.71

R-squared – Equation for the mean speed in the center lane

0.32 0.90

R-squared – Equation for the mean speed in the left lane

0.34 0.82

3SLS system R-squared — 0.81

a Endogenous variable.

© 2003 by CRC Press LLC

Page 160: Data Analysis Book

(cross-equation) correlation of error terms. If Equations 5.14 through 5.16 areestimated separately by OLS, the parameter estimates are consistent but notefficient. Efficient parameter estimates are obtained by considering the con-temporaneous correlation of disturbances 1, 2, and 3. Considering contem-poraneous correlation in seemingly unrelated equations is referred to asSURE (seemingly unrelated equation estimation) and was initially consid-ered by Zellner (1962). Estimation of seemingly unrelated equations isaccomplished using GLS as is the case in 3SLS estimation (see Appendix5A). However, cross-equation correlations are estimated by OLS instead ofthe 2SLS estimates used in 3SLS (see Appendix 5A).

5.5 Applications of Simultaneous Equations to Transportation Data

Although many transportation problems lend themselves naturally tosimultaneous equations approaches, there have been surprisingly fewtransportation applications. The most common example is the utilizationof individual vehicles in multivehicle households. This problem has beenstudied using 3SLS by Mannering (1983). Other examples include individ-uals’ travel time to trip-generating activities, such as shopping, and thelength of time they are involved in that activity (Hamed and Mannering,1993). There are countless applications of simultaneous equations in thegeneral economic literature. The reader is referred to Pindyck and Rubin-feld (1990), Kmenta (1997), Kennedy (1998), and Greene (2000) for addi-tional information and examples.

Appendix 5A: A Note on Generalized Least Squares Estimation

OLS assumptions are that disturbance terms have equal variances and arenot correlated. Generalized least squares (GLS) is used to relax these OLSassumptions. Under OLS assumptions, in matrix notation,

(5A.1)

where E(.) denotes expected value, is an n 1 column vector of equationdisturbance terms (where n is the total number of observations in the data),

is the 1 n transpose of , is the disturbance term variance, and I isthe n n identity matrix,

T 2I

T

© 2003 by CRC Press LLC

Page 161: Data Analysis Book

. (5A.2)

When heteroscedasticity is present, , where is the n n matrix,

. (5A.3)

For disturbance-term correlation, , where

. (5A.4)

Recall that in OLS, parameters are estimated from

, (5A.5)

where is a p 1 column vector (where p is the number of parameters), Xis an n p matrix of data, is the transpose of X, and Y is an n 1 columnvector. Using , Equation 5A.5 is rewritten as

(5A.6)

The most difficult aspect of GLS estimation is obtaining an estimate of thematrix. In 3SLS, that estimate is obtained using the initial 2SLS parameter

estimates. In seemingly unrelated regression estimation (SURE), is esti-mated from initial OLS estimates of individual equations.

I

1 0 00 1 0

0 0 1

.

.. . . .

.

VT

Ω

12

22

2

0 00 0

0 0

.

.. . . .

. n

VT 2

Ω

11

1

1

2

1 2

.

.. . . .

.

N

N

N N

X X X YT T1

XT

V VX X X YT T1 1 1

© 2003 by CRC Press LLC

Page 162: Data Analysis Book

6Panel Data Analysis

Traditionally, statistical and econometric models have been estimatedusing cross-sectional or time-series data. In an increasing number of appli-cations, however, there is availability of data based on cross sections ofindividuals observed over time (or other observational units such as firms,geographic entities, and so on). These data, which combine cross-sectionaland time-series characteristics, are panel (or pooled) data, and allowresearchers to construct and test realistic behavioral models that cannotbe identified using only cross-sectional or time-series data. The modelingof panel data raises new specification issues, however, such as heteroge-neity, which, if not accounted for explicitly, may lead to model parametersthat are inconsistent and/or meaningless. This chapter focuses on thedevelopment of panel data regression models that account for heteroge-neity in a variety of ways. Furthermore, this chapter discusses issuesrelated to the possible distortions from both the cross-sectional (heterosce-dasticity) and time-series (serial correlation) dimensions to which paneldata are vulnerable.

6.1 Issues in Panel Data Analysis

Panel data analyses provide a number of advantages over analyses basedsolely on cross-sectional or time-series data (Hsiao, 1986). First, from a sta-tistical perspective, by increasing the number of observations, panel datahave higher degrees of freedom and less collinearity, particularly in compar-ison with time-series data, thus improving the efficiency of the parameterestimates. Second, panel data allow a researcher to analyze questions thatcannot be adequately addressed using either time-series or cross-sectionaldata. For example, suppose that a cross section of public transit agenciesreveals that, on average, public transit subsidies are associated with 20%increased ridership. If a homogeneous population of public transit firms isavailable, this might be construed to imply that a firm’s ridership willincrease by 20% given transit subsidies. This might also be the implication

© 2003 by CRC Press LLC

Page 163: Data Analysis Book

for a sample of heterogeneous firms. However, an alternative explanation ina sample of heterogeneous public transit firms is that the subsidies have noeffect (0% increase) on four fifths of the firms, and raise ridership by 100%on one fifth of the firms. Although these competing hypotheses cannot betested using a cross-sectional sample (in the absence of a cross-sectionalvariable that “explains” this difference), it is possible to test between themby identifying the effect of subsidies on a cross section of time series for thedifferent transit systems. Third, panel data allow researchers to test whethermore simplistic specifications are appropriate. With panel data, it is possibleto account for cross-sectional heterogeneity by introducing additionalparameters into the model. Thus, testing for cross-sectional homogeneity isequivalent to testing the null hypothesis that these additional parametersare equal to zero.

Compared with cross-sectional or time-series data, panel data raise newspecification issues that need to be considered during analysis. The mostimportant of these is heterogeneity bias (Hausman and Taylor, 1981). Het-erogeneity refers to the differences across cross-sectional units that maynot be appropriately reflected in the available data — that is, an existingexplanatory variable or variables. If heterogeneity across cross-sectionalunits is not accounted for in a statistical model, estimated parameters arebiased because they capture part of the heterogeneity. Indeed, as noted byGreene (2000), cross-sectional heterogeneity should be the central focus ofpanel data analysis.

A second issue is serial correlation of the disturbance terms, whichoccurs in time-series studies when the disturbances associated with obser-vations in one time period are dependent on disturbances from prior timeperiods. For example, positive serial correlation frequently occurs in time-series studies, either because of correlation in the disturbances or, morelikely, because of a high degree of temporal correlation in the cumulativeeffects of omitted variables (serial correlation is discussed in depth inChapter 4). Although serial correlation does not affect the unbiasednessor consistency of regression parameter estimates, it does affect their effi-ciency. In the case of positive serial correlation, the estimates of the stan-dard errors are smaller than the true standard errors. In other words, theregression estimates are unbiased but their associated standard errors arebiased downward, which biases the t statistics upward and increases thelikelihood of rejecting a true null hypothesis. A third issue is heterosce-dasticity, which refers to the variance of the disturbances not being con-stant across observations. As discussed in Chapter 4, this problem arisesin many practical applications, particularly when cross-sectional data areanalyzed, and, similar to serial correlation, it affects the efficiency of theestimated parameters.

Notwithstanding the advantages associated with panel data, failure toaccount for cross-sectional heterogeneity, serial correlation, and heterosce-dasticity could seriously affect the estimated parameters and lead to inap-propriate statistical inferences.

© 2003 by CRC Press LLC

Page 164: Data Analysis Book

6.2 One-Way Error Component Models

Variable-intercept models across individuals or time (one-way models) andacross both individuals and time (two-way models), the simplest and moststraightforward models for accounting for cross-sectional heterogeneity inpanel data, arise when the null hypothesis of overall homogeneity is rejected.The variable-intercept model assumes that the effects of omitted variablesmay be individually unimportant but are collectively significant, and thusis considered to be a random variable that is independent of included inde-pendent variables. Exaof error-component models in the transportation lit-erature include Dee (1999), Eskelanda and Feyziolub (1997), and Loizidesand Tsionas (2002). Because heterogeneity effects are assumed to be constantfor given cross-sectional units or for different cross-sectional units duringone time period, they are absorbed by the intercept term as a means toaccount for individual or time heterogeneity (Hsiao, 1986). More formally, apanel data regression is written as

(6.1)

where i refers to the cross-sectional units (individuals, counties, states, etc.),t refers to the time periods, is a scalar, is a vector, and is theith observation on the Pth explanatory variable. A one-way error componentmodel for the disturbances, which is the most commonly utilized panel dataformulation, is specified as

(6.2)

where is the unobserved cross-sectional specific effect and are randomdisturbances. For example, in an investigation of the effects of operating sub-sidies on the performance of transit systems, might be a performance indi-cator for transit system i for year t, a set of explanatory variables includingoperating subsidies, and the capture the unobservable transit system-spe-cific effects such as, say, managerial skills of the system’s administration.

When the are assumed to be fixed parameters to be estimated and the are random disturbances that follow the usual regression assumptions,

then combining Equations 6.1 and 6.2 yields the following model, whereinference is conditional on the particular n cross-sectional units (transit sys-tems) that are observed, and is thus called a fixed effects model:

(6.3)

on which ordinary least squares (OLS), which provide best linear unbiasedestimators (BLUE), are used to obtain , , and . Note that, especially

Y u i n t Tit itXit , , ..., ; , ...,1 1

P 1 X it

u vit i it

i vit

YitX it

i

ivit

Y v i n t Tit it i itX , , ..., ; , ...,1 1

i

© 2003 by CRC Press LLC

Page 165: Data Analysis Book

when n is large, many indicator variables are included in the model, and thematrices to be inverted by OLS are of large dimension (n + P). As such, aleast squares dummy variable (LSDV) estimator for Equation 6.3 is obtainedfor (this estimator is also called the within-group estimator because onlythe variation within each group is utilized in forming the estimator). Testingfor the joint significance of the included fixed effects parameters (the dummyvariables) is straightforwardly conducted using the Chow F test

(6.4)

where RRSS are the restricted residual sums of squares from OLS on thepooled model and URSS are the unrestricted residual sums of squares fromthe LSDV regression.

The fixed effects specification suffers from an obvious shortcoming in thatit requires the estimation of many parameters and the associated losses ofdegrees of freedom. This is avoided if the are considered to be randomvariables, such that , , the i and are inde-pendent, and the are independent of the i and .

Unlike the fixed effects model where inferences are conditional on theparticular cross-sectional units sampled, an alternative formulation, calledthe random effects model, is an appropriate specification if n cross-sectionalunits are randomly drawn from a large population. Furthermore, it can beshown that a random effects specification implies a homoscedastic distur-bances variance, for all i, t, and serial correlation only fordisturbances of the same cross-sectional unit (Hsiao, 1986). In general, thefollowing holds for a random effects model

and

To obtain a generalized least squares (GLS, see Chapter 5) estimator of therandom effects regression parameters, the disturbance variance–covariance

Fn

NT n PF

H

n n T P0 1 1

1 0RRSS URSSURSS

~ ,

i

i IID~ ( , )0 2 v IIDit v~ ( , )0 2 vitX it vit i t,

VAR uit v( ) 2 2

COV u u i j t s

i j t s

it js v, ,

,

2 2

2

0

for

for

otherwise

COR u u i j t s

i j t s

it js

v

, ,

,

1

0

2

2 2

for

for

otherwise.

© 2003 by CRC Press LLC

Page 166: Data Analysis Book

matrix from Equation 6.3 needs to be estimated. This is typically a largematrix, and this estimation can only be successful with a small n and T(Baltagi, 1985). As a result, this estimation is typically accomplished usingeither one of a variety of modified GLS approaches (see, for example, Ner-love, 1971b; Wansbeek and Kapteyn, 1982, 1983; Swamy and Arora, 1972;and Taylor, 1980) or using maximum likelihood estimation (Breusch, 1987).

The existence of two fundamentally different model specifications foraccounting for heterogeneity in panel data analyses raises the obvious ques-tion of which specification — fixed or random effects is preferred in anyparticular application. Despite the research interest this issue has attracted,it is not a straightforward question to answer, especially when one considersthat the selection of whether to treat the effects as fixed or random can makea significant difference in the estimates of the parameters (see Hausman,1978, for an example). The most important issue when considering thesealternative specifications is the context of the analysis. In the fixed-effectsmodel, inferences are conditional on the effects that are in the sample, whilein the random-effects model inferences are made unconditionally withrespect to the population of effects (Hsiao, 1986). In other words, the essentialdifference between these two modeling specifications is whether the infer-ences from the estimated model are confined to the effects in the model orwhether the inferences are made about a population of effects (from whichthe effects in the model are a random sample). In the former case the fixed-effects model is appropriate, whereas the latter is suited for the randomeffects model.

Although the choice between the two different models is based on the apriori hypotheses regarding the phenomenon under investigation, the fixed-effects model has a considerable virtue in that it does not assume that theindividual effects are uncorrelated with the regressors, , as isassumed by the random-effects model. In fact, the random-effects modelmay be biased and inconsistent due to omitted variables (Hausman andTaylor, 1981; Chamberlain, 1978). With the intent of identifying potentialcorrelation between the individual effects and the regressors, Hausman(1978) devised a test to examine the null hypothesis of no correlation betweenthe individual effects and . This test assumes that under the null hypoth-esis both the LSDV and the GLS are consistent and asymptotically efficient,whereas under the alternative hypothesis the GLS is biased and inconsistentfor , but the LSDV remains unbiased and consistent. As a result, the test isbased on the difference and the assumption that

. The test statistic is given by

which, under the null hypothesis, is asymptotically distributed as ,where P denotes the dimension of (note that the Hausman test is not

E uit it( )X 0

X it

ˆ ˆ ˆd GLS LSDVCOV GLS( ˆ , ˆ d 0)

h VARˆ ˆ ˆd d d 1

P2

© 2003 by CRC Press LLC

Page 167: Data Analysis Book

valid for heteroscedastic or serially correlated disturbances; an alternativeWald statistic for the Hausman test, not sensitive to heteroscedasticity orserial correlation, was derived by Arellano, 1993). Hausman’s test is not atool for conclusively deciding between the fixed- and random-effects spec-ifications. A rejection of the null hypothesis of no correlation suggests thepossible inconsistency of the random effects model and the possible pref-erence for a fixed-effects specification.

6.2.1 Heteroscedasticity and Serial Correlation

As previously noted, the disturbance components of the models in Equa-tions 6.1 and 6.2 are vulnerable to distortions from both the cross-sectionaland time-series dimensions, which, as discussed in Chapter 4, amount toproblems associated with heteroscedasticity and serial correlation. Inboth cases the parameter estimates are unbiased and consistent, but arenot efficient.

In the case of heteroscedasticity, the homoscedastic error componentmodel of Equation 6.2 is generalized to have a heteroscedastic ; that is,

, but . To correct for heteroscedasticityWLS (weighted least squares) is applied (see Chapter 4). However, thefeasibility of the WLS estimates in these cases requires estimates of

and for , which requires large T, small n , and (Baltagi, 1995). As a result the application of WLS in panel data is

not as straightforward as in OLS regression. To this end, Baltagi and Griffin(1988) derived a GLS approach for estimating heteroscedastic disturbancesin panel data and Magnus (1982) developed a maximum likelihoodapproach that assumes normality.

The disturbances in the model shown in Equation 6.3 are assumed to becorrelated over time due to the presence of the same cross-sectional unitsobserved over time. This correlation was shown to be COR(uit, ujs) =

, for i = j, t = s. This assumption is not ideal for the basic errorcomponent model because it may result in unbiased and consistent butinefficient parameter estimates. Lillard and Willis (1978) generalized theerror component model to the case of first-order serially correlated distur-bances as follows: , where , , and

. Furthermore, the are independent from . Baltagi andLi (1991) derived the appropriate matrix for transforming the autocorrelateddisturbances into random terms (disturbances). When using panel data, thetransformation has to be applied for all n cross-sectional units individually,leading to a potentially significant loss of degrees of freedom, especiallywhen n is large, if the first time period observations are discarded and nottransformed using and (see Chapter 4 formore details on this transformation). Note that corrections for second- andfourth-order serial correlation are found in Baltagi (1995).

i

i iw i n~ , , , ...,0 12 v IIDit v~ ,0 2

v2 wi

2 i n1, ...,T n

2 2 2/( )v

v vit i t it, 1 1 i IID~ ,0 2

it IID~ ,0 2i vit

Y Y12

11 X X12

11

© 2003 by CRC Press LLC

Page 168: Data Analysis Book

6.3 Two-Way Error Component Models

The disturbances presented in Equation 6.2 are further generalized to includetime-specific effects. As Wallace and Hussain (1969), Nerlove (1971b), andAmemiya (1971) suggest, this generalization is called a two-way error com-ponents model, whose disturbances are written as

(6.5)

where is the unobserved cross-sectional specific effect discussed in theprevious section, denotes the unobservable time effects, and are ran-dom disturbances. Note here that is individual invariant and accountsfor any time-specific effect that is not included in the regression. For example,it could account for strike year effects that disrupt transit service, and so on.

When the and are assumed to be fixed parameters to be estimatedand are random disturbances that follow the usual regression assump-tions, combining Equations 6.1 and 6.5 yields a model where inferences areconditional on the particular n cross-sectional units (transit systems) and areto be made over the specific time period of observation. This model is calleda two-way fixed effects error component model and is given as

(6.6)

where are assumed independent of the for all i, t. Inference for thistwo-way fixed-effects model is conditional on the particular n individualsand over the T time periods of observation. Similar to the one-way fixed-effects model, the computational difficulties involved with obtaining the OLSestimates for are circumvented by applying the within transformation ofWallace and Hussain (1969). Also, similar to the one-way model, testing forthe joint significance of the included cross-sectional and time period fixed-effects parameters (the dummy variables) is straightforwardly computedusing an F test

(6.7)

where RRSS are the restricted residual sums of squares from OLS on thepooled model and URSS are the unrestricted residual sums of squares fromthe regression using the within transformation of Wallace and Hussain(1969). As a next step, the existence of individual effects given time effectscan also be tested (Baltagi, 1995).

Similar to the one-way error component model case, if both the and are random with , , , the i,

u v i n t Tit i t it , , ..., ; , ...,1 1

i

t vit

t

i tvit

Y v i n t Tit it i t itX , , ..., ; , ...,1 1

X it vit

Fn T

n T PF

H

n T n T P0 2 1 1

21 1

0RRSS URSSURSS

~ ,

i

t i IID~ ( , )0 2t IID~ ( , )0 2 v IIDit v~ ( , )0 2

© 2003 by CRC Press LLC

Page 169: Data Analysis Book

, and independent and the independent of the i, , and ,then this formulation is called the random-effects model. Furthermore, it canbe shown that a random-effects specification implies a homoscedastic dis-turbance variance, where for all i, t, and

and

.

Estimation of the two-way random-effects model is typically accomplishedusing the GLS estimators of Wallace and Hussain (1969) and Amemiya(1971), or by using maximum likelihood estimation (Baltagi and Li, 1992).For this model specification, Breusch and Pagan (1980) derived a Lagrange-multiplier test for the null hypothesis ; this test is based onthe normality of the disturbances and is fairly straightforward to compute(for details, see Breusch and Pagan, 1980). Finally, the two-way error com-ponent model, Hausman’s (1978) test, is still based on the difference betweenthe fixed-effects estimator including both cross-sectional and time indicatorvariables and the two-way random effects GLS estimator. However, Kang(1985) showed the test does not hold for the two-way error component modeland proposed a modified Hausman test.

Example 6.1

The effectiveness of safety belt use in reducing motor vehicle–relatedfatalities has been the subject of much research interest in the past fewyears (for more details, see Derrig et al., 2002). To investigate the hypoth-esized relationship between various exogenous factors including seat beltusage rates and traffic fatalities, Derrig et al. (2002) compiled a paneldata set of demographic, socioeconomic, political, insurance, and road-way variables for all 50 U.S. states over a 14-year period (1983 through

t vit X it t vit i t,

VAR uit v( ) 2 2 2

COV u u i j t s

i j t s

it js, ,

,

2

2

0

for

for

otherwise

COR u u i j t s

i j t s

i j t s

i j t s

it jsv

v

, ,

,

,

,

2

2 2 2

2

2 2 2

1

0

for

for

for

for

H02 2 0:

© 2003 by CRC Press LLC

Page 170: Data Analysis Book

1996) for a total of 700 observations (state-years). This data set wassubsequently enriched with additional information by R. Noland of Im-perial College, U.K., and is used in this analysis (for additional informa-tion, see Noland and Karlaftis (2002)). The variables used in the analysisappear in Table 6.1.

In this case study, one potential accident indicator variable, total fatalitiesper vehicle-miles traveled, is used and is regressed on a number of inde-pendent variables. For the purposes of this case study, a number ofalternative modeling formulations were investigated: OLS regression (nocorrection for panel effects); a fixed-effects model (state effects); a two-way fixed-effects model (state and time effects); a random-effects model;a fixed-effects model with correction for first-order serial correlation; anda random-effects model with correction for first-order serial correlation.The parameter estimates and corresponding t statistics appear in Table6.2. Interestingly, the results and the significance of the parameter esti-mates in particular show ample variation between the different specifi-cations. For example, the parameter for percent seat belt use (seatbelt) issignificant at the 99% level for both the fixed- and random-effects spec-ifications but loses much of this significance when incorporating a two-way fixed-effects model or a correction for serial correlation (indicatingthat without correction for serial correlation the standard error of the

TABLE 6.1

Variables Available for Analysis

VariableAbbreviation Variable Description

STATE StateYEAR YearSTATENUM State ID numberDEATHS Number of traffic-related deathsINJURED Number of traffic-related injuriesPRIMLAW Primary seat belt lawSECLAW Secondary seat belt lawTOTVMT Total VMTPI92 Per capita income 1992 $POP Total populationHOSPAREA Hospitals per square mileETHANOL Alcohol consumption total ethanol by volumeETHPC Per capita ethanol consumptionSEATBELT Percent seat belt usePERC1524 Percent population 15–24PERC2544 Percent population 25–44PERC4564 Percent population 45–64PERC65P Percent population 65 plusPERC75P Percent population 75 plusINMILES Total lane miles (excluding local roads)PRECIP Annual precipitation in inchesEDUC Percent bachelors degreesPERCRINT Percentage of vehicle miles driven in rural interstate highway miles

© 2003 by CRC Press LLC

Page 171: Data Analysis Book

TABLE 6.2

Alternative Modeling Specifications for Fatalities per VMT (t statistics in parentheses)

IndependentVariable

Parameter EstimatesModel 1 OLS: No

CorrectionModel 2

Fixed Effects

Model 3 Two-Way Fixed

EffectsModel 4

Random Effects

Model 5Fixed Effects w/

AR(1)

Model 6Random Effects w/

AR(1)

Constant 1.69 (9.22) — — 0.83 (3.83) — 0.28 (1.2)YEAR –7.90E-04 (–8.62) –6.71E-04 (–3.75) — –4.10E-04 (–3.87) –4.50E-04 (–2.21) –1.30E-04 (–1.15)PI92 –6.50E-07 (–7.12) 3.30E-07 (1.61) 5.5E-08 (0.23) –1.90E-07 (–1.33) 1.30E-07 (0.62) –3.70E-07 (–2.57)POP 1E-10 (2.4) 1.1E-09 (2.21) 9.3E-10 (1.91) 1.30E-10 (1.35) 1.50E-09 (2.81) 1.20E-10 (1.26)HOSPAREA –0.18 (–1.88) –0.35 (–0.61) –0.55 (–0.95) –0.64 (–3.22) –0.19 (–0.35) –0.54 (–2.79)ETHPC 1.62 (3.65) 2.6 (1.72) 1.81 (1.19) 1.43 (1.73) –1.97 (–1.11) –3.70E-03 (–0.004)SEATBELT –2.20E-03 (–1.67) –4.00E-03 (–3.3) 2.37E-03 (–1.87) –4.10E-03 (–3.590) –2.50E-03 (–1.95) –2.30E-03 (–1.91)PERC1524 –0.148 (–5.63) 0.084 (3.22) 0.074 (2.85) 0.036 (1.713) 0.13 (4.41) 0.032 (1.33)PERC2544 –0.184 (–10.61) 0.097 (2.77) 0.081 (2.26) 1.22E-02 (0.51) 0.16 (4.3) 0.031 (1.24)PERC4564 0.081 (5.48) 0.063 (1.66) 0.022 (0.56) 0.037 (1.6) 0.052 (1.25) 0.011 (0.51)PERC75P –0.298 (–11.29) 0.226 (2.29) 0.15 (1.48) –6.20E-03 (–0.15) 0.31 (2.980) 3.80E-03 (0.090)INMILES –2.4E-09 (–1.11) –3.6E-08 (–1.47) –3.6E-08 (–1.49) –2.8E-09 (–0.55) –3.3E-08 (–1.35) –4.4E-09 (–0.88)PRECIP –3.10E-05 (–0.8) 3.30E-04 (2.210) 2.30E-04 (1.61) 2.10E-04 (2.77) 2.40E-04 (1.48) 1.80E-04 (2.46)

Model Statistics

N 400 400 400 400 350 350R2 0.650 0.916 0.923 0.650 0.926 0.651

© 2003 by C

RC

Press LL

C

Page 172: Data Analysis Book

parameters was downward biased; that is, the models without correctionunderestimated the standard error). On the other hand, the parametersfor the hospitals per square miles (HOSPAREA) variable are significantfor both random-effects specifications but are not significant for any ofthe fixed-effects formulations.

This brings up the interesting and at times highly debated issue of modelselection. As previously discussed, when inferences are confined to theeffects in the model, the effects are more appropriately considered to befixed. When inferences are made about a population of effects from whichthose in the data are considered to be a random sample, then the effectsshould be considered random. In these cases, a fixed-effects model isdefended on grounds that inferences are confined to the sample. In favorof selecting the fixed-effects rather than the random-effects formulationwere the Hausman tests presented in Table 6.3.

6.4 Variable Coefficient Models

The discussion in the two previous sections concentrated on models forwhich the effects of omitted variables are considered as either one-wayeffects (individual specific, time specific) or two-way effects (individual andtime specific). However, there may be cases in which the cross-sectional unitsexamined possess different unobserved socioeconomic and demographicbackground factors, resulting in response variables that vary over time andacross different cross-sectional units (Hsiao, 1986). For example, in Chapter4 data from small, medium, and large transit systems from Indiana wereused to investigate the effects of operating subsidies on the performance ofthe transit systems (Example 4.1). Karlaftis and McCarthy (1998) originallyanalyzed these data using an F test (described in the previous section) andrejected the null hypothesis of common intercept for the transit systems.However, also rejected was the null hypothesis of common slope parametersespecially for systems of different size groupings (small, medium, and large).This finding implies that operating subsidies have differential effects (dif-ferent signs and/or magnitudes) depending on the size of the transit systemsbeing examined.

TABLE 6.3

Fixed-Effects vs. Random-Effects Model Tests

Test StatisticsHausman Test (2 vs. 4) Hausman Test (5 vs. 6)

H d.f. p-Valuea H d.f. p-Valuea

83.76 12 0.000 89.04 12 0.000

ap-values < 0.1 favor fixed effects models (at the 90% level).

© 2003 by CRC Press LLC

Page 173: Data Analysis Book

In general, when the underlying hypothesis of the relationship betweenthe variables examined is considered theoretically sound but the data do notsupport the hypothesis of the parameters being equal, it is possible to allowfor variations in parameters across cross-sectional units and/or time toaccount for interunit and/or interperiod heterogeneity. The first case of amodel that accounts for parameters that vary over cross-sectional units butare invariant over time is written as

(6.8)

where the is the common-mean parameter vector and ki are the individualdeviations from the common mean ( ). If the characteristics of individualsamples are of interest, then the ki are treated as fixed constants (fixed-coefficient model), whereas if the population characteristics are of interestand the cross-sectional units available in the sample have been randomlydrawn from the population, then the ki are treated as random variableshaving zero means and constant variances and covariances (random-coeffi-cient model; for more details, see Swamy, 1970, 1971).

Similar to the model of Equation 6.8, if the assumption is made thatmodel parameters are needed to account for individual cross-sectionalunit heterogeneity and for specific time periods, then the model is rewrit-ten as

. (6.9)

If the ki, kt and k are treated as fixed constants, then the model is afixed-coefficient model, whereas if the ki and kt are considered randomand the k fixed, the model corresponds to the mixed analysis of variancemodel (Harley and Rao, 1967, Hsiao 1975). Despite the potential flexibilityof the variable-coefficient models, they have not been used as extensivelyin empirical work as the one- and two-way error component modelsbecause of the computational complexities involved in their estimation.Finally, Rao (1973) developed a Lagrange multiplier test for the nullhypothesis that the regression parameters do not vary across cross-sec-tional units and/or time in order to test for the need to develop variable-coefficient models. Also, Mundlak (1978) proposed a test for the appropri-ateness of the random-coefficient model (similar to the Hausman testdescribed in the previous section).

Y u

u i n t T

it ki kit itk

P

k ki kit itk

P

X

X

1

1

1 1 , , ..., ; , ...,

Y X u i n t Tit k ki kt kit itk

P

1

1 1, , ..., ; , ...,

© 2003 by CRC Press LLC

Page 174: Data Analysis Book

6.5 Additional Topics and Extensions

The discussions in this chapter have concentrated on the development ofsingle-equation regression models on a balanced panel data set. Althoughthis development is well established in the literature and most of the modelsdescribed are readily available in many commercially available statisticalsoftware packages, research in the development of models using panel datais evolving in many different directions.

First is the issue of unbalanced panel data models. The discussion to thispoint has concentrated on “complete” or “balanced” panels, referring tocases were cross-sectional units (individuals) are observed over the entiresample period. Unfortunately, experience shows that incomplete panels aremore common in empirical research (Archilla and Madanat, 2000). For exam-ple, in collecting transit data over time, it is possible to find that some transitsystems have discontinued operating, new systems have been formed, andothers have been combined. These are typical scenarios that could lead toan unbalanced or incomplete data set. Both one-way and two-way errorcomponent models for unbalanced panel data can be estimated (see Moulton,1986; Wansbeek and Kapteyn, 1989).

Second is the issue of error component models in systems of equations(simultaneous equations; see Chapter 5 for more details). In the simultaneousequations framework, error component models are viewed as pertainingeither to SURE systems (seemingly unrelated regression equations) or topure simultaneous equations systems. In the former, the application of fixed-and random-effect models is straightforward, and these are easily imple-mented in commercially available software (see Avery, 1977, for model devel-opment and Karlaftis et al., 1999, for a transportation application).Unfortunately, the same ease of practical application does not hold for simul-taneous equations. This is a specialized topic described in Baltagi (1995).

Third is the issue of error components in limited and discrete dependentvariable models. The available literature in this area is vast. For limiteddependent variable models such as the fixed- and random-effects Tobitmodel, readers are referred to Maddala (1983), Heckman and McCurdy(1980), and Honore (1992). For discrete dependent variable models, readersshould refer to Hsiao (1986) for a rigorous treatment of the subject andGreene (2000) for estimation aspects. Detailed estimation aspects of discreteoutcome models with panel data are found in Keane (1994), McFadden(1989), and Madanat et al. (1997). Bhat (2000) and Hensher (2001) offer someinteresting applications in the transportation area.

© 2003 by CRC Press LLC

Page 175: Data Analysis Book

7Time-Series Analysis

Time-series models have been the focus of considerable research and devel-opment in recent years in many disciplines, including transportation. Thisinterest stems from the insights that can be gained by observing and ana-lyzing the behavior of a variable over time. For example, modeling andforecasting a number of macroscopic traffic variables, such as traffic flow,speed, and occupancy, have been an indispensable part of most congestionmanagement systems. Furthermore, forecasting passenger volumes at air-port terminals is an integral input to the engineering planning and designprocesses. Each transportation variable that is observed over time frequentlyresults in a time-series modeling problem where an attempt is made topredict the value of the variable based on a series of past values at regulartime intervals.

The main goals of time-series analysis are to model the behavior of his-torical data and to forecast future outcomes. When analyzing such data,time-domain or frequency-domain approaches are used. The time-domainapproach assumes that adjacent points in time are correlated and that futurevalues are related to past and present ones. The frequency-domain approachmakes the assumption that time-series characteristics relate to periodic orsinusoidal variations that are uncovered in the data. The focus of this chapteris the time-domain approach due to the predominance of this approach inthe transportation literature and the relatively sparse number of applicationsin the transportation area using the frequency-domain approach (an over-view and transportation-related applications of the frequency-domainapproach are found in Stathopoulos and Karlaftis, 2001b and Peeta andAnastassopoulos, 2002).

7.1 Characteristics of Time Series

In the time-domain approach, time-series analysis is based on specific char-acteristics, called types of movement, which are frequently encountered insuch data. A time series of data is decomposed into long-term movement

© 2003 by CRC Press LLC

Page 176: Data Analysis Book

(L), seasonal movement (S), cyclical movement (C), and irregular or randommovement (I). The decomposition of time-series characteristics is based onthe assumption that these movements interact in either a multiplicative(L S C I) or in an additive way (L + S + C + I). Graphical representationsof such movements appear in Figure 7.1.

7.1.1 Long-Term Movements

Long-term (or secular) movements correspond to the general direction inwhich a time-series graph moves over time. The movement is indicated bya trend curve, frequently represented by a simple trend line. Some of theappropriate methods for estimating trends in a time series are the usual leastsquares approach, the moving average method, and the method of semi-averages (Spiegel and Stephens, 1998).

7.1.2 Seasonal Movements

When time series tend to follow identical or nearly identical patterns withinequal time periods, they are recognized as seasonal movements or variations.Seasonal variations range from yearly periods in economic time series todaily periods in traffic data (when, for example, seasons or periods areconsidered the peak and off-peak travel periods). To quantify variationswithin these periods, say, monthly variation within a year, a seasonal indexis frequently used. A seasonal index is used to obtain a set of measurementsthat reflects the seasonal variation (e.g., monthly) in values of a variable (e.g.,monthly traffic crashes) for each of the subdivisions of a period. The averageseasonal index for the entire period is equal to 100%, and fluctuates aboveand below 100% across subdivisions of the period. A number of methodsare used for calculating a seasonal index.

The first is the average percentage method. In this method, data (e.g.,monthly traffic crashes) for each subdivision are expressed as a percentageof the average across subdivisions in the period (e.g., yearly traffic crashes

FIGURE 7.1Examples of time-series movements.

Variable

Time

Long-Term Movement

Variable

Time

Variable

Trend LineTrend Line

Time

Long-Term and CyclicalMovement

Long-Term, Cyclical, andSeasonal Movement

© 2003 by CRC Press LLC

Page 177: Data Analysis Book

divided by 12). When the corresponding subdivisions for all periods areaveraged and extreme values are disregarded (i.e., outliers are dealt withappropriately), then the resulting values give the seasonal index whose meanis 100%.

The second method for calculating seasonal trend is the percentage trendor ratio-to-trend method. In this method, data are subdivided into appro-priate periods and the data of each subdivision are expressed as a percentageof the trend value of the series. The values for all periods are averaged,yielding the seasonal index whose mean should be 100%.

The third method is the percentage moving average or ratio-to-movingaverage method where, for appropriate periods of the data, a period movingaverage is computed. From the period moving average, a two-subdivisionmoving average is computed (called a centered moving average). Therequired index is produced by expressing the original data as percentagesof the centered moving average. Finally, if the original subdivision data aredivided by their corresponding seasonal index number, they are deseason-alized or adjusted for seasonal variation.

7.1.3 Cyclic Movements

Cyclic movements are thought of as alternating swings around a trend curveof a time series. It comprises cycles that may or may not be periodic. Usually,after trend adjustment and deseasonilization, cyclic effects are derived indetail. If data are periodic (in cycles), cyclic indices analogous to seasonalones are derived (Spiegel and Stephens, 1998).

7.1.4 Irregular or Random Movements

Irregular or random movements do not follow any ordinary pattern or trendand, as such, are characterized by intense variations of a short duration.Estimation of irregularities is done after eliminating trend, seasonal, andcyclic movements. Usually, irregularities are of minimum importance topractical problems, but they should be handled carefully as they may rep-resent important information regarding the underlying process of the dataexamined (random movements are sometimes known as outliers).

7.2 Smoothing Methodologies

Smoothing methodologies are a family of techniques that are employed intime-series analyses to provide reasonably good short-term forecasts. Inapplying smoothing techniques to data, past values are used to obtain asmoothed value for the time series. This smoothed value is then extrapolated

© 2003 by CRC Press LLC

Page 178: Data Analysis Book

to forecast the future values for the series. Two major classes of smoothingtechniques are discussed here: simple moving averages and exponentialsmoothing. The first class of methods assigns equal weights to past valuesof the series to determine the moving average. The second class of methodsassigns the most recent data points with a larger weight relative to onesfrom the past. Exponential smoothing methods assume both an underlyingpattern in time series as well as the existence of random effects (irregularmovements) and attempt to distinguish between the underlying patternand random effects and eliminate the latter (Brockwell and Davis, 1996).

7.2.1 Simple Moving Averages

Time-series data often contain randomness that leads to widely varyingpredictions. To eliminate the unwanted effect of this randomness, the simplemoving averages method takes a set of recent observed values, finds theiraverage, and uses it to predict a value for an upcoming time period. Theterm moving average implies that when a new observation becomes available,it is available for forecasting immediately (Makridakis et al., 1989). Given aset of observations V1, V2, V3,…, a definition of the moving average of orderN is the sequence of arithmetic means given by

. (7.1)

The higher the number of available observations, the better the forecastingeffect of the moving averages method. Increasing the number of observationsleads to a better forecast value because it tends to minimize the effects ofirregular movements that are present in the data. A generalized mathemat-ical representation of the moving averages method is

, (7.2)

where is the predicted value for the series at time t, St is the smoothedvalue at time t, Vi is the value for the series at time i, i is the time periodcorresponding to , and N is the number of values (order)included in the moving average estimation. As inferred from Equation 7.2,the method assigns equal weights and thus equal importance to the last Nobservations of the time series. The method of moving averages has twomajor limitations. The first is the requirement for a large number of pastobservations in order for the method to provide adequate predictions. Thesecond is that it assigns equal weights for each observation. This secondpoint contradicts the usual empirical evidence that more recent observationsshould be closer in value to future ones so they should have an increased

V V VN

V V VN

V V VN

N N N1 2 2 3 1 3 4 2...,

...,

...

ˆ ...V S

V V VN

N Vt tt t t N

ii t N

t

11 1 1

1

Vt 1

i t N t1, ...,

© 2003 by CRC Press LLC

Page 179: Data Analysis Book

importance in the forecasting process. To address these two limitations,exponential smoothing methods were developed.

7.2.2 Exponential Smoothing

Exponential smoothing resembles the moving averages method because itis used to smooth past data to eliminate irregularities. Suppose Equation 7.2is used to compute the moving average value of a series with only its mostrecently observed value (period t) available. The observed value is approx-imated in period t N with a reasonable estimate such as the predicted valuefrom period t ( ). This result implies that

. (7.3)

Equation 7.3 implies that the most recently observed value in period t isweighted by (1/N) and the forecast for the same period is weighted by (1/(1 – N)). Letting k = 1/N, Equation 7.3 becomes

, (7.4)

which is the general form of the equation for exponential smoothing.Equation 7.4 takes care of the first limitation of the moving average methodbecause there is no longer a need to store a large number of past data sinceonly the most recent datum is used. The value of is expanded such that

. From Equation 7.4 and its expansion the following isobtained:

(7.5)

By recursively substituting for each i = t, t – 1, …, the final equationbecomes

(7.6)

from which it is apparent that the more recently observed values areassigned higher weights. This operation relaxes the limitation of theunweighted average used in the moving averages method (by using expo-

Vt

ˆˆ

ˆ

ˆ

VVN

VN

V

NV

NV

tt t

t

t t

1

11

1

ˆ ˆV k V k Vt t t1 1

Vtˆ ( ) ˆV kV k Vt t t1 11

ˆ ˆ

( ) ( ) ˆ .

V kV k kV k V

kV k k V k V

t t t t

t t t

1 1 1

12

1

1 1

1 1

Vi

ˆ ( ) ( ) ( ) ...V kV k k V k k V k k Vt t t t t1 12

23

31 1 1

© 2003 by CRC Press LLC

Page 180: Data Analysis Book

nential smoothing of previous points to forecast future values). Exponen-tial smoothing requires only the most recently observed value of the timeseries and a value for k. Furthermore, with the use of software and expe-rience, this method is effective for forecasting patterns of past data. Incases of trends or seasonal effects, other methods of smoothing need tobe considered. For example, linear exponential smoothing is used whenthere is a trend in the observed data. In these cases, implementing singleexponential smoothing would provide a random disturbances term foreach observation. By using this smoothing technique, a smoothed estimateof the trend is

, (7.7)

where St is the equivalent of the single exponential smoothed value for periodt, Tt is the smoothed trend for period t, and d is the smoothing weight.Equation 7.7 operates much like Equation 7.4 in that the most recent trendhas a larger weight than others. Combining these two equations gives

, (7.8)

where the additional term Tt-1 is used to smooth trend effects.Finally, Winters’ linear and seasonal exponential smoothing method is

similar to linear exponential smoothing but attempts also to account forseasonal movements of past data in addition to trends. The method usesthree equations to smooth for randomness, trends, and seasonality, yielding

(7.9)

where S is the smoothed value of the deseasonalized time series, T is thesmoothed value for the trend, SM is the smoothed value for the seasonalfactor, and L is the length of seasonality. The third equation in 7.9 providesan index for seasonality effects; this index is the ratio Xi/Si, the currentvalue over the smoothed value (see Kitagawa and Gerch, 1984). However,irregularity effects are not completely eliminated by accounting for sea-sonality, and the ratio and past seasonality value SMt–L are weightedaccordingly. The second equation in 7.9 has the same role as in simplelinear exponential smoothing. Finally, in the first equation the first termis divided by SMt–L to eliminate the seasonality effects from Xt. For com-

T d S S d Tt t t t1 11

ˆ ( ) ˆV S kV k V Tt t t t t1 1 1

ˆ ( )

( )

( ) ,

S S kV

SMk S T

T d S S d T

SM yVS

y SM

t tt

t Lt t

t t t t

tt

tt L

1

1

1

1 1

1 1

© 2003 by CRC Press LLC

Page 181: Data Analysis Book

putational purposes the value SMt–L is used, because without it St and SMt

cannot be calculated.Exponential smoothing, with or without seasonality, has been frequently

applied due to the successful application of the progressive decay of theweights of past values. This property is particularly applicable when mod-eling traffic conditions approaching congestion, since traffic variables typi-cally exhibit extreme peaks, unstable behavior, and rapid fluctuations. Insuch cases the best-fit model must respond quickly and needs to be lessinfluenced by past values, making exponential smoothing a useful trafficengineering forecasting tool in practice (Williams et al., 1998).

Example 7.1

Figure 7.2 presents a classic data set from the time-series literature (Boxand Jenkins, 1976). The data are monthly international airline passengertotals from January 1949 to December 1960. Inspection of the plot revealsa trend, a yearly pattern, and an increasing variance. The graph alsopresents the graph of the log transform used to address the increasingvariance (as noted in Chapter 4, a log transformation is a variance-stabilizing transformation that can at times be effective in cases of in-creasing variance).

One may be interested in developing a forecast model for next month’spassenger total based on these data. Box and Jenkins developed a modelfor these data that has become known as the airline model because ofits relationship to the airline passenger data. A further illustration ofthe importance of correcting for the increasing variance is seen when

FIGURE 7.2Monthly airline passenger data.

Logged Data

Raw Data

3

2.5

2

Log

(Pas

seng

ers)

1.5

1

0.5

0

700

Pas

seng

ers

(in m

illio

ns)

600

500

400

300

200

100

0

1/1/

49

7/1/

49

1/1/

50

7/1/

50

1/1/

51

7/1/

51

1/1/

52

7/1/

52

1/1/

53

7/1/

53

1/1/

54

7/1/

54

1/1/

55

7/1/

55

1/1/

56

7/1/

56

1/1/

57

7/1/

57

1/1/

58

7/1/

58

1/1/

59

7/1/

59

1/1/

60

7/1/

60

© 2003 by CRC Press LLC

Page 182: Data Analysis Book

comparing the goodness of fit of the airline model to the raw data withthe goodness of fit of the airline model to the logged data (Table 7.1).

In a further investigation, an initial value calculation for trend andseasonal effects in the first 3 years of the airline data is estimated(Figure 7.3 and Table 7.2). Because the seasonal effect appears to in-crease with level, multiplicative seasonal factors are appropriate. InFigure 7.3, a fitted straight line is shown through the first 3 years ofdata. The intercept of this line is 114 passengers. This value is usedfor the initial level. The slope of the line is 1.7 passengers, and is usedfor the initial trend. Table 7.2 presents the calculation of the initialseasonal factors. The average of the first 3 years of data is 145.5. Aratio of observed passenger total to this average is computed for eachof the first 3 years of observations. Finally, the three instances of eachmonth are averaged. It so happens that the calculated initial seasonalfactors average to exactly 1.0. In cases when multiplicative seasonalfactors do not average to 1.0, they should all be divided by the averageto obtain an average of 1.0. Additive seasonal factors are calculated ina similar fashion using the difference between observed and averagevalues instead of the ratio. Additive seasonal factors should sum tozero, which is achieved by subtracting an equal allocation of the sumfrom each of the seasonal factors.

TABLE 7.1

Characteristic Measures for Raw and Logged Airline Data

Measure Raw Data Natural Log Transform

Mean square error 135.49 114.85Root mean square error 11.64 10.71Mean absolute percentage error 3.18% 2.93%Mean absolute error 8.97% 8.18%

FIGURE 7.3First 3 months of airline passenger data: actual data, solid line; fitted (linear trend), dashed line.

220

200

180

160

140

120

100

Jan-

49

Apr-4

9

Jul-4

9

Oct-49

Jan-

50

Apr-5

0

Jul-5

0

Oct-50

Jan-

51

Apr-5

1

Jul-5

1

Oct-51

Pas

seng

ers

(in m

illio

ns)

© 2003 by CRC Press LLC

Page 183: Data Analysis Book

7.3 The ARIMA Family of Models

Methods like time-series decomposition or exponential smoothing are usedto predict future values of any given variable observed over time. This sectionfocuses on more sophisticated methods for modeling time series whose basicunderlying characteristic is autocorrelation (correlation across values in a timeseries is autocorrelation, whereas correlation in cross-sectional data is simple

TABLE 7.2

First 3 Years of Airline Passenger Data

MonthAirline

PassengersAverage

First 3 YearsActual-to-

Average RatioSeasonalFactors

Jan-49 112 145.5 0.77 0.85Feb-49 118 145.5 0.81 0.90Mar-49 132 145.5 0.91 1.03Apr-49 129 145.5 0.89 0.98May-49 121 145.5 0.83 0.96Jun-49 135 145.5 0.93 1.06Jul-49 148 145.5 1.02 1.18Aug-49 148 145.5 1.02 1.18Sep-49 136 145.5 0.93 1.10Oct-49 119 145.5 0.82 0.95Nov-49 104 145.5 0.71 0.83Dec-49 118 145.5 0.81 0.97Jan-50 115 145.5 0.79Feb-50 126 145.5 0.87Mar-50 141 145.5 0.97Apr-50 135 145.5 0.93May-50 125 145.5 0.86Jun-50 149 145.5 1.02Jul-50 170 145.5 1.17Aug-50 170 145.5 1.17Sep-50 158 145.5 1.09Oct-50 133 145.5 0.91Nov-50 114 145.5 0.78Dec-50 140 145.5 0.96Jan-51 145 145.5 1.00Feb-51 150 145.5 1.03Mar-51 178 145.5 1.22Apr-51 163 145.5 1.12May-51 172 145.5 1.18Jun-51 178 145.5 1.22Jul-51 199 145.5 1.37Aug-51 199 145.5 1.37Sep-51 184 145.5 1.26Oct-51 162 145.5 1.11Nov-51 146 145.5 1.00Dec-51 166 145.5 1.14

© 2003 by CRC Press LLC

Page 184: Data Analysis Book

correlation). Before proceeding with the description of the actual models suchas the autoregressive, moving average, or ARMA (autoregressive movingaverage) models, it is important to discuss three basic characteristics of time-series data used to estimate these models.

The first is the concept of autocorrelation. The basic tool for applyingadvanced methods of time-series modeling is the autocorrelation coefficient

. An autocorrelation coefficient, similar to the correlation coefficient,describes the association between observations of a particular variable acrosstime periods. Autocorrelation values range from 1 to 1 and provide impor-tant insight regarding patterns of the data being examined. If, for example,data are completely random, then autocorrelation values are typically closeto 0. It is important to note that a measure similar to autocorrelation is thepartial autocorrelation coefficient ( ), which has interesting characteristicsthat are helpful in selecting an appropriate forecasting model in practice.Partial autocorrelation is defined as the correlation between two data pointsXt and Xt- , after removing the effect of the intervening variables Xt–1,…,Xt– +1.Both the autocorrelation and partial autocorrelation coefficients are usedextensively in forecasting.

A second important concept is stationarity. Time-series models apply tohorizontal or, in statistical terms, stationary data only. Thus, before proceed-ing with modeling a time series the data must be horizontal (without a trend).More formally, assume the existence of a time series Xt, where t is theobservation period. The time series is strictly stationary if the joint proba-bility distributions of and are the samefor all t1, t2,…, tn and L. For n = 1, the previous definition shows that theunivariate distribution of is the same as that of . Accordingly, E(Xt)= E(Xt+L) and implying that the mean and variance

2 of the time series are constant over time (Shumway and Stoffer, 2000). Forn = 2, the joint probability distributions of and arethe same and have equal covariances:

. (7.10)

The previous condition depends only on the lag L. The covariance betweenZt and Zt+L is called autocovariance ( ) and is given b,

(7.11)

and . It follows that autocorrelation is expressed mathe-matically as follows:

, (7.12)

( , , ..., )X X Xt t tn1 2( , , ..., )X X Xt L t L t Ln1 2

Xt1Xt L1

VAR X VAR Xt t L( ) ( )

( , )X Xt t1 2( , )X Xt L t L1 2

COV X X COV X Xt t t L t Lt, ,

2 1 2

COV Z Z E Z mt t L t t L( , )

0 VAR Zt2

0

© 2003 by CRC Press LLC

Page 185: Data Analysis Book

where .A third important concept is differencing. As mentioned previously, one

of the requirements of time-series analysis is that the mean of the series isconstant over time. Obviously, this assumption is violated when a seriesexhibits a trend and the usual estimation methods do not apply. In general,a stationary time series has the same statistical behavior at each point intime. Forecasts based on nonstationary series usually exhibit large errorsand, as such, series are often differenced to obtain stationarity prior toanalysis of the data. The technique of differencing is simple — the firstobservation is subtracted from the second, the second from the third, andso on. The final outcome has no trend. Formally, the mathematical operatorfor this process is the differencing operator D. Assume that B is a backshiftoperator such that

. (7.13)

For operator B two properties exist:

. (7.14)

Here, B B–1 = 1 and, if c is constant, B c = c. Since for , it is straightforward to show that the differencing operator D isD = 1 – B. This expression is used for first-order differencing, and analogousexpressions are derived for higher-order differencing. First-order differenc-ing D is used in removing a linear trend, and second-order differencing D2

is used for removing a quadratic trend (Shumway and Stoffer, 2000). Fur-thermore, there may be cases of seasonal variation of time series. In thesecases it may be necessary to obtain a seasonal difference, where the differ-encing operator is denoted as , where N is the order of differencingamong periods and S is the order of differencing within a period. For exam-ple, for monthly data and a yearly period, the differencing operator is D12

Xt = (1 – B12) Xt = Xt – Xt–12. To achieve stationarity, a combination Wt ofthe two operators is used:

. (7.15)

Several tests for nonstationarity (unit root tests) have been devised. Exper-iments conducted by Granger and Engle (1984) indicate that the most satis-factory of these tests is the augmented Dickey–Fuller test. In this test, thenull hypothesis is that the variable under consideration, say Xt, is not sta-tionary (that is, it requires at least one differencing). The alternative hypoth-esis is that the time series is already stationary. To test this hypothesis thefollowing regression model is estimated:

0 1 1, ,L L k

B X Xt t 1

B X X

B B X B X X

nt t n

n mt

n mt t n m

1 1B X X Xt t tt 2

DSN

W D D Z B B Ztn

SN

td S D

t( ) ( )1 1

© 2003 by CRC Press LLC

Page 186: Data Analysis Book

, (7.16)

where the number of terms p is the minimum required to render t into whitenoise (random error). The null hypothesis is that = 0.

7.3.1 The ARIMA Models

The usual notation for an ARIMA (autoregressive integrated moving average)model is ARIMA(p,d,q), where p refers to the autoregressive order of themodels, q to the moving average order, and d to the degree of differencingneeded to achieve stationarity. When a series is already stationary ARIMAmodels become ARMA. There are three usual types of time-series modelsdeveloped in most research: the autoregressive models (AR), the moving aver-age models (MA), and the autoregressive moving average models (ARMA).

In the autoregressive models, the current observation in a series isexpressed as a linear function of p previous observations, a constant term,and a disturbance term, and for the present period is expressed as

, (7.17)

where Xt, Xt–1,…, Xt-p are the observations in periods t, t 1,…, t p, p is thenumber of periods (lags) considered in the development of the model, ki arethe autoregressive coefficients, 0 is the constant term, and t is the distur-bance for period t (this model is written as AR(p)). Provided the process isstationary then its mean is given as

(7.18)

and its variance 0 as

, (7.19)

where is the variance of the disturbances. Finally, the autocorrelationcoefficient is expressed by the following recursive equation:

. (7.20)

Consider two popular cases of the general model presented in Equation7.12. The AR(1) has only one term,

X X Xt t j t tj

p

1 11

X k X k X k Xt t t p t p t1 1 2 2 0...

E Xk k kt

p

( )...

0

1 21

VAR Xk k kt

p p

( )...0

2

1 1 2 21

2

k k kp p1 1 2 2 ...

© 2003 by CRC Press LLC

Page 187: Data Analysis Book

(7.21)

from which it is inferred that, for the process to be stationary, |k1| < 1. Itcan also be inferred that 11 = 1, = 0, for > 1. The AR(2) has two terms:

. (7.22)

In this case there are three stationarity constraints: |k2| < 1, k1 + k2 < 1,k2 – k1 < 1, and 11 = 1,

,

= 0 for > 2.In the moving average models of order q it is assumed that the current

observation is the sum of the current and weighted past disturbances as wellas a constant term

, (7.23)

where Xt is the observation in period t, q is the number of periods (order),i are the moving average coefficients, 0 is the constant term, and t is the

random disturbances term for periods t,t 1,…, t q. The mean, variance,autocorrelation coefficient, and partial autocorrelation coefficients for thisseries are given as

(7.24)

.

A special case of the previous model is the second-order form (MA(2))

, (7.25)

X k Xt t t1 1 0

X k X k Xt t t t1 1 2 2 0

222 1

2

121

Xt t t t q t q0 1 1 2 2 ....

E Xt( ) 0

VAR Xt q( ) ...02

12

22 21

1 1 2 2

2

0

1 2

0

... , , ...q q

otherwise

112

12 2

11

Xt t t t0 1 1 2 2

© 2003 by CRC Press LLC

Page 188: Data Analysis Book

which has invariability constraints .Furthermore, there are time-series models that have both autoregressive

and moving average terms. These models are written as ARMA(p,q) andhave the following general form:

, (7.26)

where the notation is consistent with that of Equations 7.17 and 7.23. Sta-tionarity must be ensured to implement the model. As an example, considerthe ARMA(3,2) model that combines an autoregressive component of order3 and a moving average component of order 2,

. (7.27)

There are more advanced models that belong to the ARIMA family suchas the seasonal ARMA and ARIMA (SARMA and SARIMA) and the GeneralMultiplicative Model. The SARMA(P,Q) and SARIMA(P,D,Q) models areextensions of the ARMA and ARIMA models and include seasonality effectson time series. The General Multiplicative Model is used to account forpossible autocorrelation both for seasonal and nonseasonal lags. Its order is(p,q,d) (P,Q,D)S. A special case of the model is the ARIMA (0,d,1) (0,D,1)S,

, (7.28)

where the backward shift operator is B for nonseasonal effects and BS forseasonal effects. In this model there are nonzero autocorrelations at lags 1,S 1, S, S + 1, for which the autocorrelation coefficients are

. (7.29)

For more detailed information regarding the seasonal and multiplicativemodels the reader is encouraged to refer to Box and Jenkins (1976) andAbraham and Ledolter (1983).

7.3.2 Estimating ARIMA Models

The Box and Jenkins (1976) approach to estimating models of the generalARIMA family is currently one of the most widely implemented strategiesfor modeling univariate time-series data. Although the three-stage

1 1 1 11 2 2 1, ,

X k X k X

k X

t t t t

p t p t t q t q

0 1 1 2 2

1 1 2 2

...

....

X k X k X k Xt t t t t t t0 1 1 2 2 3 3 1 1 2 2

W B BtS

t0 1 11 1

11

12

1

12 1 1 11 1

, , S S S S

© 2003 by CRC Press LLC

Page 189: Data Analysis Book

approach was originally designed for modeling time series with theARIMA model, the underlying strategy is applicable to a wide variety ofstatistical modeling approaches. It provides a logical framework that helpsresearchers find their way through data uncertainty (Makridakis et al.,1989). The approach consists of three distinct steps. The first step is modelidentification. The task of identifying the proper model is probably themost difficult one in practice. The values of p and q, the orders of theautoregressive and moving average terms, need to be obtained beforeapplying a model. The selection of p and q is a tedious process becausemany possible combinations need to be examined. This selection is mostfrequently done using the autocorrelation and partial autocorrelation func-tions (ACF and PACF, respectively). It is important to note that beforeattempting to specify p and q, the data must be stationary with the valuesof p and q found by examining the ACF and PACF of stationary data. Thegeneral rules that apply in interpreting these two functions appear in Table7.3. When the ACF tails off exponentially to 0, the model is AR and itsorder is determined by the number of significant lags in the PACF. Whenthe PACF tails off exponentially to 0, the model is an MA and its order isdetermined by the number of statistically significant lags in the ACF. Whenboth the ACF and PACF tail off exponentially to 0, the model is ARMA.Finally, when the data are seasonal the same procedure is used to estimatethe P,D,Q parameters.

The second step is estimation of model parameters. After selecting a modelbased on its ACF and PACF, its parameters are estimated by maximizing thecorresponding likelihood function. Unfortunately, there are no closed formexpressions for the maximum likelihood estimators of the parameters in thegeneral ARMA(p,q) model. Even in the case of the AR(1) model the maximumlikelihood estimate involves the solution of a cubic equation. Because of thecomplexity of the MLE procedure for the ARMA(p,q) model, other estimationprocedures such as the conditional least squares and unconditional leastsquares have been developed (for details, see Box and Jenkins, 1976; Cryer,1986; and Abraham and Ledolter, 1983).

The third step is the examination of the disturbances of the fitted models.These errors may be either random or follow a systematic pattern. The firstcase implies that the selected model has eliminated the underlying patternsfrom the data and only random errors remain, similar to an ideal linearregression analysis. The second case reveals that underlying patterns in thedata were not completely removed and, as such, the disturbances are notrandom. In a correctly specified model the residuals are uncorrelated, which

TABLE 7.3

ACF and PACF Behavior for ARMA Models

AR(p) MA(q) ARMA(p,q)

ACF Tails off exponentially Cuts off after lag q Tails off exponentiallyPACF Cuts off after lag p Tails off exponentially Tails off exponentially

© 2003 by CRC Press LLC

Page 190: Data Analysis Book

is reflected by the absence of significant values in the sample ACF. Further-more, an overall measure of time-series model adequacy is provided by theLjung and Box (1978) statistic (also known as the modified Box–Pierce sta-tistic). If both the residuals are random and the test statistic significant, themodel is suitable for forecasting.

7.4 Nonlinear Time-Series Models

Until recently, linear time-series approaches and especially ARMA processesdominated the field of time-series modeling. However, in many cases boththeory and data support a nonlinear model structure. There is ample evi-dence that many classical data sets such as sunspot (De Groot and Wurtz,1991), financial time-series (Engel, 1982), and transportation data (Dough-erty, 1995) arise from nonlinear processes. To a large extent, transportationsystems represent highly nonlinear problems (Dougherty, 1995). The proba-ble existence of nonstandard phenomena such as non-normality, asymmetriccycles, bimodality, nonlinear relationships between lagged variables, varia-tion of prediction performance over the series, and sensitivity to initialconditions cannot be adequately modeled by the conventional linear ARMAprocesses (Tong, 1990).

The rationale for concentrating on nonlinear models for univariate time-series analysis is that an initial consideration of the physics involved in thedata-generating mechanism may lead to the adoption of a nonlinear modelfor describing the data. In these cases it is expected that a nonlinear modelwill describe the data better than a simple linear process (Tjøstheim, 1994).The construction of a nonlinear model consists of modeling the mean andvariance of the process given past information on Xt (Tjøstheim, 1994). Thismodeling is achieved using either parametric or nonparametric approaches.The parametric nonlinear models, briefly reviewed in this section, aredivided into four main categories (Tjøstheim, 1994).

7.4.1 Conditional Mean Models

Conditional mean models start by noting that in Gaussian processes, theconditional mean is linear. For the first-order model,

(7.30)

and it follows that

, (7.31)

X f Xt t t( , )1

E X X x f xT T( | ) ( , )1

© 2003 by CRC Press LLC

Page 191: Data Analysis Book

where is an unknown parameter vector to be estimated. Selecting dif-ferent functional forms for f in Equation 7.31 results in some well-knownnonlinear models. Consider, for example, the threshold AR (autoregres-sive) model,

, (7.32)

where l( ) is the indicator function and k is a threshold. The threshold ARmodels are very useful as they provide a way of combining two AR pro-cesses. Other functional forms include the exponential AR,

, (7.33)

where 3 > 0, and the logistic AR model,

, (7.34)

where 3 > 0. An even less restrictive model than the ones presented previ-ously is formulated as

, (7.35)

where aT, T are parameter vectors and i are functions known as squashing,transition, or transfer functions (usually, these functions are logistic or sig-moidal; Bishop, 1995). These models are commonly known as neural net-works and have been applied with significant success to a variety oftransportation applications (for an extensive review of neural network appli-cations to transport, see Dougherty, 1995). Among the advantages of thesemodels are that there is no need for initial estimation of the parameters ofthe models and that they exhibit great robustness when approximating arbi-trary functions (Hornik et al., 1989).

7.4.2 Conditional Variance Models

With conditional variance models the variance is usually an expression ofrisk or volatility in economic time series, and it is frequently observed thatfor larger values of a variable Xt there is also a larger fluctuation than forlower values of Xt. To model such data, Engle (1982) introduced the autore-gressive conditionally heteroscedastic models (ARCH) and Bollerslev (1986)the generalized autoregressive conditionally heteroscedastic models

f x xl x k xl x k, 1 2

f x EXP x x, 1 2 32

f x x x EXP x, 1 2 3 4

11

12

X a X b XtT

t i i iT

ti

k

t1 11

( )

© 2003 by CRC Press LLC

Page 192: Data Analysis Book

(GARCH). The characteristic of these models is that they can deal withchanges in the variability of a time series contrary to ARIMA models, whichassume constant variability. The simplest ARCH(t) model is given as (Shum-way and Stoffer, 2000)

, (7.36)

where t is the white noise error term distributed as N(0,1). In the case ofthe ARCH(1) model it is inferred that Xt|Xt–1 is distributed as N(0,

). It can also be shown that the ARCH(1) model is transformed toa non-Gaussian AR(1) model using

, (7.37)

where . Among the properties of the ARCH model are thefollowing (see Shumway and Stoffer, 2000, for proofs and more details): E(Xt)= 0, E(ut) = 0, for h > 0 COV(ut+h,ut) = 0. The ARCH(1) is straightforwardlyextended to the ARCH(m) model. The GARCH model is an extension of theARCH model, and is described as (Bollerslev, 1986)

(7.38)

For example, the GARCH (1,1) is given as follows:

(7.39)

There are other variations of the ARCH model such as the EARCH (expo-nential ARCH) model and the IARCH (integrated ARCH) model. Modelingconditional mean and variance can also be achieved using nonparametricapproaches. One of the most commonly used nonparametric approaches forestimating the conditional mean and variance is the Kernel estimator (Tjøst-heim, 1994). Nonparametric approaches and especially the Kernel estimatorsand the special class of k-nearest neighbor approaches have found vastapplicability in the transportation area (Smith et al., 2002; Kharoufeh andGoulias, 2002). These approaches generally assume that the largest part ofthe knowledge about the relationship lies in the data rather than the persondeveloping the model (Eubank, 1988).

X

a a X

t t t

t t2

0 1 12

a a Xt0 1 12

X a a X ut t t2

0 1 12

ut t t2 2 1

X

a a X b

t t t

t j t jj

m

j t jj

r2

02

1

2

1

.

X

a a X b

t t t

t t t2

0 1 12

1 12 .

© 2003 by CRC Press LLC

Page 193: Data Analysis Book

7.4.3 Mixed Models

The general form of a mixed model that contains both nonlinear mean andvariance is

(7.40)

where g > 0, the f and g are known,

(7.41)

and . A simple formcontaining the conditional mean and variance is the bilinear model that iswritten as

. (7.42)

Despite their flexible nature, bilinear models cannot be readily applied inpractice because, in general, an explicit expression for them cannot be com-puted; furthermore, for a given conditional mean and variance it is difficultto decide on the appropriate type of bilinear model.

7.4.4 Regime Models

Regime models refer to models that attempt to describe the stochasticchanges in the nonlinear model structure. A straightforward example isthe case when the autoregressive parameters in an AR(1) process arereplaced by a Markov chain t. Tjøstheim (1986a) suggests that by takingtwo values 1 and 2, the model results in a simple doubly stochasticmodel

, (7.43)

where the series xt alternates between the two AR processes

(7.44)

and the change is regulated by the transition probability of the hiddenMarkov chain t. A special case occurs when t is independent and identicallydistributed — resulting in a random coefficients autoregressive model(Nicholls and Quinn, 1982).

X f X X g X Xt t t p t t p t( , ..., ) ( , ..., )1 1 1 2

E X X x X x f x xt t t p p p| ,..., , ..., ,1 1 1 1

VAR X X x X x g x x VARt t t p p p t( | , ..., ) ( , ..., , ) ( )1 12

1 2

X X X Xt t t t t t t1 2 1 3 1 4 1 1

X Xt t t t1

X a X

X a X

t t t

t t t

1 1

2 1

© 2003 by CRC Press LLC

Page 194: Data Analysis Book

7.5 Multivariate Time-Series Models

The models discussed in the previous sections are univariate in nature inthat they are used to forecast only one variable at a time. However, in thelong run, a single variable may not be appropriate because the phenomenonunder study may be influenced by other external events. In general, thereare three approaches for modeling and forecasting multivariate time series:the transfer function models, the vector autoregressive models, and the state-space (Kalman filter) models. Kalman filters are the most general of allmultivariate models and are briefly reviewed here.

Before proceeding, it is important to note that the state-space model andthe more widely known Kalman filter model refer to the same basic under-lying model. The term state-space refers to the model and the term Kalmanfilter refers to the estimation of the state. Most times, the process of estimatinga state-space model begins by estimating an ARMA equivalent to capturethe statistically significant AR and MA dimensions. However, the state-spaceapproach has advantages over the more widely used family of ARIMAmodels (see Durbin, 2000). First, it is based on the analysis of the modelstructure and, as such, the various components that make up the series(trend, seasonal variation, cycle, etc.) together with the effects of explanatoryvariables and interventions are modeled separately before being integratedin the final model. Second, it is flexible in that changes in the structure ofthe modeled system over time are explicitly modeled. Third, multivariateobservations are handled as straightforward extensions of univariate theory,unlike ARIMA models. Fourth, it is rather simple to deal with missingobservations. Fifth, the Markovian nature of state-space models enablesincreasingly large models to be handled effectively without disproportionateincreases in computational efficiency. And, finally, it is straightforward topredict future values by projecting the Kalman filter forward into the futureand obtaining their estimated values and standard errors.

On the other hand, the mathematical theory of state-space is quite complex.To understand it fully on a mathematical level requires a good backgroundin probability theory, matrix theory, multivariate statistics, and the theory ofHilbert space. However, the state-space model is so useful, especially formultivariate time-series processes, that it deserves wider attention in thetransportation literature. Readers interested in the finer mathematical detailsof state-space modeling should refer to Shumway and Stoffer (2000), Durbin(2000), and Whittaker et al. (1997) for transportation applications.

The concept of the state of a system emerged from physical science. In thatfield, the state consists of a vector that includes all the information about thesystem that carries over into the future. That is, the state vector consists ofa linearly independent set of linear combinations from the past that arecorrelated with future endogenous variables. To construct such a vector, thesimplest and most straightforward way is through the one-step predictor,

© 2003 by CRC Press LLC

Page 195: Data Analysis Book

then the two-step predictor, and so forth, so long as the predictors extractedin this way are linearly independent. The set of predictors extracted in thisway is, by construction, a state vector. They contain all the relevant infor-mation from the past, as all predictors of the future, for any time horizon,are linearly dependent on them. It is also a minimal set as they are linearlyindependent. The space spanned by the selected predictors is called thepredictor space. The state-space model for an r-variate (multivariate) timeseries is written

(7.45)

These two equations are called the observation equation and the stateequation, respectively, where xt, a p-vector, is the (observed) state of theprocess at time t. The evolution of xt is governed by the state equation: thestate at time t evolves by autoregression on the previous state and is furthermodified by the innovation term wt. The sequence wt, t = 1, …, n consistsof independent and identically distributed random p-vectors. The unob-served (latent) data vector, yt, is a possibly time-dependent linear trans-formation of xt with the addition of a regression on the k-vector zt of knownexogenous regressors and observational noise modeled by ut. The estima-tion of Equation 7.45 is done by maximum likelihood (Shumway andStoffer, 2000).

7.6 Measures of Forecasting Accuracy

Because the basic goal when analyzing time series is to produce forecasts ofsome future value of a variable, it is imperative that its predictive accuracybe determined. Usually, accuracy examines how well the model reproducesthe already known data (goodness of fit) (Makridakis et al., 1983). An accu-racy measure is often defined in terms of the forecasting error, which is thedifference of the actual and the predicted values of the series ( ). Somemeasures of accuracy that are used with most time-series models are shownin Table 7.4. Many of these measures can also be used for other types ofmodels, including linear regression, count models, etc.

The first four performance measures are absolute measures and are of lim-ited value when used to compare different time series. The mean square erroris the most frequently used measure in the literature. However, its value isoften questioned in evaluating the relative forecasting accuracy between dif-ferent data sets, because it is dependent on the scale of measurement and issensitive to outliers (Clemens and Hendry, 1993; Chatfield, 1992). In transpor-tation modeling commonly used measures for evaluating the accuracy of the

, , ..., y t nt 1

y M x Dz u

x x w

t t t t t

t t t

,

.1

i

© 2003 by CRC Press LLC

Page 196: Data Analysis Book

forecasting models are the mean square error (MSE), the mean absolute devi-ation (MAD), and the mean absolute percent error (MAPE). Much of the time-series literature related to transportation problems has demonstrated the needto use absolute percent error as a basis for comparison in order to eliminatethe effect of the variability observed in most transportation data sets.

Example 7.2

Toward meeting the goal of reducing congestion, decreasing travel times,and giving drivers the ability to make better route and departure timechoices, a dynamic traffic map and advanced traveler information system

TABLE 7.4

Some Measures of Accuracy Used to Evaluate Time-Series Models

Measure Corresponding Equation

Mean error

Mean absolute deviation

Sum of square errors

Mean square error

Root mean square error

Standard deviation of errors

Percentage error

Mean percentage error

Mean absolute percentage error

MEi

i

n

n1

MAD

| |i

i

n

n1

SSE i

i

n2

1

MSEi

i

n

n

2

1

RMSEi

i

n

n

2

1

SDEi

i

n

n

2

1

1

PEX F

Xtt t

t

100%

MPE

PE

n

i

i

n

1

MAPE

| |PE

n

i

i

n

1

© 2003 by CRC Press LLC

Page 197: Data Analysis Book

for the central Athens, Greece, area has been operating since 1996 (NTUA,1996). As a first attempt to meet driver needs, travel time information isgiven over the Internet updated every 15 min. Speeds and travel timesalong 17 major urban signalized arterials are estimated using flow (vol-ume) and occupancy data collected directly from the controllers installedat 144 locations in the downtown urban network (for more information,see Stathopoulos and Karlaftis, 2001a, 2002).

The data set available is sizable. In an effort to reduce the size, a newdata set was created containing flow (volume) measurements for theperiod between the January 3, 2000 and May 31, 2000 for the afternoonpeak hour (17:30 to 20:00; note that weekend and holiday data wereexcluded). This approach yielded a data set for 106 days containingapproximately 7300 3-min flow measurements to be used in the analysis.

As depicted in Figure 7.4, the ACF (for loop detector 101) demonstratesconsiderably slow decay, indicating a lack of stationarity. Interestingly,using only first-order differencing yields an ACF and a PACF that tailoff exponentially (Figure 7.5). As indicated previously, series whoseACF and PACF both tail off exponentially indicate a combined ARMAprocess. Nevertheless, because of the importance of correct differenc-ing in the development of robust ARMA models, stationarity was alsotested using the augmented Dickey–Fuller (ADF) test. At the 99%

FIGURE 7.4Autocorrelation (top) and partial autocorrelation (bottom) functions for the raw data.

1 2 3 4 5 6 7 8 9

Lag

10 11 12 13 14 15 16

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Cor

rela

tion

1 2 3 4 5 6 7 8 9

Lag

10 11 12 13 14 15 16

0.7

0.6

0.5

0.4

0.3

0.2

0.1

−0.1

0

Cor

rela

tion

© 2003 by CRC Press LLC

Page 198: Data Analysis Book

significance level, the null hypothesis of nonstationarity was rejectedafter first differencing was performed. Finally, Table 7.5 presents theforecasting accuracy results from the application of a variety of differ-ent approaches to these data including exponential smoothing, AR,MA, and ARMA models (all models were estimated on the differenceddata). The success of these approaches for one-step-ahead forecastingwas evaluated with the MAPE measure.

FIGURE 7.5Autocorrelation (top) and partial autocorrelation (bottom) functions after first differencing.

TABLE 7.5

Results Comparison for Different Time-Series Approaches

Mean Absolute Error

(veh/3 min)Mean Absolute

Percent Error (%)

Exponential smoothing 14 21.43AR(1) 12 18.73MA(1) 14 22.22ARMA(1,1) 12 17.87AR(2) 12 18.30MA(2) 13 20.65ARMA(2,2) 13 17.45

2 3 4 5 6 7 8 9

Lag

10 11 12 13 14 15 16

1

0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

0Cor

rela

tion

1 2 3 4 5 6 7 8 9

Lag

10 11 12 13 14 15 160

−0.05

−0.1

−0.15

−0.2

−0.3

−0.4

−0.25

−0.35

Cor

rela

tion

© 2003 by CRC Press LLC

Page 199: Data Analysis Book

8Latent Variable Models

This chapter presents tools for illuminating structure in data in the presenceof measurement difficulties, endogeneity, and unobservable or latent vari-ables. Structure in data refers to relationships between variables in the data,including direct or causal relationships, indirect or mediated relationships,associations, and the role of errors of measurement in the models. Mea-surement difficulties include challenges encountered when an unobserv-able or latent construct is indirectly measured through exogenousobservable variables. Examples of latent variables in transportation includeattitudes toward transportation policies or programs such as gas taxes, vanpooling, high-occupancy vehicle lanes, and mass transit. Interest mightalso be centered on the role of education, socioeconomic status, or attitudeson driver decision making — which are also difficult to measure directly.As discussed in previous chapters, endogeneity refers to variables whosevalues are largely determined by factors or variables included in the sta-tistical model or system.

In many analyses, the initial steps attempt to uncover structure in datathat can then be used to formulate and specify statistical models. Thesesituations arise predominately in observational settings — when the analystdoes not have control over many of the measured variables, or when thestudy is exploratory and there are no well-articulated theories regarding thestructure in the data. There are several approaches to uncovering data struc-ture. Principal components analysis is widely used as an exploratory methodfor revealing structure in data. Factor analysis, a close relative of principalcomponents analysis, is a statistical approach for examining the underlyingstructure in multivariate data. And, structural equation models refer to aformal modeling framework developed specifically for dealing with unob-servable or latent variables, endogeneity among variables, and complexunderlying data structures encountered in social phenomena often entwinedin transportation applications.

© 2003 by CRC Press LLC

Page 200: Data Analysis Book

8.1 Principal Components Analysis

Principal components analysis has two primary objectives: to reduce a rela-tively large multivariate data set, and to interpret data (Johnson and Wichern,1996). Principal components analysis “explains” the variance–covariancestructure using a few linear combinations of the originally measured vari-ables. Through this process a more parsimonious description of the data isprovided reducing or explaining the variance of many variables withfewer, well-chosen combinations of variables. If, for example, a large propor-tion (70 to 90%) of the total population variance is attributed to a few uncor-related principal components, then these components can replace the originalvariables without much loss of information and also describe different dimen-sions in the data. Principal components analysis relies on the correlationmatrix of variables so the method is suitable for variables measured onthe interval and ratio scales.

If the original variables are uncorrelated, then principal components anal-ysis accomplishes nothing. Thus, for example, experimental data with ran-domized treatments are not well suited to principal components analysis.However, observational data, which typically contain a large number ofcorrelated variables, are ideally suited for principal components analysis. Ifit is found that the variance in 20 or 30 original variables is describedadequately with four or five principal components (dimensions), then prin-cipal components analysis will have succeeded.

Principal components analysis begins by noting that n observations, eachwith p variables or measurements upon them, is expressed in an n pmatrix X:

(8.1)

Principal components analysis is not a statistical model, and there is nodistinction between dependent and independent variables. If the principalcomponents analysis is useful, there are K < n principal components, withthe first principal component

, (8.2)

which maximizes the variability across individuals, subject to the constraint

. (8.3)

Xn p

p

n np

x x

x x

11 1

1

K

M O M

L

Z a x a x a xp p1 11 1 12 2 1...

a a a p112

122

12 1...

© 2003 by CRC Press LLC

Page 201: Data Analysis Book

Thus, VAR[Z1] is maximized given the constraint in Equation 8.3, with theconstraint imposed simply because the solution is indeterminate, otherwiseone could simply increase one or more of the aij values to increase thevariance. A second principal component, Z2, is then sought that maximizesthe variability across individuals subject to the constraints that

and COR[Z1, Z2] = 0. In keeping, a third principal com-ponent is added subject to the same constraint on the aij values, with theadditional constraint that COR[Z1, Z2, Z3] = 0. Additional principal compo-nents are added up to p, the number of variables in the original data set.

The eigenvalues of the sample variance–covariance matrix X are the vari-ances of the principal components. The corresponding eigenvector providesthe parameters to satisfy Equation 8.3.

Note from Appendix A that the symmetric p p sample variance–covari-ance matrix is given as

. (8.4)

The diagonal elements of this matrix represent the estimated variances ofrandom variables 1 through p, and the off-diagonal elements represent theestimated covariances between variables.

The sum of the eigenvalues P of the sample variance–covariance matrixis equal to the sum of the diagonal elements in s2[X], or the sum of thevariances of the p variables in matrix X; that is,

. (8.5)

Because the sum of the diagonal elements represents the total samplevariance, and the sum of the eigenvalues is equal to the trace of s2[X], thenthe variance in the principal components accounts for all of the variation inthe original data. There are p eigenvalues, and the proportion of total vari-ance explained by the jth principal component is given by

. (8.6)

To avoid excessive influence of measurement units, the principal compo-nents analysis is carried out using a standardized variance–covariancematrix, or the correlation matrix. The correlation matrix is simply the vari-ance–covariance matrix as obtained by using the standardized variablesinstead of the original variables, such that

a a a p212

222

22 1...

s

s x s x x s x x

s x x s x s x x

s x x s x x s x

P

P

P P P

2

21 1 2 1

2 12

2 2

1 22

X

, ,, ,

, ,

K

K

M M O M

K

1 2 1 2... ...P PVAR x VAR x VAR x

VAR j pj

Pj

1 2

1 2...

, , , ...,

© 2003 by CRC Press LLC

Page 202: Data Analysis Book

replace the original Xij terms. Because the correlation matrix is often used,variables used in principal components analysis are restricted to interval andratio scales unless corrections are made. Using the correlation matrix, thesum of the diagonal terms, and the sum of eigenvalues, is equal to p (thenumber of variables).

With the basic statistical mechanics in place, the basic steps in principalcomponents analysis are as follows (see Manly, 1986): standardize allobserved variables in the X matrix; calculate the variance–covariance matrix,which is the correlation matrix after standardization; determine the eigen-values and corresponding eigenvectors of the correlation matrix (the param-eters of the ith principal component are given by the eigenvector, whereasthe variance is given by the eigenvalue); and discard any components thataccount for a relatively small proportion of the variation in the data.

The intent of principal components analysis is to illuminate underlyingcommonality, or structure, in the data. In exploratory studies relying onobservational data, it is often the case that some variables measure the sameunderlying construct, and this underlying construct is what is being soughtin principal components analysis. The goal is to reduce the number of poten-tial explanatory variables and gain insight regarding what underlyingdimensions have been captured in the data.

Example 8.1

A survey of 281 commuters was conducted in the Seattle metropolitanarea. The intent of the survey was to gather information on commuters’opinions or attitudes regarding high-occupancy vehicle (HOV) lanes(lanes that are restricted for use by vehicles with two or more occupants).Commuters were asked a series of attitudinal questions regarding HOVlanes and their use, in addition to a number of sociodemographic andcommuting behavior questions. A description of the variables is providedin Table 8.1.

It is useful to assess whether a large number of variables, includingattitudinal, sociodemographic, and travel behavior, can be distilled intoa smaller set of variables. To do this, a principal components analysis isperformed on the correlation matrix (not the variance–covariance ma-trix). Figure 8.1 shows a graph of the first ten principal components. Thegraph shows that the first principal component represents about 19% ofthe total variance, the second principal component an additional 10%,the third principal component about 8%, the fourth about 7%, and theremaining principal components about 5% each. Ten principal compo-nents account for about 74% of the variance, and six principal compo-nents account for about 55% of the variance contained in the 23 variables

ZX X

i n j pijij j

j

; , , ..., ; , , ...,1 2 1 2

© 2003 by CRC Press LLC

Page 203: Data Analysis Book

TABLE 8.1

Variables Collected during HOV Lane Survey, Seattle, WA

VariableAbbreviation Variable Description

MODE Usual mode of travel: 0 if drive alone, 1 if two person carpool, 2 if three or more person carpool, 3 if van pool, 4 if bus, 5 if bicycle or walk, 6 if motorcycle, 7 if other

HOVUSE Have used HOV lanes: 1 if yes, 0 if noHOVMODE If used HOV lanes, what mode is most often used: 0 in a bus, 1 in two

person carpool, 2 in three or more person carpool, 3 in van pool, 4 alone in vehicle, 5 on motorcycle

HOVDECLINE Sometimes eligible for HOV lane use but do not use: 1 if yes, 0 if noHOVDECREAS Reason for not using HOV lanes when eligible: 0 if slower than regular

lanes, 1 if too much trouble to change lanes, 2 if HOV lanes are not safe, 3 if traffic moves fast enough, 4 if forget to use HOV lanes, 5 if other

MODE1YR Usual mode of travel 1 year ago: 0 if drive alone, 1 if two person carpool, 2 if three or more person carpool, 3 if van pool, 4 if bus, 5 if bicycle or walk, 6 if motorcycle, 7 if other

COM1YR Commuted to work in Seattle a year ago: 1 if yes, 0 if noFLEXSTAR Have flexible work start times: 1 if yes, 0 if noCHNGDEPTM Changed departure times to work in the last year: 1 if yes, 0 if noMINERLYWRK On average, number of minutes leaving earlier for work relative to last

yearMINLTWRK On average, number of minutes leaving later for work relative to last

yearDEPCHNGREAS If changed departure times to work in the last year, reason: 0 if change

in travel mode, 1 if increasing traffic congestion, 2 if change in work start time, 3 if presence of HOV lanes, 4 if change in residence, 5 if change in lifestyle, 6 if other

CHNGRTE Changed route to work in the last year: 1 if yes, 0 if noCHNGRTEREAS If changed route to work in the last year, reason: 0 if change in travel

mode, 1 if increasing traffic congestion, 2 if change in work start time, 3 if presence of HOV lanes, 4 if change in residence, 5 if change in lifestyle, 6 if other

I90CM Usually commute to or from work on Interstate 90: 1 if yes, 0 if noI90CMT1YR Usually commuted to or from work on Interstate 90 last year: 1 if yes,

0 if noHOVPST5 On your past five commutes to work, how often have you used HOV

lanesDAPST5 On your past five commutes to work, how often did you drive aloneCRPPST5 On your past five commutes to work, how often did you carpool with

one other personCRPPST52MR On your past five commutes to work, how often did you carpool with

two or more peopleVNPPST5 On your past five commutes to work, how often did you take a van poolBUSPST5 On your past five commutes to work, how often did you take a busNONMOTPST5 On your past five commutes to work, how often did you bicycle or walkMOTPST5 On your past five commutes to work, how often did you take a

motorcycleOTHPST5 On your past five commutes to work, how often did you take a mode

other than those listed in variables 18 through 24(continued)

© 2003 by CRC Press LLC

Page 204: Data Analysis Book

that were used in the principal components analysis. Thus, there is someevidence that some variables, at least, are explaining similar dimensionsof the underlying phenomenon.

Table 8.2 shows the variable parameters for the six principal components.For example, the first principal component is given by

.

CHGRTEPST5 On your past five commutes to work, how often have you changed route or departure time

HOVSAVTIME HOV lanes save all commuters time: 0 if strongly disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

HOVADUSE Existing HOV lanes are being adequately used: 0 if strongly disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

HOVOPN HOV lanes should be open to all traffic: 0 if strongly disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

GPTOHOV Converting some regular lanes to HOV lanes is a good idea: 0 if strongly disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

GTTOHOV2 Converting some regular lanes to HOV lanes is a good idea only if it is done before traffic congestion becomes serious: 0 if strongly disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

GEND Gender: 1 if male, 0 if femaleAGE Age in years: 0 if under 21, 1 if 22 to 30, 2 if 31 to 40, 3 if 41 to 50, 4 if

51 to 64, 5 if 65 or olderHHINCM Annual household income (U.S. dollars): 0 if no income, 1 if 1 to 9,999,

2 if 10,000 to 19,999, 3 if 20,000 to 29,999, 4 if 30,000 to 39,999, 5 if 40,000 to 49,999, 6 if 50,000 to 74,999, 7 if 75,000 to 100,000, 8 if over 100,000

EDUC Highest level of education: 0 if did not finish high school, 1 if high school, 2 if community college or trade school, 3 if college/university, 4 if post college graduate degree

FAMSIZ Number of household membersNUMADLT Number of adults in household (aged 16 or older)NUMWRKS Number of household members working outside the homeNUMCARS Number of licensed motor vehicles in the householdZIPWRK Postal zip code of workplaceZIPHM Postal zip code of homeHOVCMNT Type of survey comment left by respondent regarding opinions on

HOV lanes: 0 if no comment on HOV lanes, 1 if comment not in favor of HOV lanes, 2 if comment positive toward HOV lanes but critical of HOV lane policies, 3 if comment positive toward HOV lanes, 4 if neutral HOV lane comment

TABLE 8.1 (CONTINUED)

Variables Collected during HOV Lane Survey, Seattle, WA

VariableAbbreviation Variable Description

Z HOVPST DAPST CRPPST

CRPPST MR BUSPST HOVSAVTIME

HOVOPN GTTOHOV GEND

1 0 380 5 0 396 5 0 303 5

0 109 52 0 161 5 0 325

0 364 0 339 2 0 117

. . .

. . .

. . .

© 2003 by CRC Press LLC

Page 205: Data Analysis Book

All of the variables had estimated parameters (or loadings). However,parameters less than 0.1 were omitted from Table 8.2 because of theirrelatively small magnitude. The first principal component loaded strong-ly on travel behavior variables and HOV attitude variables. In addition,Z1 increases with decreases in any non-drive-alone travel variables (HOV,Car Pool, Bus), increases with decreases in pro-HOV attitudes, and in-creases for males. By analyzing the principal components in this way,some of the relationships between variables are better understood.

8.2 Factor Analysis

Factor analysis is a close relative of principal components analysis. It wasdeveloped early in the 20th century by Karl Pearson and Charles Spearmanwith the intent to gain insight into psychometric measurements, specificallythe directly unobservable variable intelligence (Johnson and Wichern, 1992).The aim of the analysis is to reduce the number of p variables to a smallerset of parsimonious K < p variables. The objective is to describe the covari-ance among many variables in terms of a few unobservable factors. Thereis one important difference, however, between principal components andfactor analysis. Factor analysis is based on a specific statistical model,whereas principal components analysis is not. As was the case with principalcomponents analysis, factor analysis relies on the correlation matrix, and sofactor analysis is suitable for variables measured on interval and ratio scales.

Just as for other statistical models, there should be a theoretical rationalefor conducting a factor analysis (Pedhazur and Pedhazur-Schmelkin, 1991).

FIGURE 8.1Cumulative proportion of variance explained by ten principal components: HOV lane survey data.

Var

ianc

es

4

3

2

1

0

0.188

Comp.1

Comp.2

0.295

Comp.3

0.37

Comp.4

0.436

Comp.5

0.494

Comp.6

0.549

Comp.7

0.599

Comp.8

0.648

Comp.9

0.695

Comp.10

0.738

© 2003 by CRC Press LLC

Page 206: Data Analysis Book

TABLE 8.2

Factor Loadings of Principal Components Analysis: HOV Lane Survey Data

Variable Comp. 1 Comp. 2 Comp. 3 Comp. 4 Comp. 5 Comp. 6

Travel Behavior Variables

HOVPST5 –0.380 –0.284 0.236 DAPST5 0.396 0.274 –0.283 0.128 CRPPST5 –0.303 –0.223 0.240 0.282 0.221 CRPPST52MR –0.109 0.167 0.196 –0.107 VNPPST5 –0.146BUSPST5 –0.161 –0.140 –0.227 0.112 –0.514 –0.395 NONMOTPST5 0.471MOTPST5 0.104 0.381 –0.418 CHGRTEPST5 0.525 –0.302

HOV Attitude Variables

HOVSAVTIME –0.325 0.301 –0.140 HOVADUSE –0.321 0.227 –0.133 HOVOPN 0.364 –0.216 0.210 GPTOHOV –0.339 0.125 0.230 –0.115 GTTOHOV2 –0.260 0.245 –0.153

Sociodemographic Variables

GEND 0.117 0.388 0.180 –0.199 AGE 0.268 0.341 –0.363 –0.270 HHINCM 0.304 0.131 0.489 0.101 EDUC 0.188 0.247 0.443 0.247 FAMSIZ 0.429 –0.122 NUMADLT 0.516 –0.188 –0.128 –0.133 NUMWRKS 0.451 –0.242 –0.137 NUMCARS 0.372 –0.106 0.107 –0.268

Note: Loadings < 0.10 shown as blanks.

© 2003 by C

RC

Press LL

C

Page 207: Data Analysis Book

One should not simply “feed” all variables into a factor analysis with theintention to uncover real dimensions in the data. There should be a theo-retically motivated reason to suspect that some variables may be measuringthe same underlying phenomenon, with the expectation of examiningwhether the data support this expected underlying measurement modelor process.

The factor analysis model is formulated by expressing the Xi terms as linearfunctions, such that

. (8.7)

where in matrix notation the factor analysis model is given as

, (8.8)

where the F are factors and the are the factor loadings. The i areassociated only with the Xi, and the p random errors and m factor loadingsare unobservable or latent. With p equations and p + m unknowns, theunknowns cannot be directly solved without additional information. Tosolve for the unknown factor loadings and errors, restrictions are imposed.The types of restrictions determine the type of factor analysis model. Thefactor rotation method used determines the type of factor analysis model,orthogonal or oblique. Factor loadings that are either close to 1 or closeto 0 are sought. A factor loading close to 1 suggests that a variable Xi islargely influenced by Fj. In contrast, a factor loading close to 0 suggeststhat a variable Xi is not substantively influenced by Fj. A collection offactor loadings that is as diverse as possible is sought, lending itself toeasy interpretation.

The orthogonal factor analysis model satisfies the following conditions:

(8.9)

X F F F

X F F F

X F F F

m m

m m

p p pm m p

1 1 11 1 12 2 1 1

2 2 21 1 22 2 2 2

3 3 1 1 2 2

l l K l

l l K l

M M O M

l l K l

X L F p p m m p1 1 1

lij

F

E

COV

E

COV

,

,

are independent

where is a diagonal matrix.

F 0

F I

0

© 2003 by CRC Press LLC

Page 208: Data Analysis Book

Varimax rotation, which maximizes the sum of the variances of the factorloadings, is a common method for conducting an orthogonal rotation,although there are many other methods.

The oblique factor analysis model relaxes the restriction of uncorrelatedfactor loadings, resulting in factors that are non-orthogonal. Obliquefactor analysis is conducted with the intent to achieve a structure that ismore interpretable. Specifically, computational strategies have beendeveloped to rotate factors such that clusters of variables are best repre-sented, without the constraint of orthogonality. However, the obliquefactors produced by such rotations are often not easily interpreted, some-times resulting in factors with less-than-obvious meaning (that is, withmany cross loadings).

Interpretation of factor analysis is straightforward. Variables that havehigh factor loadings are thought to be highly influential in describing thefactor, whereas variables with low factor loadings are less influential indescribing the factor. Inspection of the variables with high factor loadingson a specific factor is used to uncover structure or commonality among thevariables. One must then determine the underlying constructs that are com-mon to variables that load highly on specific factors.

Example 8.2

Continuing from the previous example, a factor analysis on continuousvariables is conducted to determine which variables might be explainingsimilar underlying phenomenon. The same set of variables used for theprincipal components analysis is again used in an exploratory factoranalysis with orthogonal varimax rotation. Because six principal compo-nents explained roughly 55% of the data variance, the number of factorsestimated was six.

Table 8.3 shows the factor loadings resulting from this factor analysis.As was done in the principal components analysis, factor loadingsless than 0.1 are blanks in the table. Table 8.3, in general, is sparserthan Table 8.2, simply because factors are orthogonal to one another(correlations between factors are zero), and because a factor set solu-tion was sought that maximizes the variances of the factor loadings.Inspection of Table 8.3 shows that the equation for X1 is given asHOVPST5 = 0.306F1 + 0.658F2 + 0.323F3 + 0.147F4 + 0.147F5, wherefactors 1 through 5 are unobserved or latent. As in the principalcomponents analysis, travel behavior variables and HOV attitudinalvariables seem to load heavily on factor 1. That many of the factorsinclude travel behavior variables suggests that many dimensions oftravel behavior are reflected in these variables. There appear to betwo factors that reflect dimensions in the data related to attitudestoward HOV lanes. Sociodemographic variables do not seem to loadon any particular factor, and so probably represent unique dimensionsin the data.

© 2003 by CRC Press LLC

Page 209: Data Analysis Book

TABLE 8.3

Factor Loadings of Factor Analysis: HOV Lane Survey Data

Variable Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6

Travel Behavior Variables

HOVPST5 0.306 0.658 0.323 0.147 0.147DAPST5 –0.278 –0.846 –0.358 –0.206 –0.167CRPPST5 0.221 0.930 –0.217 –0.166 –0.109CRPPST52MR 0.112 0.987VNPPST5 0.675BUSPST5 0.113 0.108 0.983NONMOTPST5 0.298MOTPST5 0.992CHGRTEPST5 –0.125 –0.113 0.158

HOV Attitude Variables

HOVSAVTIME 0.734 0.142HOVADUSE 0.642 0.110 0.135HOVOPN –0.814 –0.117GPTOHOV 0.681 0.156GTTOHOV2 0.515

Sociodemographic Variables

AGE –0.128HHINCM –0.167EDUCFAMSIZ –0.129NUMWRKS 0.105NUMCARS

Note: Loadings <0.10 shown as blanks.

© 2003 by C

RC

Press LL

C

Page 210: Data Analysis Book

8.3 Structural Equation Modeling

A structural equation model (SEM) represents a natural extension of ameasurement model and is a mature statistical modeling technique. TheSEM is a tool developed largely by clinical sociologists and psychologists.It is designed to deal with several difficult modeling challenges, includingcases in which some variables of interest to a researcher are unobservableor latent and are measured using one or more exogenous variables, endo-geneity among variables, and complex underlying social phenomena. Forexample, in a study of transit ridership, one might want to determine theimpact of a latent variable such as attitude toward transit on transit ridership.Because attitude toward transit itself is not directly observable, one mightask a question or series of questions with the intent to indirectly measurethis variable. These observed variables — answers to attitudinal questions— are poorly measured variables meant to reflect the latent variable attitudetoward transit. In addition, experiences riding transit might influence atti-tudes toward transit, and attitudes toward transit might influence theirexperiences — an endogenous relationship where each variable could influ-ence the other.

When measurement errors in independent variables are incorporatedinto a regression equation (via a poorly measured variable), the variancesof the measurement errors in the regressors are transmitted to the modelresidual, thereby inflating the model residual variance. The estimatedmodel variance is thus larger than would be experienced if no measurementerrors were present. This outcome would have deleterious effects on stan-dard errors of parameter estimates and on goodness-of-fit criteria includingthe standard F-ratio and R-squared measures. If the parameter estimationmethod is least squares, then parameter estimates are biased and a functionof the measurement error variances. The SEM framework resolves potentialproblems by explicitly incorporating measurement errors into the model-ing framework. In addition, the SEM model can accommodate a latentvariable as a dependent variable, something that cannot be done in thetraditional regression analysis.

8.3.1 Basic Concepts in Structural Equation Modeling

SEMs have two components, a measurement model and a structuralmodel. The measurement model is concerned with how well variousmeasured exogenous variables measure latent variables. A classical factoranalysis is a measurement model, and determines how well various vari-ables load on a number of factors, or latent variables. The measurementmodels within a SEM incorporate estimates of measurement errors ofexogenous variables and their intended latent variable. The structuralmodel is concerned with how the model variables are related to one

© 2003 by CRC Press LLC

Page 211: Data Analysis Book

another. SEMs allow for direct, indirect, and associative relationships tobe explicitly modeled, unlike ordinary regression techniques, whichimplicitly model associations. It is the structural component of SEMs thatenables substantive conclusions to be drawn about the relationshipbetween latent variables, and the mechanisms underlying a process orphenomenon. The structural component of SEMs is similar to a system ofsimultaneous equations discussed previously in Chapter 5. Because of theability of SEMs to specify complex underlying relationships, SEMs lendthemselves to graphical representations, and these graphical representa-tions have become the standard means for presenting and communicatinginformation about SEMs.

Like factor and principal components analyses, SEMs rely on informationcontained in the variance–covariance matrix. Similar to other statistical mod-els, the SEM requires the specification of relationships between observedand unobserved variables. Observed variables are measured, whereas unob-served variables are latent variables — similar to factors in a factor analysis— that represent underlying unobserved constructs. Unobserved variablesalso include error terms that reflect the portion of the latent variable notexplained by their observed counterparts. In a SEM, there is a risk that thenumber of model parameters sought will exceed the number of model equa-tions needed to solve them. Thus, there is a need to distinguish betweenfixed and free parameters — fixed parameters are set by the analyst and freeparameters are estimated from the data. The collection of fixed and freeparameters specified by the analyst will imply a variance–covariance struc-ture in the data, which is compared to the observed variance–covariancematrix to assess model fit.

There are three types of relationships that are modeled in the SEM. Anassociation is a casual (not causal) relationship between two independentvariables, and is depicted as a double-headed arrow between variables. Adirect relationship is where the independent variable influences the depen-dent variable, and is shown with a directional arrow, where the direction ofthe arrow is assumed to coincide with the direction of influence from theexogenous to the endogenous variable. An indirect relationship is when anindependent variable influences a dependent variable indirectly through athird independent variable. For example, variable A has a direct effect onvariable B, which has a direct effect on variable C, so variable A has anindirect effect on variable C. Note that in this framework a variable mayserve as both an endogenous variable in one relationship, and an exogenousvariable in another.

Figure 8.2 shows a graphical representation of two different linear regres-sion models with two independent variables, as would be depicted in theSEM nomenclature. The independent variables X1 and X2, shown in rect-angles, are measured exogenous variables, have direct effects on variableY1, and are correlated with each other. The model in the bottom of thefigure reflects a fundamentally different relationship among variables.First, variables X3 and X4 directly influence Y2. Variable X4 is also directly

© 2003 by CRC Press LLC

Page 212: Data Analysis Book

influenced by variable X3. The SEM model shown in the top of the figureimplies a variance–covariance matrix different from the model shown inthe bottom of the figure. The models also show that although the indepen-dent variables have direct effects on the dependent variable, they do notfully explain the variability in Y, as reflected by the error terms, depictedas ellipses in the figure. The additional error term, e3, is that portion ofvariable X4 not fully explained by variable X3. Latent variables, if enteredinto these models, would also be depicted as ellipses in the graphicalrepresentation of the SEM.

An obvious issue of concern is how these two different SEMs depictedin Figure 8.2 imply different variance–covariance matrices. The modeldepicted in the top of Figure 8.2 represents a linear regression model withtwo independent variables that covary, such that .The model depicted in the bottom of the figure represents two simulta-neous regressions, and . Inthis second SEM model, the variable X4 serves as both an exogenous andan endogenous variable. The collective set of constraints implied by thesetwo SEMs determines the model-implied variance–covariance structure.The original correlation matrix is completely reproduced if all effects,direct, indirect, and associative, are accounted for in a model. This is thesaturated model, and is uninteresting simply because there is no parsimonyachieved by such a model. Without compromising the statistical validityof the model, a natural goal is to simplify an underlying complex data-generating process with a relatively simple model. How the paths aredrawn in the development of SEMs determines the presumed vari-ance–covariance matrix.

Example 8.3

To examine simple path models, consider the relationships betweenan attitudinal variable in the HOV survey and two behavioral vari-ables. Table 8.4 shows the results of two SEMs. These overly simple

FIGURE 8.2SEMs depicting standard linear regression models with two variables.

Y2

Y1 e1

e2

X1

X2

X3

X4e3

Y X X e1 0 1 1 2 2 1

Y X X e2 0 3 3 4 4 2 X X e4 0 5 3 3

© 2003 by CRC Press LLC

Page 213: Data Analysis Book

models illustrate the two SEM models depicted in Figure 8.2, with theexception that no covariation between independent variables is al-lowed in Model I. Model I is a linear regression model with twovariables, CRPPST5 and HOVPST5. The second SEM consists of twosimultaneous regressions, as shown in the figure. The constraint ofzero covariance between CRPPST5 and HOVPST5 in Model I does notagree well with the observed covariance of 1.51 between these twovariables. These two competing models imply different variance–co-variance matrices, and the agreement of these two model-impliedcovariance matrices with the observed variance–covariance matrix isused to assess model fit.

Example 8.3 illustrates how SEMs imply sets of regression equations,which in turn have implications on the variance–covariance matrix. It is thediscrepancy between the observed and the implied variance–covariancematrix that forms the basis of SEMs. When latent variables are brought intothe SEM framework, the true value of SEMs is realized. The SEM is able tocope with both endogenous and exogenous variables, as well as observedand unobserved variables, expressed as embedded linear relationshipsreflecting complex underlying relationships. Detailed examples of how com-plex SEMs are estimated and how they have been applied in the constructionand transit industries, are presented in Molenaar et al. (2000) and Karlaftis(2002), respectively.

TABLE 8.4

Implied Variance–Covariance Matrices for Two Simple SEMs: HOV Survey Data

SampleVariance–Covariance Matrix HOVPST5 CRPPST5 HOVSAVTI

3.29CRPPST5 1.51 2.32HOVSAVTI 0.74 0.67 1.97

Implied Variance–Covariance Matrix: SEM Model I

HOVPST5 3.28CRPPST5 0.00 2.30HOVSAVTI 0.43 0.46 1.89

Implied Variance–Covariance Matrix: SEM Model II

HOVPST5 2.292CRPPST5 1.490 3.281HOVSAVTI 0.333 0.734 1.959

HOVSAVTI CRPPST HOVPST1 52 0 201 5 0 123 5. . .

HOVSAVTI HOVPST

HOVPST CRPPST

1 58 0 224 5

5 0 55 0 650 5

. .

. .

© 2003 by CRC Press LLC

Page 214: Data Analysis Book

8.3.2 The Structural Equation Model

The focus in this section is to provide a general framework of SEMs, todemonstrate how the parameters are estimated, and to illustrate how resultsare interpreted and used. The interested reader can consult Hoyle (1995) andArminger et al. (1995) for a presentation and discussion of the full gamut ofalternative SEM specifications.

SEMs, similar to other statistical models, are used to evaluate theories orhypotheses using empirical data. The empirical data are contained in a p pvariance–covariance matrix S, which is an unstructured estimator of thepopulation variance–covariance matrix . A SEM is then hypothesized to bea function of Q, a vector of unknown structural parameters, , which in turnwill generate a model-implied variance–covariance matrix ( ). All variablesin the model, whether observed or latent, are classified as either independent(endogenous) or dependent (exogenous). A dependent variable in a SEMdiagram is a variable that has a one-way arrow pointing to it. The set ofthese variables is collected into a vector , while independent variables arecollected in the vector , such that (following Bentler and Weeks, 1980)

, (8.10)

where and are estimated vectors that contain regression parametersfor the dependent and independent variables, respectively, and is avector of regression disturbances. The exogenous factor covariance matrixis represented as , and the error covariance matrix as

.The variance–covariance matrix for the model in Equation 8.10 is

, (8.11)

where G is a selection matrix containing either 0 or 1 to select the observedvariables from all the dependent variables in . There are p2 elements orsimultaneous equations in Equation 8.11, one for each element in ( ). Someof the p2 equations are redundant, however, leaving

independent equations. These p* independent equations are used to solvefor unknown parameters , which consist of the vector , the vector , and

. The estimated model-implied variance–covariance matrix is then givenas ( ).

Model identification in SEM can present serious challenges. There are Qunknown model parameters (comprising ), which must be solved using p*

= + +

COV T[ , ]u COV T[ , ]

u G I I G1 1T TT

pp p* 1

2

© 2003 by CRC Press LLC

Page 215: Data Analysis Book

simultaneous independent equations. There are two necessary and sufficientconditions for SEM identification. The first condition is that the number ofsimultaneous equations must be equal to or greater than the number ofunknown model parameters, such that . The second condition is thateach and every free model parameter must be identified, which can bedifficult (Hoyle, 1995).

Once the SEM has been specified, and identification conditions are met,solutions for the parameters are obtained. Parameters are estimated using adiscrepancy function criterion, where the differences between the samplevariance–covariance matrix and the model-implied variance–covariancematrix are minimized. The discrepancy function is

. (8.12)

Different estimation methods in SEM have varying distributional assump-tions and, in turn, require different discrepancy functions. For example,maximum likelihood estimated model parameters, which require that spe-cific distributional and variable assumptions are met, are obtained using thediscrepancy function:

(8.13)

For detailed discussions on other discrepancy functions and correspondingestimation methods, including maximum likelihood (MLE), generalizedleast squares (GLS), asymptotically distribution-free (ADF), scale-free leastsquares (SLS), unweighted least squares (ULS), and Browne’s method, seeArbuckle and Wothke (1995), Hoyle (1995), or Arminger et al. (1995).

A useful feature of discrepancy functions is that they can be used to testthe null hypothesis , where

. (8.14)

This equation shows given that the model is correct, variables are approxi-mately multivariate normally distributed, and the sample size is sufficientlylarge that the product of the minimized discrepancy function and samplesize minus one is asymptotically chi square distributed with degrees of freedomequal to p* Q. Also, it can be shown that SEM parameter estimates areasymptotically unbiased, consistent, and asymptotically efficient (Hoyle, 1995).

Equation 8.14 needs to be applied with care. Its unsuitability as a criterionfor model assessment and selection was pointed out early in SEM theorydevelopment because the test statistic is largely a function of sample size(Joreskog, 1969; Bentler and Bonett, 1980; Gulliksen and Tukey, 1958). Thus,the X2 best serves the analyst in the selection of the best from competing

Q p*

F F S,

F LN TRACE LN pMLE 1S S .

H0 : ( )

X F n p Q2 21 , *

© 2003 by CRC Press LLC

Page 216: Data Analysis Book

models estimated on the same data, and whose absolute value should beevaluated with respect to sample size on which the statistic is estimated.

8.3.3 Non-Ideal Conditions in the Structural Equation Model

Recall that ideal conditions in SEM include multivariate normality of inde-pendent variables, the correct model functional form, independent anddependent variables measured on the interval or ratio scale, and a sufficientlylarge sample size. A large number of studies have been conducted to assessthe impact of continuous yet non-normal variables on SEMs (see, for exam-ple, Browne, 1984; Chou et al., 1991; Hu et al., 1992; Finch et al., 1994; Kline,1998). Non-normality can arise from poorly distributed continuous variablesor coarsely categorized continuous variables. Non-normality can be detectedin a number of ways, including box plots, histograms, normal probabilityplots, and inspection of multivariate kurtosis. Numerous studies havearrived at similar conclusions regarding the impact of non-normality inSEMs. The X2 test statistic becomes inflated as the data become more non-normal. In addition, the GLS and MLE methods of parameter estimationproduce inflated X2 test statistics with small sample sizes, even if multivariatenormality is satisfied. In addition, model goodness-of-fit indices are under-estimated under non-normality, and non-normality leads to moderate tosevere underestimation of standard errors of parameter estimates.

There are several remedies for dealing with non-normality. The asymptoti-cally distribution free estimator (ADF) is a generalized least squares estimationapproach that does not rely on multivariate normality (Browne, 1984). TheADF estimator produces asymptotically unbiased estimates of the X2 teststatistic, parameter estimates, and standard errors. The scaled X2 test statistic,developed by Satorra and Bentler (see Satorra, 1990), corrects or rescales theX2 test statistic so that it approximates the referenced 2 distribution.

Bootstrapping is a third method for dealing with non-normal samples.Bootstrapping is based on the principle that the obtained random sample isa fair representation of the population distribution, and that by resamplingthe sample, estimates of parameters and their standard errors can beobtained from the empirical distribution as defined by the resampling. Efronand Tibshirani (1986) have demonstrated that in many studies the samplingdistribution can be reasonably approximated by data obtained from a singlesample. Details of the bootstrap approach to SEM can be found in Bollenand Stine (1992).

Nominal and ordinal scale variables also cause problems in SEMs result-ing in biased estimates of X2 test statistics and estimated parameters andtheir standard errors. One approach, developed by Muthen (1984), consistsof a continuous/categorical variable methodology (CVM) weighted leastsquares estimator and discrepancy function, which results in unbiased, con-sistent, and efficient parameter estimates when variables are measured onnominal and ordinal scales. However, this estimator requires large sample

© 2003 by CRC Press LLC

Page 217: Data Analysis Book

sizes (at least 500 to 1000 cases) and is difficult to estimate for overly complexmodels (Hoyle, 1995). Other approaches include variable reexpressions (Cat-tell and Burdsal, 1975), variable transformations (Daniel and Wood, 1980;Emerson and Stoto, 1983), and alternating conditional expectations andBox–Cox transformations (de Veaux, 1990).

Interactions and nonlinear effects arise frequently in the modeling of realdata. In SEMs, interactions and nonlinear effects present challenges aboveand beyond those encountered in simple linear regression. There are twogeneral approaches to handling these problems: the indicant productapproach and the multisample approach. The indicant product approach iswell developed only for multiplicative cases and requires a centering trans-formation. The multisample approach is more flexible, avoids some multicol-linearity and distributional problems associated with the product indicantapproach, and is suitable under the widest range of conditions (Rigdon et al.,1998). Most currently available SEM software packages can accommodate themultisample approach. The reader is encouraged to consult Schumacker andMarcoulides (1998) for a complete treatment of this topic as it relates to SEMs.

8.3.4 Model Goodness-of-Fit Measures

Model goodness-of-fit (GOF) measures are an important part of any statis-tical model assessment. GOF measures in SEMs are an unsettled topic, pri-marily as a result of lack of consensus on which GOF measures serve as“best” measures of model fit to empirical data (Arbuckle and Wothke, 1995).For detailed discussions of these debates and a multitude of SEM GOFmethods, see Mulaik et al. (1989), MacCallum (1990), Steiger (1990), Bollenand Long (1993), and Arbuckle and Wothke (1995).

Several important concepts are routinely applied throughout SEM GOF teststhat enable the assessment of statistical models. A saturated model is a modelthat is perfectly fit to the data the variance–covariance structure is com-pletely unconstrained and represents an unappealing model. It is the mostgeneral model possible, and is used as a standard of comparison to the esti-mated model. Because the saturated model is as complex as the original data,it does not summarize the data into succinct and useful relationships. Incontrast, the independence model is constrained such that no relationshipsexist in the data and all variables in the model are independent of each other.This model presents the “worst case” model. The saturated and independencemodels can be viewed as two extremes within which the best model will lie.

There are a large number of GOF criteria available for assessing the fit ofSEMs. Several important and widely used GOF measures are described, andthe reader is encouraged to examine other GOF measures in the referencesprovided.

The first class of GOF indices includes measures of parsimony. Modelswith few parameters are preferred to models with many parameters, pro-viding that the important underlying model assumptions are not violated.

© 2003 by CRC Press LLC

Page 218: Data Analysis Book

This is supported by a general desire to explain complex phenomena withas simple a model as possible. Three simple measures of parsimony are thenumber of model parameters Q, the degrees of freedom of the model beingtested df = P* Q, and the parsimony ratio

(8.15)

where d is the degrees of freedom of the estimated model and di is the degreesof freedom of the independence model. PR represents the number of param-eter constraints of the estimated model as a fraction of the number of con-straints in the independence model (a higher PR is preferred).

There are several GOF indices based on the discrepancy function, F, shownin Equation 8.12. As stated previously, the X2 test statistic, derived from thediscrepancy function, needs to be treated with care because it is dependentlargely on sample sizes — small samples tending to accept (fail to reject) thenull hypothesis, and large samples tending to reject the null hypothesis.

The X2 statistic is the minimum value of the discrepancy function F timesits degrees of freedom (see Equation 8.14). The p-value is the probability ofobtaining a discrepancy function as large as or larger than the one obtainedby random chance if the model is correct, distributional assumptions arecorrect, and the sample size is sufficiently large. The statistic X2/(modeldegrees of freedom) has been suggested as a useful GOF measure. Rules ofthumb have suggested that this measure (except under ULS and SLS estima-tion) should be close to 1 for correct models. In general, researchers haverecommended this statistic lie somewhere less than 5, with values close to 1preferred (Carmines and McIver, 1981; Marsh and Hocevar, 1985; Byrne, 1989).

Another class of fit measures is based on the population discrepancy. Thesemeasures rely on the notion of a population discrepancy function (asopposed to the sample discrepancy function) to estimate GOF measures,including the noncentrality parameter (NCP), the root mean square error ofapproximation (RMSEA), and PCLOSE, the p-value associated with ahypothesis test of RMSEA = 0.05. For details on these measures, the readershould consult Steiger et al. (1985) and Browne and Cudeck (1993).

Information theoretic measures are designed primarily for use with max-imum likelihood methods, and are meant to provide a measure of the amountof information contained in a given model. There are many measures thatcan be used to assess fit in this class. The Akaike information criterion (AIC;Akaike, 1987) is given as

. (8.16)

Lower values of AIC are preferred to higher values because higher values of X2

correspond to greater lack of fit. In the AIC, a penalty is imposed on modelswith larger numbers of parameters, similar to the adjusted R-squared measure

PRddi

AIC X q2 2

© 2003 by CRC Press LLC

Page 219: Data Analysis Book

in regression. The Browne–Cudeck (1989) criterion is similar to AIC, except itimposes a slightly greater penalty for model complexity than does AIC. It is alsothe only GOF measure in this class of measures designed specifically for theanalysis of moment structures (Arbuckle and Wothke, 1995).

Yet another class of GOF measures is designed to compare the fitted modelto baseline models. The normed fit index (NFI), developed by Bentler andBonett (1980), is given as

, (8.17)

where is the minimum discrepancy of the baseline model comparison, usu-ally the saturated or independence model. NFI will take on values between 0and 1. Values close to one indicate a close fit to the saturated model (as typicallymeasured to the terribly fitting independence model). Rules of thumb for thismeasure suggest that models with a NFI less than 0.9 can be improved substan-tially (Bentler and Bonett, 1980). Other GOF measures in this category includethe relative fit index (RFI), the incremental fit index (IFI), the Tucker–Lewiscoefficient, and the comparative fit index (CFI), discussion on which can befound in Bollen (1986), Bentler (1990), and Arbuckle and Wothke (1995).

8.3.5 Guidelines for Structural Equation Modeling

Similar to other statistical modeling methods, some guidelines can aid in theestimation of SEMs. To review, there are a couple of reasons for using a SEM.First, SEMs handle measurement problems well. They are ideal when exoge-nous variables measure an underlying unobservable or latent construct orconstructs. Second, SEMs provide a way to check the entire structure of dataassumptions, not just whether or not the dependent variable predictions fitobservations well. Third, SEMs cope with endogeneity among variables well.When numerous endogenous variables are present in a model, the SEM frame-work can account for the feedback present in these relationships.

The need for the analyst to have a well-articulated theory regarding thedata-generating process prior to structural equation modeling is com-pounded. The complexity of variable relationships accommodated in theSEM framework translates to a significant increase in the potential for datamining. One must be careful to prespecify direct, indirect, and associativerelationships between variables that correspond with theory and expecta-tion. In addition, an understanding of the measurement model or challengein the data is needed along with knowledge of which underlying latentconstructs are being sought and which variables reflect these constructs. Intransportation applications, latent constructs include such characteristics asattitudes in favor of or against a policy or program, educational background,level of satisfaction with a program or service, and socioeconomic status, toname but a few. Finally, assessment of a SEM should take into account many

NFIXXb

12

2

Xb2

© 2003 by CRC Press LLC

Page 220: Data Analysis Book

criteria including theoretical appeal of the model specification, overall X2

goodness of fit between observed and implied variance–covariance matrices,individual variable parameters and their standard errors, and GOF indices.As always, the analyst should rely on consensus building over time toestablish a model as a reliable depiction of the underlying reality.

Example 8.4

The results from the principal components and factor analyses are used todevelop the SEM specification. In the SEM presented here, three importantlatent variables are thought to define the structure in the HOV survey data.The first is HOV attitude, which reflects the respondent’s attitude towardHOV lanes and their use, positive values reflecting support for HOV lanesand negative values reflecting distaste for HOV lanes. The second is socio-economic status, which reflects the relative level of personal wealth com-bined with relative social status. The third latent variable thought to beimportant is HOV experience, which is thought to measure how muchexperience using HOV lanes the respondent has had. It is expected thatHOV experience will influence someone’s HOV attitude; someone who hasexperienced the benefits from utilizing an HOV facility will have a betterattitude toward HOV facilities (presuming that they are effective). It mightalso be that HOV attitude influences HOV experience, that is, respondentswith a better attitude are more likely to use HOV facilities. It is alsosuspected that respondents with higher socioeconomic status will in generalhave better attitudes toward HOV facilities.

FIGURE 8.3SEM model of HOV survey data: three latent variables.

e1 e2 e3

e4

e5

e6

e7

e8

e9

e10

e11

e12

BUSPST5 CRPPST5 HOVPST5

NUMCARS

NUMWRKS

HHINCM

GTTOHOV2

GPTOHOV

HOVOPN

HOVADUSE

HOVSAVTI

HOV Experience

SocioeconomicStatus

HOV Attitude

© 2003 by CRC Press LLC

Page 221: Data Analysis Book

As shown in Figure 8.3 (with error terms represented in ellipses), theresponses to questions in the HOV survey were determined by underlyinglatent variables (identified in part by the factor and principal componentsanalyses). For example, HOV experience is influenced positively by a re-spondent who regularly engages in HOV use, either through carpooling,bus, or some other HOV qualifying use. The relationship between latentvariables and the measured variables is shown in the figure. Table 8.5shows the estimation results for the SEM depicted in Figure 8.3. The initial

TABLE 8.5

Maximum Likelihood Estimation Results for SEM of HOV Survey Data

Estimated Parameters EstimateStandard

Error Z-Value

Regression Parameters

HOV Attitude Socioeconomic Status –0.001 0.080 –0.017HOV Attitude HOV Experience 0.507 0.078 6.509CRPPST5 HOV Experience 0.848 0.085 10.001HOVPST5 HOV Experience 1.724 0.079 21.908BUSPST5 HOV Experience 0.386 0.061 6.324HOVSAVTI HOV Attitude 0.905 0.068 13.333HOVADUSE HOV Attitude 0.661 0.056 11.875HOVOPN HOV Attitude –1.164 0.075 –15.505GTTOHOV2 HOV Attitude 0.542 0.065 8.283NUMCARS Socioeconomic Status 0.832 0.185 4.497NUMWRKS Socioeconomic Status 0.374 0.091 4.114HHINCM Socioeconomic Status 0.431 0.126 3.410GPTOHOV HOV Attitude 0.847 0.067 12.690

Intercepts

HOVPST5 0.973 0.105 9.234CRPPST5 0.643 0.089 7.259BUSPST5 0.290 0.061 4.785HOVADUSE 1.268 20.053 0.063 HOVOPN 1.733 0.089 19.458GPTOHOV 1.757 0.077 22.914HOVSAVTI 1.787 0.079 22.742GTTOHOV2 1.892 0.071 26.675NUMCARS 2.418 0.061 39.524HHINCM 6.220 0.088 70.537NUMWRKS 1.849 0.046 40.250

Goodness-of-Fit Measures

Degrees of freedom = p* – Q 77 – 34 = 43Chi-square 129.66Parsimony ratio 0.652Chi-square/degrees of freedom 3.02Akaike’s information criterion: Saturated/fitted/independence 154/198/4503Browne–Cudeck criterion: Saturated/fitted/independence 150/200/4504Normed fit index 0.971

© 2003 by CRC Press LLC

Page 222: Data Analysis Book

expectation that HOV experience and HOV attitude were mutually endog-enous was not supported in the modeling effort. Instead, HOV experiencedirectly influenced HOV attitude. As seen in the table, there is negligiblesupport for an effect of socioeconomic status on HOV attitude; however, itwas the best specification of all specifications tested, and overall modelGOF indices are reasonable. All other estimated model parameters werestatistically significant at well beyond the 90% level of confidence.

As shown in Table 8.5, the SEM reflects 12 simultaneous regression equa-tions. Constraints were imposed to ensure identifiability; for example,the mean and variance of the variable socioeconomic status were set to 0and 1, respectively. Some of the data were missing as a result of incom-plete surveys maximum likelihood estimates were used to imputemissing values (for details, see Arbuckle and Wothke, 1995). Because ofcomputational constraints, other estimation methods are not availablewhen there are missing data — limiting the number of estimation meth-ods that were available.

GOF statistics for the SEM are shown at the bottom of the table. The GOFstatistics are in general encouraging. The X2 statistics divided by modeldegrees of freedom is around 3, an acceptable statistic. Akaike’s infor-mation criterion is close to the value for the saturated model. Finally, thenormed fit index is about 0.971, a value fairly close to 1.

Given the significance of the SEM variables, the theoretical appeal of thestructural portion of the model, and the collective GOF indices, the SEMis considered to be a good approximation of the underlying data-gener-ating process. Some improvements are possible, especially the role of thevariable socioeconomic status.

© 2003 by CRC Press LLC

Page 223: Data Analysis Book

9Duration Models

In many instances, the need arises to study the elapsed time until the occur-rence of an event or the duration of an event. Data such as these are referredto as duration data and are encountered often in the field of transportationresearch. Examples include the time until a vehicle accident occurs, the timebetween vehicle purchases, the time devoted to an activity (shopping, recre-ation, etc.), and the time until the adoption of new transportation technologies.Duration data are usually continuous and can, in most cases, be modeled usingleast squares regression. The use of estimation techniques that are based onhazard functions, however, often provide additional insights into the under-lying duration problem (Kiefer, 1988; Hensher and Mannering, 1994).

9.1 Hazard-Based Duration Models

Hazard-based duration modeling has been extensively applied in a numberof fields including biostatistics (Kalbfleisch and Prentice, 1980; Fleming andHarrington, 1991) and economics (Kiefer, 1988). In the transportation field,the application of such models has been limited, save for the recent flurryof applications. A good example of this is the research on the study of theduration of trip-generating activities, such as the length of time spent shop-ping and engaging in recreational/social visits and the length of time spentstaying at home between trips (see Hamed and Mannering, 1993; Bhat 1996a,b; Ettema and Timmermans, 1997; Kharoufeh and Goulias, 2002).

To study duration data, hazard-based models are applied to study theconditional probability of a time duration ending at some time t, given thatthe duration has continued until time t. For example, consider the durationof time until a vehicle accident occurs as beginning when a person firstbecomes licensed to drive (see Mannering, 1993). Hazard-based durationmodels can account for the possibility that the likelihood of a driver becom-ing involved in an accident may change over time. As time passes, drivingexperience is gained and there may be changes in drivers’ confidence, skill,and risk-taking behavior. Thus, one would expect the probability of an

© 2003 by CRC Press LLC

Page 224: Data Analysis Book

accident to increase or decrease, although it may also be possible that theprobability of an accident would remain constant over time. Probabilitiesthat change as time passes are ideally suited to hazard-function analyses.

Developing hazard-based duration models begins with the cumulativedistribution function:

F(t) = P(T < t), (9.1)

where P denotes probability, T is a random time variable, and t is somespecified time. By using the example of time until an accident, Equation 9.1gives the probability of having an accident before some transpired time, t.The density function corresponding to this distribution function (the firstderivative of the cumulative distribution with respect to time) is

f(t) = dF(t)/dt, (9.2)

and the hazard function is

h(t) = f(t)/[1 F(t)], (9.3)

where h(t) is the conditional probability that an event will occur (for example,an accident, death, or the end of a shopping trip) between time t and t + dt,given that the event has not occurred up to time t. In other words, h(t) givesthe rate at which event durations are ending at time t (such as the durationin an accident-free state that would end with the occurrence of an accident),given that the event duration has not ended up to time t. The cumulativehazard H(t) is the integrated hazard function, and provides the cumulativerate at which events are ending up to or before time t.

The survivor function, which provides the probability that a duration isgreater than or equal to some specified time, t, is also frequently used inhazard analyses for interpretation of results. The survivor function is

S(t) = P(T t). (9.4)

If one of these functions is known — the density, cumulative distribution,survivor, hazard, or integrated hazard — any of the others are readilyobtained. The relationships between these functions are

(9.5)

S t F t f t dt H t

f tddt

F t h t H tddt

S t

H t h t dt LN S t

h tf tS t

f tF t

ddt

H t

t

t

1 1

1

0

0

( ) EXP

EXP

© 2003 by CRC Press LLC

Page 225: Data Analysis Book

Hazard, density, cumulative distribution, and survivor functions are graph-ically illustrated in Figure 9.1.

The slope of the hazard function (the first derivative with respect to time)has important implications. It captures dependence of the probability of aduration ending on the length of the duration (duration dependence). Con-sider a driver’s probability of having an accident and the length of timewithout having an accident. Figure 9.2 presents four possible hazard func-tions for this case. The first hazard function, h1(t), has dh1(t)/dt < 0 for all t.This hazard is monotonically decreasing in duration, implying that thelonger drivers go without having an accident, the less likely they are to haveone soon. The second hazard function is nonmonotonic and has dh2(t)/dt >0 and dh2(t)/dt < 0 depending on the length of duration t. In this case the

FIGURE 9.1Illustration of hazard (h(t)), density (f(t)), cumulative distribution (F(t)), and survivor functions (S(t)).

FIGURE 9.2Illustration of four alternate hazard functions.

1.0

0.5

01 2

t

3 4

S(t )

h(t )

F(t )

f(t )

Pr

4

2

h(t)

h1(t )h2(t )

h3(t )

h4(t )

01 2 3

t

4 5

© 2003 by CRC Press LLC

Page 226: Data Analysis Book

accident probabilities increase or decrease in duration. The third hazardfunction has dh3(t)/dt > 0 for all t and is monotonically increasing in duration.This implies that the longer drivers go without having an accident, the morelikely they are to have an accident soon. Finally, the fourth hazard functionhas dh4(t)/dt = 0, which means that accident probabilities are independentof duration and no duration dependence exists.

In addition to duration dependence, hazard-based duration modelsaccount for the effect of covariates on probabilities. For example, for vehicleaccident probabilities, gender, age, and alcohol consumption habits wouldall be expected to influence accident probabilities. Proportional hazards andaccelerated lifetime models are two alternative methods that have beenpopular in accounting for the influence of covariates.

The proportional hazards approach assumes that the covariates, which arefactors that affect accident probabilities, act multiplicatively on some under-lying hazard function. This underlying (or baseline) hazard function isdenoted h0(t) and is the hazard function assuming that all elements of acovariate vector, X, are zero. For simplicity, covariates are assumed to influ-ence the baseline hazard with the function EXP( X), where is a vector ofestimable parameters. Thus the hazard rate with covariates is

h(t|X) = h0(t)EXP( X). (9.6)

This proportional hazards approach is illustrated in Figure 9.3.The second approach for incorporating covariates in hazard-based models

is to assume that the covariates rescale (accelerate) time directly in a baselinesurvivor function, which is the survivor function when all covariates arezero. This accelerated lifetime method again assumes covariates influence

FIGURE 9.3Illustration of the proportional hazards model.

6

4

h(t )

h0(t )

h(tX) = h0(t )EXP(X)

2

01

t

2 3 4 5

© 2003 by CRC Press LLC

Page 227: Data Analysis Book

the process with the function EXP( X). The accelerated lifetime model iswritten as

S(t|X) = So[t EXP( X)], (9.7)

which leads to the conditional hazard function

h(t|X) = ho[t EXP( X)]EXP( X). (9.8)

Accelerated lifetime models have, along with proportional hazards models,enjoyed widespread use and are estimable by standard maximum likelihoodmethods (see Kalbfleisch and Prentice, 1980).

9.2 Characteristics of Duration Data

Duration data are often left or right censored. For example, consider the timeof driving a vehicle until a driver’s first accident. Suppose data are onlyavailable for reported accidents over a specified time period beginning attime a in Figure 9.4 and ending at time b. Observation 1 is not observed,since it does not fall within the time period of observation. Observation 2 isleft and right censored because it is not known when driving began and thefirst accident is not observed in the a to b time interval. Observation 3 iscomplete with both start and ending times in the observed period. Obser-vations 4 and 6 are left censored and observation 5 is right censored.

FIGURE 9.4Illustration of duration data.

1t1

t2

t3

t4

t5

t6

2

Observations

3

4

5

6

a

Timeb

© 2003 by CRC Press LLC

Page 228: Data Analysis Book

Hazard-based models can readily account for right-censored data (seeKalbfleisch and Prentice, 1980). With left-censored data the problem becomesone of determining the distribution of duration start times so that they areused to determine the contribution of the left-censored data to the likelihoodfunction of the model. Left-censored data create a far more difficult problembecause of the additional complexity added to the likelihood function. Foradditional details regarding censoring problems in duration modeling, theinterested reader should refer to Kalbfleisch and Prentice (1980), Heckmanand Singer (1984), and Fleming and Harrington (1991).

Another challenge may arise when a number of observations end theirdurations at the same time. This is referred to as the problem of tied data. Tieddata can arise when data collection is not precise enough to identify exactduration-ending times. When duration exits are grouped at specific times, thelikelihood function for proportional and accelerated lifetime models becomesincreasingly complex. Kalbfleisch and Prentice (1980) and Fleming and Har-rington (1991) provide an excellent discussion on the problem of tied data.

9.3 Nonparametric Models

Although applications in the transportation field are rare, there are numerousnonparametric (distribution-free) survival analysis applications in fieldssuch as medicine and epidemiology. Because of the predominant use ofsemiparametric and parametric methods in the field of transportation, onlythe fundamentals of nonparametric methods are covered here. It should benoted that lack of transportation applications of nonparametric methods inno way implies that these methods are not useful. Nonparametric methodsafford the ability to model survival or duration data without relying onspecific or well-behaved statistical distributions.

There are two popular approaches for generating survival functions for non-parametric methods, the product-limit (PL) method developed by Kaplan andMeier (1958) and life tables. The PL estimate is based on individual survivaltimes, whereas the life table method groups survival times into intervals.

The basic method for calculating survival probabilities using the PL methodbegins by specifying the probability of surviving r years (without event A occur-ring) as the conditional probability of surviving r years given survival for r 1years times the probability of surviving r 1 years (or months, days, minutes,etc.). In notation, the probability of surviving k or more years is given by

, (9.9)

where is the proportion of observed subjects surviving toperiod k, given survival to period k – 1, and so on. This PL estimator

ˆ | .... | | |S k p p p p p p p p pk k 1 4 2 3 2 2 1 1

( | )p pk k 1

© 2003 by CRC Press LLC

Page 229: Data Analysis Book

produces a survival step-function that is based purely on the observeddata. The Kaplan–Meier method provides useful estimates of survivalprobabilities and a graphical presentation of the survival distribution. Itis the most widely applied nonparametric method in survival analysis(Lee, 1992).

A few observations about this nonparametric method are worth mention-ing. Kaplan–Meier estimates are limited. If the largest (survival) observationis right censored, the PL estimate is undefined beyond this observation. Ifthe largest observation is not right censored, then the PL estimate at thattime equals zero, which is technically correct because no survival time longerthan this was observed. In addition, the median survival time cannot beestimated if more than 50% of the observations are censored and the largestobservation is censored.

The PL method assumes that censoring is independent of survivaltimes. If this is false, the PL method is inappropriate. For example, astudy of drivers’ crash probabilities, where drivers remove themselvesfrom the study due to the burden of participating, would violate theindependence assumption.

Life table analysis, an alternative method similar to the PL method, isdescribed in detail by Lee (1992).

9.4 Semiparametric Models

Both semiparametric and fully parametric hazard-based models have beenwidely cited in the literature. Fully parametric models assume a distributionof duration times (for example, Weibull, exponential, and so on) and alsohave a parametric assumption on the functional form of the influence of thecovariates on the hazard function (usually EXP( X) as discussed previously).Semiparametric models, in contrast, are more general in that they do notassume a duration time distribution, although they do retain the parametricassumption of the covariate influence.

A semiparametric approach for modeling the hazard function is conve-nient when little or no knowledge of the functional form of the hazard isavailable. Such an approach was developed by Cox (1972) and is based onthe proportional hazards approach. The Cox proportional hazards model issemiparametric because EXP( X) is still used as the functional form of thecovariate influence. The model is based on the ratio of hazards — so theprobability of an observation i exiting a duration at time ti, given that at leastone observation exits at time ti, is given as

(9.10)EXP EXPi jj Ri

X X

© 2003 by CRC Press LLC

Page 230: Data Analysis Book

where Ri denotes the set of observations, j, with durations greater than orequal to ti. This model is readily estimated using standard maximum likeli-hood methods. If only one observation completes its duration at each time(no tied data), and no observations are censored, the partial log likelihood is

. (9.11)

If no observations are censored and tied data are present with more thanone observation exiting at time ti, the partial log likelihood is the sum ofindividual likelihoods of the ni observations that exit at time ti

. (9.12)

Example 9.1

A survey of 96 Seattle-area commuters was conducted to examine theduration of time that commuters delay their work-to-home trips in aneffort to avoid peak-period traffic congestion. All commuters in the samplewere observed to delay their work-to-home trip to avoid traffic congestion(see Mannering and Hamed, 1990, for additional details). All delays wereobserved in their entirety so there is neither left nor right censoring in thedata. Observed commuter delays ranged from 4 min to 4 h with an averagedelay of 51.3 min and a standard deviation of 37.5 min.

To determine significant factors that affect the duration of commuters’delay, a Cox proportional hazards is estimated using the variables shownin Table 9.1. Model estimation results are presented in Table 9.2, whichshows that the multiplicative effect of the male indicator variable increasesthe hazard (see Equation 9.6) and thus shortens the expected commutingdelay. The finding that men have shorter delays is not highly significantthough, with a t statistic of 1.07. The ratio of actual travel time, at the timeof expected departure, to free-flow (uncongested) travel time decreases thehazard and increases delay. This variable captures the impact of high levelsof congestion on increasing overall work-to-home commute departuredelay. The greater was the distance from home to work, the greater wasthe departure delay — perhaps capturing the time needed for congestionto dissipate. Finally, the greater the resident population in the area (zone)of work, the greater the length of delay. This variable likely reflects theavailability of activity opportunities during the delay.

Note from Table 9.2 that a constant is not estimated. It is not possibleto estimate a constant because the Cox proportional hazards model is

LL EXPi jj Ri

I

i

X X=1

LL n EXPjj t

i jj Ri

I

i i

X X=1

© 2003 by CRC Press LLC

Page 231: Data Analysis Book

homogeneous of degree zero in X (see Equation 9.10). Thus, any vari-able that does not vary across observations cannot be estimated, in-cluding the constant term.

TABLE 9.1

Variables Available for Work-to-Home Commuter Delay Model

VariableNo. Variable Description

1 Minutes delayed to avoid congestion2 Primary activity performed while delaying: 1 if perform additional work, 2 if

engage in nonwork activities, or 3 if do both3 Number of times delayed in the past week to avoid congestion4 Mode of transportation used on work-to-home commute: 1 if by single-

occupancy vehicle, 2 if by carpool, 3 if by van pool, 4 if by bus, 5 if by other5 Primary route to work in Seattle area: 1 if Interstate 90, 2 if Interstate 5, 3 if State

Route 520, 4 if Interstate 405, 5 if other6 In the respondent’s opinion, is the home-to-work trip traffic congested: 1 if yes,

0 if no7 Commuter age in years: 1 if under 25, 2 if 26 to 30, 3 if 31 to 35, 4 if 36 to 40, 5

if 41 to 45, 6 if 46 to 50, 7 if 51 or older8 Respondent’s gender: 1 if female, 0 if male9 Number of cars in household

10 Number of children in household11 Annual household income (U.S. dollars): 1 if less than 20,000, 2 if 20,000 to

29,999, 3 if 30,000 to 39,999, 4 if 40,000 to 49,999, 5 if 50,000 to 59,999, 6 if over 60,000

12 Respondent has flexible work hours? 1 if yes, 0 if no13 Distance from work to home (in kilometers)14 Respondent faces level of service D or worse on work-to-home commute? 1 if

yes, 0 if no15 Ratio of actual travel time at time of expected departure to free-flow

(noncongested) travel time16 Population of work zone17 Retail employment in work zone18 Service employment in work zone19 Size of work zone (in hectares)

TABLE 9.2

Cox Proportional Hazard Model Estimates of the Duration of Commuter Work-to-Home Delay to Avoid Congestion

Independent VariableEstimatedParameter t Statistic

Male indicator (1 if male commuter, 0 if not) 0.246 1.07Ratio of actual travel time at time of expected departure to free-flow (noncongested) travel time

–1.070 –2.59

Distance from home to work in kilometers –0.028 –1.73Resident population of work zone –0.1761E-04 –1.58Number of observations 96Log likelihood at zero –361.56Log likelihood at convergence –325.54

© 2003 by CRC Press LLC

Page 232: Data Analysis Book

9.5 Fully Parametric Models

With fully parametric models, a variety of distributional alternatives for thehazard function have been used with regularity in the literature. Theseinclude gamma, exponential, Weibull, log-logistic, and Gompertz distribu-tions, among others. Table 9.3 shows the names and corresponding hazardfunctions (with their associated distribution parameters) for a variety ofparametric duration models. The choice of any one of these alternatives isjustified on theoretical grounds or statistical evaluation. The choice of aspecific distribution has important implications relating not only to the shapeof the underlying hazard, but also to the efficiency and potential biasednessof the estimated parameters.

This chapter looks closely at three distributions: exponential, Weibull, andlog-logistic. For additional distributions used in hazard-based analysis, thereader is referred Kalbfleisch and Prentice (1980) and Cox and Oakes (1984).

The simplest distribution to apply and interpret is the exponential. Withparameter > 0, the exponential density function is

f(t) = EXP(– t) (9.13)

with hazard

TABLE 9.3

Some Commonly Used Hazard Functions for Parametric Duration Models

Name Hazard Function, h(t)

Compound exponential

Exponential

Exponential with gamma heterogeneity

Gompertz

Gompertz–Makeham

Log-logistic

Weibull

Weibull with gamma heterogeneity

h tP

tP

0

h t

h tt1

h t P EXP t

h t EXP t0 1

2

h tP t

t

P

P

1

1

h t P tP 1

h tP t

t

P

P

1

1

© 2003 by CRC Press LLC

Page 233: Data Analysis Book

h(t) = . (9.14)

Equation 9.14 implies that this distribution’s hazard is constant, as illustratedby h4(t) in Figure 9.2. This means that the probability of a duration endingis independent of time and there is no duration dependence. In the case ofthe time until a vehicle accident occurs, the probability of having an accidentstays the same regardless of how long one has gone without having anaccident. This is a potentially restrictive assumption because it does not allowfor duration dependence.

The Weibull distribution is a more generalized form of the exponential. Itallows for positive duration dependence (hazard is monotonic increasing induration and the probability of the duration ending increases over time),negative duration dependence (hazard is monotonic decreasing in durationand the probability of the duration ending decreases over time), or no dura-tion dependence (hazard is constant in duration and the probability of theduration ending is unchanged over time). With parameters > 0 and P > 0,the Weibull distribution has the density function:

f(t) = P( t)P–1EXP[–( t)P], (9.15)

with hazard

. (9.16)

As indicated in Equation 9.16, if the Weibull parameter P is greater than 1,the hazard is monotone increasing in duration (see h3(t) in Figure 9.2); if P isless than 1, it is monotone decreasing in duration (see h1(t) in Figure 9.2); andif P equals 1, the hazard is constant in duration and reduces to the exponentialdistribution’s hazard with h(t) = (see h4(t) in Figure 9.2). Because the Weibulldistribution is a more generalized form of the exponential distribution, itprovides a more flexible means of capturing duration dependence. However,it is still limited because it requires the hazard to be monotonic over time. Inmany applications, a nonmonotonic hazard is theoretically justified.

The log-logistic distribution allows for nonmonotonic hazard functionsand is often used as an approximation of the more computationally cumber-some lognormal distribution. The log-logistic with parameters > 0 andP > 0 has the density function

f(t) = P( t)P–1[1 + ( t)P]–2, (9.17)

and hazard function

. (9.18)

h t P t P 1

h tP t

t

P

P

1

1

© 2003 by CRC Press LLC

Page 234: Data Analysis Book

The log-logistic’s hazard is identical to the Weibull’s except for the denom-inator. Equation 9.18 indicates that if P < 1, then the hazard is monotonedecreasing in duration (see h1(t) in Figure 9.2); if P = 1, then the hazard ismonotone decreasing in duration from parameter ; and if P > 1, then thehazard increases in duration from zero to an inflection point, t = (P 1)1/P/ ,and decreases toward zero thereafter (see h2(t) in Figure 9.2).

Example 9.2

Using the same data as in Example 9.1, we wish to study work-to-homedeparture delay by estimating exponential, Weibull, and log-logistic pro-portional hazards models. Model estimation results are presented inTable 9.4. Note that these estimated fully parametric models include aconstant term. Unlike the Cox proportional hazards model, such termsare estimable.

The signs of the parameters are identical to those found earlier in the Coxmodel. Of particular interest in these models are the duration effects. Theexponential model assumes that the hazard function is constant, and thusthe probability of a departure delay ending is independent of the timespent delaying. This assumption is tested by looking at the Weibull modelestimation results. Recall the exponential model implicitly constrains P =1.0. So if the Weibull estimate of P is significantly different from 1.0, theexponential model will not be valid and the hazard function is not con-stant over time. Table 9.4 shows the Weibull model parameter P is positive(indicating a monotonically increasing hazard) and significantly different

TABLE 9.4

Hazard Model Parameter Estimates of the Duration of Commuter Work-to-Home Delay to Avoid Congestion (t statistics in parentheses)

Independent Variable Exponential Weibull Log-Logistic

Constant 1.79 (1.02) 1.74 (2.46) 1.66 (3.07)Male indicator (1 if male commuter, 0 if not)

0.173 (0.46) 0.161 (1.03) 0.138 (0.953)

Ratio of actual travel time at time of expected departure to free-flow (noncongested) travel time

–0.825 (–1.04) –0.907 (–3.00) –0.752 (–3.02)

Distance from home to work in kilometers

–0.035 (–0.65) –0.032 (–1.55) –0.041 (–2.69)

Resident population of work zone

–0.130E-05 (–0.57) –0.137E-04 (–1.68) –0.133E-05 (–1.90)

P (distribution parameter) — 1.745 (9.97) 2.80 (10.17)0.020 (5.67) 0.018 (14.88) 0.024 (15.67)

Log likelihood at convergence

–113.75 –93.80 –91.88

© 2003 by CRC Press LLC

Page 235: Data Analysis Book

from 0. To compare exponential and Weibull models, the appropriate testof whether P is significantly different from 1 is given as

(9.19)

This t statistic shows that P is significantly different from 1.0 and theWeibull model is preferred over the exponential. The resulting hazardfunction for the Weibull distribution is illustrated in Figure 9.5.

The log-logistic distribution relaxes the monotonicity of the hazard func-tion. Table 9.4 shows that the parameter P is greater than 1, indicatingthat the hazard increases in duration from zero to an inflection, t = (P1)1/P/ , and decreases toward 0 thereafter. The value t is computed to be51.23 min, the time after which the hazard starts to decrease. The log-logistic hazard for this problem is shown in Figure 9.6.

FIGURE 9.5Estimated hazard function for the Weibull model in Example 9.2.

FIGURE 9.6Estimated hazard function for the log-logistic model in Example 9.2.

240

h(t )

.0000 48 96

t (in minutes)

144 192

.025

.050

.075

.100

tS

.1 1 745 1

0 1754 26

..

h(t )

t (in minutes)

.0000 48 96 144 192

.005

.010

.015

.020

.025

.030

.040

.035

240

© 2003 by CRC Press LLC

Page 236: Data Analysis Book

9.6 Comparisons of Nonparametric, Semiparametric, and Fully Parametric Models

The choice among nonparametric, semiparametric, and fully parametricmethods for estimating survival or duration models is complicated. Whenthere is little information about the underlying distribution due to the smallsize of the sample or the lack of a theory that would suggest a specificdistribution, a nonparametric approach may be appropriate. Nonparametricmethods are a preferred choice when the underlying distribution is notknown, whereas parametric methods are more suitable when underlyingdistributions are known or are theoretically justified (Lee, 1992).

Semiparametric models may also be a good choice when little is knownabout the underlying hazard distribution. However, there are two draw-backs. First, duration effects are difficult to track. This is problematic if theprimary area of interest is how probabilities of duration exits change withrespect to duration. Second, a potential loss in efficiency may result. It canbe shown that in data where censoring exists and the underlying survivaldistribution is known, the Cox semiparametric proportional hazards modeldoes not produce efficient parameter estimates. However, a number of stud-ies have shown that the loss in efficiency is typically small (Efron, 1977; Oaks,1977; Hamed and Mannering, 1990).

Comparing various hazard distributional assumptions for fully parametricmodels can also be difficult. Determining the relative difference between aWeibull and an exponential model could be approximated by the significanceof the Weibull P parameter, which represents the difference between the twodistributions (see Equations 9.16 and 9.18). In Example 9.2, it was noted thatthe Weibull model is superior to the exponential because the Weibull-esti-mated value of P is significantly different from 1.0. A likelihood ratio testcomparing the two models can be conducted using the log likelihoods atconvergence. The test statistic is

X2 = –2(LL( e) – LL( w)), (9.20)

where LL( e) is the log likelihood at convergence for the exponential distri-bution and LL( w) is the log likelihood at convergence for the Weibull distri-bution. This X2 statistic is 2 distributed with 1 degree of freedom(representing the additional parameter estimated, P). In the case of the modelsdiscussed in Example 9.2, this results in a X2 test statistic of 39.9 [–2(–113.75– (–93.80))]. With 1 degree of freedom, the confidence level is over 99.99%.Thus we are more than 99.99% confident that the Weibull model provides abetter fit than the exponential model. This comparison is made directlybecause exponential and Weibull models are derivatives. This is sometimesreferred to as nesting, as the exponential is simply a special case of the Weibull(with P constrained to be equal to 1).

© 2003 by CRC Press LLC

Page 237: Data Analysis Book

The difference between the Weibull and log-logistic models and otherdistributions is more difficult to test because the models may not be nested.One possible comparison for distributional models that are not nested is tocompare likelihood ratio statistics (see Nam and Mannering, 2000). In thiscase, the likelihood ratio statistic is

X2 = –2(LL(0) – LL( c)), (9.21)

where LL(0) is the initial log likelihood (with all parameters equal to zero),and LL( c) is the log likelihood at convergence. This statistic is 2 distrib-uted with the degrees of freedom equal to the number of estimated param-eters included in the model. One could select the distribution thatprovided the highest level of significance for this statistic to determinethe best-fit distribution.

An informal method suggested by Cox and Oakes (1984) is to use plots ofthe survival and hazard distributions obtained using nonparametric meth-ods to guide selection of a parametric distribution. With this method, a visualinspection of the shapes and characteristics (inflection points, slopes) of thesurvival and hazard curves is used to provide insight into the selection ofan appropriate parametric distribution.

9.7 Heterogeneity

While formulating proportional hazard models, an implicit assumption isthat the survival function (see Equation 9.6) is homogeneous across obser-vations. Thus, all the variation in durations is assumed to be captured bythe covariate vector X. A problem arises when some unobserved factors, notincluded in X, influence durations. This is referred to as unobserved heter-ogeneity and can result in a major specification error that can lead one todraw erroneous inferences on the shape of the hazard function, in additionto producing inconsistent parameter estimates (Heckman and Singer, 1984;Gourieroux et al., 1984).

In fully parametric models the most common approach to account forheterogeneity is to introduce a heterogeneity term designed to capture unob-served effects across the population and to work with the resulting condi-tional survival function. With a heterogeneity term, w, having a distributionover the population g(w), along with a conditional survival function, S(t|w),the unconditional survival function becomes

S(t) = S(t|w)g(w)dw. (9.22)

To see how this heterogeneity term is applied, consider a Weibull distri-bution with gamma heterogeneity (Hui, 1990). Without loss of generality,

© 2003 by CRC Press LLC

Page 238: Data Analysis Book

w is assumed to be gamma distributed with mean 1 and variance = 1/k.So,

. (9.23)

With the Weibull distribution and S(t) = f(t)/h(t), Equations 9.17 and 9.18 give

. (9.24)

The unconditional survival distribution can then be written as (with = 1/k):

, (9.25)

resulting in the hazard function

h(t) = P( t)P–1[S(t)] . (9.26)

Note that if = 0, heterogeneity is not present because the hazard reducesto Equation 9.18 and the variance of the heterogeneity term w is zero.

Example 9.3

Using the data used in Example 9.1, a Weibull model is estimated withgamma heterogeneity and compared with the results of the Weibullmodel reported in Table 9.4. Model estimation results are presented inTable 9.5. Although there is some variation in the parameter values andcorresponding t statistics, of particular concern is the estimation of (with its corresponding t statistic of 1.58). The correct comparison of thetwo models is the likelihood ratio test

X2 = –2(LL(bw) – LL(bwh)), (9.27)

where LL(bw) is the log likelihood at convergence for the Weibull modeland LL(bwh) is the log likelihood at convergence for the Weibull modelwith gamma heterogeneity. This results in an X2 statistic of 3.84 [–2(–93.80– (–91.88))]. With 1 degree of freedom, this gives a confidence level of95% indicating that we are 95% confident that heterogeneity is presentin the underlying Weibull survival process (if the Weibull specificationis correct).

The selection of a heterogeneity distribution should not be taken lightly.The consequences of incorrectly specifying g(w) are potentially severe andcan result in inconsistent estimates, as demonstrated both theoretically and

g(w)k

ke w

kkv k 1

S(t|w) e w t P

S t S t w g w dw t P( ) ( | ) ( ) 11

0

© 2003 by CRC Press LLC

Page 239: Data Analysis Book

empirically by Heckman and Singer (1984). Thus, while the gamma distri-bution has been a popular approach in accounting for heterogeneity, somecaution must be exercised when interpreting estimation results based onthe selection of a specific parametric form for heterogeneity. Fortunately,for choosing among distributions, it has been shown that if the correctdistribution is used for the underlying hazard function, parameter esti-mates are not highly sensitive to alternative distributional assumptions ofheterogeneity (Kiefer, 1988). Also, Meyer (1990) and Han and Hausman(1990) have shown that if the baseline hazard is semiparametric (a Coxproportional hazards model), the choice of heterogeneity distribution maybe less important with the estimation results being less sensitive to thedistribution chosen. In other work, Bhat (1996a) has explored the use ofnonparametric heterogeneity corrections in semiparametric and parametricmodels. The reader is referred to his work for further information onheterogeneity-related issues.

9.8 State Dependence

In duration modeling, state dependence refers to a number of processes thatseek to establish a relationship between past duration experiences and cur-rent durations. Conceptually, the understanding of how past experienceaffects current behavior is a key component in modeling that captures impor-tant habitual behavior. State dependence is classified into three types.

Type I state dependence is duration dependence, which, as discussedabove, is the conditional probability of a duration ending soon, given that

TABLE 9.5

Weibull vs. Weibull with Gamma Heterogeneity Model Parameter Estimates of the Duration of Commuter Work-to-Home Delay to Avoid Congestion (t statistics in parentheses)

Independent Variable WeibullWeibull with Gamma

Heterogeneity

Constant 1.74 (2.46) 1.95 (3.08)Male indicator (1 if male commuter, 0 if not) 0.161 (1.03) 0.216 (1.58)Ratio of actual travel time at time of expected departure to free-flow (noncongested) travel time

–0.907 (–3.00) –0.751 (–2.70)

Distance from home to work in kilometers –0.032 (–1.55) –0.034 (–1.92)Resident population of work zone –0.137E-04 (-1.68) –0.120E-05 (–1.57)

P (distribution parameter) 1.745 (9.97) 2.31 (6.18)0.018 (14.88) 0.024 (15.67)

— 0.473 (1.58)Log likelihood at convergence –93.80 –91.88

© 2003 by CRC Press LLC

Page 240: Data Analysis Book

it has lasted to some known time. Type I state dependence is captured inthe shape of the hazard function (see Figure 9.2).

Type II state dependence is occurrence dependence. This refers to the effectthat the number of previous durations has on a current duration. For exam-ple, suppose interest is focused on modeling the duration of a traveler’s timespent shopping. The number of times the traveler previously shopped duringthe same day may have an influence on current shopping duration. This isaccounted for in the model by including the number of previous shoppingdurations as a variable in the covariate vector.

Type III state dependence, lagged duration dependence, accounts for theeffect that lengths of previous durations have on a current duration.Returning to the example of the time spent shopping, an individual mayhave developed habitual behavior with respect to the time spent shoppingthat would make the length of previous shopping durations an excellentpredictor of current shopping duration. This effect would be accounted forin the model’s covariate vector by including the length of previous dura-tions as a variable.

When including variables that account for Type II or Type III state durationdependence, the findings can easily be misinterpreted. This misinterpretationarises because unobserved heterogeneity is captured in the parameter esti-mates of the state dependence variables. To illustrate, suppose income is animportant variable for determining shopping duration. If income is omittedfrom the model, it becomes part of the unobserved heterogeneity. Because itis correlated with occurrence and lagged duration dependence, the parameterestimates of these variables are capturing both true state dependence andspurious state dependence (unobserved heterogeneity). Heterogeneity correc-tions (as in Equation 9.24) in the presence of occurrence and lagged durationvariables do not necessarily help untangle true state dependence from heter-ogeneity (see Heckman and Borjas, 1980). One simple solution is to instrumentstate variables by regressing them against exogenous covariates and using theregression-predicted values as variables in the duration model. This procedureproduces consistent but not efficient estimators. For additional information,the reader is referred to the extensive empirical study of state dependence andheterogeneity provided in Kim and Mannering (1997).

9.9 Time-Varying Covariates

Covariates that change over a duration are problematic. For example,suppose we are studying the duration between an individual’s vehicularaccidents. During the time between accidents it is possible that the indi-vidual would change the type of vehicle driven. This, in turn, could affectthe probability of having a vehicular accident. If the covariate vector, X,changes over the duration being studied, parameter estimates may be

© 2003 by CRC Press LLC

Page 241: Data Analysis Book

biased. Time-varying covariates are difficult to account for, but are incor-porated in hazard models by allowing the covariate vector to be a functionof time. The hazard and likelihood functions are then appropriately rewrit-ten. The likelihood function becomes more complex, but estimation istypically simplified because time-varying covariates usually make only afew discrete changes over the duration being studied. However, the inter-pretation of findings and, specifically, duration effects is much more diffi-cult. The interested reader should see Peterson (1976), Cox and Oakes(1984), and Greene (2000) for additional information on the problems andsolutions associated with time-varying covariates.

9.10 Discrete-Time Hazard Models

An alternative to standard continuous-time hazard models is to use a dis-crete-time approach. In discrete-time models, time is segmented into uniformdiscrete categories and exit probabilities in each time period are estimatedusing a logistic regression or other discrete outcome modeling approach (seeChapter 11). For example, for developing a model of commuters’ delay inwork-to-home departure to avoid traffic congestion, one could categorizethe observed continuous values into 30-min time intervals. Then, a logisticregression could be estimated to determine the exit probabilities (probabilityof leaving work for home) in each of these time periods. This discreteapproach allows for a very general form of the hazard function (and durationeffects) that can change from one time interval to the next as shown in Figure9.7. However, the hazard is implicitly assumed to be constant during eachof the discrete time intervals.

Discrete hazard models have the obvious drawback of inefficient param-eter estimators because of information lost in discretizing continuous data.However, when the discrete time intervals are short and the probability ofexiting in any discrete time interval is small, studies have shown that discretehazard models produce results very close to those of continuous-time hazardmodels (Green and Symons, 1983; Abbot, 1985; Ingram and Kleinman, 1989).

Although discrete-time hazard models are inefficient, they provide at leasttwo advantages over their continuous-time counterparts. First, time-varyingcovariates are easily handled by incorporating new, time-varying values in thediscrete time intervals (there is still the assumption that they are constant overthe discrete-time interval). Second, tied data (groups of observations exiting aduration at the same time), which, as discussed above, are problematic whenusing continuous-data approaches are not a problem with discrete hazard mod-els. This is because the grouped data fall within a single discrete time intervaland the subsequent lack of information on “exact” exit times does not compli-cate the estimation. Discrete-time hazard models continue to be a viable alter-native to continuous-time approaches because problems associated with tied

© 2003 by CRC Press LLC

Page 242: Data Analysis Book

data and time-varying covariates are sometimes more serious than the effi-ciency losses encountered when discretizing the data.

Advances continue to be made by a number of researchers in the field ofdiscrete-time hazard models. For example, Han and Hausman (1990) devel-oped a generalized discrete-time hazard approach with gamma heterogene-ity that demonstrates the continuing promise of the approach.

9.11 Competing Risk Models

The normal assumption for hazard-model analyses is that durations end asa result of a single event. For example, in the commuters’ delay examplediscussed previously, the delay duration ends when the commuter leavesfor home. However, multiple outcomes that produce different durations canexist. For example, if the commuter delay problem is generalized to considerany delay in the trip from work to avoid traffic congestion (not just the work-to-home trip option), then multiple duration-ending outcomes would exist.These are referred to as competing risks and could include work-to-home,work-to-shopping, and work-to-recreation trips as options that would endcommuters’ delay at work.

The most restrictive approach to competing risk models is to assumeindependence among the possible risk outcomes. If this assumption is made,separate hazard models for each possible outcome are estimated (see Katz,1986; Gilbert, 1992). This approach is limiting because it ignores the interde-pendence among risks that are often present and of critical interest in the

FIGURE 9.7Illustration of hazard function using discrete data analysis.

h(t )

4

3

2

1

00 1 2 3

Time

4 5 6

© 2003 by CRC Press LLC

Page 243: Data Analysis Book

analysis of duration data. Accounting for interdependence among competingrisks is not easily done due to the resulting complexity in the likelihoodfunction. However, Diamond and Hausman (1984) have developed a modelbased on strict parametric assumptions about the interdependence of risks.In subsequent work, Han and Hausman (1990) extend this model by pro-viding a more flexible form of parametric interdependence that also permitsone to test statistically for independence among competing risks.

© 2003 by CRC Press LLC

Page 244: Data Analysis Book

Part III

Count and Discrete Dependent Variable Models

© 2003 by CRC Press LLC

Page 245: Data Analysis Book

10Count Data Models

Count data consist of non-negative integer values and are encountered fre-quently in the modeling of transportation-related phenomena. Examples ofcount data variables in transportation include the number of driver routechanges per week, the number of trip departure changes per week, drivers’frequency of use of intelligent transportation systems (ITS) technologies oversome time period, number of vehicles waiting in a queue, and the numberof accidents observed on road segments per year.

A common mistake is to model count data as continuous data byapplying standard least squares regression. This is not correct because regres-sion models yield predicted values that are non-integers and can also predictvalues that are negative, both of which are inconsistent with count data.These limitations make standard regression analysis inappropriate for mod-eling count data without modifying dependent variables.

Count data are properly modeled by using a number of methods, themost popular of which are Poisson and negative binomial regression mod-els. Poisson regression is the more popular of the two, and is applied to awide range of transportation count data. The Poisson distribution approx-imates rare-event count data, such as accident occurrence, failures in man-ufacturing or processing, and number of vehicles waiting in a queue. Onerequirement of the Poisson distribution is that the mean of the countprocess equals its variance. When the variance is significantly larger thanthe mean, the data are said to be overdispersed. There are numerousreasons for overdispersion, some of which are discussed later in this chap-ter. In many cases, overdispersed count data are successfully modeledusing a negative binomial model.

10.1 Poisson Regression Model

To help illustrate the principal elements of a Poisson regression model,consider the number of accidents occurring per year at various intersections

© 2003 by CRC Press LLC

Page 246: Data Analysis Book

in a city. In a Poisson regression model, the probability of intersection ihaving yi accidents per year (where yi is a non-negative integer) is given by

where P(yi) is the probability of intersection i having yi accidents per yearand i is the Poisson parameter for intersection i, which is equal to theexpected number of accidents per year at intersection i, E[yi]. Poisson regres-sion models are estimated by specifying the Poisson parameter i (theexpected number of events per period) as a function of explanatory variables.For the intersection accident example, explanatory variables might includethe geometric conditions of the intersections, signalization, pavement types,visibility, and so on. The most common relationship between explanatoryvariables and the Poisson parameter is the log-linear model,

where Xi is a vector of explanatory variables and is a vector of estimableparameters. In this formulation, the expected number of events per periodis given by . This model is estimable by standard max-imum likelihood methods, with the likelihood function given as

. (10.3)

The log of the likelihood function is simpler to manipulate and more appro-priate for estimation,

(10.4)

As with most statistical models, the estimated parameters are used tomake inferences about the unknown population characteristics thoughtto impact the count process. Maximum likelihood estimates produce Pois-son parameters that are consistent, asymptotically normal, and asymptot-ically efficient.

To provide some insight into the implications of parameter estimationresults, elasticities are computed to determine the marginal effects of the inde-pendent variables. Elasticities provide an estimate of the impact of a variableon the expected frequency and are interpreted as the effect of a 1% change inthe variable on the expected frequency i. For example, an elasticity of –1.32

P yEXP

yii i

y

i

i

!

i i i iEXP LN X X or, equivalently ,

E y EXPi i i(X )

LEXP EXP EXP

yi i

y

ii

i

= X X

!

LL EXP y LN yi i i ii

n

X X !1

© 2003 by CRC Press LLC

Page 247: Data Analysis Book

is interpreted to mean that a 1% increase in the variable reduces the expectedfrequency by 1.32%. Elasticities are the correct way of evaluating the relativeimpact of each variable in the model. Elasticity of frequency i is defined as

, (10.5)

where E represents the elasticity, xik is the value of the kth independentvariable for observation i, k is the estimated parameter for the kth indepen-dent variable and i is the expected frequency for observation i. Note thatelasticities are computed for each observation i. It is common to report asingle elasticity as the average elasticity over all i.

The elasticity in Equation 10.5 is only appropriate for continuous variablessuch as highway lane width, distance from outside shoulder edge to roadsidefeatures, and vertical curve length. It is not valid for noncontinuous variablessuch as indicator variables that take on values of 0 or 1. For indicator vari-ables, a pseudo-elasticity is computed to estimate an approximate elasticityof the variables. The pseudo-elasticity gives the incremental change in fre-quency caused by changes in the indicator variables. The pseudo-elasticity,for indicator variables, is computed as

. (10.6)

Also, note that the Poisson probabilities for observation i are calculatedusing the following recursive formulas:

(10.7)

where P0,i is the probability that observation i experiences 0 events in thespecified observation period.

10.2 Poisson Regression Model Goodness-of-Fit Measures

There are numerous goodness-of-fit (GOF) statistics used to assess the fit ofthe Poisson regression model to observed data. As mentioned in previouschapters, when selecting among alternative models, GOF statistics should beconsidered along with model plausibility and agreement with expectations.

Exx

xxi

i

ik

ikk ikik

i

EEXP

EXPk

kxik

i( )

( )1

P EXP

Pi

P j i n

i i

j ii

j i

0

1 1 2 3 1 2

,

, , , , , , ....; , , .... ,

© 2003 by CRC Press LLC

Page 248: Data Analysis Book

The likelihood ratio test is a common test used to assess two competingmodels. It provides evidence in support of one model, usually a full orcomplete model, over another competing model that is restricted by havinga reduced number of model parameters. The likelihood ratio test statistic is

X2 = –2[LL( R) – LL( U)], (10.8)

where LL( R) is the log likelihood at convergence of the “restricted” model(sometimes considered to have all parameters in equal to 0, or just toinclude the constant term, to test overall fit of the model), and LL( U) is thelog likelihood at convergence of the unrestricted model. The X2 statistic is

2 distributed with the degrees of freedom equal to the difference in thenumbers of parameters in the restricted and unrestricted model (the differ-ence in the number of parameters in the R and the U parameter vectors).

The sum of model deviances, G2, is equal to zero for a model with perfectfit. Note, however, that because observed yi is an integer while the predictedexpected value is continuous, a G2 equal to zero is a theoretical lowerbound. This statistic is given as

. (10.9)

An equivalent measure to R2 in ordinary least squares linear regression isnot available for a Poisson regression model due to the nonlinearity of theconditional mean (E[y|X]) and heteroscedasticity in the regression. A similarstatistic is based on standardized residuals,

, (10.10)

where the numerator is similar to a sum of square errors and the denomi-nator is similar to a total sum of squares.

Another measure of overall model fit is the 2 statistic. The 2 statistic is

, (10.11)

where LL( ) is the log likelihood at convergence with parameter vector and LL(0) is the initial log likelihood (with all parameters set to zero). The

ˆi

G y LNy

ii

ii

n2

1

2 ˆ

R

y

y y

y

p

i i

ii

n

i

i

n

21

2

1

21

ˆ

ˆ

2 1LLLL

0

© 2003 by CRC Press LLC

Page 249: Data Analysis Book

perfect model would have a likelihood function equal to one (all selectedalternative outcomes would be predicted by the model with probability one,and the product of these across the observations would also be one) and thelog likelihood would be zero, giving a 2 of one (see Equation 10.11). Thusthe 2 statistic is between zero and one and the closer it is to one, the morevariance the estimated model is explaining. Additional GOF measures forthe Poisson regression model are found in Cameron and Windmeijer (1993),Gurmu and Trivedi (1994), and Greene (1995b).

Example 10.1

Accident data from California (1993 to 1998) and Michigan (1993 to 1997)were collected (Vogt and Bared, 1998; Vogt, 1999). The data represent aculled data set from the original studies, which included data from fourstates across numerous time periods and over five different intersectiontypes. A reduced set of explanatory variables is used for injury accidentson three-legged stop-controlled intersections with two lanes on the minorand four lanes on the major road. The accident data are thought to beapproximately Poisson or negative binomial distributed, as suggested byprevious studies on the subject (Miaou and Lum, 1993; Miaou 1994;Shankar et al., 1995; Poch and Mannering, 1996; Milton and Mannering,1998; and Harwood et al., 2000). The variables in the study are summa-rized in Table 10.1.

Table 10.2 shows the parameter estimates of a Poisson regression esti-mated on the accident data. This model contains a constant and four

TABLE 10.1

Summary of Variables in California and Michigan Accident Data

VariableAbbreviation Variable Description

Maximum/Minimum

ValuesMean of

Observations

StandardDeviation of Observations

STATE Indicator variable for state: 0 = California; 1 = Michigan

1/0 0.29 0.45

ACCIDENT Count of injury accidents over observation period

13/0 2.62 3.36

AADT1 Average annual daily traffic on major road

33058/2367 12870 6798

AADT2 Average annual daily traffic on minor road

3001/15 596 679

MEDIAN Median width on major road in feet

36/0 3.74 6.06

DRIVE Number of driveways within 250 ft of intersection center

15/0 3.10 3.90

© 2003 by CRC Press LLC

Page 250: Data Analysis Book

variables: two average annual daily traffic (AADT) variables, medianwidth, and number of driveways. The mainline AADT appears to havea smaller influence than the minor road AADT, contrary to what isexpected. Also, as median width increases, accidents decrease. Finally,the number of driveways close to the intersection increases the numberof intersection injury accidents. The signs of the estimated parametersare in line with expectation.

The mathematical expression for this Poisson regression model is asfollows:

.

The model parameters are additive in the exponent or multiplicative onthe expected value of yi. As in a linear regression model, standard errorsof the estimated parameters are provided, along with approximate t-values, and p-values associated with the null hypothesis of zero effect.In this case, the results show all estimated model parameters to be sta-tistically significant beyond the 0.01 level of significance.

Example 10.2

Inspection of the output shown in Table 10.2 reveals numerous propertiesof the fitted model. The value of the log-likelihood function for the fittedmodel is –169.25, whereas the restricted log likelihood is –246.18. Therestricted or reduced log likelihood is associated with a model with theconstant term only. A likelihood ratio test comparing the fitted model

TABLE 10.2

Poisson Regression of Injury Accident Data

Independent VariableEstimatedParameter t Statistic

Constant –0.826 –3.57Average annual daily traffic on major road 0.0000812 6.90Average annual daily traffic on minor road 0.000550 7.38Median width in feet – 0.0600 – 2.73Number of driveways within 250 ft of intersection 0.0748 4.54Number of observations 84Restricted log likelihood (constant term only) –246.18 Log likelihood at convergence –169.25Chi-squared (and associated p-value) 153.85

(<0.0000001)Rp-squared 0.4792 G2 176.5

E y EXP

EXPAADT1

AADT2 MEDIAN DRIVE

i i i

i

i i i

X

0 83 0 00008

0 0005 0 06 0 07

. . ( )

. ( ) . ( ) . ( )

© 2003 by CRC Press LLC

Page 251: Data Analysis Book

and the reduced model results in X2 = 153.85, which is sufficient to rejectthe fit of the reduced model. Thus it is very unlikely (p-value less than0.0000001) that randomness alone would produce the observed decreasein the log likelihood function.

Note that G2 is 186.48, which is only relevant in comparison to othercompeting models. The R2

p is 0.48, which again serves as a comparisonto competing models. A model with higher X2, lower G2, and a higherR2

p is sought in addition to a more appealing set of individual pre-dictor variables, model specification, and agreement with expectationand theory.

Table 10.3 shows the average elasticities computed for the variables inthe model shown in Table 10.2. These elasticities are obtained by enu-merating through the sample (applying Equation 10.5 to all observations)and computing the average of these elasticities.

Poisson regression is a powerful analysis tool, but, as with all statisticalmethods, it is used inappropriately if its limitations are not fully under-stood. There are three common analysis errors (Lee and Mannering, 2002).The first is failure to recognize that data are truncated. The second is thefailure to recognize that the mean and variance are not equal, as requiredby the Poisson distribution. The third is the failure to recognize that thedata contain a preponderance of zeros. These limitations and their remediesare now discussed.

10.3 Truncated Poisson Regression Model

Truncation of data can occur in the routine collection of transportation data.For example, if the number of times per week an in-vehicle navigation systemis used on the morning commute to work, during weekdays, the data areright truncated at 5, which is the maximum number of uses in any givenweek. Estimating a Poisson regression model without accounting for thistruncation will result in biased estimates of the parameter vector , and

TABLE 10.3

Average Elasticities of the Poisson Regression Model Shown in Table 10.2

Independent Variable Elasticity

Average annual daily traffic on major road 1.045Average annual daily traffic on minor road 0.327Median width in feet –0.228Number of driveways within 250 ft of intersection 0.232

© 2003 by CRC Press LLC

Page 252: Data Analysis Book

erroneous inferences will be drawn. Fortunately, the Poisson model isadapted easily to account for such truncation. The right-truncated Poissonmodel is written as (see Johnson and Kotz, 1969)

, (10.12)

where P(yi) is the probability of commuter i using the system yi times perweek, i is the Poisson parameter for commuter i; mi is the number of usesper week; and r is the right truncation (in this case, 5 times per week). Adriver decision-making example of a right-truncated Poisson regression isprovided by Mannering and Hamed (1990a) in their study of weekly depar-ture delays.

10.4 Negative Binomial Regression Model

A common analysis error is a result of failing to satisfy the property of thePoisson distribution that restricts the mean and variance to be equal, whenE[yi] = VAR[yi]. If this equality does not hold, the data are said to be underdispersed (E[yi] > VAR[yi]) or overdispersed (E[yi] < VAR[yi]), and theparameter vector is biased if corrective measures are not taken. Overdis-persion can arise for a variety of reasons, depending on the phenomenonunder investigation (for additional discussion and examples, see Karlaftisand Tarko, 1998). The primary reason in many studies is that variablesinfluencing the Poisson rate across observations have been omitted fromthe regression.

The negative binomial model is derived by rewriting Equation 10.2 suchthat, for each observation i,

i = EXP( Xi + i), (10.13)

where EXP( i) is a gamma-distributed error term with mean 1 and variance2. The addition of this term allows the variance to differ from the mean

as below:

VAR[yi] = E[yi][1 + E[yi]] = E[yi]+ E[yi]2. (10.14)

The Poisson regression model is regarded as a limiting model of the neg-ative binomial regression model as approaches zero, which means that theselection between these two models is dependent on the value of . Theparameter is often referred to as the overdispersion parameter. The nega-tive binomial distribution has the form:

P y y mi iy

i im

im

r

i i

i

= ! !0

© 2003 by CRC Press LLC

Page 253: Data Analysis Book

, (10.15)

where (.) is a gamma function. This results in the likelihood function:

. (10.16)

When the data are overdispersed, the estimated variance term is largerthan under a true Poisson process. As overdispersion becomes larger, so doesthe estimated variance, and consequently all of the standard errors of param-eter estimates become inflated.

A test for overdispersion is provided by Cameron and Trivedi (1990) basedon the assumption that under the Poisson model, hasmean zero, where is the predicted count . Thus, null and alternativehypotheses are generated, such that

(10.17)

where g( ) is a function of the predicted counts that is most often givenvalues of g( ) = or g( ) = . To conduct this test, a simplelinear regression is estimated where Zi is regressed on Wi, where

(10.18)

After running the regression (Zi = bWi) with g( ) = and g( )= , if b is statistically significant in either case, then H0 is rejected forthe associated function g.

Example 10.3

Consider again the Poisson regression model on injury accidents esti-mated previously. Even though the model presented in Example 10.2appears to fit the data fairly well, overdispersion might be expected.Using the regression test proposed by Cameron and Trivedi, with

P yy

yii

i i

i

i

yi

( )(( ) )

( ) ! ( ) ( )11

11 1

1

Ly

yii

i i

i

i

y

i

i

( )(( ) )

( ) ! ( ) ( )11

11 1

1

( [ ]) [ ]y E y E yi i i2

E yi[ ] ˆi

H VAR y E y

H VAR y E y g E y

i i

A i i i

0 :

: ,

E yi[ ]E yi[ ] E yi[ ] E yi[ ] E yi[ ]2

Zy E y y

E y

Wg E y

ii i i

i

ii

2

2

2.

E yi[ ] E yi[ ] E yi[ ]E yi[ ]2

© 2003 by CRC Press LLC

Page 254: Data Analysis Book

g(E[yi]) = E[yi], b = 1.28 with a t statistic = 2.75 and with g(E[yi]) = E[yi]2,b = 0.28 with a t statistic = 2.07. Based on the model output, both casesare statistically significant at the 5% level of significance, or 95% level ofconfidence. This result suggests that random sampling does not satisfac-torily explain the magnitude of the overdispersion parameter, and aPoisson model is rejected in favor of a negative binomial model.

Example 10.4

With evidence that overdispersion is present, a negative binomial modelis estimated using the accident data. The results of the estimated negativebinomial regression model are shown in Table 10.4.

As with the Poisson regression model, the signs of the estimated param-eters are expected and are significant. In addition, the overdispersionparameter is statistically significant, confirming that the variance is largerthan the mean. The restricted log-likelihood test suggests that the fittedmodel is better than a model with only the constant term.

10.5 Zero-Inflated Poisson and Negative Binomial Regression Models

There are certain phenomena where an observation of zero events duringthe observation period can arise from two qualitatively different conditions.One condition may result from simply failing to observe an event duringthe observation period. Another qualitatively different condition may resultfrom an inability ever to experience an event. Consider the following exam-ple. A transportation survey asks how many times a commuter has taken

TABLE 10.4

Negative Binomial Regression of Injury Accident Data

Independent VariableEstimatedParameter t Statistic

Constant –0.931 –2.37Average annual daily traffic on major road 0.0000900 3.47Average annual daily traffic on minor road 0.000610 3.09Median width in feet – 0.0670 –1.99Number of driveways within 250 ft of intersection 0.0632 2.24Overdispersion parameter, 0.516 3.09Number of observations 84Restricted log likelihood (constant term only) –169.25Log likelihood at convergence –153.28 Chi-squared (and associated p-value) 31.95

(<0.0000001)

© 2003 by CRC Press LLC

Page 255: Data Analysis Book

mass transit to work during the past week. An observed zero could arise intwo distinct ways. First, last week the commuter may have opted to take thevan pool instead of mass transit. Alternatively, the commuter may never taketransit, as a result of other commitments on the way to and from the placeof employment. Thus two states are present, one a normal count-processstate and the other a zero-count state.

At times what constitutes a zero-count state may be less clear. Considervehicle accidents occurring per year on 1-km sections of highway. Forstraight sections of roadway with wide lanes, low traffic volumes, and noroadside objects, the likelihood of a vehicle accident occurring may beextremely small, but still present because an extreme human error couldcause an accident. These sections may be considered in a zero-accidentstate because the likelihood of an accident is so small (perhaps the expec-tation would be that a reported accident would occur once in a 100-yearperiod). Thus, the zero-count state may refer to situations where thelikelihood of an event occurring is extremely rare in comparison to thenormal-count state where event occurrence is inevitable and follows someknown count process (see Lambert, 1992). Two aspects of this nonquali-tative distinction of the zero state are noteworthy. First, there is a prepon-derance of zeros in the data — more than would be expected under aPoisson process. Second, a sampling unit is not required to be in the zeroor near zero state into perpetuity, and can move from the zero or nearzero state to the normal-count state with positive probability. Thus, thezero or near zero state reflects one of negligible probability compared tothe normal state.

Data obtained from two-state regimes (normal-count and zero-countstates) often suffer from overdispersion if considered as part of a single,normal-count state because the number of zeros is inflated by the zero-countstate. Example applications in transportation are found in Miaou (1994) andShankar et al. (1997). It is common not to know if the observation is in thezero state so the statistical analysis process must uncover the separationof the two states as part of the model estimation process. Models that accountfor this dual-state system are referred to as zero-inflated models (see Mul-lahey, 1986; Lambert, 1992; Greene, 2000).

To address phenomena with zero-inflated counting processes, the zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) regressionmodels have been developed. The ZIP model assumes that the events, Y =(y1, y2,…yn), are independent and the model is

(10.18)

y p p EXP

y yp EXP

y

i i i i

ii i i

y

0 1

1

with probability

with probability !

.

© 2003 by CRC Press LLC

Page 256: Data Analysis Book

where y is the number of events per period. Maximum likelihood estimatesare used to estimate the parameters of a ZIP regression model and confidenceintervals are constructed by likelihood ratio tests.

The ZINB regression model follows a similar formulation with events, Y =(y1, y2,…, yn), independent and

(10.19)

where . Maximum likelihood methods are again usedto estimate the parameters of a ZINB regression model.

Zero-inflated models imply that the underlying data-generating processhas a splitting regime that provides for two types of zeros. The splittingprocess is assumed to follow a logit (logistic) or probit (normal) probabilityprocess, or other probability processes. A point to remember is that theremust be underlying justification to believe the splitting process exists(resulting in two distinct states) prior to fitting this type of statisticalmodel. There should be a basis for believing that part of the process is ina zero-count state.

To test the appropriateness of using a zero-inflated model rather than atraditional model, Vuong (1989) proposed a test statistic for non-nestedmodels that is well suited for situations where the distributions (Poisson ornegative binomial) are specified. The statistic is calculated as (for each obser-vation i),

, (10.20)

where f1(yi|Xi) is the probability density function of model 1, and f2(yi|Xi) isthe probability density function of model 2. Using this, Vuong’s statistic fortesting the non-nested hypothesis of model 1 vs. model 2 is (Greene, 2000;Shankar et al., 1997)

y p p

y y py u u

yy

i i i

i

i i

i iy

0 1

1

1

1

11

1

1

1

with probability

with probability = 1, 2, 3...( )

!,

ui i( ) [( ) ]1 1

m LNf y

f yii i

i i

1

2

X

X

© 2003 by CRC Press LLC

Page 257: Data Analysis Book

, (10.21)

where is the mean

Sm is standard deviation, and n is the sample size. Vuong’s value is asymp-totically standard normal distributed (to be compared to z-values), and if|V| is less than Vcritical (1.96 for a 95% confidence level), the test does notsupport the selection of one model over another. Large positive values of Vgreater than Vcritical favor model 1 over model 2, whereas large negative valuessupport model 2. For example, if comparing negative binomial alternatives,one would let f1(.) be the density function of the ZINB and f2(.) be the densityfunction of the negative binomial model. In this case, assuming a 95% criticalconfidence level, if V > 1.96, the statistic favors the ZINB, whereas a valueof V < –1.96 favors the negative binomial. Values in between would meanthat the test was inconclusive. The Vuong test can also be applied in anidentical manner to test ZIP Poisson and Poisson models.

Because overdispersion will almost always include excess zeros, it is notalways easy to determine whether excess zeros arise from true overdispersionor from an underlying splitting regime. This could lead one to erroneouslychoose a negative binomial model when the correct model may be a ZIP. Forexample, recall from Equation 10.14 that the simple negative binomial gives

VAR[yi] = E[yi][1 + E[yi]]. (10.22)

For the ZIP model it can be shown that

. (10.23)

Thus, the term pi/(1 – pi) could be erroneously interpreted as . For guid-ance in unraveling this problem, consider the Vuong statistic comparingZINB and negative binomial (with f1(.) the density function of the ZINBand f2(.) the density function of the negative binomial model for Equation10.20). Shankar et al. (1997) provide model-selection guidelines for this casebased on possible values of the Vuong test and overdispersion parameter

V

nn

m

nm m

n m

S

ii

n

ii

nm

1

1

1

2

1

m

11

n mii

n

,

VAR y E yp

pE yi i

i

ii= 1

1

© 2003 by CRC Press LLC

Page 258: Data Analysis Book

( ) t statistics. These guidelines, for the 95% confidence level, are presentedin Table 10.5. Again, it must be stressed that great care must be taken inthe selection of models. If there is not a compelling reason to suspect thattwo states are present, the use of a zero-inflated model may be simplycapturing model misspecification that could result from factors such asunobserved effects (heterogeneity) in the data.

Example 10.5

To determine if a zero-inflated process is appropriate for the accidentdata used in previous examples one must believe that three-legged stop-controlled intersections with two lanes on the minor and four lanes onthe major road are capable of being perfectly safe with respect to injuryaccidents. If convinced that perfectly safe intersections may exist, a zero-inflated negative binomial model may be appropriate. The results fromsuch a model are shown in Table 10.6.

The first noteworthy observation is that the negative binomial model with-out zero inflation appears to account for the number of observed zeros aswell as the zero-inflated model. The Vuong statistic of 0.043 suggests thatthe test is inconclusive. Thus there is no statistical support to select theZINB model over the standard negative binomial model (see also Table10.5 where the t statistic of nearly satisfies the 95% confidence guidelinesused). This intuitively makes sense because there is no compelling reasonto believe that intersections, which are noted as accidents hot spots, wouldbe in what might be considered a zero-accident state.

10.6 Panel Data and Count Models

The time-series nature of count data from a panel (where groups of data areavailable over time) creates a serial correlation problem that must be dealt

TABLE 10.5

Decision Guidelines for Model Selection (using the 95% confidence level) among Negative Binomial (NB), Poisson, Zero-Inflated Poisson (ZIP), and Zero-Inflated Negative Binomial (ZINB) Models Using the Vuong Statistic and the Overdispersion Parameter

t Statistic of the NB Overdispersion Parameter <|1.96| >|1.96|

Vuong statistic for ZINB(f1(.)) and NB(f2(.)) comparison

<– 1.96 ZIP or Poisson as alternative to NB

NB

>1.96 ZIP ZINB

© 2003 by CRC Press LLC

Page 259: Data Analysis Book

with in estimation. Hausman et al. (1984) proposed random- and fixed-effectsapproaches for panel count data to resolve problems of serial correlation.The random-effects approach accounts for possible unobserved heterogene-ity in the data in addition to possible serial correlation. In contrast, the fixed-effects approach does not allow for unobserved heterogeneity.

Transportation applications of fixed- and random-effects Poisson andnegative binomial models have been quite limited. However, two studiesare noteworthy in this regard. Johansson (1996) studied the effect of alowered speed limit on the number of accidents on roadways in Sweden.And Shankar et al. (1998) compared standard negative binomial and ran-dom-effects negative binomial models in a study of accidents caused bymedian crossovers in Washington State. The reader is referred to thesesources and Hausman et al. (1984) for additional information on random-and fixed-effects count models.

TABLE 10.6

Zero-Inflated Negative Binomial Regression of Injury Accident Data: Logistic Distribution Splitting Model

Variable DescriptionEstimatedParameter t-Statistic

Negative Binomial Accident State

Constant –13.84 –4.33Log of average annual daily traffic on major road 1.337 4.11Log of average annual daily traffic on minor road 0.289 2.82Number of driveways within 250 ft of intersection 0.0817 2.91Overdispersion parameter, 0.0527 1.93

Zero-Accident State

Constant –5.18 0.77Number of observations 84Log likelihood at convergence (Poisson) –171.52Log likelihood at convergence (negative binomial) –154.40Log likelihood at convergence (zero-inflated negative binomial)

–154.40

Poisson zeros: actual/predicted 29/15.6Negative binomial zeros: actual/predicted 29/25.5Zero-inflated negative binomial zeros: actual/predicted

29/25.5

Vuong statistic for testing zero-inflated negative binomial vs. the normal-count model

0.043

© 2003 by CRC Press LLC

Page 260: Data Analysis Book

11Discrete Outcome Models

Discrete or nominal scale data often play a dominant role in transportationbecause many interesting policy-sensitive analyses deal with such data.Examples of discrete data in transportation include the mode of travel (auto-mobile, bus, rail transit), the type or class of vehicle owned, and the type ofa vehicular accident (run-off-road, rear-end, head-on, etc.). From a concep-tual perspective, such data are classified as those involving a behavioralchoice (choice of mode or type of vehicle to own) or those simply describingdiscrete outcomes of a physical event (type of vehicle accident). The meth-odological approach used to model these conceptual perspectives statisti-cally is often identical. However, the underlying theory used to derive thesemodels is often quite different. Discrete models of behavioral choices arederived from economic theory, often leading to additional insights in theanalysis of model estimation results, whereas models of physical phenomenaare derived from simple probabilistic theory.

Some discrete data encountered in transportation applications areordered. An example is telecommuting-frequency data that have outcomesof never, sometimes, and frequently. In contrast to data that are not ordered,ordinal discrete data possess additional information on the ordering ofresponses that can be used to improve the efficiency of the model’s param-eter estimates. For example, using the information that “never” is less than“sometimes” which is less than “frequently” can increase the efficiency ofparameter estimates. Guidelines and considerations for choosing betweenordered and unordered discrete modeling approaches are presented laterin this chapter.

11.1 Models of Discrete Data

In formulating a statistical model for unordered discrete outcomes, it iscommon to start with a linear function of covariates that influences specificdiscrete outcomes. For example, in the event of a vehicular accident,possible discrete crash outcomes are rear-end, sideswipe, run-off-road,

© 2003 by CRC Press LLC

Page 261: Data Analysis Book

head-on, turning, and other. Let Tin be a linear function that determinesdiscrete outcome i for observation n, such that

Tin = i Xin, (11.1)

where i is a vector of estimable parameters for discrete outcome i, and Xin

is a vector of the observable characteristics (covariates) that determine dis-crete outcomes for observation n. To arrive at a statistically estimable prob-abilistic model, a disturbance term in is added, giving

Tin = i Xin + in. (11.2)

The addition of the disturbance term is supported on a number of groundssuch as the possibility that (1) variables have been omitted from Equation11.1 (some important data may not be available), (2) the functional form ofEquation 11.1 may be incorrectly specified (it may not be linear), (3) proxyvariables may be used (variables that approximate missing variables in thedatabase), and (4) variations in i that are not accounted for ( i may varyacross observations). As is shown later in this chapter, the existence of someof these possibilities can have adverse effects on the estimation of i, socaution must be exercised in the interpretation of in.

To derive an estimable model of discrete outcomes with I denoting allpossible outcomes for observation n, and Pn(i) being the probability of obser-vation n having discrete outcome i (i I)

Pn(i) = P(Tin TIn) I i. (11.3)

By substituting Equation 11.2 into Equation 11.3, the latter is written as

Pn(i) = P( iXin + in I XIn + In) I i (11.4)

or

Pn(i) = P( iXn I Xn In in) I i. (11.5)

Estimable models are developed by assuming a distribution of the randomdisturbance term, . Two popular modeling approaches are probit and logit.

11.2 Binary and Multinomial Probit Models

A distinction is often drawn between binary models (models that considertwo discrete outcomes) and multinomial models (models that consider three

© 2003 by CRC Press LLC

Page 262: Data Analysis Book

or more discrete outcomes) because the derivation between the two can varysignificantly. Probit models arise when the disturbance terms in Equation11.5 are assumed to be normally distributed. In the binary case (two out-comes, denoted 1 or 2) Equation 11.5 is written as

Pn(1) = P( 1X1n 2X2n 2n 1n). (11.6)

This equation estimates the probability of outcome 1 occurring for observa-tion n, where 1n and 2n are normally distributed with mean = 0, variances

21 and 2

2, respectively, and the covariance is 12. An attractive feature ofnormally distributed variates is that the addition or subtraction of two nor-mal variates also produces a normally distributed variate. In this case 2n

1n is normally distributed with mean zero and variance 21 + 2

2 – 12. Theresulting cumulative normal function is

. (11.7)

If (.) is the standardized cumulative normal distribution, then

, (11.8)

where = ( 21 + 2

2 – 12)0.5. The term 1/ in Equations 11.7 and 11.8 is ascaling of the function determining the discrete outcome and can be set toany positive value, although = 1 is typically used. An example of thegeneral shape of this probability function is shown in Figure 11.1.

FIGURE 11.1General shape of probit outcome probabilities.

P EXP w dwn

n n

112

12

1 1 2 2

2

X X

Pnn n1 1 1 2 2 X X

Pn (1)

1X1n – 2X2n

1.0

0.5

© 2003 by CRC Press LLC

Page 263: Data Analysis Book

The parameter vector ( ) is readily estimated using standard maximumlikelihood methods. If in is defined as being equal to 1 if the observeddiscrete outcome for observation n is i and zero otherwise, the likelihoodfunction is

, (11.9)

where N is the total number of observations. In the binary case with i = 1or 2, the log likelihood from Equation 11.9 is (again, the log likelihood isused in estimation without loss of generality because the log transformationdoes not affect the ordering),

. (11.10)

The derivation of the multinomial probit model is provided in numeroussources, most notably Daganzo (1979). The problem with the multinomialprobit is that the outcome probabilities are not closed form and estimationof the likelihood functions requires numerical integration. The difficulties ofextending the probit formulation to more than two discrete outcomes haveled researchers to consider other disturbance term distributions.

11.3 Multinomial Logit Model

From a model estimation perspective, a desirable property of an assumeddistribution of disturbances ( in Equation 11.5) is that the maximums ofrandomly drawn values from the distribution have the same distributionas the values from which they were drawn. The normal distribution doesnot possess this property (the maximums of randomly drawn values fromthe normal distribution are not normally distributed). A disturbance termdistribution with such a property greatly simplifies model estimationbecause Equation 11.10 could be applied to the multinomial case by replac-ing 2X2n with the highest value (maximum) of all other IXIn 1. Distribu-tions of the maximums of randomly drawn values from some underlyingdistribution are referred to as extreme value distributions (Gumbel, 1958).Extreme value distributions are categorized into three families: Type 1, Type2, and Type 3 (see Johnson and Kotz, 1970). The most common extremevalue distribution is the Type 1 distribution (sometimes referred to as theGumbel distribution). It has the desirable property that maximums of ran-domly drawn values from the extreme value Type 1 distribution are also

L P i in

i

I

n

N

== 11

LL LN LNnn n

nn n

n

N

11 1 2 2

11 1 2 2

1

1 X X X X

=

© 2003 by CRC Press LLC

Page 264: Data Analysis Book

extreme value Type 1 distributed. The probability density function of theextreme value Type 1 distribution is

(11.11)

with corresponding distribution function

, (11.12)

where is a positive scale parameter, is a location parameter (mode), andthe mean is + 0.5772/ . An illustration of the extreme value Type 1distribution with location parameter equal to zero and various scale param-eter values is presented in Figure 11.2.

To derive an estimable model based on the extreme value Type 1 distri-bution, a revised version of Equation 11.4 yields

. (11.13)

For the extreme value Type 1 distribution, if all In are independentlyand identically (same variances) distributed random variates with locationparameters In and a common scale parameter (which implies equal

FIGURE 11.2Illustration of an extreme value Type I distribution.

f EXP EXP EXP

F EXP EXP

P i Pn i in in I i I In In X Xmax

x

ω = 0

η = 2

η = 1

η = 0.5

f (x)

0.8

0.6

0.4

−2 0 2 4 6

© 2003 by CRC Press LLC

Page 265: Data Analysis Book

variances), then the maximum of I XIn + In is extreme value Type 1 dis-tributed with location parameter

(11.14)

and scale parameter (see Gumbel 1958). If n is a disturbance term asso-ciated with the maximum of all possible discrete outcomes i (see Equation11.13) with mode equal to zero and scale parameter , and Xn is theparameter and covariate product associated with the maximum of all pos-sible discrete outcomes i, then it can be shown that

. (11.15)

This result arises because for extreme value Type 1 distributed variates, ,the addition of a positive scalar constant, say, a, changes the location param-eter from to + a without affecting the scale parameter (see Johnsonand Kotz, 1970). So, if n has location parameter equal to zero and scaleparameter , adding the scalar shown in Equation 11.14 gives an extremevalue distributed variate with location parameter ( Xn ) equal to Equation11.14 (as shown in Equation 11.15) and scale parameter .

Using these results, Equation 11.13 is written as

(11.16)

or

. (11.17)

And, because the difference between two independently distributed extremevalue Type 1 variates with common scale parameter is logistic distributed,

. (11.18)

Rearranging terms,

. (11.19)

1LN EXP I In

I i

X

X Xn I InI i

LN EXP1

P i Pn i in in n n X X

P i Pn n n i in in X X 0

P iEXPn

n i in 1

1 X X

P iEXP

EXP EXPni in

i in n

X

X X

© 2003 by CRC Press LLC

Page 266: Data Analysis Book

Substituting with Equation 11.15 and setting = 1 (there is no loss of gen-erality, Johnson and Kotz, 1970) the equation becomes

(11.20)

or

, (11.21)

which is the standard multinomial logit formulation. For estimation of theparameter vectors ( ) by maximum likelihood, the log likelihood function is

, (11.22)

where I is the total number of outcomes, is as defined in Equation 11.9,and all other variables are as defined previously.

When applying the multinomial logit model, it is important to realize thatthe choice of the extreme value Type 1 distribution is made on the basis ofcomputational convenience, although this distribution is similar to the nor-mal distribution (see Figure 11.3 for a two outcome model and compare withFigure 11.1). Other restrictive assumptions made in the derivation of the

FIGURE 11.3Comparison of binary logit and probit outcome probabilities.

P iEXP

EXP EXP LN EXPn

i in

i in I InI i

X

X X

P iEXP

EXPn

i in

I InI

X

X

LL LN EXPin i in I InIi

I

n

N

X X== 11

P(i )

1X1n − 2X2n

1.0

0.5

Probit

Logit

© 2003 by CRC Press LLC

Page 267: Data Analysis Book

model are the independence of disturbance terms and their common vari-ance (homoscedasticity). These assumptions have estimation consequencesthat are discussed later in this chapter.

11.4 Discrete Data and Utility Theory

In many applications it is useful to tie the statistical model to an underlyingtheoretical construct. An example is the economic and transportation litera-ture where discrete outcome models have been tied to utility theory. Tradi-tional approaches from microeconomic theory have decision makerschoosing among a set of alternatives such that their utility (satisfaction) ismaximized subject to the prices of the alternatives and an income constraint(see Nicholson, 1978). When dealing with decision-makers’ choices amongdiscrete alternatives, discrete outcome models are often referred to as discretechoice models, but it is important to remember, in a more general sense, thatany discrete outcome can, whether a consumer choice or not, be modeledusing a discrete outcome model.

Because utility theory consists of decision-makers selecting a utility max-imizing alternative based on prices of alternatives and an income con-straint, any purchase affects the remaining income and thus all purchasesare interrelated. This creates an empirical problem because one theoreti-cally cannot isolate specific choice situations. For example, when consid-ering a traveler’s choice of mode (e.g., car, bus, or rail), one would alsohave to consider the choice of breakfast cereal because any purchase willaffect remaining income. This difficulty can make the empirical analysis ofchoice data virtually intractable. However, the economic literature providesa number of options in this regard by placing restrictions on the utilityfunction. To illustrate these, a utility function is defined by the consumptionof m goods (y1, y2,…, ym) such that

u = f(y1, y2,…, ym). (11.23)

As an extremely restrictive case it is assumed that the consumption of onegood is independent of the consumption of all other goods. The utilityfunction is then written as

u = f1(y1) + f2(y2) + + fm(ym). (11.24)

This is referred to as an additive utility function and, in nearly all applica-tions, it is unrealistically restrictive. For example, the application of such anassumption implies that the acquisition of two types of breakfast cereal areindependent, although it is clear that the purchase of one will affect thepurchase of the other. A more realistic restriction on the utility function is

© 2003 by CRC Press LLC

Page 268: Data Analysis Book

to separate decisions into groups and to assume that consumption of goodswithin the groups is independent of those goods in other groups. For exam-ple, if utility is a function of the consumption of breakfast cereal and a word-processing software package, these two choices are separated and modeledin isolation. This is referred to as separability and is an important constructin applied economic theory (Phlips, 1974). It is this property that permits thefocus on specific choice groups such as the choices of travel mode to work.

The previous discussion views utility in the traditional sense in that utilityis maximized subject to an income constraint and this maximization producesa demand for goods y1, y2,…, ym. When applying discrete outcome models,the utility function is typically written with prices and incomes as arguments.When the utility function is written in this way, the utility function is said tobe indirect, and it can be shown that the relationship between this indirectutility and the resulting demand equation for some good m is given by Roy’sidentity (Phlips, 1974; Mannering and Winston, 1985)

, (11.25)

where V is the indirect utility, pm is the price of good m, Inc is the income ofthe decision maker, and is the utility maximizing demand for good m.As will be shown later, the application of this equation proves to be criticalin model development and the economic interpretation of model findings.

Applying the utility framework within discrete outcome models isstraightforward. By using the notation above, T in Equations 11.1 through11.3 becomes the utility determining the choice (as opposed to a functiondetermining the outcome). But in addition to the restrictions typicallyused in applying utility theory to discrete outcome models (separabilityof utility functions and the use of an indirect utility function with pricesand income as arguments in the function), behavioral implications arenoteworthy. The derivations of discrete outcome models provided hereimply that the model is compensatory. That is, changes in factors thatdetermine the function T in Equation 11.1 for each discrete outcome donot matter as long as the total value of the function remains the same.This is potentially problematic in some utility-maximizing choice situa-tions. For example, consider the choice of a vehicle purchase where price,fuel efficiency, and seating capacity determine the indirect utility (vari-ables in the X vector in Equation 11.1). Decision makers with families offive may get very little utility from a two-seat sports car. Yet the modelassumes that the decision maker is compensated for this by lowering priceor increasing fuel efficiency. Some argue that a more realistic frameworkis one in which there are thresholds beyond which the decision makerdoes not accept an alternative. Models that use thresholds and relax the

y

VpV

Inc

mm0

ym0

© 2003 by CRC Press LLC

Page 269: Data Analysis Book

compensatory assumption are referred to as noncompensatory models.Because of their inherent complexity, noncompensatory models have beenrarely used in practice (Tversky, 1972).

11.5 Properties and Estimation of Multinomial Logit Models

The objective of the multinomial logit (MNL) model is to estimate a functionthat determines outcome probabilities. As an example, for choice modelsbased on utility theory this function is the indirect utility. To illustrate this,consider a commuter’s choice of route from home to work where the choicesare to take an arterial, a two-lane road, or a freeway. By using the form ofthe MNL model given in Equation 11.21, the choice probabilities for the threeroutes are for each commuter n (n subscripting omitted):

, , , (11.26)

where P(a), P(t), and P(f), are the probabilities that commuter n selects thearterial, two-lane road, and freeway, respectively, and Va, Vt, and Vf arecorresponding indirect utility functions. Broadly speaking, the variablesdefining these functions are classified into two groups — those that varyacross outcome alternatives and those that do not. In route choice, distanceand number of traffic signals are examples of variables that vary acrossoutcome alternatives. Commuter income and other commuter-specific char-acteristics (number of children, number of vehicles, and age of commutingvehicle) are variables that do not vary across outcome alternatives. Thedistinction between these two sets of variables is important because the MNLmodel is derived using the difference in utilities (see Equation 11.17). Becauseof this differencing, estimable parameters relating to variables that do notvary across outcome alternatives can, at most, be estimated in I – 1 of thefunctions determining the discrete outcome (I is the total number of discreteoutcomes). The parameter of at least one of the discrete outcomes must benormalized to zero to make parameter estimation possible (this is illustratedin a forthcoming example).

Given these two variables types, the utility functions for Equation 11.26are defined as

(11.27)

P ae

e e e

V

V V V

a

a t fP t

e

e e e

V

V V V

t

a t fP f

e

e e e

V

V V V

f

a t f

V

V

V

a a a a a

t t t t t

f f f f f

1 2 3

1 2 3

1 2 3

X Z

X Z

X Z

© 2003 by CRC Press LLC

Page 270: Data Analysis Book

where Xa, Xt, and Xf are vectors of variables that vary across arterial, two-lane, and freeway choice outcomes, respectively, as experienced by com-muter n, Z is a vector of characteristics specific to commuter n, the 1 areconstant terms, the 2 are vectors of estimable parameters correspondingto outcome-specific variables in X vectors, and the 3 are vectors corre-sponding to variables that do not vary across outcome alternatives. Notethat the constant terms are effectively the same as variables that do notvary across alternative outcomes and at most are estimated for I – 1 ofthe outcomes.

Example 11.1

A survey of 151 commuters was conducted. Information was collectedon their route selection on their morning trip from home to work. Allcommuters departed from the same origin (a large residential complexin suburban State College, PA) and went to work in the downtownarea of State College. Distance was measured precisely from parkinglot of origin to parking lot of destination, so there is a variance indistances among commuters even though they departed and arrivedin the same general areas. Commuters had a choice of three alternateroutes: a four-lane arterial (speed limit = 60 km/h, two lanes eachdirection), a two-lane highway (speed limit = 60 km/h, one lane eachdirection), and a limited access four-lane freeway (speed limit = 90km/h, two lanes each direction). Each of these three routes sharedsome common portions for access and egress because, for example,the same road to the downtown area is used by both freeway and two-lane road alternatives since the freeway exits onto the same city streetas the two-lane road.

To analyze commuters’ choice of route, a MNL model is estimated giventhe variables shown in Table 11.1. Model estimation results are presentedin Table 11.2. Based on this table, the estimated utility functions are (seeEquation 11.27)

Va = –0.942(DISTA)

Vt = 1.65 –1.135(DISTT) + 0.128(VEHAGE) (11.28)

Vf = –3.20 – 0.694(DISTF) + 0.233(VEHAGE) + 0.766(MALE)

The interpretation of the parameters is straightforward. First, considerthe constant terms. Note that Va does not have a constant. This is becauseconstants are estimated as variables that do not vary across alternativeoutcomes and therefore can appear in at most I – 1 functions (constantsare estimated as X, where X is a vector of 1’s; this vector does not varyacross alternate outcomes). The lack of constant in the arterial functionestablishes it as a 0 baseline. Thus, all else being equal, the two-lane roadis more likely to be selected (with its positive constant) relative to the

© 2003 by CRC Press LLC

Page 271: Data Analysis Book

TABLE 11.1

Variables Available for Home to Work Route-Choice Model Estimation

VariableNo. Variable Description

1 Route chosen: 1 if arterial, 2 if two-lane road, 3 if freeway2 Traffic flow rate of arterial at time of departure (vehicles per hour)3 Traffic flow rate of two-lane road at time of departure (vehicles per hour)4 Traffic flow rate of freeway at time of departure (vehicles per hour)5 Number of traffic signals on the arterial6 Number of traffic signals on the two-lane road7 Number of traffic signals on the freeway8 Distance on the arterial in kilometers9 Distance on the two-lane road in kilometers

10 Distance on the freeway in kilometers11 Seat belts: 1 if wearing, 0 if not12 Number of passengers in vehicle13 Commuter age in years: 1 if less than 23, 2 if 24 to 29, 3 if 30 to 39, 4 if 40 to 49,

5 if 50 and older14 Gender: 1 if male, 0 if female15 Marital status: 1 if single, 0 if married16 Number of children in household (aged 16 or younger)17 Annual household income (U.S. dollars): 1 if less than 20,000, 2 if 20,000 to

29,999, 3 if 30,000 to 39,999, 4 if 40,000 to 49,999, 5 if more than 50,00018 Age of the vehicle used on the trip in years

TABLE 11.2

MNL Estimation Results for the Choice of Route to Work

Independent VariableVariable

MnemonicEstimatedParameter t Statistic

Two-lane road constant 1.65 1.02Freeway constant –3.20 –1.16

Variables That Vary across Alternative Outcomes

Distance on the arterial in kilometers DISTA –0.942 –3.99Distance on the two-lane road in kilometers DISTT –1.135 –5.75Distance on the freeway in kilometers DISTF –0.694 –2.24

Variables That Do Not Vary across Alternative Outcomes

Male indicator (1 if male commuter, 0 if not; defined for the freeway utility function)

MALE 0.766 1.19

Vehicle age in years (defined for the two-lane road utility function)

VEHAGE 0.128 1.87

Vehicle age in years (defined for the two-lane freeway utility function)

VEHAGE 0.233 2.75

Number of observations 151Log likelihood at zero –165.89Log likelihood at convergence –92.51

© 2003 by CRC Press LLC

Page 272: Data Analysis Book

arterial, and the freeway is less likely to be selected relative to the arterial(with its negative constant). And, all else being equal, the freeway is lesslikely to be selected than the two-lane road. Omitting a constant fromthe arterial function is arbitrary; a constant could have been omitted fromone of the other functions instead, and the same convergence of thelikelihood function would have resulted. The key here is that all of theconstants are relative and the omission of one constant over anothersimply rescales the remaining constants. For example, if the constant isomitted from the two-lane function (so it becomes zero) the estimatedconstant for the arterial is –1.65 and the estimated constant for the free-way is –4.85 (thus preserving the same magnitude of differences).

The distance from home to work along the various routes was alsoincluded in the model. Distance varies across routes so this variable isincluded in all utility functions. The negative sign indicates that increas-ing distance decreases the likelihood of a route being selected. The esti-mation of separate parameters for the three distances indicates thatdistance is not valued equally on the three routes. In this case, distanceon the two-lane road is most onerous, followed by distance on the arterialand distance on the freeway.

The positive parameter on male indicator variable indicates that men aremore likely to take the freeway. The vehicle age variables indicate that theolder the commuting vehicle, the more likely the commuter will drive onthe freeway, followed by the two-lane road. This variable does not varyacross alternative outcomes so at most is estimated in I – 1 functions (inthis case two). Age of the vehicle on the arterial is implicitly set to zeroand the same relativity logic discussed above for the constants applies.

11.5.1 Statistical Evaluation

The statistical significance of individual parameters in an MNL model isapproximated using a one-tailed t test. To determine if the estimated param-eter is significantly different from zero, the test statistic t*, which is approx-imately t distributed, is

, (11.29)

where S.E.( ) is the standard error of the parameter. Note that because theMNL is derived from an extreme value distribution and not a normal dis-tribution, the use of t statistics is not strictly correct although in practice itis a reliable approximation.

A more general and appropriate test is the likelihood ratio test. This test isused for a wide variety of reasons, such as assessing the significance of indi-vidual parameters, evaluating overall significance of the model (similar to theF test for ordinary least squares regression), examining the appropriateness of

t* 0S.E.

© 2003 by CRC Press LLC

Page 273: Data Analysis Book

estimating separate parameters for the same variable in different outcomefunctions (such as the separate distance parameters estimated in Example11.1), and examining the transferability of results over time and space. Thelikelihood ratio test statistic is

X2 = 2[LL(bR) – LL(bU)], (11.30)

where LL( R) is the log likelihood at convergence of the “restricted” model,and LL( U) is the log likelihood at convergence of the “unrestricted” model.This statistic is 2 distributed with degrees of freedom equal to the differencein the numbers of parameters between the restricted and unrestricted model(the difference in the number of parameters in the R and the U parametervectors). By comparing the improvement in likelihoods as individual vari-ables are added, this test statistic is the correct way to test for the significanceof individual variables because, as mentioned above, the more commonlyused t statistic is based on the assumption of normality, which is not the casefor the MNL model.

Example 11.2

For the model estimated in Example 11.1, it is interesting to test if theeffects of distance are indeed different across routes (as estimated) orwhether there is one effect across all routes. To test this, the model shownin Table 11.2 needs to be reestimated while constraining the distanceparameters to be the same across all three functions (see Equation 11.28).Keeping all other variables in the model, the single distance parameteris estimated as –1.147 (t* of –5.75), and the log likelihood at convergenceof this restricted model (estimating one instead of three distance param-eters) is –94.31. With the unrestricted model’s log likelihood at conver-gence of –92.51 (as shown in Table 11.2), the application of Equation 11.30yields an X2 value of 3.6. The degrees of freedom are 2 (because twofewer parameters are estimated in the restricted model). An X2 value of3.6 and 2 degrees of freedom yields a level of confidence of 0.835. Thus,using a 90% level of confidence, the model of a single distance effectcannot be rejected (the model with individual effects is not preferred tothe model with a single effect). The interpretation is that sampling vari-ability alone can explain the observed difference in log likelihoods, andthe model with individual distance effects is rejected.

Finally, a common measure of overall model fit is the 2 statistic (it issimilar to R2 in regression models in terms of purpose). The 2 statistic is

(11.31)

where LL( ) is the log likelihood at convergence with parameter vector and LL(0) is the initial log likelihood (with all parameters set to zero). The

2 1LLLL

0

© 2003 by CRC Press LLC

Page 274: Data Analysis Book

perfect model has a likelihood function equal to one (all selected alternativeoutcomes are predicted by the model with probability one, and the productof these across the observations are also equal to one) and the log likelihoodis zero giving a 2 of one (see Equation 11.31). Thus the 2 statistic liesbetween zero and one, and a statistic close to one suggests that the modelis predicting the outcomes with near certainty. For Example 11.1, the 2 is0.442 [1 – (–92.51/–165.89)].

As is the case with R2 in regression analysis, the disadvantage of the 2

statistic is that it will always improve as additional parameters are estimatedeven though the additional parameters may be statistically insignificant. Toaccount for the estimation of potentially insignificant parameters, a corrected

2 is estimated as

, (11.32)

where K is the number of parameters estimated in the model (the numberof parameters in the vector ). For Example 11.1, the corrected 2 is 0.394[1 – ((–92.51 – 8)/–165.89)].

11.5.2 Interpretation of Findings

Two other techniques are used to assess individual parameter estimates:elasticities and marginal rates of substitution. Elasticity computations mea-sure the magnitude of the impact of specific variables on the outcome prob-abilities. Elasticity is computed from the partial derivative for eachobservation n (n subscripting omitted):

, (11.33)

where P(i) is the probability of outcome i and xki is the value of variable kfor outcome i. It is readily shown by taking the partial derivative of the MNLmodel that Equation 11.33 becomes

. (11.34)

Elasticity values are interpreted as the percent effect that a 1% change inxki has on the outcome probability P(i). If the computed elasticity value isless than one, the variable xki is said to be inelastic and a 1% change in xki

will have less than a 1% change in outcome i selection probability. If thecomputed elasticity is greater than one it is said to be elastic and a 1% changein xki will have more than a 1% change in outcome i selection probability.

corrected 2 1

LL KLL 0

EP ix

xP ix

P i

ki

kiki

E P i xxP i

ki kiki1

© 2003 by CRC Press LLC

Page 275: Data Analysis Book

Example 11.3

To determine the elasticities for distances on the three routes based onthe model estimated in Example 11.1 (assuming this is a tenable model),Equation 11.34 is applied over all observations, N (the entire 151 com-muters). It is found that the average elasticity for distance (averagedover N commuters) on the arterial is –6.48. This means for the averagecommuter a 1% increase in distance on the arterial will decrease theprobability of the arterial being selected by 6.48%. Among the 151commuters in the sample, elasticities range from a high of –13.47 to alow of –1.03. The computed average elasticities for distances on thetwo-lane road and the freeway are –3.07 and –6.60, respectively. Thesefindings show that distance on the two-lane road has the least effecton the selection probabilities.

There are a several points to keep in mind when using elasticities. First,the values are point elasticities and, as such, are valid only for small changesxik and considerable error may be introduced when an elasticity is used toestimate the probability change caused by a doubling of xki. Second, elastic-ities are not applicable to indicator variables (those variables that take onvalues of 0 or 1 such as the male indicator in Table 11.2). Some measure ofthe sensitivity of indicator variables is made by computing a pseudo-elas-ticity. The equation is

, (11.35)

where In is the set of alternate outcomes with xk in the function determining theoutcome, and I is the set of all possible outcomes. See Shankar and Mannering(1996) and Chang and Mannering (1999) for an application of this equation.

The elasticities described previously are referred to as direct elasticitiesbecause they capture the effect that a change in a variable determining thelikelihood of alternative outcome i has on the probability that outcome i willbe selected. It may also be of interest to determine the effect of a variableinfluencing the probability of outcome j may have on the probability ofoutcome i. For example, referring to the route choice problem (Example 11.1),interest may be centered on knowing how a change in the distance on thearterial affects the probability the two-lane road will be chosen. Known asa cross-elasticity, its value is computed using the equation:

. (11.36)

Note that this equation implies that there is one cross-elasticity for all i (ij). This means that an increase in distance on the arterial results in an equal

E

EXP EXP x

EXP EXP x EXP xxP i I

I II

ki

i i kI kI

i i kI kI kI kI

nn

( )

X

X1

E P j xxP i

kj kjki

© 2003 by CRC Press LLC

Page 276: Data Analysis Book

increase in the likelihood that the two-lane and freeway alternatives will bechosen. This property of uniform cross-elasticities is an artifact of the error termindependence assumed in deriving the MNL model. This is not always arealistic assumption and is discussed further when independence of irrelevantalternatives (IIA) properties and nested logit models are discussed later in thischapter. When computed using the sample of 151 commuters, the averagecross-elasticity for distance on the arterial is 1.63, indicating that a 1% increasein distance on the arterial increases the probability of the two-lane road beingselected by 1.63% and the probability of the freeway being accepted by 1.63%.

Because logit models are compensatory (as discussed in Section 11.4),marginal rates of substitution are computed to determine the relative mag-nitude of any two parameters estimated in the model. In MNL models, thisrate is computed simply as the ratio of parameters for any two variables inquestion (in this case a and b):

. (11.37)

Example 11.4

One might want to know the marginal rate of substitution betweendistance and vehicle age on the two-lane road based on the model esti-mated in Example 11.1. The estimated parameters are –1.135 for distanceand 0.128 for vehicle age. The marginal rate of substitution betweendistance and vehicle age is –0.113 km/vehicle-year (0.128/–1.135). Thisestimate means that each year a vehicle ages (which increases the prob-ability of the two-lane route being selected) distance would have toincrease 0.113 km on average to ensure that the same route choice prob-ability is maintained.

11.5.3 Specification Errors

A number of restrictive assumptions are made when deriving the MNLmodel, and, when these are not met, potentially serious specification errorsmay result. In addition, as with any statistical model, there are consequencesin omitting key variables and including irrelevant variables in the modelestimation. Arguably the most commonly overlooked and misunderstoodlimitation of the MNL model is the independence of irrelevant alternatives(IIA) property. Recall that a critical assumption in the derivation of the MNLmodel is that the disturbances (the in Equation 11.5) are independentlyand identically distributed. When this assumption does not hold, a majorspecification error results. This problem arises when only some of the func-tions, which determine possible outcomes, share unobserved elements (thatshow up in the disturbances). The “some of the functions” is a criticaldistinction because if all outcomes shared the same unobserved effects, the

MRS i baai

bi

© 2003 by CRC Press LLC

Page 277: Data Analysis Book

problem would self-correct because in the differencing of outcome functions(see Equation 11.17) common unobserved effects would cancel out. Becausethe common elements cancel in the differencing, a logit model with only twooutcomes can never have an IIA violation.

To illustrate the IIA problem, it is easily shown that for MNL models theratio of any two-outcome probabilities is independent of the functions deter-mining any other outcome since (with common denominators as shown inEquation 11.21)

(11.38)

for each observation n (n subscripting omitted). To illustrate the problemthat can arise as a result of this property, consider the estimation of a modelof choice of travel mode to work where the alternatives are to take apersonal vehicle, a red transit bus, or a blue transit bus. The red and bluetransit buses clearly share unobserved effects that will appear in theirdisturbance terms and they will have exactly the same functions ( rbXrb =

bbXbb) if the only difference in their observable characteristics is their color.For illustrative purposes, assume that, for a sample commuter, all threemodes have the same value of iXi (the red bus and blue bus will, andassume that costs, time, and other factors that determine the likelihood ofthe personal vehicle being chosen works out to the same value as the buses).Then the predicted probabilities will give each mode a 33% chance of beingselected. This is unrealistic because the correct answer should be a 50%chance of taking a personal vehicle and a 50% chance of taking a bus (bothred and blue bus combined) and not 33.33 and 66.67%, respectively, as theMNL would predict. The consequences of an IIA violation are incorrectprobability estimates.

In most applications the IIA violation is subtler than in the previous exam-ple. There are a number statistical tests that are conducted to test for IIAviolations. One of the more common of these tests was developed by Smalland Hsiao (1985). The Small–Hsiao IIA test is easy to conduct. The procedureis first to split the data randomly into two samples (NA and NB) containingthe same number of observations. Two separate models are then estimatedproducing parameter estimates A and B. A weighted average of theseparameters is obtained from

. (11.39)

Then, a restricted set of outcomes, D, is created as a subsample from the fullset of outcomes. The sample NB is then reduced to include only those obser-vations in which the observed outcome lies in D. Two models are estimatedwith the reduced sample (NB’) using D as if it were the entire outcome set

PP

EXP

EXP12

1 1

2 2

X

X

AB A B 1 2 1 1 2

© 2003 by CRC Press LLC

Page 278: Data Analysis Book

(B in superscripting denotes the sample reduced to observations with out-comes in D). One model is estimated by constraining the parameter vectorto be equal to AB as computed above. The second model estimates theunconstrained parameter vector B’. The resulting log likelihoods are usedto evaluate the suitability of the model structure by computing the X2 teststatistic with the number of degrees of freedom equal to the number ofparameters in AB (also the same number as in bB’). This statistic is

X2 = –2[LLB’(bAB) – LLB’(bB’)]. (11.40)

The test is then repeated by interchanging the roles of the NA and NB subsam-ples (reducing the NA sample to observations where the observed outcomeslie in D and proceed). Using the same notation, Equation 11.39 becomes

(11.41)

and the X2 test statistic is

X2 = –2[LLA’(bBA) – LLA’(bA’)]. (11.42)

Example 11.5

It might be plausible that the model estimated in Example 11.1 may havean IIA violation because the two-lane road and the arterial may shareunobserved elements (because they are lower-level roads relative to thefreeway, no access control, lower design speeds, and so on). ASmall–Hsiao test is used to evaluate this suspicion. The sample is ran-domly split to give 75 observations in NA and 76 observations in NB. Twomodels (one for the A and B subsamples) are estimated with the eightparameters as shown in Table 11.2. Based on the estimation results, AB

is computed as shown in Equation 11.39. The NB sample is then reducedto include only those observations in which the arterial or two-lane roadwere chosen. The models required to compute Equation 11.40 are esti-mated (only four parameters are estimated because the parameters spe-cific to the freeway cannot be estimated, as freeway users are no longerin the sample). The estimation results give LLB’( AB) = –25.83 and LLB’( B’)= –24.53. Applying Equation 11.40 gives an X2 of 2.6 with 4 degrees offreedom. This means that the MNL structure cannot be rejected at anyreasonable confidence level (confidence level is 37.3%). Following thesame procedure, reversing the NA and NB subsamples and applyingEquations 11.41 and 11.42 gives an X2 of 0.18 with 4 degrees of freedom,which means the MNL structure again cannot be rejected at any reason-able confidence level (confidence level is 0.4%). Based on this test, theMNL model structure cannot be refuted and IIA violations resulting fromshared unobserved effects from the arterial and two-lane road are notstatistically significant.

BA B A 1 2 1 1 2

© 2003 by CRC Press LLC

Page 279: Data Analysis Book

Aside from IIA violations, there are a number of other possible specifica-tion errors, including the following:

• Omitted variables. The omission of a relevant explanatory variableresults in inconsistent estimates of logit model parameters andchoice probabilities if any of the following hold: (1) the omittedvariable is correlated with other variables included in the model,(2) the mean values of the omitted variable vary across alternativeoutcomes and outcome specific constants are not included in themodel, or (3) the omitted variable is correlated across alternativeoutcomes or has a different variance in different outcomes. Becauseone or more of these conditions are likely to hold, omitting relevantvariables is a serious specification problem.

• Presence of an irrelevant variable. Estimates of parameter andchoice probabilities remain consistent in the presence of an irrelevantvariable, but the standard errors of the parameter estimates willincrease (loss of efficiency).

• Disturbances that are not independently and identically distributed(IID). Dependence among a subset of possible outcomes causes theIIA problem, resulting in inconsistent parameter estimates and out-come probabilities. Having disturbances with different variances (notidentically distributed) also results in inconsistent parameter estimatesand outcome probabilities. Such a case could arise in a model of choiceof mode of work where “comfort” was a major part of the disturbance.In this case, the wide variety of vehicles may make the variance of thedisturbance term bigger for the vehicle-mode option than the distur-bance terms for the other competing modes such as bus and train.

• Random parameter variations. Standard MNL estimation assumesthat the estimated parameters are the same for all observations.Violations of this assumption can occur if there is some compellingreason (to believe) that parameters vary across observations in a waythat cannot be accounted for in the model. For example, supposethat a price parameter, the cost of taking the mode, is estimated ina model of choice of mode to work. The parameter estimate inher-ently assumes that price is equally onerous across the population.Realistically, price is less important to wealthy people. In these cases,estimating the parameter associated with a transformed variable(such as price divided income) instead of simply price is an attemptto account for this variation. In other applications, the parametervariation may be essentially random (such as the effect of patienceon the parameter associated with mode travel time) in that availabledata cannot uncover the relationship causing the variation in theparameter. Such random parameter estimates give inconsistent esti-mates of parameters and outcome probabilities. The mixed logitmodel described in Section 11.8 addresses this modeling challenge.

© 2003 by CRC Press LLC

Page 280: Data Analysis Book

• Correlation between explanatory variables and disturbances andendogenous variables. If a correlation exists between X and thenparameter estimates will be inconsistent. An example of such cor-relation is a model of mode-to-work choice where distance is anexplanatory variable and comfort is unobserved. If it is the practiceof the transit agency to put new comfortable buses on longer routes,a correlation will exist. An example of an endogenous variableproblem is a discrete model of the severity of vehicle accidents inicy conditions (with discrete outcome categories such as propertydamage only, injury, and fatality) and the presence of an ice warn-ing sign (an X variable). Because ice warning signs are likely to beposted at high-severity locations, their presence is correlated withseverity and thus is endogenous (see Carson and Mannering, 2001,for this example).

• Erroneous data. If erroneous data are used, parameter and outcomeprobabilities will be incorrectly estimated (also erroneous).

• State dependence and heterogeneity. A potential estimation problemcan arise in discrete outcome models if information on previousoutcomes is used to determine current outcome probabilities. Todemonstrate this, consider the earlier example of a commuter choos-ing among three routes on a morning commute to work. On day 1,the driver is observed taking the freeway. In modeling the com-muter’s choice on day 2, it is tempting to use information observedfrom the previous day as an independent variable in the X vector.Conceptually, this may make sense because it could be capturingimportant habitual behavior. Such habitual behavior is called statedependence (Heckman, 1981). Unfortunately, the inclusion of such astate variable may also capture residual heterogeneity, which wouldlead one to observe spurious state dependence. To illustrate, supposethat the disturbance term includes unobserved characteristics (unob-served heterogeneity) that are commonly present among driversselecting specific routes. If a variable indicating previous route selec-tion is included, it is unclear whether the estimated parameter of thisvariable is capturing true habitual behavior (state dependence) or issimply picking up some mean commonality in drivers’ disturbanceterms (unobserved heterogeneity). This is an important distinctionbecause the presence or absence of habitual behavior could lead oneto draw significantly different behavioral conclusions. Isolating truestate dependence from unobserved heterogeneity is not an easy task(see Heckman, 1981). As a result,extreme caution must be used when interpreting the parameters ofvariables that are based on previous outcomes.

One general noteworthy aspect of endogeneity and discrete outcomemodels relates to the effect of a single observation on the right-hand-side

© 2003 by CRC Press LLC

Page 281: Data Analysis Book

(explanatory) variable. Consider the choice of a used vehicle (make, model,and vintage) in which price is an explanatory variable. Naturally, becausethere are a limited number of specific used vehicles, such as a 1992 MazdaMiata, an increase in selection probabilities will increase the price so thatsupply–demand equilibrium is met, ensuring that the demand for the 1992Mazda Miata does not exceed the supply of 1992 Miatas. Thus one couldargue that price is endogenous (as it would surely be in an aggregateregression model of vehicle demand). However, because the effect of anysingle observation on vehicle price is infinitesimal, many have contendedthat price is exogenous for estimation purposes. Still, if one were to forecastused-vehicle market shares using the model (using the summation of indi-vidual outcome probabilities as a basis), vehicle prices would need to beforecast internally as a function of total vehicle demand. As a result, pricesmust be treated as endogenous for forecasting. See Berkovec (1985) andMannering and Winston (1987) for examples of such forecasting strategies.

Authors of recent work have argued that price variables still present aproblem in individual discrete outcome models. This is because prices tendto be higher for products that have attributes that are observed by theconsumer and not by the analyst. An example is unobserved aspects ofvehicle quality (such as fit and finish) that result in a higher price. Thismissing variable sets up a correlation between the price variables and thedisturbances that causes an obvious specification problem (see above).

11.5.4 Data Sampling

There are two general types of sampling strategies for collecting data toestimate discrete outcome models, random and stratified random sampling.All standard MNL model derivations assume that the data used to estimatethe model are drawn randomly from a population of observations. Compli-cations may arise in the estimation when this is not the case. Stratifiedsampling refers to a host of nonrandom sampling alternatives. The idea ofstratified sampling is that some known population of observations is parti-tioned into subgroups (strata), and random sampling is conducted in eachof these subgroups. This type of sampling is particularly useful when onewants to gain information on a specific group that is a small percentage ofthe total population of observations (such as transit riders in the choice oftransportation mode or households with incomes exceeding $200,000 peryear). Note that random sampling is a special case of stratified sampling inthat the number of observations chosen from each stratum is in exact pro-portion to the size of the stratum in the population of observations.

There are four special cases of stratified sampling — exogenous sampling,outcome-based sampling, enriched sampling, and double sampling.

Exogenous sampling refers to sampling in which selection of the strata isbased on values of the X (right-hand-side) variables. For example, in theprevious route-choice example (Example 11.1), to ensure that users of oldervehicles are well represented, one may have chosen (for the 151 observations)

© 2003 by CRC Press LLC

Page 282: Data Analysis Book

to sample 75 commuters with vehicles of age 10 years or younger and 76commuters commuting in vehicles 10 years or older. In such cases, Manskiand Lerman (1977) have shown that standard maximum likelihood estima-tion (treating the sample as though it were a random sample) is appropriate.

Outcome-based sampling may be used to obtain a sufficient representationof a specific outcome or may be an artifact of the data-gathering process. Forexample, the percentage of individuals choosing public transit as a modefrom home to work in certain U.S. cities may be very small (on the order of5% or less). To obtain a sufficient representation of transit riders, one maysurvey transit users at bus stops and thus overrepresent transit users in theoverall mode-choice sample. If the proportions of outcomes in the sampleare not equal to the proportions of outcomes in the overall population, anestimation correction must be made. The correction is straightforward, pro-viding that a full set of outcome-specific constants is specified in the model(because constants do not vary across alternative outcomes this means thatI – 1 constants must be specified, where I is the set of outcomes). Under theseconditions, standard MNL estimation correctly estimates all parametersexcept for the outcome-specific constants. To correct the constant estimates,each constant must have the following subtracted from it:

, (11.43)

where SFi is the fraction of observations having outcome i in the sample andPFi is the fraction of observations having outcome i in the total population.

Example 11.6

In Example 11.1 a random sample was used and the population fractionswere 0.219 choosing the arterial, 0.682 choosing the two-lane road, and0.099 choosing the freeway. Suppose the estimation results shown in Table11.2 were obtained by equal sampling of the three alternatives (SFa = 0.333,SFt = 0.333, SFf = 0.333). The correct constant estimations would then haveto be calculated. Using the estimated constants in Table 11.2 and applyingEquation 11.43, the correct constant estimate is 0 – LN(0.333/0.219) or–0.419 for the arterial, 1.65 – LN(0.333/0.682) or 0.933 for the two-laneroad, and –3.20 – LN(0.333/0.099) or –4.41 for the freeway.

Enriched sampling and double sampling are the last two general types ofstratified sampling. Enriched sampling is the merging of a random (or randomstratified) sample with a sample of another type. For example, a randomsample of route choices may be merged with a sample of commuters observedtaking one of the routes such as the freeway. Some types of enriched samplingproblems reduce to the same correction used for the outcome-based samples.Others may result in estimation complications (see Coslett, 1981).

LNSFPF

i

i

© 2003 by CRC Press LLC

Page 283: Data Analysis Book

Finally, double sampling usually refers to the process where informationfrom a random sample is used to direct the gathering of additional dataoften targeted at oversampling underrepresented components of the popu-lation. Estimation of MNL models with double sampling can complicate thelikelihood function. The reader is referred to Manski and Lerman (1977),Manski and McFadden (1981), and Cosslett (1981) for details on estimationalternatives in sampling and additional sampling strategies.

11.5.5 Forecasting and Aggregation Bias

When using logit models for forecasting aggregate impacts, there is a poten-tial for forecasting bias that arises from the nonlinearity of the model. Toillustrate this problem, consider what happens in a simple case when apopulation consists of two observations, a and b, in a model with one inde-pendent variable, x. To predict the population probabilities, one approach isto use the average x in the model to come up with a probability of outcomei, over the population, of P(i|xavg). However, Figure 11.4 shows that this isincorrect due to the nonlinearity of outcome probabilities with respect to x.In the two-observation case, the correct population probability is the averageof the individual probabilities of a and b, as shown by Pab(i|xab) in Figure11.4. Figure 11.4 shows the bias that can result by using xavg instead ofaveraging probabilities over the entire population. Given this potential bias,the forecasting problem becomes one of minimizing this bias (often referredto as aggregation bias) since one rarely has the entire population needed toforecast outcome shares in the population.

FIGURE 11.4Example of population aggregation bias.

1.0

Bias

0

Pab (i |xab)

P(i |xavg)

P(i |xb)

P(i |xa)

xa xavg xb x

© 2003 by CRC Press LLC

Page 284: Data Analysis Book

Theoretically, all biases in estimating outcome shares in a population areeliminated by the equation:

(11.44)

where Si is the population share of outcome i, gi(X) is the functional repre-sentation of the model, and h(X) is the distribution of model variables overthe population. The problem in applying this equation is that h(X) is notcompletely known. However, there are four common approximations toEquation 11.44 that are used in practice: sample enumeration, density func-tions, distribution moments, and classification.

• Sample enumeration. This procedure involves using the same samplethat was used to estimate the model to predict the population prob-abilities. Outcome probabilities for all observations in the sample arecomputed and these probabilities are averaged to approximate thetrue population outcome probabilities for each alternative. In doingthis with the route choice data used to estimate the model in Example11.1, average route-choice probabilities are 0.218, 0.684, and 0.098 forthe arterial, two-lane road, and freeway, respectively. To illustrate thepossible bias, if average values of X are used, population route choiceprobabilities are estimated to be 0.152, 0.794, and 0.053 for the arterial,two-lane road, and freeway, respectively, which are far from theobserved shares of 0.219, 0.682, and 0.099, respectively. Sample enu-meration represents an excellent approximation to Equation 11.44 butis restrictive when transferring results to other populations.

• Density functions. Constructing density function approximationsof the x can approximate Equation 11.44. The advantage of thisapproach is that it has great flexibility in applying results to differentpopulations. The disadvantages are that the functions are often dif-ficult to construct on theoretical or empirical grounds and it is dif-ficult to capture covariances among the x.

• Distribution moments. This approach attempts to describe h(X) byconsidering moments and cross moments to represent the spreadand shape of variable distributions and their interactions in thepopulation. The major limitation of this approach is the gatheringof theoretical and/or empirical information to support the represen-tation of moments and cross moments.

• Classification. The classification approach attempts to categorize thepopulation into nearly homogeneous groups and to use averages ofthe x from these groups. This approach is easy to apply but theassumption of homogeneity among population groups is often dubi-ous and can introduce considerable error.

S g X h X dXi

X

i

© 2003 by CRC Press LLC

Page 285: Data Analysis Book

For additional information on forecasting and possible aggregation bias,the interested reader should see Koppelman (1975) for a detailed explorationof the topic and Mannering and Harrington (1981) for an example of anapplication of alternative bias-reduction techniques.

11.5.6 Transferability

A concern with all models is whether or not their estimated parameters aretransferable spatially (among regions or cities) or temporally (over time). Froma spatial perspective, transferability is desirable because it means that param-eters of models estimated in other places can be used, thus saving the cost ofadditional data collection and estimation. Temporal transferability ensures thatforecasts made with the model have some validity in that the estimated param-eters are stable over time. When testing spatial and temporal transferability,likelihood ratio tests are applied. Suppose the transferability of parametersbetween two regions, a and b, is tested (or, equivalently, between two timeperiods). One could conduct the following likelihood ratio tests to test for this,

X2 = –2[LL( T) – LL( a) – LL( b)] (11.45)

where LL( T) is the log likelihood at convergence of the model estimatedwith the data from both regions (or both time periods), LL( a) is the loglikelihood at convergence of the model using region a data, and LL( b) is thelog likelihood at convergence of the model using region b data. In this testthe same variables are used in all three models (total model, region a model,and region b model). This X2 statistic is 2 distributed with degrees of free-dom equal to the summation of the number of estimated parameters in allregional models (a and b in this case but additional regions can be added tothis test) minus the number of estimated parameters in the overall model.The resulting X2 statistic provides the probability that the models havedifferent parameters. Alternatively, one could conduct the following test:

X2 = –2[LL(bba) – LL(ba)], (11.46)

where LL( ba) is the log likelihood at convergence of a model using theconverged parameters from region b (using only the data of region b) on thedata of region a (restricting the parameters to be the estimated parametersof region b), and LL( a) is the log likelihood at convergence of the modelusing region a data. This test can also be reversed using LL( ab) and LL( b).The statistic is 2 distributed with the degrees of freedom equal to the numberof estimated parameters in ba and the resulting X2 statistic provides theprobability that the models have different parameters. The combination ofthese two tests yields a good assessment of the model’s transferability.

A key criterion in demonstrating the spatial or temporal transferability ofmodels is that they be well specified. Omitted variables and other specifica-tion errors may lead one to erroneously reject transferability.

© 2003 by CRC Press LLC

Page 286: Data Analysis Book

11.6 Nested Logit Model (Generalized Extreme Value Model)

As noted previously, a restrictive property of the MNL model is the inde-pendence of irrelevant alternatives (IIA). Recall that this property arises fromthe model’s derivation, which assumes independence of disturbance terms( in), among alternative outcomes. To overcome the IIA limitation in simpleMNL models, McFadden (1981) developed a class of models known asgeneralized extreme value models, which includes the MNL model andextensions. McFadden’s derivation of this class of models, based on theassumption of generalized extreme value (GEV) disturbance terms, allowsany IIA problem to be readily addressed. The nested logit model is one ofthe more commonly used models in this class. The idea behind a nested logitmodel is to group alternative outcomes suspected of sharing unobservedeffects into nests (this sharing sets up the disturbance term correlation thatviolates the derivation assumption). Because the outcome probabilities aredetermined by differences in the functions determining these probabilities(both observed and unobserved, see Equation 11.17), shared unobservedeffects will cancel out in each nest providing that all alternatives in the nestshare the same unobserved effects. This canceling out will not occur if a nest(group of alternatives) contains some alternative outcomes that share unob-served effects and others that do not (this sets up an IIA violation in the nest).

As an example of a nested structure, consider the route choice problemdescribed in Example 11.1. Suppose it is suspected that the arterial and two-lane road share unobserved elements (i.e., that they are lower-level roadsrelative to the freeway with no access control, lower design speeds). Whendeveloping a nested structure to deal with the suspected disturbance termcorrelation, a structure shown visually in Figure 11.5 is used. By groupingthe arterial and two-lane road in the same nest, their shared unobservedelements cancel.

FIGURE 11.5Nested structure of route choice with shared disturbances between arterial and two-lane road.

Two-Lane

Nonfreeway

Arterial

Freeway

© 2003 by CRC Press LLC

Page 287: Data Analysis Book

Mathematically, McFadden (1981) has shown the GEV disturbanceassumption leads to the following model structure for observation n choos-ing outcome i

Pn(i) = EXP[ iXin + i Lin]/ EXP[ IXIn + I LSIn] (11.47)

Pn(j i) = EXP[ j i Xn]/ EXP[ J i XJn] (11.48)

LSin = LN[ EXP( J i XJn)], (11.49)

where Pn(i) is the unconditional probability of observation n having discreteoutcome i, the X are vectors of measurable characteristics that determine theprobability of discrete outcomes, the are vectors of estimable parameters,Pn(j i) is the probability of observation n having discrete outcome j condi-tioned on the outcome being in outcome category i (for example, for thenested structure shown in Figure 11.5 the outcome category i would benonfreeway and Pn(j i) would be the binary logit model of the choice betweenthe arterial and two-lane road), J is the conditional set of outcomes (condi-tioned on i), I is the unconditional set of outcome categories (the upper twobranches of Figure 11.5), LSin is the inclusive value (logsum), and i is anestimable parameter. Note that this equation system implies that the uncon-ditional probability of having outcome j is

Pn(j) = Pn(i) Pn(j i). (11.50)

Estimation of a nested model is usually done in a sequential fashion. Theprocedure is first to estimate the conditional model (Equation 11.48) using onlythe observations in the sample that are observed having discrete outcomes J.In the example illustrated in Figure 11.5 this is a binary model of commutersobserved taking the arterial or the two-lane road. Once these estimation resultsare obtained, the logsum is calculated (this is the denominator of one or moreof the conditional models — see Equations 11.48 and 11.49) for all observations,both those selecting J and those not (for all commuters in our example case).Finally, these computed logsums (in our example there is just one logsum) areused as independent variables in the functions as shown in Equation 11.47.Note that not all unconditional outcomes need to have a logsum in theirrespective functions (the example shown in Figure 11.5 would have a logsumpresent only in the function for the nonfreeway choice).

Caution needs to be exercised when using the sequential estimationprocedure described above because it has been shown that sequentialestimation of a nested logit model results in variance–covariance matricesthat are too small and thus the t statistics are inflated (typically by about

I

J

J

© 2003 by CRC Press LLC

]

Page 288: Data Analysis Book

10 to 15%). This problem is resolved by estimating the entire model at onceusing full information maximum likelihood. See Greene (2000) for detailson this approach.

It is important to note that the interpretation of the estimated parameterassociated with logsums ( i) has the following important elements:

• The i must be greater than 0 and less than 1 in magnitude to beconsistent with the nested logit derivation (see McFadden, 1981).

• If i = 1, the assumed shared unobserved effects in the nest are notsignificant and the nested model reduces to a simple MNL (seeEquations 11.47 through 11.49 with the i = 1). In fact, for the exampleroute-choice model, estimating the nest shown in Figure 11.5 yieldsa i for the nonfreeway outcome of the upper nest that is not signif-icantly different from 1, indicating that the simple MNL model isthe appropriate structure and that the arterial and two-lane road donot share significant unobserved effects (this finding is also corrob-orated by the IIA test conducted in Example 11.5). When estimatingthe nested structure shown in Figure 11.5, i for the nonfreewaychoice is estimated as 0.91 with a standard error of 0.348. For testingif this parameter estimate is significantly different from 1, the t testis (again this is an approximation due to the non-normality in thelogit model)

A one-tailed t test gives a confidence level of only 60%. It is importantto exercise caution when using estimation packages that calculate tstatistics because they typically report values where is comparedwith zero as shown in Equation 11.29 (which would have been 2.62in the case above). In the previous example, the 0.91 parameterestimate is significantly different from zero but not significantlydifferent from one.

• If i is less than zero, then factors increasing the likelihood of anoutcome being chosen in the lower nest will decrease the likelihoodof the nest being chosen. For example, in the route-choice case, ifdistance on the arterial was reduced, the logsum on the nonfreewaynest would increase but the negative i decreases the likelihood ofa nonfreeway route being chosen. This does not make sense becausean improvement in a nonfreeway route should increase the likeli-hood of such a route being chosen.

• If i is equal to zero, then changes in nest outcome probabilities willnot affect the probability of nest selection and the correct model isrecursive (separated).

t..

* .1 0 91 1

0 3480 2586

S.E.

© 2003 by CRC Press LLC

Page 289: Data Analysis Book

In closing, it is important to note that the nested structure can be estimatedwith more than the two levels shown in Figure 11.5, depending on thenumber of alternative outcomes and the hypothesized disturbance correla-tions. In addition, the nesting does not imply a decision tree or an orderingof how decisions are made the proposed nesting structure conveys noth-ing about a hierarchical decision-making process. The nesting is purely anempirical method for eliminating IIA violations.

11.7 Special Properties of Logit Models

Two special properties of logit models are noteworthy: (1) subsampling ofalternative outcomes for model estimation and (2) using the logsum toestimate consumer welfare effects. There are many discrete outcome situa-tions where the number of alternative outcomes is very large. For example,to estimate which make, model, and vintage of vehicle someone may chooseto own, a choice model might have thousands of alternative outcomes suchas a 2002 Ford Mustang, 2001 Ford Mustang, 2002 Honda Accord, 2001Honda Accord, and so on. Fortunately, the independently identically dis-tributed extreme value distribution used in the derivation of the MNLmodel permits consistent estimation of model parameters using a subsam-ple of the available outcome set (McFadden, 1978). The estimation proce-dure is one of reducing the set of outcomes available to each observationto a manageable size. In doing this, one must include the outcome observedfor each observation in the estimation sample, and this is supplementedwith additional outcome possibilities that are selected randomly from thecomplete outcome set (a different set of randomly chosen outcomes isgenerated for each observation). For example, suppose there are just twoobservations of individuals choosing over 3500 different makes, models,and vintages of vehicles. Observation 1 is observed owning a 1994 MazdaMiata and observation 2 is observed owning a 2001 Dodge Caravan. If tenoutcomes are available for estimation, the outcome set of observation 1would include the 1994 Mazda Miata (with associated attributes as the Xvector) in addition to nine other make, model, and vintage options drawnrandomly from the 3500 available vehicles. The outcome set of observation2 would include a 2001 Dodge Caravan (with associated attributes as theX vector) and nine different other make, model, and vintage options drawnrandomly from the 3500 available vehicles. This subsampling of alternativeoutcomes is legitimate for estimation of the , but when one estimates theprobabilities and logsums the entire set of outcome alternatives must beused (enumeration through the entire set of outcomes, not just the subsam-ple, must be conducted to estimate outcome probabilities). See Manneringand Winston (1985) for an example of this sampling approach using a nestedlogit modeling structure of vehicle choice.

© 2003 by CRC Press LLC

Page 290: Data Analysis Book

When logit models are coupled with the theory of utility maximization,the denominator can be used to compute important welfare effects. The basisfor this calculation is the concept of compensating variation (CV), which isthe (hypothetical) amount of money that individuals would be paid (ordebited) to make them as well off after a change in X as they were prior tothe change in X. Small and Rosen (1981) show that the CV for each obser-vation n is

, (11.51)

where is the marginal utility of income, Xo refers to initial values of X,Xf refers to final values of X (after a policy change), and all other termsare as defined previously. In most applications the marginal utility ofincome is equal in magnitude but opposite in sign to the cost parameterassociated with alternative outcomes — the cost parameter estimated inthe discrete outcome models. The procedure is to apply Equation 11.51to the sample (by enumerating all observations) and then expand theresults to the population (using the proportion of the sample in the overallpopulation as a basis for this expansion). CV is a powerful tool and hasbeen used to estimate the value consumers place on automobile safetydevices (Winston and Mannering, 1984), the value of high-occupancyvehicle lanes (Mannering and Hamed, 1990b), and the value of va- rious transportation modes in urban areas (Niemeler, 1997).

11.8 Mixed MNL Models

As noted previously, a general specification concern in the application oflogit models is the possible random variation in parameters across observa-tions. Such random parameter variations present a serious specificationproblem, resulting in inconsistent estimates of parameters and outcomeprobabilities. To address this problem, mixed MNL models have been devel-oped. As with standard MNL models, the linear function Tin is specified todetermine the discrete outcome i for observation n, such that

Tin = i Xin + in. (11.52)

CV LN EXP I InI

I Ino

I Inf

1

XX

X

© 2003 by CRC Press LLC

Page 291: Data Analysis Book

As in the standard MNL case, Xin is a vector of the observable characteristics(covariates) that determine discrete outcomes for observation n, and in isassumed to be independently and identically, extreme value Type I distributed.However, instead of assuming i does not vary over observations as in thestandard logit formulation, the mixed logit formulation has i as a vector ofestimable parameters for discrete outcome i, which varies across observations.This variation is with density q( i| ), where is a vector of parameters of thedensity distribution. The density distribution, q( i| ), is referred to as themixing distribution. The resulting outcome probabilities are determined from

, (11.53)

where Pn(i| ) is the mixed logit outcome probability, which is the probabilityof outcomes conditional on q( i| ). It can be shown that this mixed logitformulation does not have the IIA property of the standard logit model.Furthermore, any discrete outcome model is approximated by a mixed logitmodel with the appropriate specification of q( i| ) (see McFadden and Train,2000). This approximation makes the mixed MNL model a versatile modelthat overcomes the restrictions of IIA and i that are assumed to be invariantover observations in the standard MNL models. The reader is referred toMcFadden and Train (2000) for additional information on this class of modelsand a review of other work on mixed logit models.

11.9 Models of Ordered Discrete Data

In many transportation applications discrete data are ordered. Examplesinclude situations where respondents are asked for quantitative ratings (forexample, on a scale from 1 to 10 rate the following), ordered opinions (forexample, you disagree, are neutral, or agree), or categorical frequency data(for example, property damage only crash, injury crash, and fatal crash).Although these response data are discrete, as were the data used in the modelsdiscussed previously in this chapter, application of the standard or nestedmultinomial discrete models presented earlier does not account for the ordinalnature of the discrete data and thus the ordering information is lost. Amemiya(1985) showed that if an unordered model (such as the MNL model) is usedto model ordered data, the model parameter estimates remain consistent butthere is a loss of efficiency. To address the problem of ordered discrete data,ordered probability models have been developed. However, because of therestrictions placed on how variables affect ordered discrete outcome probabil-ities, not all ordered data are best modeled using ordered probability models.

P iEXP

EXPq dn

i in

I InI

i i|

X

X|

© 2003 by CRC Press LLC

Page 292: Data Analysis Book

In some cases unordered probability models are preferred, even though thedata are ordinal. Such cases are explained later in this section.

Ordered probability models (both probit and logit) have been in wide-spread use since the mid-1970s (see McKelvey and Zavoina, 1975). Orderedprobability models are derived by defining an unobserved variable, z, thatis used as a basis for modeling the ordinal ranking of data. This unobservedvariable is typically specified as a linear function for each observation (nsubscripting omitted), such that

z = X + , (11.54)

where X is a vector of variables determining the discrete ordering for observationn, is a vector of estimable parameters, and is a random disturbance. Usingthis equation, observed ordinal data, y, for each observation are defined as

y = 1 if z 0

y = 2 if 0 < z 1

y = 3 if 1 < z 2 (11.55)

y = ...

y = I if z I –2,

where the are estimable parameters (referred to as thresholds) that definey, which corresponds to integer ordering, and I is the highest integer orderedresponse. Note that during estimation, non-numerical orderings such asnever, sometimes, and frequently are converted to integers (for example, 1,2, and 3) without loss of generality.

The are parameters that are estimated jointly with the model parameters( ). The estimation problem then becomes one of determining the probabilityof I specific ordered responses for each observation n. This determination isaccomplished by making an assumption on the distribution of . If is assumedto be normally distributed across observations with mean = 0 and variance = 1,an ordered probit model results with ordered selection probabilities as follows:

P(y = 1) = (– X)

P(y = 2) = ( 1 – X) – ( X)

P(y = 3) = ( 2 – X) – ( 1 – X)

.. (11.56)

..

..

P(y = I) = 1 – ( I–2 – X),

© 2003 by CRC Press LLC

Page 293: Data Analysis Book

where (.) is the cumulative normal distribution,

. (11.57)

Note that in Equation 11.56, threshold 0 is set equal to zero without lossof generality (this implies that one need only estimate I – 2 thresholds).Figure 11.6 provides an example with five possible ordered outcomes. Forestimation, Equation 11.56 is written as

P(y = i) = ( i– X) – ( i+1 – X), (11.58)

where i and i+1 represent the upper and lower thresholds for outcome i.The likelihood function is (over the population of N observations)

, (11.59)

where in is equal to 1 if the observed discrete outcome for observation n isi, and zero otherwise. This equation leads to a log-likelihood of

(11.60)

Maximizing this log likelihood function is subject to the constraint 0 1 2 ... I–2. If the assumption is made that in Equation 11.54 is

FIGURE 11.6Illustration of an ordered probability model with 0 = 0.

12

12

2EXP w dwu

L y , i n i ni

I

n

Nin|

==

X X111

LL LNin i n i ni

I

n

N

=

X X111

f (ε)

0.3

y = 1

y = 2

y = 3

y = 4

y = 5

0.2

0.1

−X µ1 − X µ2 − X µ3 − X ε0

© 2003 by CRC Press LLC

Page 294: Data Analysis Book

logistically distributed across observations with mean = 0 and variance= 1, an ordered logit model results, and the derivation proceeds the sameas for the ordered probit model. Because the ordered probit model is notafflicted with the estimation difficulties encountered in estimating themultinomial probit model for unordered discrete data, the ordered probitis usually chosen over the ordered logit because of the underlyingassumption of normality.

In terms of evaluating the effect of individual estimated parameters inordered probability models, Figure 11.7 shows that a positive value of k

implies that an increase in xk will unambiguously increase the probabilitythat the highest ordered discrete category results (y = 5 in Figure 11.7) andunambiguously decreases the probability that the lowest ordered discretecategory results (y = 1 in Figure 11.7).

The problem with ordered probability models is associated with the inter-pretation of intermediate categories, (y = 2, y = 3, and y = 4 in Figure 11.7).Depending on the location of the thresholds, it is not necessarily clear whateffect a positive or negative k has on the probabilities of these “interior”categories. This difficulty arises because the areas between the shifted thresh-olds may yield increasing or decreasing probabilities after shifts to the leftor right (see Figure 11.7). For example, suppose the five categories in Figure11.7 were disagree strongly, disagree, neutral, agree, agree strongly. It istempting to interpret a positive k as implying that an increase xk will increasethe likelihood of agreeing, but this conclusion is incorrect because of theambiguity in the interior category probabilities. The correct interpretation isthat an increase in xk increases the likelihood of agreeing strongly anddecreases the likelihood of disagreeing strongly.

To obtain a sense of the direction of the effects on the interior categories,marginal effects are computed for each category. These marginal effectsprovide the direction of the probability for each category as

FIGURE 11.7Illustration of an ordered probability model with an increase in X (with 0 = 0).

y = 1

y = 2

y = 3

y = 4

y = 5

µ3 − Xµ2 − Xµ1 − X−X ε

f (ε)

0.3

0.2

0.1

0

© 2003 by CRC Press LLC

Page 295: Data Analysis Book

TABLE 11.3

Variables Available to Study High-Occupancy Vehicle (HOV) Lanes Opinions

VariableNo. Variable Description

1 Usual mode of travel: 0 if drive alone, 1 if two person carpool, 2 if three or more person carpool, 3 if van pool, 4 if bus, 5 if bicycle or walk, 6 if motorcycle, 7 if other

2 Have used HOV lanes: 1 if yes, 0 if no3 If used HOV lanes, what mode is most often used: 0 in a bus, 1 in two person

carpool, 2 in three or more person carpool, 3 in van pool, 4 alone in vehicle, 5 on motorcycle

4 Sometimes eligible for HOV lane use but do not use: 1 if yes, 0 if no5 Reason for not using HOV lanes when eligible: 0 if slower than regular lanes,

1 if too much trouble to change lanes, 2 if HOV lanes are not safe, 3 if traffic moves fast enough, 4 if forget to use HOV lanes, 5 if other

6 Usual mode of travel 1 year ago: 0 if drive alone, 1 if two person carpool, 2 if three or more person carpool, 3 if van pool, 4 if bus, 5 if bicycle or walk, 6 if motorcycle, 7 if other

7 Commuted to work in Seattle a year ago: 1 if yes, 0 if no8 Have flexible work start times: 1 if yes, 0 if no9 Changed departure times to work in the last year: 1 if yes, 0 if no

10 On average, number of minutes leaving earlier for work relative to last year11 On average, number of minutes leaving later for work relative to last year12 If changed departure times to work in the last year, reason: 0 if change in travel

mode, 1 if increasing traffic congestion, 2 if change in work start time, 3 if presence of HOV lanes, 4 if change in residence, 5 if change in lifestyle, 6 if other

13 Changed route to work in the last year: 1 if yes, 0 if no14 If changed route to work in the last year, reason: 0 if change in travel mode, 1

if increasing traffic congestion, 2 if change in work start time, 3 if presence of HOV lanes, 4 if change in residence, 5 if change in lifestyle, 6 if other

15 Usually commute to or from work on Interstate 90: 1 if yes, 0 if no16 Usually commuted to or from work on Interstate 90 last year: 1 if yes, 0 if no17 On your past five commutes to work, how often have you used HOV lanes18 On your past five commutes to work, how often did you drive alone19 On your past five commutes to work, how often did you carpool with one other

person20 On your past five commutes to work, how often did you carpool with two or

more people21 On your past five commutes to work, how often did you take a van pool22 On your past five commutes to work, how often did you take a bus23 On your past five commutes to work, how often did you bicycle or walk24 On your past five commutes to work, how often did you take a motorcycle25 On your past five commutes to work, how often did you take a mode other

than those listed in variables 18 through 2426 On your past five commutes to work, how often have you changed route or

departure time27 HOV lanes save all commuters time: 0 if strongly disagree, 1 if disagree, 2 if

neutral, 3 if agree, 4 if agree strongly28 Existing HOV lanes are being adequately used: 0 if strongly disagree, 1 if

disagree, 2 if neutral, 3 if agree, 4 if agree strongly29 HOV lanes should be open to all traffic: 0 if strongly disagree, 1 if disagree, 2

if neutral, 3 if agree, 4 if agree strongly30 Converting some regular lanes to HOV lanes is a good idea: 0 if strongly

disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

© 2003 by CRC Press LLC

Page 296: Data Analysis Book

P(y = i)/ X = [ ( i–1 – X) – ( i – X)] (11.61)

where (.) is the standard normal density.

Example 11.7

A survey of 281 commuters was conducted in the Seattle metropolitanarea. The survey’s intent was to gather information on commuters’ opin-ions of high-occupancy vehicle (HOV) lanes (lanes that are restricted foruse by vehicles with two or more occupants). The variables obtainedfrom this study are provided in Table 11.3.

Among other questions, commuters were asked whether they agreedwith the statement “HOV lanes should be open to all vehicles, regardlessof vehicle occupancy level” (variable number 29 in Table 11.3). The ques-tion provided ordered responses of strongly disagree, disagree, neutral,agree, agree strongly, and the observed percentage frequency of responsein these five categories was 32.74, 21.71, 8.54, 12.10, and 24.9,1 respec-tively. To understand the factors determining commuter opinions, anordered probit model of this survey question is appropriate.

31 Converting some regular lanes to HOV lanes is a good idea only if it is done before traffic congestion becomes serious: 0 if strongly disagree, 1 if disagree, 2 if neutral, 3 if agree, 4 if agree strongly

32 Gender: 1 if male, 0 if female33 Age in years: 0 if under 21, 1 if 22 to 30, 2 if 31 to 40, 3 if 41 to 50, 4 if 51 to 64,

5 if 65 or older34 Annual household income (U.S. dollars): 0 if no income, 1 if 1 to 9,999, 2 if 10,000

to 19,999, 3 if 20,000 to 29,999, 4 if 30,000 to 39,999, 5 if 40,000 to 49,999, 6 if 50,000 to 74,999, 7 if 75,000 to 100,000, 8 if over 100,000

35 Highest level of education: 0 if did not finish high school, 1 if high school, 2 if community college or trade school, 3 if college/university, 4 if post college graduate degree

36 Number of household members37 Number of adults in household (aged 16 or older)38 Number of household members working outside the home39 Number of licensed motor vehicles in the household40 Postal zip code of workplace41 Postal zip code of home42 Type of survey comment left by respondent regarding opinions on HOV lanes:

0 if no comment on HOV lanes, 1 if comment not in favor of HOV lanes, 2 if comment positive toward HOV lanes but critical of HOV lane policies, 3 if comment positive toward HOV lanes, 4 if neutral HOV lane comment

TABLE 11.3 (CONTINUED)

Variables Available to Study High-Occupancy Vehicle (HOV) Lanes Opinions

VariableNo. Variable Description

© 2003 by CRC Press LLC

Page 297: Data Analysis Book

Ordered probit estimation results are presented in Table 11.4. The resultsshow that commuters who normally drive alone and are 50 years old orolder are more likely to agree strongly, and less likely to disagree stronglythat HOV lanes should be open to all traffic. The findings also show thatas income increases and the more frequently a commuter is observedvarying from normal route and departure times (presumably in an effortto avoid traffic congestion), the more likely the commuter is to agreestrongly (and less likely to disagree strongly). Finally, commuters withflexible work hours were less likely to agree strongly (more likely todisagree strongly). Statistical assessment of this model is the same as thatfor other maximum likelihood models (t*, 2X 2 likelihood ratio tests,and so on). The estimated threshold values do not have any direct inter-pretation but are needed to compute outcome probabilities and marginaleffects.

As mentioned in the introductory paragraph of this section, there are timeswhen an unordered probability model provides the best fit for ordered data.Such cases arise because ordered probability models place a restriction onhow variables in X affect outcome probabilities. For example, suppose one ismodeling the severity of vehicle accidents with outcomes of property damageonly, injury, and fatality. These are clearly ordered data. Suppose that one ofthe key factors in determining the level of injury is whether or not an air bagwas deployed. The air bag–deployment indicator variable in an ordered

TABLE 11.4

Ordered Probit Estimation Results for the Question “High-Occupancy Vehicle (HOV) Lanes Should Be Open to All Vehicles, Regardless of Vehicle Occupancy Level” (disagree strongly, disagree, neutral, agree, agree strongly)

Independent VariableEstimatedParameter t Statistic

Constant –0.806 –3.57Drive alone indicator (1 if commuter’s usual mode is drive alone, 0 otherwise)

1.134 6.66

Flexible work-time indicator (1 if commuter has flexible work hours, 0 otherwise)

–0.267 –1.97

Commuter’s income (in thousands of dollars per year)

0.0028 1.52

Older age indicator (1 if commuter is 50 years old or older, 0 otherwise)

0.229 1.36

Number of times in the past five commutes the commuter changed from normal routes or departure time

0.073 1.39

Threshold 1 0.636 15.24Threshold 2 0.879 22.57Threshold 3 1.26 20.93Number of observations 281Log likelihood at zero –481.78Log likelihood at convergence –396.93

2 0.176

© 2003 by CRC Press LLC

Page 298: Data Analysis Book

model would either increase the probability of a fatality (and decrease theprobability of property damage only) or decrease the probability of fatality(and increase the probability of property damage only). But the reality maybe that the deployment of an air bag decreases fatalities but increases thenumber of minor injuries (from air bag deployment). In an unordered dis-crete-modeling framework, estimation would result in the parameter for theair bag deployment variable having a negative value in the severity functionfor the fatality outcome and also a negative value for the property damage-only outcome (with the injury outcome having an increased probability inthe presence of an air bag deployment). In this case, an ordered probabilitymodel is not appropriate because it does not have the flexibility to explicitlycontrol interior category probabilities. One must be cautious in the selectionof ordered and unordered models. A trade-off is inherently being madebetween recognizing the ordering of responses and losing the flexibility inspecification offered by unordered probability models.

© 2003 by CRC Press LLC

Page 299: Data Analysis Book

12Discrete/Continuous Models

Statistical approaches to study continuous data (ordinary least squaresregression) and discrete data (logit and probit models) have been used fordecades in the analysis of transportation data. However, researchers havealso identified a class of transportation-related problems that involve inter-related discrete and continuous data. Examples include consumers’ choiceof the type of vehicle to own (discrete) and the number of kilometers todrive it (continuous), choice of route (discrete) and driving speed (contin-uous), and choice of trip-generating activity (discrete) and duration in theactivity (continuous).

Interrelated discrete/continuous data are easily overlooked and some-times difficult to identify. The transportation literature and the literature inmany other fields is strewn with well-known examples of data that havebeen erroneously modeled as continuous data, when the data are really partof an interrelated discrete/continuous process. The consequences of thisoversight are serious — biased estimation results that could significantlyalter the inferences and conclusions drawn from the data.

12.1 Overview of the Discrete/Continuous Modeling Problem

The problem of interrelated discrete/continuous data is best viewed as aproblem of selectivity, with observed data being an outcome of a selectionprocess that results in a nonrandom sample of data in observed discretecategories. To better understand this, consider the route-choice problempresented previously in Chapter 11 (Example 11.1). In this example, com-muters had a choice of three alternative routes for their morning commute(from home to work): a four-lane arterial (speed limit = 60 km/h, twolanes each direction), a two-lane highway (speed limit = 60 km/h, onelane each direction), and a limited-access four-lane freeway (speed limit= 90 km/h, two lanes each direction). If one were interested in studyingaverage travel speed between the work and home, a classic discrete/continuous problem arises. This is because the observed travel-speed data

© 2003 by CRC Press LLC

Page 300: Data Analysis Book

arise from a self-selection process. As such, commuters observed on eachof the three routes represent a nonrandom sample because commuters’choice of route is influenced by their propensity to drive faster. Commuterswho tend to drive faster may be more likely to choose the freeway, whilecommuters who tend to drive slower may be more likely to choose thearterial. From a model estimation perspective, travel speeds are onlyobserved on the chosen route. Thus, it is not known how fast observedarterial users would have driven had they taken the freeway, or how fastfreeway users would have driven had they taken the arterial. If fasterdrivers are indeed more likely to choose the freeway and slower driversmore likely to choose the arterial, the speed behavior of arterial driverson the freeway is likely to be different from the speed behavior of driversobserved on the freeway. If this interrelationship between route (discrete)and travel speed (continuous) is ignored, selectivity bias will result in thestatistical models estimated to predict travel speed.

The selectivity bias problem is best illustrated by a simple example. Con-sider the estimation of a regression model of average travel speed on thefreeway route from work to home:

, (12.1)

where is the average speed of commuter n on the freeway, Xn is a vectorof commuter n characteristics influencing average travel speed, f is a vectorof estimable parameters and are unobserved characteristics influencingtravel speed. The selectivity bias that would result in estimating this equationwith observed data is shown in Figure 12.1. In this figure, commuter data

FIGURE 12.1An illustration of self-selected bias in discrete-continuous data.

sfn f n fn = X

sfn

fn

sn

Line 1

Line 2

f Xn

© 2003 by CRC Press LLC

Page 301: Data Analysis Book

indicated by a “+” represent the data of observed freeway users, and com-muter data indicated by a “–” represent the unobserved speeds of nonfree-way users had they chosen to drive on the freeway. Because freeway usersare a self-selected group (for whatever reasons) of faster drivers (with fasterdrivers being more likely to choose the freeway), they are underrepresentedat lower values of f Xn and overrepresented at higher values of f Xn. IfEquation 12.1 is estimated on only observed data (observed freeway usersonly), the resulting estimates would be biased as indicated by “Line 1” inFigure 12.1. The true equation of freeway speed with all data (observed andunobserved) is given by “Line 2.”

From an average speed-forecasting perspective, the problem with “Line1” is that any policy causing a shift in route choice will result in biasedforecasts of average speed because of the selectivity bias present in equationestimation. For example, if the arterial were closed for reconstruction, thearterial users moving to the freeway would have very different determin-ing their speeds than the of the originally observed freeway users. Thisproblem would also exist if a single equation were estimated for the averagespeeds on all three routes because each route contains a self-selected sampleof commuters that is based on their speed preferences.

12.2 Econometric Corrections: Instrumental Variables and Expected Value Method

Perhaps the most simplistic method of addressing misspecification inducedby discrete/continuous processes is to use an instrumental variablesapproach. To do this, consider Example 11.1 with commuters selecting fromthree possible routes (freeway, four-lane arterial, and two-lane road) and arevision of Equation 12.1 such that the average travel speed to work on thechosen route of commuter n is

, (12.2)

where is the average speed of commuter n on commuter n’s chosen route,Xn is a vector of commuter n characteristics influencing average travel speedthat are not a function of the route chosen (e.g., such as income, driver age),Zn is a vector of characteristics commuter n faces that influence averagetravel speed that are a function of the route chosen (such as number oftraffic signals and travel distance), and are corresponding vectors ofestimable parameters and are unobserved characteristics influencingtravel speed. Direct estimation of Equation 12.2 would result in biased andinefficient parameter estimation because Zn is endogenous due to the dis-crete/continuous interrelation between travel speed and route choice. This

sn n n n = X Z

sn

n

© 2003 by CRC Press LLC

Page 302: Data Analysis Book

is because, as a commuter chooses a higher speed, elements in the Zn vectorwill change since the likelihood of selecting a specific route is interrelatedwith speed preferences. As was the case in simultaneous equations modelsthat involve all continuous variables (see Chapter 5), one could replace theelements of Zn with estimates derived from regressing Zn against all exog-enous variables. The procedure consists of estimating regression equationsfor all elements of Zn and to use regression predicted values, , to estimateEquation 12.2, such that

. (12.3)

An alternative to the instrumental variables approach is to interact theendogenous variables directly with the corresponding discrete-outcomemodel (the route-choice model in this case). The most common approach isto replace the elements in the endogenous variable vector Zn with theirexpected values. Thus, each element of Zn, zjn, is replaced by (with subscript-ing j denoting elements of Zn)

, (12.4)

where Pn(i) is the predicted probability of commuter n selecting discreteoutcome i, as determined by a discrete outcome model. Equation 12.2 thenbecomes

. (12.5)

Example 12.1

To demonstrate the expected value selectivity bias correction method,consider the data used previously in Example 11.1. Recall that these dataare from a survey of 151 commuters departing from the same origin (alarge residential complex) and going to work in the same downtownarea. Distance is measured precisely from parking lot of origin to parkinglot of destination so there is a variance in distances among commuterseven though they depart and arrive in the same general areas. Commut-ers are observed choosing one of three alternative routes; a four-lanearterial, a two-lane highway, and a limited-access four-lane freeway. Toestimate a model of average commuting speed, the variables shown inTable 12.1 are available.

For the equation correction, to estimate the expected value of endogenousvariables that result because of the discrete/continuous interaction, the

Zn

sn n n n = X Z

E z P i zjn ni

jn =

s P i zn n ni

jnj

n = X

© 2003 by CRC Press LLC

Page 303: Data Analysis Book

route-choice probabilities are determined from the logit model estimatedin Example 11.1. Recall those logit model parameter estimates gave thefollowing utility functions:

Va = –0.942(DISTA)

Vt = 1.65 – 1.135(DISTT) + 0.128(VEHAGE), (12.6)

Vf = –3.20 – 0.694(DISTF) + 0.233(VEHAGE) + 0.764(MALE)

where DISTA, DISTT, and DISTF are the distances of individual travelerson the arterial, two-lane highway, and freeway route, respectively, VEHAGEis the age of the commuter’s vehicle in years, and male is an indicatorvariable that is equal to 1 if the commuter is a male and 0 if female.

These give estimated route probabilities:

, , . (12.7)

Table 12.2 gives estimation results for two models. The “uncorrected” modelis estimated ignoring discrete/continuous interactions. The “corrected”model uses Equations 12.6 and 12.7 as input to estimate Equation 12.5.

Both models show that commuters tend to have higher average commutespeeds if they have longer trips, more children, and are males, and lower

TABLE 12.1

Variables Available for Home to Work Average Speed Model Estimation

VariableNo. Variable Description

1 Average commuting speed in km/h2 Route chosen: 1 if arterial, 2 if rural road, 3 if freeway3 Traffic flow rate at time of departure in vehicles per hour4 Number of traffic signals on the selected route5 Distance along the selected route in kilometers6 Seat belts: 1 if wearing, 0 if not7 Number of passengers in vehicle8 Commuter age in years: 1 if younger than 23, 2 if 24 to 29, 3 if 30 to 39, 4 if 40

to 49, 5 if 50 and older9 Gender: 1 if male, 0 if female

10 Marital status: 1 if single, 0 if married11 Number of children (aged 16 or younger)12 Annual household income (U.S. dollars): 1 if less than 20,000, 2 if 20,000 to

29,999, 3 if 30,000 to 39,999, 4 if 40,000 to 49,999, 5 if more than 50,00013 Age of vehicle driven in years14 Manufacturer origin of vehicle driven: 1 if domestic, 0 if foreign15 Operating cost of trip based on vehicle fuel efficiency (in U.S. dollars)

P ae

e e e

V

V V V

a

a t fP t

e

e e e

V

V V V

t

a t fP f

e

e e e

V

V V V

f

a t f

© 2003 by CRC Press LLC

Page 304: Data Analysis Book

average commute speeds as the number of traffic signals they encounterincreases and the older their commute vehicle. Although the two modelsshow consistency in the direction that the included variables have onaverage commute speeds, the magnitude of these variables and theirstatistical significance vary widely in some cases. Specifically, in thecorrected model, the number of traffic signals and vehicle age becomestatistically insignificant. These shifting results underscore the need toconsider the discrete/continuous nature of the problem and the role thatcommuter selectivity is playing in the model. For example, if the uncor-rected model was used, erroneous inferences could have been drawnconcerning the number of traffic signals and vehicle age.

12.3 Econometric Corrections: Selectivity-Bias Correction Term

Another popular approach to resolve selectivity bias problems and arrive atunbiased parameter estimates is to develop an expression for a selectivitybias correction term (see Heckman, 1976, 1978, 1979). In the context of theexample problem, this is done by noting that average travel speed, s, forcommuter n from home to work is written as (refer to Equation 12.1)

, (12.8)

where is the average commute speed of commuter n conditional onthe chosen route i, Xn is a vector of commuter n characteristics influencingaverage travel speed, i is a vector of estimable parameters, and is

TABLE 12.2

Corrected and Uncorrected Regression Models of Average Commute Speed (t statistics in parentheses)

Independent Variable

UncorrectedParameterEstimates

CorrectedParameterEstimates

Constant 14.54 (2.97) 15.37 (2.14)Distance on chosen route in kilometers 4.08 (7.71) 4.09 (2.48)Number of signals on chosen route –0.48 (–1.78) –1.30 (–0.81)Number of children 3.10 (2.16) 2.85 (1.75)Single indicator (1 if commuter is not married, 0 if married)

3.15 (1.52) 3.98 (1.70)

Vehicle age in years –0.51 (–2.13) –0.16 (–0.57)Number of observations 151 151R2 0.32 0.13Corrected R2 0.30 0.10

E s i E in i n n| | = X

E s in( | )

E in( | )

© 2003 by CRC Press LLC

Page 305: Data Analysis Book

the conditional unobserved characteristics. Note that variables specific to i(Zn in Equation 12.2) are omitted from the right-hand side of this equation.Application of this equation provides bias-corrected and consistent estimatesof i. This is because the conditional expectation of n, , accounts forthe nonrandom observed commute speeds that are selectively biased bycommuters’ self-selected choice of route.

The problem then becomes one of deriving a closed-form representationof that is used for equation estimation. Such a derivation was devel-oped by Hay (1980) and Dubin and McFadden (1984) based on the assump-tion that the discrete-outcome portion of the problem is represented by amultinomial logit model.

To see this (suppressing subscripting n), let denote a vector of discrete-outcome disturbance terms ( 1, 2, 3,… J), where J is the total number ofdiscrete outcomes. The conditional expectation (conditioned on discreteoutcome k) is written as

, (12.9)

where Pi is the probability of discrete outcome i. If it is assumed that isgeneralized extreme value distributed, is the unconditional variance of, and i is the correlation between and the resulting discrete-outcome

logistic error terms (resulting from the differencing of I j), then Hay (1980)and Dubin and McFadden (1984) have shown that Equation 12.9 is

(12.10)

When using Equation 12.10, selectivity bias in discrete/continuous modelsis corrected by undertaking the following three steps:

1. Estimate a multinomial logit model to predict the probabilities ofdiscrete outcomes i for each observation.

2. Use the logit-predicted outcome probabilities to compute the portionof Equation 12.10 in large brackets ([.]) for each observation.

3. Use the values computed in step 2 to estimate Equation 12.8 usingstandard least squares regression methods. The term 6 i/becomes a single estimable parameter.

Thus, Equation 12.8 is estimated for each route k as

, (12.11)

E in( )|

E in( | )

E iP

E f di i

jj

J

| ||

= 1

1

E i J P LN P P LN PnJ

i j j j ij i

J

| = +1 6 1 11 2

( )1 1J+

sin i n i n n = X

© 2003 by CRC Press LLC

Page 306: Data Analysis Book

where i = 6 i/ 2,

n = ,

and is a disturbance term. It is common practice not to include i in thesummation over discrete outcomes J. By doing this, an equality restriction ofthe correlations of across n and the i j is imposed. This restriction is relaxedby moving i within the summation, making it necessary to estimate a totalof J 1 selectivity bias parameters ( ) for each continuous equation corre-sponding to discrete outcome i. However, it has been shown empirically byHay (1980) and Mannering (1986a) that this restriction on i is reasonable.

Example 12.2

The selectivity bias correction term approach is demonstrated by esti-mating an average commute speed model for the two-lane highway usersusing the data used previously in Example 12.1 and the discrete modelestimated in Example 11.1. Also, for comparison purposes, it is interestingto estimate a model that excludes the selectivity bias correction term.

Model estimation results are presented in Table 12.3. The table showsthat 103 of the 151 commuters chose the two-lane highway as theircommute route. For both selectivity-corrected and uncorrected models,commuters who were men, had more passengers in their vehicle, wore

TABLE 12.3

Corrected and Uncorrected Regression Models of Average Commute Speed (t statistics in parentheses)

Variable Description

ParameterEstimates with

SelectivityCorrection

UncorrectedParameterEstimates

Constant 77.04 (5.25) 26.12 (5.03)Number of passengers in vehicle 4.31 (1.92) 5.07 (2.14)Seat belt indicator (1 if commuter wearing seat belts, 0 if not)

3.72 (1.60) 4.29 (1.74)

Gender indicator (1 if male, 0 if female) 3.52 (1.53) 3.19 (1.30)Driver age in years 0.23 (1.96) 0.12 (1.57)Annual household income (U.S. dollars) –0.000175 (–2.01) –0.00009 (–1.12)Single indicator (1 if commuter is not married, 0 if married)

4.18 (1.72) 6.03 (2.40)

Selectivity bias correction k 12.55 (3.60) —Number of observations 103 103R2 0.22 0.11Corrected R2 0.16 0.05

( )1 1J+

1 1J P LN P P LN Pj j j ij i

J

n

© 2003 by CRC Press LLC

Page 307: Data Analysis Book

seat belts, were older, and were not married tended to have highercommute speeds, while those commuters with large annual incomestended to drive at slower speeds.

For the selectivity-corrected model, the selectivity bias parameter is signif-icantly different from zero. This is a strong indication that sample selec-tivity is playing a role because the null hypothesis of no selectivity bias isrejected with over 99% confidence as indicated by the t statistic. The im-plication is that the model is seriously misspecified when the selectivitybias term is omitted. This is seen by inspecting the uncorrected parameterestimates. They show noticeable differences when compared with the pa-rameters in the selectivity-corrected model — particularly the constantparameter. These findings show that if selectivity bias is ignored, erroneousinferences and conclusions are drawn from the estimation results.

12.4 Discrete/Continuous Model Structures

In developing discrete/continuous equation systems, one can provide aconstruct to link the discrete and continuous components. The most obviousof these constructs is a reduced form approach. A common way to implementa reduced form is to start with the discrete model. Let Tin be a linear functionthat determines discrete outcome i for observation n and let yin be the cor-responding continuous variable in the discrete/continuous modeling sys-tem. Then write

Tin = i Xin + i yin + in (12.12)

where i is a vector of estimable parameters for discrete outcome i, Xin is avector of the observable characteristics (covariates) that determines discreteoutcomes for observation n, and in is a disturbance term, and i is an estimableparameter. Let the corresponding continuous equation be the linear function:

yin = i Win + in, (12.13)

where i is a vector of estimable parameters for the continuous variableobserved for discrete outcome i, Win is a vector of the observable character-istics (covariates) that determine yin, and in is a disturbance term. Equation12.13 is estimated using ordinary least squares with appropriate selectivitybias correction (such as adding a selectivity bias correction term). For esti-mation of the discrete outcome portion of the discrete/continuous process,note that yin is endogenous in Equation 12.12 because yin changes with chang-ing Tin due to the interrelated discrete/continuous structure. To avoid esti-mation problems, a reduced form model is obtained by substituting Equation12.13 into Equation 12.12, giving

© 2003 by CRC Press LLC

Page 308: Data Analysis Book

Tin = i Xin + i iWin + i in + in. (12.14)

With Equation 12.14, a discrete outcome model is derived readily. Forexample, if the in are assumed to be generalized extreme value distributed,a multinomial logit model results with the probability of observation nhaving outcome i as (see Chapter 11)

. (12.15)

Note that because the term i in does not vary across outcomes i, itcancels out and does not enter the logit model structure. Examples ofapplications of reduced form models are given in Hay (1980) and Man-nering (1986a).

Another common approach to link discrete/continuous model structuresis based on utility theory and ensures economic consistency. If the discreteoutcome model is based on utility maximization (see Chapter 11), one canderive a model structure by noting that indirect utility (used as the basisto arrive at outcome probabilities) is related to commodity demand byRoy’s identity, as discussed in Chapter 11),

, (12.16)

where Vin is the indirect utility of discrete alternative i to consumer n, pin isthe consumer’s unit price of consuming i, Incn is the decision-maker’sincome, and is the utility-maximizing demand for i. The discrete/con-tinuous link is made by either specifying the indirect utility function, Vin, orthe commodity demand y.

As an example, consider consumers’ choice of vehicle make, model, andvintage, and the annual usage of the vehicle (kilometers driven per year).This is a classic discrete/continuous problem because consumers select avehicle type, defined by make (Ford), model (Mustang), and vintage (2001)with some expectation of how much it will be driven. Aggregate data supportthe relationship between vehicle type and usage. For example, it is wellknown that newer model vehicles are driven more, on average, than oldervehicles. From a modeling perspective, consumers that own older vehiclesare likely to be a self-selected group of people with lower levels of usagewhereas those consumers owning newer vehicles may be people with higherlevels of usage. Because older-vehicle owners are not observed driving newervehicles and vice versa, a selectivity bias problem arises (see Mannering and

P iEXP

EXPn

i in i i in

I In I I InI

X W

X W

y

VpV

Inc

in

in

in

in

n

0

yin0

© 2003 by CRC Press LLC

Page 309: Data Analysis Book

Winston, 1985, and Mannering 1986a, b, for detailed work on discrete/continuous vehicle type and usage interactions).

To develop an economically consistent model structure, a vehicle utiliza-tion equation is specified as (alternatively an indirect utility function couldbe specified first)

yin = + (Incn rnpin) + in, (12.17)

where yin is the annual utilization (for example, kilometers per year) ofvehicle i, is a vector of consumer characteristics that determine vehicleutilization, is a vector of vehicle characteristics that determine vehicleutilization, Incn is consumer n’s annual income, rn is the expected annualvehicle utilization for consumer n, pin is the consumer’s unit price of utili-zation (for example, dollars per kilometer driven), , , , and areestimable parameters, and i is a disturbance term. The expected utilization,rin, is needed to capture the income effect of consumption (Incn rnpin) butis determined exogenously by regressing it against exogenous variables andusing predicted values in Equation 12.17 (see Mannering, 1986b, for a com-plete discussion of this).

With Equation 12.17, we can apply Roy’s identity as a partial differentialequation (see Equation 12.16):

(12.18)

and solve for Vin, giving

+ in, (12.19)

where in is a disturbance term added for discrete outcome model estimation.If the in are assumed to be generalized extreme value distributed, a logitmodel for discrete outcomes results. In estimating Equation 12.17, correctionsneed to be applied to the vector of vehicle characteristics Zin such as instru-mental variables or expected value methods to account for the endogeneityof vehicle characteristics. The major drawback with the economically con-sistent approach is that a nonlinear form of either the indirect utility functionor continuous equation results. For example, from the previous model, thelinear utilization equation produced a nonlinear indirect utility function.This nonlinearity complicates estimation of the discrete outcome model. Inchoosing between a reduced form and the economically consistentapproaches, many applications require a trade-off between ease of estimationand compromising theory.

For applications of discrete/continuous modeling to a variety of transpor-tation problems, the reader is referred to Damm and Lerman (1981),

i in i inX Z

X inZin

i i

V

Incy

Vp

in

nin

in

in

0 0

V Inc r p ein i in i in n n in inpin X Z

© 2003 by CRC Press LLC

Page 310: Data Analysis Book

Mannering and Winston (1985), Train (1986), Hensher and Milthorpe (1987),Mannering and Hensher (1987), and Mannering et al. (1990).

© 2003 by CRC Press LLC

Page 311: Data Analysis Book

References

Abbot, R. (1985). Logistic regression in survival analysis. American Journal of Epide-miology 121, 465–471.

Abraham, B. and T. Ledolter (1983). Statistical Methods for Forecasting. Wiley, NewYork.

Aczel, A.D. (1993). Complete Business Statistics. Erwin, Homewood, IL.Aigner, D., K. Lovell, and P. Schmidt (1977). Formulation and estimation of stochastic

frontier production models. Journal of Econometrics 6, 21–37.Akaike, H. (1987). Factor analysis and AIC. Psychometrika 52, 317–332.Amemiya, T. (1971). The estimation of the variances in a variance-components model.

International Economic Review 12, 1–13.Amemiya, T. (1985). Advanced Econometrics. Harvard University Press, Cambridge,

MA.Ansari, A.R. and R.A. Bradley (1960). Rank sum tests for dispersion. Annals of Math-

ematical Statistics 31, 1174–1189.Anselin, L. (1988). Spatial Econometrics: Methods and Models. Kluwer Academic Pub-

lishers, Boston, MA.Arbuckle, J.L. and W. Wothke (1995). AMOS Users’ Guide Version 4.0. SmallWaters

Corporation, Chicago, IL.Arbuthnott, I. (1910). An argument for Divine Providence taken from the constant

regularity observed in the births of both sexes. Philosophical Transactions 27,186–190.

Archilla, A.R. and S.M. Madanat (2000). Development of a pavement rutting modelfrom experimental data. ASCE Journal of Transportation Engineering 126, 291–299.

Arellano, M. (1993). On the testing of correlated effects with panel data. Journal ofEconometrics 59, 87–97.

Arminger, G., C. Clogg, and M. Sobel (1995). Handbook of Statistical Modeling for theSocial and Behavioral Sciences. Plenum Press, New York.

Ash, C. (1993). The Probability Tutoring Book. IEEE Press, New York.Avery, R.B. (1977). Error components and seemingly unrelated regressions. Economet-

rica 45, 199–209.Ayyub, B. (1998). Uncertainty Modeling and Analysis in Civil Engineering. CRC Press,

Boca Raton, FL.Baltagi, B.H. (1985). Econometric Analysis of Panel Data. John Wiley & Sons, New York.Baltagi, B.H. (1998). Econometrics. Springer, New York.Baltagi, B.H. and J.M. Griffin (1988). A generalized error component model with

heteroscedastic disturbances. International Economic Review 29, 745–753.Baltagi, B.H. and Q. Li (1991). A transformation that will circumvent the problem of

autocorrelation in an error component model. Journal of Econometrics 48,385–393.

© 2003 by CRC Press LLC

Page 312: Data Analysis Book

Baltagi, B.H. and Q. Li (1992). A monotonic property for iterative GLS in the two-way random effects model. Journal of Econometrics 53, 45–51.

Bartels, R. (1982). The rank version of von Neumann’s ratio test for randomness.Journal of the American Statistical Association 77, 40–46.

Beach, C.M. and J.G. McKinnon (1978). A maximum likelihood procedure for regres-sion with autocorrelated errors. Econometrics 46, 51–58.

Ben-Akiva, M. and S.R. Lerman (1985). Discrete Choice Analysis. MIT Press, Cam-bridge, MA.

Bentler, P. (1990). Comparative fit indexes in structural models. Psychological Bulletin107, 238–246.

Bentler, P. and D. Bonett (1980). Significance tests and goodness of fit in the analysisof covariance structures. Psychological Bulletin 88, 588–606.

Bentler, P. and D. Weeks (1980). Linear structural equations with latent variables.Psychometrika 45, 289–307.

Berkovec, J. (1985). New car sales and used car stocks: a model of the automobilemarket. Rand Journal of Economics 16(2), 195–214.

Bhat, C.R. (1996a). A hazard-based duration model of shopping activity with non-parametric baseline specification and non-parametric control for unobservedheterogeneity. Transportation Research 30B, 189–207.

Bhat, C.R. (1996b). A generalized multiple durations proportional hazard model withan application to activity behavior during the evening work-to-home commute.Transportation Research 30B, 465–480.

Bhat, C.R. (2000). Accommodating variations in responsiveness to level-of-servicemeasures in travel mode choice modeling, Transportation Research 32A, 495–507.

Birnbaum, Z.W. and R.A. Hall (1960). Small sample distributions for multi-samplestatistics of the Smirnov type, Annals of Mathematical Statistics 22, 592–596.

Bishop, Ch.M. (1995). Neural Networks for Pattern Recognition. Oxford University Press,New York.

Bollen, K. (1986). Sample size and Bentler and Bonett’s nonnormed fit index. Psy-chometrika 51, 375–377.

Bollen, K.A. and J.S. Long, Eds. (1993). Testing Structural Equation Models. Sage, New-bury Park, CA.

Bollen, K. and R. Stine (1992). Bootstrapping goodness-of-fit measures in structuralequation models. Sociological Methods and Research 21, 205–229.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedasticity. Jour-nal of Econometrics 31, 307–327.

Box, G.E.P. and G.M. Jenkins (1976). Time Series Analysis: Forecasting and Control.Holden–Day, San Francisco, CA.

Bradley, J.V. (1968). Distribution-Free Statistical Tests. Prentice-Hall, Englewood Cliffs,N.J.

Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and RegressionTrees. Wadsworth, Belmont, CA.

Breusch, T.S. (1987). Maximum likelihood estimation of random effects models. Jour-nal of Econometrics 36, 383–389.

Breusch, T.S. and A.R. Pagan (1979). A simple test for heteroscedasticity and randomcoefficient variation. Econometrica 47, 1287–1294.

Brockwell, P.J. and R.A. Davis (1991). Time Series: Theory and Methods, 2nd ed.Springer-Verlag, New York.

Brockwell, P.J. and R.A. Davis (1996). Introduction to Time Series and Forecasting. Springer-Verlag, New York.

© 2003 by CRC Press LLC

Page 313: Data Analysis Book

Brown, G.W. and A.M. Mood (1951). On median tests for linear hypotheses. In Proc.2nd Berkeley Symp. on Mathematical Statistics and Probability, J. Neyman, Ed.University of California Press, Berkeley, 159–166.

Browne, M. (1984). Asympotically distribution-free methods for the analysis of co-variance structures. British Journal of Mathematics and Statistical Psychology 37,62–83.

Browne, M. and R. Cudeck (1989). Single sample cross-validation indices for covari-ance structures. Multivariate Behavioral Research 24, 445–455.

Browne, M. and R. Cudeck (1993). Alternative ways of assessing model fit. In TestingStructural Equation Models, K. Bollen and J. Long, Eds. Sage, Newbury Park,CA, 136–162.

Byrne, B. (1989). A Primer of LISREL: Basic Applications and Programming for Confirma-tory Factor Analytic Methods. Springer-Verlag, New York.

Cameron, A. and P. Trevedi (1990). Regression based tests for overdispersion in thePoisson model. Journal of Econometrics 46, 347–364.

Cameron, C. and F. Windmeijer (1993). R-Squared Measures for Count Data Regres-sion Models with Applications to Health Care Utilization. Working Paper93–24, Department of Economics, University of California, Davis.

Capon, J. (1961). Asymptotic efficiency of certain locally most powerful rank tests.Annals of Mathematical Statistics 32, 88–100.

Carlin, B. and T. Louis (1996). Bayes and Empirical Bayes Methods for Data Analysis.Chapman & Hall/CRC, New York.

Carmines, E. and J. McIver (1981). Analyzing models with unobserved variables. InSocial Measurement: Current Issues, G. Bohrnstedt and E. Borgatta, Eds. Sage,Beverly Hills, CA.

Carson, J. and F. Mannering (2001). The effects of ice warning signs on accidentfrequencies and severities. Accident Analysis and Prevention 33, 99–109.

Cattell, R. and C. Burdsal (1975). The radial parcel double factoring design: a solutionto the item-vs.-parcel controversy. Multivariate Behavioral Research 10, 165–179.

Chamberlain, G. (1978). Omitted variable bias in panel data: estimating the returnsto schooling. Annales de l’INSEE 30(1), 49–82.

Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Eco-nomic Studies 47, 225–238.

Chang, L.-Y. and F. Mannering (1999). Analysis of injury severity and vehicle occu-pancy in truck- and non-truck-involved accidents. Accident Analysis and Pre-vention 31(5), 579–592.

Chatfield, C. (1992). A commentary on errors measures. International Journal of Fore-casting 8, 100–102.

Cheng, C. (1992). Optimal Sampling for Traffic Volume Estimation. Ph.D. dissertation,Carlson School of Management, University of Minnesota, Minneapolis.

Chou, C., P. Bentler, and A. Satorra (1991). Scaled test statistics and robust standarderrors for non-normal data in covariance structure analysis: a Monte Carlostudy. British Journal of Mathematical and Statistical Psychology 44, 347–357.

Chung, K.L. (1979). Probability Theory with Stochastic Processes. Springer-Verlag, NewYork.

Clemens, M.P. and D.F. Hendry (1993). On the limitations of the comparison meansquare forecast errors. Journal of Forecasting 12, 615–637.

Cochran, W.G. (1950). The comparison of percentages in matched samples. Biometrika37, 256–266.

© 2003 by CRC Press LLC

Page 314: Data Analysis Book

Cochrane, D. and G. Orcutt (1949). Application of least squares regression to rela-tionships containing autocorrelated error terms. Journal of the American Statis-tical Association 44, 32–61.

Cohen, A.C. (1959). Simplified estimators for the normal distribution when samplesare singly censored or truncated. Technometrics 1, 217–237.

Cohen, A.C. (1961). Tables for maximum likelihood estimates: singly truncated andsingly censored samples. Technometrics 3, 535–541.

Conover, W.J. (1980). Practical Nonparametric Statistics, 2nd ed. John Wiley & Sons,New York.

Cosslett, S. (1981). Efficient estimation of discrete-choice models. In Structural Analysisof Discrete Data with Econometric Applications, C. Manski and D. McFadden, Eds.MIT Press, Cambridge, MA.

Cox, D. and D. Oakes (1984). Analysis of Survival Data. Chapman & Hall/CRC, BocaRaton, FL.

Cox, D.R. (1972). Regression models and life tables. Journal of the Royal StatisticalSociety B34, 187–200.

Cox, D.R. and D.V. Hinkley (1974). Theoretical Statistics. Chapman & Hall, London.Cox, D.R. and A. Stuart (1955). Some quick tests for trend in location and dispersion.

Biometrika 42, 80–95.Cryer, J.D. (1986). Time Series Analysis. Duxbury Press, Boston, MA.Daganzo, C. (1979). Multinomial Probit: The Theory and its Application to Demand Fore-

casting. Academic Press, New York.Damm, D. and S. Lerman (1981). A theory of activity scheduling behavior. Environ-

ment and Planning 13A, 703–718.Daniel, C. and F. Wood (1980). Fitting Equations to Data. Wiley, New York.Daniel, W. (1978). Applied Nonparametric Statistics. Houghton Mifflin, Boston, MA.Daniels, H.E. (1950). Rank correlation and population models. Journal of the Royal

Statistical Society B 12, 171–181.David, F.N. and D.E. Barton (1958). A test for birth-order effects. Annals of Human

Eugenics 22, 250–257.Dee, G.S. (1999). State alcohol policies, teen drinking, and traffic fatalities. Journal of

Public Economics 72, 289–315.De Gooijer, J.G. and K. Kumar (1992). Some recent developments in non-linear time

series modeling, testing and forecasting. International Journal of Forecasting 8,135–256.

De Groot, C. and D. Wurtz (1991). Analysis of univariate time series with connec-tionist nets: a cone study of two clonsical examples. Neurocomputing 3, 177–192.

Deming, W. (1966). Some Theory of Sampling. Dover, New York.Derrig, R.A., M. Segui-Gomez, A. Abtahi, and L.L. Liu (2002). The effect of population

safety belt usage rates on motor vehicle-related fatalities. Accident Analysis andPrevention 34(1), 101–110.

de Veaux, R.D. (1990). Finding transformations for regression using the ACE algo-rithm. In Modern Methods of Data Analysis, J. Fox and J.S. Long, Eds. Sage,Newbury Park, CA, 177–208.

Devore, J. and N. Farnum (1999). Applied Statistics for Engineers and Scientists. DuxburyPress, New York.

Dhrymes, P.J. (1986). Limited dependent variables. In Handbook of Econometrics, Z.Griliches and M. Intriligator, Eds. North Holland Publishing Co., New York,3, 1567–1631.

© 2003 by CRC Press LLC

Page 315: Data Analysis Book

Diamond, P. and J. Hausman (1984). The retirement and unemployment behavior ofolder men. In Retirement and Economic Behavior, H. Aaron and G. Burtless, Eds.Brookings Institute, Washington, D.C.

Dougherty, M. (1995). A review of neural networks applied to transport. Transporta-tion Research Part C 3(4), 247–260.

Duran, B.J. (1976). A survey of nonparametric tests for scale, Communications inStatistics, Part A: Theory and Methods 14, 1287–1312.

Durbin, J. (1951). Incomplete blocks in ranking experiments, British Journal of Psy-chology. (Stat. Sec.) 4, 85–90.

Durbin, J. (1960). Estimation of parameters in time-series regression model. Journalof the Royal Statistical Society Series B 22, 139–153.

Durbin, J. (1970). Testing for serial correlation in least-squares regression when someof the regressors are lagged dependent variables. Econometrics 38, 410–421.

Durbin, J. (2000). The Foreman lecture: the state space approach to time series analysisand its potential for official statistics. Australian and New Zealand Journal ofStatistics 42(1), 1–23.

Durbin, J. and D. McFadden (1984). An econometric analysis of residential electricappliance holdings and consumption. Econometrica 52, 345–362.

Durbin, J. and G. Watson (1951). Testing for serial correlation in least squares regres-sion–II. Biometrika 37, 159–178.

Edgington, E.S. (1961). Probability table for number of runs of signs or first differencesin ordered series. Journal of the American Statistical Association 56, 156–159.

Efron, B. (1977). Efficiency of Cox’s likelihood function for censored data. Journal ofthe American Statistical Association 72, 557–565.

Efron, B. and R. Tibshirani (1986). Bootstrap methods for standard errors, confidenceintervals and other measures of statistical accuracy. Statistical Science 1, 54–74.

Emerson, J. and M. Stoto (1983). Transforming data. In Understanding Robust andExploratory Data Analysis, D.C. Hoaglin, F. Mosteller, and J. Tukey, Eds. Wiley.New York, 97–127.

Engle, R.F. (1982). Autoregressive conditional heteroscedasticity with estimates ofvariance of UK inflation. Econometrica 50, 987–1008.

Engle, R.F. and C.W. Granger (1987). Cointegration and error correction: representa-tion, estimation and testing. Econometrica 55, 251–276.

Eskelanda, G.S. and T.N. Feyziolub (1997). Is demand for polluting goods manage-able? An econometric study of car ownership and use in Mexico. Journal ofDevelopment Economics 53, 423–445.

Ettema, D.F. and H. Timmermans (1997). Activity-Based Approaches to Travel Analysis.Pergamon/Elsevier, Amsterdam.

Eubank, R.L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker,New York.

FHWA (1997). Status of the Nation’s Surface Transportation System: Condition andPerformance. Federal Highway Administration, Washington, D.C.

Fildes, R. (1984). Bayesian forecasting. In Forecasting Accuracy of Major Time Series,S. Makridakis et al., Eds. Wiley, Chichester, U.K.

Finch, J., P. Curran, and S. West (1994). The Effects of Model and Data Characteristicson the Accuracy of Parameter Estimates and Standard Errors in ConfirmatoryFactor Analysis. Unpublished manuscript.

Fleming, T. and D. Harrington (1991). Counting Processes and Survival Analysis. JohnWiley & Sons, New York.

© 2003 by CRC Press LLC

Page 316: Data Analysis Book

Fomby, T.B., R.C. Hill,, and S.R. Johnson (1984). Advanced Econometrics Methods.Springer-Verlag, New York.

Fomunung, I., S. Washington, R. Guensler, and W. Bachman (2000). Validation of theMEASURE automobile emissions model: a statistical analysis. Journal of Trans-portation Statistics 3(2), 65–84.

Freund, J.E. and A.R. Ansari (1957). Two-Way Rank Sum Tests for Variance. TechnicalReport to Office of Ordnance Research and National Science Foundation 34.Virginia Polytechnic Institute, Blacksburg, VA.

Freund, R. and W. Wilson (1997). Statistical Methods. Academic Press, New York.Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit

in the analysis of variance. Journal of the American Statistical Association 32,675–701.

Garber, N.G. (1991). Impacts of Differential Speed Limits on Highway Speeds andAccidents. AAA Foundation for Traffic Safety, Washington, D.C..

Gastwirth, J.L. (1965). Percentile modifications of two-sample rank tests. Journal ofthe American Statistical Association 60, 1127–1141.

Geary, R.C. (1966). A note on residual covariance and estimation efficiency in regres-sion. American Statistician 20, 30–31.

Gelman, A., J. Carlin, H. Stern, and D.B. Rubin (1995). Bayesian Data Analysis. Chap-man & Hall, New York.

Gibbons, J.D. (1985a). Nonparametric Methods for Quantitative Analysis, 2nd ed. Amer-ican Sciences Press, Syracuse, NY.

Gibbons, J.D. (1985b). Nonparametric Statistical Inference, 2nd ed. Marcel Dekker, NewYork.

Gibbons, J.D. and J.L. Gastwirth (1970). Properties of the percentile modified ranktests. Annals of the Institute of Statistical Mathematics, Suppl. 6, 95–114.

Gilbert, C. (1992). A duration model of automobile ownership. Transportation Research26B, 97–114.

Glasser, G.J. and R.F. Winter (1961). Critical values of the rank correlation coefficientfor testing the hypothesis of independence. Biometrika 48, 444–448.

Glenberg, A. (1996). Learning from Data: An Introduction to Statistical Reasoning, 2nded. Lawrence Erlbaum Associates, Mahwah, NJ.

Godfrey, L.G. (1988). Misspecification Tests in Econometrics: The Lagrange MultiplierPrinciple and Other Approaches. Cambridge University Press, New York.

Goldfeld, S.M. and R.E. Quandt (1965). Some tests for homoscedasticity. Journal ofthe American Statistical Association 60, 539–547.

Goldfeld, S.M. and R.E. Quandt (1972). Nonlinear Methods in Econometrics. North-Holland, Amsterdam.

Gourieroux, C., A. Monfort, and A. Trogon (1984). Pseudo maximum likelihoodmethods: theory. Econometrica 52, 681–700.

Granger, C.W.J. (1969). Investigating causal relations by econometric models andcross-spectral methods, Econometrica 37, 424–438.

Granger, C.W. and P. Engle (1984). Dynamic model specification with equilibriumconstraints: co-integration and error correction. Draft manuscript.

Green, D. and P. Hu (1984). The influence of the price of gasoline on vehicle use inmultivehicle households. Transportation Research Record 988, 19–23.

Green, M. and M. Symons (1983). A comparison of the logistic risk function and theproportional hazards model in prospective epidemiologic studies. Journal ofChronic Diseases 36, 715–724.

Greene, W. (1990a). Econometric Analysis. Macmillan, New York.

© 2003 by CRC Press LLC

Page 317: Data Analysis Book

Greene, W. (1990b). A gamma distributed stochastic frontier model. Journal of Econo-metrics 46, 141–163.

Greene, W. (1992). LIMDEP Version 6.0. Econometric Software, Inc., Bellport, NY.Greene, W. (1995a). LIMDEP Version 7.0. Econometric Software, Inc., Plainview, NY.Greene, W. (1995b). Count Data. Manuscript, Department of Economics, Stern School

of Business, New York University, New York.Greene, W.H. (2000). Econometric Analysis, 4th ed. Prentice-Hall, Upper Saddle River,

NJ.Griffiths, W.E., R.C Hill, and G.G. Judge (1993). Learning and Practicing Econometrics.

John Wiley & Sons, New York.Gujarati, D. (1992). Essentials of Econometrics. McGraw-Hill, New YorkGullikson, H. and J. Tukey (1958). Reliability for the law of comparative judgment.

Psychometrika 23, 95–110.Gumbel, E. (1958). Statistics of Extremes. Columbia University Press, New York.Gurmu, S. and P. Trivedi (1994). Recent Developments in Models of Event Counts:

A Survey. Manuscript, Department of Economics, University of Virginia, Char-lottesville.

Haldar, A. and S. Mahadevan (2000). Probability, Reliability, and Statistical Methods inEngineering Design. John Wiley & Sons, New York.

Hamed, M. and F. Mannering (1990). A note on commuters’ activity duration andthe efficiency of proportional hazards models. Unpublished manuscript. Uni-versity of Washington, Seattle.

Hamed, M. and F. Mannering (1993). Modeling travelers’ postwork activity involve-ment: toward a new methodology. Transportation Science 17, 381–394.

Han, A. and J. Hausman (1990). Flexible parametric estimation of duration andcompeting risk models. Journal of Applied Econometrics 5, 1–28.

Harley, H.O. and J.N.K. Rao (1967). Maximum likelihood estimation for the mixedanalysis of variance model. Biometrika 54, 93–108.

Harvey, A.C. (1976). Estimating regression models with multiplicative heteroscedas-ticity. Econometrica 44, 461–465.

Harwood, D.W., F.M. Council, E. Hauer, W.E. Hughes, and A. Vogt (2000). Predictionof the Expected Safety Performance of Rural Two-Lane Highways. FederalHighway Administration Report, FHWA-RD-99–207, Washington, D.C.

Hauer, E. (1997). Observational Before-After Studies in Road Safety. Pergamon, Tarry-town, NY.

Hauer, E. and J. Bamfo (1997). Two tools for finding what function links the dependentvariable to the explanatory variables. In Proceedings of the ICTCT 1997 Confer-ence. Lund, Sweden.

Hausman, J., B. Hall, and Z. Griliches (1984). Economic models for count data withan application to the patents–R&D relationship. Econometrica 52, 909–938.

Hausman, J.A. (1978). Specification tests in econometrics. Econometrica 46, 1251–1271.Hausman, J.A. and W.E. Taylor (1981). Panel data and unobservable individual ef-

fects. Econometrica 49, 1377–1398.Hay, J. (1980). Occupational Choice and Occupational Earnings. Ph.D. dissertation.

Yale University, New Haven, CT.Heckman, J. (1976). The common structure of statistical models for truncation, sample

selection, and limited dependent variables and a simple estimator for suchmodels. Annals of Economic and Social Measurement 5, 475–492.

Heckman, J. (1978). Dummy endogenous variables in a simultaneous equation sys-tem. Econometrica 46, 931–960.

© 2003 by CRC Press LLC

Page 318: Data Analysis Book

Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47,153–162.

Heckman, J. (1981). Statistical models for discrete panel data. In Structural Analysisof Discrete Data with Econometric Applications, C. Manski and D. McFadden, Eds.MIT Press, Cambridge, MA.

Heckman, J. and G. Borjas (1980). Does unemployment cause future unemployment?Definitions, questions and answers from a continuous time model of hetero-geneity and state dependence. Econometrica 47, 247–283.

Heckman, J. and T.E. McCurdy (1980). A life-cycle model of female labor supply.Review of Economic Studies 47, 47–74.

Heckman, J. and B. Singer (1984). A method for minimizing the impact of distribu-tional assumptions in econometric models for duration data. Econometrica 52,271–320.

Hensher, D.A. (2001). The sensitivity of the valuation of travel time savings to thespecification of unobserved effects. Transportation Research Part E: Logistics andTransportation Review 37E, 129–142.

Hensher, D. and F. Mannering (1994). Hazard-based duration models and their ap-plication to transport analysis. Transport Reviews 14, 63–82.

Hensher, D. and F. Milthorpe (1987). Selectivity correction in discrete-continuouschoice analysis: with empirical evidence for vehicle choice and use. RegionalScience and Urban Economics 17, 123–150.

Hildreth, C. and J. Lu (1960). Demand relations with autocorrelated disturbances.Technical Bulletin 276. Michigan State University, Agriculture Experiment Sta-tion.

Hochberg, Y. and A.C. Tamhane (1987). Multiple Comparison Procedures. Wiley, NewYork.

Hoeffding, W. (1951). Optimum nonparametric tests. In Proceedings of the 2nd BerkeleySymposium on Mathematical Statistics and Probability, J. Neuman, Ed. Universityof California Press, Berkeley, CA, 83–92.

Hogg, R.V. and A.T. Craig (1992). Introduction to Mathematical Statistics. Prentice-Hall,Englewood Cliffs, NJ.

Hollander, M. (1970). A distribution-free test for parallelism. Journal of the AmericanStatistical Association 65, 387–394.

Hollander, M. and D.A. Wolfe (1973). Nonparametric Statistical Methods. John Wiley& Sons, New York.

Honore, B.E. (1992). Trimmed LAD and least squares estimation of truncated andcensored regression models with fixed effects. Econometrica 60, 533–565.

Hornik, K., M. Stinchkombe, and H. White (1989). Multilayer feedforward networksare universal approximators. Neural Networks 2, 359–366.

Hoyle, R.H., Ed. (1995). Structural Equation Modeling: Concepts, Issues, and Applications.SAGE, Thousand Oaks, CA.

Hsiao, C. (1975). Some estimation methods for a random coefficients model. Econo-metrica 43, 305–325.

Hsiao, C. (1986). Analysis of Panel Data. Cambridge University Press, Cambridge, UK.Hu, L., P. Bentler, and Y. Kano (1992). Can test statistics in covariance structure

analysis be trusted? Psychological Bulletin 112, 351–362.Hui, W. (1990). Proportional hazard Weibull mixtures. Working paper, Department

of Economics, Australian National University, Canberra.

© 2003 by CRC Press LLC

Page 319: Data Analysis Book

Iman, R.L., D. Quade, and D.A. Alexander (1975). Exact probability levels for theKruskal-Wallis test. In Selected Tables in Mathematical Statistics, Vol. 3. Instituteof Mathematical Statistics, American Mathematical Society, Providence, RI,329–384.

Ingram, D. and J. Kleinman (1989). Empirical comparisons of proportional hazardsand logistic regression models. Statistics in Medicine 8, 525–538.

Johansson, P. (1996). Speed limitation and motorway casualties: a time series countdata regression approach. Accident Analysis and Prevention 28, 73–87.

Johnson, N. and S. Kotz. (1970). Distributions in Statistics: Continuous Univariate Dis-tributions, Vol. 1. John Wiley & Sons, New York.

Johnson, R.A. (1994). Miller & Freund’s Probability and Statistics for Engineers, 5th ed.Prentice-Hall, Englewood Cliffs, NJ.

Johnson, R. and D. Wichern (1992). Multivariate Statistical Analysis, 3rd ed. Prentice-Hall, Englewood Cliffs, NJ.

Joreskog, K. (1969). A general approach to confirmatory maximum likelihood factoranalysis. Psychometrika 34, 183–202.

Jorgenson, D.L. (1997). The Development of a Speed Monitoring Program for Indiana.Master’s thesis, School of Civil Engineering, Purdue University, West Lafayette,IN.

Kaarsemaker, L. and A. van Wijngaarden (1953). Tables for use in rank correlation.Statistica Neerlandica 7, 41–54.

Kalbfleisch, J. and R. Prentice (1980). The Statistical Analysis of Failure Time Data. JohnWiley & Sons, New York.

Kamat, A.R. (1956). A two-sample distribution-free test. Biometrika 43, 377–387.Kang, S. (1985). A note on the equivalence of specification tests in the two-factor

multivariate variance components model. Journal of Econometrics, 28, 193–203.Kaplan, E. and P. Meier (1958). Nonparametric estimation from incomplete observa-

tions. Journal of the American Statistical Association 53, 457–481.Karlaftis, M., J. Golias, and S. Papadimitriou (2002). Transit quality as an integrated

traffic management strategy: measuring perceived service quality. Journal ofPublic Transportation 4(1), 13–26.

Karlaftis, M.G. and P.S. McCarthy (1998). Operating subsidies and performance: anempirical study. Transportation Research 32A, 389–375.

Karlaftis, M.G. and K.C. Sinha (1997). Modeling approach for rolling stock deterior-ization prediction. ASCE Journal of Transportation Engineering, 12(3), 227–228.

Karlaftis, M.G. and A.P. Tarko (1998). Heterogeneity considerations in accident mod-eling. Accident Analysis and Prevention 30, 425–433.

Karlaftis, M.G., P.S. McCarthy, and K.C. Sinha (1999). System size and cost structureof transit industry, ASCE Journal of Transportation Engineering 125(3), 208–215.

Katz, L. (1986). Layoffs, recall and the duration of unemployment. Working paper1825, National Bureau of Economic Research, Cambridge, MA.

Keane, M. (1994). A computationally practical simulation estimator for panel data.Econometrica 62, 95–116.

Keller, G. and B. Warrack (1997). Statistics for Management and Economics, 4th ed.Duxbury Press, Belmont, CA.

Kendall, M.G. (1962). Rank Correlation Methods, 3rd ed. Hafner, New York.Kennedy, P. (1998). A Guide to Econometrics. MIT Press, Cambridge, MA.Kharoufeh, J.P. and K. Goulias (2002). Non-parametric identification of daily activity

durations using kernel density estimators. Transportation Research 36B, 59–82.

© 2003 by CRC Press LLC

Page 320: Data Analysis Book

Kiefer, N. (1988). Economic duration data and hazard functions. Journal of EconomicLiterature 26, 646–679.

Kim, P.J. and R.J. Jennrich (1973). Tables of the exact sampling distribution of thetwo-sample Kolmogorov-Smirnov criterion Dmn (m = n). In Selected Tables inMathematical Statistics, Vol. 1. Institute of Mathematical Statistics, AmericanMathematics Society, Providence, RI, 79–170.

Kim, S.G. and F. Mannering (1997). Panel data and activity duration models: econo-metric alternatives and applications. In Panels for Transportation Planning: Meth-ods and Applications, T. Golob, R. Kitamura, and L. Long, Eds. Kluwer, Norwell,MA, 349–373.

King, G., R. Keohane, and S. Verba (1994). Designing Social Inquiry. Princeton Univer-sity Press, Princeton, NJ.

Kitagawa, G. and W. Gerch (1984). A smoothness priors modeling of time series withtrend and seasonality. Journal of the American Statistical Association 79, 378–389.

Kitagawa, G. and W. Gerch (1996). Smoothness Priors Analysis of Time Series. Springer-Verlag, New York.

Kline, R. (1998). Principles and Practice of Structural Equation Modeling. Guilford Press,New York.

Klotz, J.H. (1962). Nonparametric tests for scale. Annals of Mathematical Statistics 33,495–512.

Kmenta, J. (1971). Elements of Econometrics. Macmillan Press, New York.Koppelman, F. (1975). Travel Prediction with Models of Individualistic Choice Be-

havior. Ph.D. dissertation, Department of Civil Engineering, MIT, Cambridge,MA.

Kottegoda, N. and R. Rosso (1997). Probability, Statistics, and Reliability for Civil andEnvironmental Engineers. McGraw-Hill, New York.

Kruskal, W.H. and W.A. Wallis (1953). Use of ranks in the one-criterion analysis ofvariance. Journal of the American Statistical Association 47, 583–621, 1952; errata,48, 905–911.

Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defectsin manufacturing. Technometrics 34, 1–14.

Lancaster, T. (1968). Grouping estimators on heteroscedastic data. Journal of the Amer-ican Statistical Association 63, 182–191.

Larson, R. and M. Marx (1986). An Introduction to Mathematical Statistics and ItsApplications, 2nd ed. Prentice-Hall, Englewood Cliffs, NJ.

Laubscher, N.F. and R.E. Odeh (1976). A confidence interval for the scale parameterbased on Sukhatme’s two-sample statistic. Communications in Statistics — Theoryand Methods 14, 1393–1407.

Laubscher, N.F., R.E. Odeh, F.E. Steffens, and E.M. deLange (1968). Exact criticalvalues for Mood’s distribution-free test statistic for dispersion and its normalapproximation. Technometrics 10, 497–508.

Lee, E. (1992). Statistical Methods for Survival Data Analysis, 2nd ed. John Wiley &Sons, New York.

Lee, J. and F. Mannering (2002). Impact of roadside features on the frequency andseverity of run-off-roadway accidents: an empirical analysis. Accident Analysisand Prevention 34, 149–161.

Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden Day,San Francisco, CA.

© 2003 by CRC Press LLC

Page 321: Data Analysis Book

Lempers, F.B. and T. Kloek (1973). On a simple transformation for second-orderautocorrelated disturbances in regression analysis. Statistica Neerlandica, 27,69–75.

Levy, P. and S. Lemeshow (1999). Sampling of Populations, Methods and Applications,3rd ed. John Wiley & Sons, New York.

Lillard, L.A. and R.J. Willis (1978). Dynamic aspects of earning mobility. Econometrica46, 985–1012.

Ljung, G.M and G.E.P. Box (1978). On a measure of lack of fit in time series models.Biometrika 65, 297–303.

Loizides, J. and Tsionas, E.G., (2002). Productivity growth in European railways: anew approach. Transportation Research 36, 633–644.

MacCallum, R. (1990). The need for alternative measures of fit in covariance structuremodeling. Multivariate Behavioral Research 25, 157–162.

Madanat, S.M., M.G. Karlaftis, and P.S. McCarthy (1997). Probabilistic infrastructuredeterioration models with panel data. ASCE Journal of Infrastructure Systems3(1), 4–9.

Maddala, G.S. (1983). Limited Dependent and Qualitative Variables in Econometrics. Cam-bridge University Press, Cambridge, U.K.

Maddala, G.S. (1988). Introduction to Econometrics. Macmillan, New York.Maghsoodloo, S. (1975). Estimates of the quantiles of Kendall’s partial rank correla-

tion coefficient and additional quantile estimates. Journal of Statistical Computingand Simulation 4, 155–164.

Maghsoodloo, S. and L.L. Pallos (1981). Asymptotic behavior of Kendall’s partialrank correlation coefficient and additional quantile estimates. Journal of Statis-tical Computing and Simulation 13, 41–48.

Magnus, J.R. (1982). Multivariate error components analysis of linear and nonlinearregression models by maximum likelihood. Journal of Econometrics 19, 239–285.

Makridakis, S., S.C. Wheelwright, and V.E. McGee (1989). Forecasting: Methods andApplications, 3rd ed. Wiley, New York.

Manly, B. (1986). Multivariate Statistical Methods, A Primer. Chapman & Hall, NewYork.

Mann, H.B. (1945). Nonparametric tests against trend. Econometrica 13, 245–259.Mann, H.B. and D.R. Whitney (1947). On a test whether one of two random variables

is stochastically larger than the other. Annals of Mathematical Statistics 18, 50–60.Mannering, F. (1983). An econometric analysis of vehicle use in multivehicle house-

holds. Transportation Research 17A, 183–189.Mannering, F. (1993). Male/female driver characteristics and accident risk: some new

evidence. Accident Analysis and Prevention 25, 77–84.Mannering, F. and M. Hamed (1990a). Occurrence, frequency, and duration of com-

muters’ work-to-home departure delay. Transportation Research 24B, 99–109.

© 2003 by CRC Press LLC

Page 322: Data Analysis Book

Mannering, F. and M. Hamed (1990b). Commuter welfare approach to high occupan-cy vehicle lane evaluation: an exploratory analysis. Transportation Research 24A,371–379.

Mannering, F. and I. Harrington (1981). Use of density function and Monte Carlosimulation techniques in the evaluation of policy impacts on travel demand.Transportation Research Record 801, 8–15.

Mannering, F. and D. Hensher (1987). Discrete/continuous econometric models andtheir application to transport analysis. Transport Reviews 7(3), 227–244.

Mannering, F. and C. Winston (1985). Dynamic empirical analysis of householdvehicle ownership and utilization. Rand Journal of Economics 16(2), 215–236.

Mannering, F. and C. Winston (1987). U.S. automobile market demand. In BlindIntersection: Policy and the Automobile Industry. Brookings Institution, Washing-ton, D.C., 36–60.

Mannering, F., S. Abu-Eisheh, and A. Arnadottir (1990). Dynamic traffic equilibriumwith discrete/continuous econometric models. Transportation Science 24,105–116.

Manski, C. and S. Lerman (1977). The estimation of choice probabilities from choice-based samples. Econometrica 45, 1977–1988.

Manski, C. and D. McFadden (1981). Alternative estimators and sample designs fordiscrete choice analysis. In Structural Analysis of Discrete Data with EconometricApplications, C. Manski and D. McFadden, Eds. MIT Press, Cambridge, MA.

Marsh, H. and D. Hocevar (1985). Application of confirmatory factor analysis to thestudy of self-concept: first- and higher-order factor models and their invarianceacross groups. Psychological Bulletin 97, 562–582.

Mathsoft (1999). S-Plus 2000 Statistical Software. Mathsoft, Inc., Bellevue, WA.McCarthy, P.S. (2001). Transportation Economics Theory and Practice: A Case Study Ap-

proach. Blackwell, Boston, MA.McFadden, D. (1978). Modeling the choice of residential location. In Spatial Interaction

Theory and Residential Location. A. Karlqvist, L. Lundqvist, F. Snickars, and J.Weibuill, Eds. North Holland, Amsterdam.

McFadden, D. (1981). Econometric models of probabilistic choice. In Structural Anal-ysis of Discrete Data with Econometric Applications. C. Manski and D. McFadden,Eds. MIT Press, Cambridge, MA.

McFadden, D. (1989). A method of simulated moments for estimation of discreteresponse models without numerical integration. Econometrica 57, 995–1026.

McFadden, D. and K. Train (2000). Mixed MNL models for discrete response. Journalof Applied Econometrics 15, 447–470.

McKelvey, R. and W. Zavoina (1975). A statistical model for the analysis of ordinallevel dependent variables. Journal of Mathematical Sociology 4, 103–120.

McNemar, Q. (1962). Psychological Statistics. Wiley, New York.Mehta, C. and N. Patel (1983). A network algorithm for performing Fisher’s exact

test in r c contingency tables. Journal of the American Statistical Association 78,427–434.

Meyer, B. (1990). Unemployment insurance and unemployment spells. Econometrica58, 757–782.

© 2003 by CRC Press LLC

Page 323: Data Analysis Book

Miaou, S.P. (1994). The relationship between truck accidents and geometric designof road sections: Poisson versus negative binomial regressions. Accident Anal-ysis and Prevention 26, 471–482.

Miaou, S.P. and H. Lum (1993). Modeling vehicle accidents and highway geometricdesign relationships. Accident Analysis and Prevention 25, 689–709.

Miller, R.G. (1981). Simultaneous Statistical Inference, 2nd ed. Springer, New York.Milton, J. and F. Mannering (1998). The relationship among highway geometrics,

traffic-related elements and motor vehicle accident frequencies, Transportation25, 395–413.

Molenaar, K., S. Washington, and J. Diekmann (2000). A structural equation modelof contract dispute potential. ASCE Journal of Construction Engineering and Man-agement 124(4), 268–277.

Montgomery, D. (1991). Design and Analysis of Experiments, 3rd ed. John Wiley & Sons.New York.

Mood, A.M. (1954). On the asymptotic efficiency of certain nonparametric two-sam-ple tests. Annals of Mathematical Statistics 25, 514–522.

Moran, P.A. P. (1951). Partial and multiple rank correlation. Biometrika 38, 26–32.Moses, L.E. (1963). Rank tests for dispersion. Annals of Mathematical Statistics 34,

973–983.Moulton, B.R. (1986). Random group effects and the precision of regression estimates.

Journal of Econometrics 32, 385–397.Mulaik, S., L. James, J. Van Alstine, N. Bennett, S. Lind, and C. Stilwell (1989).

Evaluation of goodness of fit indices for structural equation models. Psycholog-ical Bulletin 105, 430–445.

Mullahey, J. (1986). Specification and testing of some modified count data models.Journal of Econometrics 33, 341–365.

Mundlak, Y. (1978). On the pooling of time series and cross-section data. Econometrica46, 69–85.

Muthen, B. (1978). Response to Freedman’s critique of path analysis: improve cred-ibility by better methodological training. Journal of Educational Statistics 12,178–184.

Muthen, B. (1984). A general structural equation model with dichotomous, orderedCategorical, and continuous latent variable indicators. Psychometrika 49,115–132.

Myers, R.H. (1990). Classical and Modern Regression with Applications, 2nd ed. DuxburyPress, Belmont, CA.

Nam, D. and F. Mannering (2000). Hazard-based analysis of highway incident dura-tion. Transportation Research 34A, 85–102.

National Technical University of Athens, Department of Transportation Planning andEngineering (1996). http://www/transport.ntua.gr/map.

Nerlove, M. (1971a). Further evidence on the estimation of dynamic economic rela-tions from a time-series of cross-sections. Econometrica 39, 359–382.

Nerlove, M. (1971b). A note on error components models. Econometrica 39, 383–396.Neter, J., M. Kutner, C. Nachtsheim, and W. Wasserman (1996). Applied Linear Statis-

tical Models, 4th ed. Irwin, Boston, MA.Nicholls, D.F. and B.G. Quinn (1982). Random coefficient autoregressive models: an

introduction. Lecture Notes in Statistics, Vol. 11, Springer, New York.Nicholson, W. (1978). Microeconomic Theory. Dryden Press, Hinsdale, IL.Niemeier, D. (1997). Accessibility: an evaluation using consumer welfare. Transporta-

tion 24(4), 377–396.

© 2003 by CRC Press LLC

Page 324: Data Analysis Book

Noether, G.E. (1972). Distribution-free confidence intervals. American Statistician 26,39–41.

Noland, R and M.G. Karlaftis (2002). Seat-belt usage and motor vehicle fatalities: arethe results methodology-sensitive? Working paper, Imperial College, London,U.K.

Oaks, D. (1977). The asymptotic information in censored survival data. Biometrika 64,441–448.

Okutani, I. and Y.J. Stephanedes (1984). Dynamic prediction of traffic volume throughKalman filtering theory. Transportation Research 18B, 1–11.

Olmstead, P.S. (1946). Distribution of sample arrangements for runs up and down.Annals of Mathematical Statistics 17, 24–33.

Park, R.E. (1966). Estimation with heteroscedastic error terms. Econometrica 34, 888.Pedhazur, E. and L. Pedhazur Schmelkin (1991). Measurement, Design, and Analysis:

An Integrated Approach. Lawrence Erlbaum Associates, Hillsdale, NJ.Peeta, S. and I. Anastassopoulos (2002). Automatic real-time detection and correction

of erroneous detector data with Fourier transforms for on-line traffic controlarchitectures. Transportation Research Record 1811, 1–16.

Peterson, A. (1976). Bounds for a joint distribution function with fixed sub-distribu-tion functions: applications to competing risks. Proceedings of the National Acad-emy of Sciences U.S.A. 73, 11–13.

Phlips, L. (1974). Applied Consumption Analysis. North-Holland, Amsterdam.Pindyck, R. and D. Rubinfeld (1990). Econometric Models and Economic Forecasts.

McGraw-Hill, New York.Pindyck, R.S. and D.L. Rubinfeld (1991). Econometric Models & Economic Forecasts, 3rd

ed. McGraw-Hill, New York.Poch, M. and F. Mannering (1996). Negative binomial analysis of intersection accident

frequencies. Journal of Transportation Engineering 122, 391–401.Pratt, J. W. and J. D. Gibbons (1981). Concepts of Nonparametric Theory. Springer, New

York.Rao, C.R. (1972). Estimation variance and covariance components in linear models.

Journal of the American Statistical Association 67, 112–115.Rao, C.R. (1973). Linear Statistical Inference and Its Applications, 2nd ed. John Wiley &

Sons, New York.Rhyne, A.L. and R.G.D. Steel (1965). Tables for a treatment versus control multiple

comparisons sign test. Technometrics 7, 293–306.Rigdon, E., R. Schumacker, and W. Wothke (1998). A comparative review of interac-

tion and nonlinear modeling. In Interaction and Nonlinear Effects in StructuralEquation Modeling, R. Schumacker and G. Marcoulides, Eds. Lawrence ErlbaumAssociates, Mahwah, NJ.

Rosenbaum, S. (1953). Tables for a nonparametric test of dispersion. Annals of Math-ematical Statistics 24, 663–668.

Rosenkrantz, W. (1997). Introduction to Probability and Statistics for Scientists and Engi-neers. McGraw-Hill, New York.

Ross, S. M. (1988). A First Course in Probability, 3rd ed. Macmillan, New York.Satorra, A. (1990). Robustness issues in structural equation modeling: a review of

recent developments. Quality and Quantity 24, 367–386.Schumacker, R.E. and R.G. Lomax (1996). A Beginner’s Guide to Structural Equation

Modeling. Lawrence Erlbaum Associates, Mahwah, NJ.Schumacker, R.E. and G.A. Marcoulides, Eds. (1998). Interaction and Non-Linear Effects

in Structural Equation Modeling. Lawrence Erlbaum Associates, Mahwah, NJ.

© 2003 by CRC Press LLC

Page 325: Data Analysis Book

Sedlmeier, P. (1999). Improving Statistical Reasoning: Theoretical Models and PracticalImplications. Lawrence Erlbaum Associates, Mahwah, NJ.

Shankar, V. and F. Mannering (1996). An exploratory multinomial logit analysis ofsingle-vehicle motorcycle accident severity. Journal of Safety Research 27(3),183–194.

Shankar, V. and F. Mannering (1998). Modeling the endogeneity of lane-mean speedsand lane-speed deviations: a structural equations approach. Transportation Re-search 32A, 311–322.

Shankar, V., F. Mannering, and W. Barfield (1995). Effect of roadway geometrics andenvironmental factors on rural accident frequencies. Accident Analysis and Pre-vention 27, 371–389.

Shankar, V., J. Milton, and F. Mannering (1997). Modeling accident frequencies aszero-altered probability processes: an empirical inquiry. Accident Analysis andPrevention 29, 829–837.

Shankar, V., R. Albin, J. Milton, and F. Mannering (1998). Evaluating median cross-overlikelihoods with clustered accident counts: an empirical inquiry using the ran-dom effects negative binomial model. Transportation Research Record 1635, 44–48.

Shumway, R.H. (1988). Applied Statistical Time Series Analysis. Prentice-Hall, Engle-wood Cliffs, NJ.

Shumway, R.H. and S.A. Stoffer (2000). Time Series Analysis and Its Applications.Springer-Verlag, New York.

Siegel, S. (1988). Nonparametric Statistics for the Behavioral Sciences, 2nd ed. McGraw-Hill, New York.

Siegel, S. and J.W. Tukey (1960). A nonparametric sum of ranks procedure for relativespread in unpaired samples. Journal of the American Statistical Association 55,429–445; correction, 56, 1005, 1961.

Small, K. and C. Hsiao (1985). Multinomial logit specification tests. InternationalEconomic Review 26, 619–627.

Small, K. and H. Rosen (1981). Applied welfare economics with discrete choicemodels. Econometrica 49, 105–130.

Smith, B.L. and M.J. Demetsky (1994). Short-term traffic flow prediction: neuralnetwork approach. Transportation Research Record 1453, 98–104.

Smith, B.L. and M.J. Demetsky (1997). Traffic flow forecasting: comparison of mod-eling approaches. ASCE Journal of Transportation Engineering 123, 261–266.

Smith, B.L., B.M. Williams, and K.R. Oswald (2002). Comparison of parametric andnonparametric models for traffic flow forecasting. Transportation Research PartC: Emerging Technologies 10(4), 303–321.

Smith, L. (1998). Linear Algebra, 3rd ed. Springer, New York.Spearman, C. (1904). The proof and measurement of association between two things,

American Journal of Psychology 15, 73–101.Spiegel, M.R. and L.J. Stephens (1998). Schaum’s Outlines: Statistics, 3rd ed. McGraw-

Hill International, New York.Stathopoulos, A. and M.G. Karlaftis (2001a). Spectral and cross-spectral analysis of

urban traffic flows. In Proceedings of the 4th IEEE Conference on Intelligent Trans-portation Systems, August 25–29, Oakland, CA.

Stathopoulos, A. and M.G. Karlaftis (2001b). Temporal and spatial variations of real-time traffic data in urban areas. Transportation Research Record 1768, 135–140.

Stathopoulos, A. and M.G. Karlaftis. A multivariate state-space approach for urbantraffic flow modeling and prediction. Transportation Research C: Policy and Prac-tice, in press.

© 2003 by CRC Press LLC

Page 326: Data Analysis Book

Steel, R.G.D. (1959a). A multiple comparison sign test: treatments versus control.Journal of the American Statistical Association 54, 767–775.

Steel, R.G.D. (1959b). A multiple comparison sign test: treatments versus control.Biometrics 15, 560–572.

Steel, R.G.D. (1960). A rank sum test for comparing all sets of treatments. Technometrics2, 197–207.

Steel, R.G.D. (1961). Some rank sum multiple comparisons tests. Biometrics 17, 539–552.Steel, R., J. Torrie, and D. Dickey (1997). Principles and Procedures of Statistics: A

Biometrical Approach, 3rd ed. McGraw-Hill, New York.Steiger, J. (1990). Structural model evaluation and modification: an interval estimation

approach. Multivariate Behavioral Research 25, 173–180.Steiger, J., A. Shapiro, and M. Browne (1985). On the multivariate asymptotic distri-

bution and sequential chi-square statistics. Psychometrika 50, 253–263.Sukhatme, B.V. (1957). On certain two-sample nonparametric tests for variances.

Annals of Mathematical Statistics 28, 188–194.Swamy, P.A.V.B. (1971). Statistical Inference in Random Coefficient Regression Models.

Springer-Verlag, New York.Swamy, P.A.V.B. and S.S. Arora (1972). The exact finite sample properties of the

estimators of coefficients in the error components regression models. Economet-rica 40, 261–275.

Swed, F.S. and C. Eisenhart (1943). Tables for Testing Randomness of Grouping in aSequence of Alternatives. Annals of Mathematical Statistics 14, 66–87.

Taylor, W.E. (1980). Small sample considerations in estimation from panel data. Jour-nal of Econometrics 13, 203–223.

Terry, M.E. (1943). Some rank order tests which are most powerful against specificparametric alternatives. Annals of Mathematical Statistics 23, 346–366.

Terry, M.E. (1952). Some rank order tests which are most powerful against specificparametric alternatives. Annals of Mathematical Statistics 14, 66–87.

Theil, H. (1950). A rank-invariant method of linear and polynomial regression anal-ysis. Indagationes Mathematicae 12, 85–91.

Theil, H. (1978). Introduction to Econometrics. Prentice-Hall, Englewood Cliffs, NJ.Tjøstheim, D. (1986). Some doubly stochastic time series models. Journal of Time Series

Analysis 7, 51–72.Tjøstheim, D. (1994). Non-linear time series: a selective review. Scandinavian Journal

of Statistics 21, 97–130.Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econo-

metrica 26, 24–36.Tong, H. (1990). A Non-Linear Time Series: A Dynamical System Approach. Oxford

University Press, Oxford, U.K.Train, K. (1986). Qualitative Choice Analysis: Theory, Econometrics and an Application to

Automobile Demand. MIT Press, Cambridge, MA.Tukey, J. W. (1959). A quick, compact, two-sample test to Duchworth’s specifications.

Technometrics 1, 31–48.Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, MA.Tversky, A. (1972). Elimination by aspects: a theory of choice. Psychological Review 79,

281–299.U.S. DOT (1977). Procedural Guide for Speed Monitoring. Federal Highway Admin-

istration, Office of Highway Planning, Washington, D.C.U.S. DOT (1978). Interim Speed Monitoring Procedures. Federal Highway Adminis-

tration, Office of Highway Planning, Washington, D.C.

© 2003 by CRC Press LLC

Page 327: Data Analysis Book

U.S. DOT (1980). Speed Monitoring Program Procedural Manual. Federal HighwayAdministration, Office of Highway Planning, Washington, D.C.

U.S. DOT (1992). Speed Monitoring Program Procedural Manual. Federal HighwayAdministration, Office of Highway Planning, Washington, D.C.

U.S. DOT (1997a). Transportation in the United States: A Review. Bureau of Trans-portation Statistics, Washington, D.C.

U.S. DOT (1997b). National Transportation Statistics. Bureau of Transportation Sta-tistics, Washington, D.C.

Van der Waerden, B.L. (1952). Order Tests for the Two-Sample Problem and TheirPower, I, II, III, Proceedings, Koninklijke Nederlandse Akademie van Wetenschappen,Series A 55(Indagationes Mathematicae 14). 453–459; Indagationes Mathematicae 15,303–310, 311–316, 1953; correction, 15, 80, 1953.

Van der Waerden, B.L. and E. Nievergelt (1956). Tafeln zum Vergleich zweier Stichprobenmittels X-test und Zeichentest. Springer, Berlin.

Vardeman, S. (1994). Statistics for Engineering Problem Solving. PWS, Boston, MA.Vardeman, S. and J. Jobe (1994). Basic Engineering Data Collection and Analysis. Dux-

bury/Thomas Learning, New York.Vogt, A. (1999). Accident Models for Rural Intersections: Four-Lane by Two-Lane

Stop-Controlled and Two-Lane by Two-Lane Signalized. Federal Highway Ad-ministration Report FHWA-RD-99–128. Washington, D.C.

Vogt, A. and J. Bared (1998). Accident Prediction Models for Two-Lane Rural Roads:Segments and Intersections. Federal Highway Administration Report FHWA-RD-98–133, Washington, D.C.

Vuong, Q. (1989). Likelihood ratio tests for model selection and non-nested hypoth-eses. Econometrica 57, 307–334.

Wald, A. and J. Wolfowitz (1940). On a test whether two samples are from the samepopulation. Annals of Mathematical Statistics 11, 147–162.

Wallace, T.D. and A. Hussain (1969). The use of error components models in com-bining cross-section and time-series data. Econometrica 37, 55–72.

Wansbeek, T.J. and A. Kapteyn (1982). A simple way to obtain the spectral decom-position of variance components models for balanced data. Communications inStatistics A11, 2105–2112.

Wansbeek, T.J. and A. Kapteyn (1983). A note on spectral decomposition and maxi-mum likelihood estimation of ANOVA models with balanced data. Statisticsand Probability Letters 1, 213–215.

Washington, S. (2000a). Conducting statistical tests of hypotheses: five common mis-conceptions found in transportation research. Transportation Research Record1665, 1–6.

Washington, S. (2000b). Iteratively specified tree-based regression models: theoreticaldevelopment and example applied to trip generation. ASCE Journal of Trans-portation Engineering 126, 482–491.

Washington, S. and J. Wolf (1997). Hierarchical tree-based versus linear regression:theory and example applied to trip generation. Transportation Research Record1581, 82–88.

Washington, S., J. Wolf, and R. Guensler (1997). A binary recursive partitioningmethod for modeling hot-stabilized emissions from motor vehicles. Transpor-tation Research Record 1587, 96–105.

© 2003 by CRC Press LLC

Page 328: Data Analysis Book

Washington, S., J. Metarko, I. Fomunung, R. Ross, F. Julian, and E. Moran (1999). Aninter-regional comparison: fatal crashes in the southeastern and non-southeast-ern United States: preliminary findings. Accident Analysis & Prevention 31,135–146.

Westenberg, J. (1948). Significance test for median and interquartile range in samplesfrom continuous populations of any form. Proceedings, Koninklijke NederlandseAkademie van Wetenschappen 51, 252–261.

White, D. and S. Washington (2001). Safety restraint use rate as a function of lawenforcement and other factors: preliminary analysis. Journal of the TransportationResearch Board 1779, 109–115.

White, H. (1980). A heteroscedastic-consistent covariance matrix estimator and adirect test for heteroscedasticity. Econometrica 48, 817–838.

Whittaker J., S. Garside, and K. Lindeveld (1997). Tracking and predicting networktraffic process. International Journal of Forecasting 13, 51–61.

Winston, C. and F. Mannering (1984). Consumer demand for automobile safety.American Economic Review 74(2), 316–319.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics 1, 80–83.Wilcoxon, F. (1947). Probability tables for individual comparisons by ranking meth-

ods. Biometrics 3, 119–122.Wilcoxon, F. (1949). Some Rapid Approximate Statistical Procedures. American Cyanamid

Co., Stanford Research Laboratory, Stanford, CA.Wilcoxon, F., S.K. Katti, and R.A. Wilcox (1972). Critical values and probability levels

for the Wilcoxon rank sum test and the Wilcoxon signed rank test. In SelectedTables in Mathematical Statistics. Vol. 1, Institute of Mathematical Statistics,American Mathematical Society, Providence, RI, 171–259

Williams, B.M., P.K. Durvasula, and D.E. Brown (1998). Urban traffic flow prediction:application of seasonal ARIMA and exponential smoothing models. Transpor-tation Research Record 1644, 132–144.

Wolf, J., R. Guensler, and S. Washington (1998). High emitting vehicle characterizationusing regression tree analysis. Transportation Research Record 1641, 58–65.

Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressionsand test for aggregation bias. Journal of the American Statistical Association 57,348–368.

© 2003 by CRC Press LLC

Page 329: Data Analysis Book

Appendix A: Statistical Fundamentals

This appendix provides a brief review of statistical fundamentals that areused in chapters in this book. Because this appendix serves solely as a reviewof fundamentals, readers wanting greater depth should consult other refer-ences that provide more detailed coverage of the topics. For methods froman engineering perspective, see Ash (1993), Johnson (1994), Vardeman (1994),Kottegoda and Rosso (1997), Rosenkrantz (1997), Ayyub (1998), Devore andFarnum (1999), Haldar and Mahadevan (2000), and Vardeman and Jobe(1994). For methods presented from statistics, economics, social sciences, andbiology perspectives, see Larson and Marx (1986), Pedhazur and Pedhazur-Schmelkin (1991), Glenberg (1996), Freund and Wilson (1997), Steel et al.(1997), Smith (1998), and Greene (2000).

A.1 Matrix Algebra Review

A review of matrix algebra is undertaken in this section because matrixalgebra is a critical and unavoidable tool in multivariate statistical methods.Computations are easier when using matrix algebra than they are usingsimple algebra. Before standard matrix notation is introduced, summationoperators, which are useful when working with matrices, are presented.Some common summation operators include

(A.1)

X X X X

C NC

X Y X Y

X C K C X NK

i Ni

N

i

N

i ii

N

ii

n

ii

n

ii

n

ii

n

1 21

1

1 1 1

1 1

K

© 2003 by CRC Press LLC

Page 330: Data Analysis Book

An n p matrix is expressed as

, (A.2)

where i = the row index and j = the column index of matrix X. In manystatistical modeling applications, the rows of matrix X correspond to observa-tions within the data set, and the columns of matrix X correspond to indepen-dent variables measured on each observation. Therefore a matrix Xn p containsobservations on n individuals, each with p measured or observed attributes.

A square matrix is a matrix that has an equal number of rows and columns,or when n = p. A matrix containing only one column is a column vector, orsimply a vector. A row vector is a matrix containing only one row. Matricesare said to be equal only if they have the same dimension and if all corre-sponding elements in the matrices are equal.

The transpose of a matrix A is obtained by transposing the columns androws in matrix A, and is denoted A or AT , such that

. (A.3)

Matrix addition and subtraction requires that the matrices added or sub-tracted have the same dimension. The sum or difference of two matrices isthe corresponding sum or difference of the individual elements in the twomatrices, such that

.(A.4)

A.1.1 Matrix Multiplication

Matrix multiplication can be done in two ways: a matrix can by multiplied bya scalar, or by another matrix. To multiply a matrix by a scalar, each elementin the matrix is multiplied by the scalar to obtain the scalar–matrix product.The product of two matrices is found by computing the cross-products of therows of matrix A with the columns of matrix B and summing across cross-products. In matrix multiplication, order of multiplication matters; therefore,a matrix is premultiplied or postmultiplied by another matrix. In computing

x x

x x

p

n np

11 1

1

K

M O M

L

AT

Ta a a

a a a

a a a

a a a

a a a

a a a

11 11 13

21 22 23

31 32 33

11 21 31

12 22 32

13 23 33

A B

a a

a a

b b

b b

a b a b

a b a b

p

n np

p

n np

p p

n n np np

11 1

1

11 1

1

11 11 1 1

1 1

K

M O M

L

K

M O M

L

K

M O M

L

© 2003 by CRC Press LLC

Page 331: Data Analysis Book

products of matrices, matrix dimensions are critical. To obtain the product AB,the number of columns in matrix A must equal the number of rows in matrixB. The size of a resultant matrix C, which results when matrix A is postmul-tiplied by B, will be a matrix C that will contain rows equal to the number ofrows in matrix A, and columns equal to the number of columns in matrix B.

Example A.1

Consider the product of two matrices A and B as follows:

.

Premultiplying matrix A by matrix B gives

.

Calculating the matrix entries for this matrix product results in

Thus, the resultant matrix C2 3 is given as

.

Note that the matrix product AB is not defined.

When computing products of matrices, the product AB generally does notequal the product BA. In fact, it is possible to have the matrix product ABdefined, and the product BA not defined, as in Example A.1. If two matricesA and B can be multiplied, then they are said to be conformable. Calculationof matrix products for dimensions up to 3 3 is fairly easy by hand, but asthe size of matrices increases, the matrix product computation becomescumbersome. Computers are ideal for multiplying matrices.

A.1.2 Linear Dependence and Rank of a Matrix

Column vectors are linearly dependent if one vector can be expressed as alinear combination of the other. If this is not the case, then the vectors arelinearly independent.

A Bn m m p

4 6 25 1 5

5 13 3

, and

B A2 2 2 3

5 13 3

4 6 25 1 5

C C

C C

C C

11 12

13 21

22 23

5 4 1 5 15 5 6 1 1 29

5 2 1 5 5 3 4 3 5 3

3 6 3 1 21 3 2 3 5 21

;

;

; ..

C 2 3

15 29 53 21 21

© 2003 by CRC Press LLC

Page 332: Data Analysis Book

Example A.2

A matrix A is given as

.

Note that the second column is a linear combination of the third column,such that

.

Stated more formally, scalar multiples of the columns of matrix An m canbe sought such that the sum of scalar and column vectors 1C1 + 2C2 + +

mCm = 0. If a combination of scalars 1, 2,…, m is found such that the linearcombination sums to zero, then at least two of the rows are said to be linearlydependent. If the only set of scalars for which this holds is the set of all 0,then the matrix is said to be linearly independent. Linear or near-lineardependencies can arise in statistical modeling and cause difficulties in dataanalysis. A linear dependency between two columns suggests that the twocolumns contain the same information, differing only by a constant.

The rank of a matrix is defined to be the maximum number of linearlyindependent columns in the matrix. In the previous example, the rank of thematrix is 3. A matrix is said to have full rank if none of the column vectorsis linearly dependent. In statistical modeling applications, a full rank matrixof independent variables suggests that each variable is measuring differentinformation, or a different dimension in the data.

A.1.3 Matrix Inversion (Division)

Matrix inversion is the equivalent to division in algebra. In (nonmatrix)algebra, if a number is multiplied by its inverse or reciprocal, the result isalways 1, such that (12)(1/12) = 1. In matrix algebra, the inverse of a matrixA is another matrix and is denoted by A–1. The inverse property holds suchthat A–1A = AA–1 = I, where I is the identity matrix, which is a diagonal matrixwhose elements on the main diagonal are all ones and all other elements arezeros, such that

. (A.5)

A 3 4

10 4 5 11 4 5 6 4

5 3 1 5 4 1/

/ /

A Ai i, ,2 345

1 0

0 1

K

M O M

L

© 2003 by CRC Press LLC

Page 333: Data Analysis Book

A matrix inverse is defined only for square matrices. Some matrices donot have an inverse, but if a matrix does have an inverse, the inverse isunique. Also, an inverse of a square r r matrix exists only if the matrix hasfull rank. Such a matrix is said to be nonsingular. An r r matrix with rankless than r is said to be singular, and does not have an inverse.

In ordinary algebra the problem 10x = 30 is solved by dividing both sidesof the equation by 10 to obtain x = 3. To illustrate the equivalent operationusing matrix algebra, consider the following example.

Example A.3

The following three equations have three unknowns, X, Y, and Z:

.

These equations can be rewritten in matrix form as AB = C, where

.

Then, to solve for the unknown quantities in vector B, the followingmatrix operations are performed:

AB = C

A–1AB = A–1C (premultiply both sides by A–1).

IB = A–1C (a matrix multiplied by its inverse is I).

B = A–1C (a matrix multiplied by I is itself).

One need only find the inverse of matrix A to solve for the unknownquantities in matrix B.

Computing matrix inversions for matrices larger than size 3 3 is aformidable computation, and justifies the extensive use of computers forthese types of problems. In statistical modeling applications, statisticalanalysis software is used to perform these operations. The general formulafor computing the individual ijth element of matrix A–1, the inverse ofmatrix A is given as

6 8 10 4010 9 24 2614 10 6 12

X Y Z

X Y Z

X Y Z

A B C

6 8 1010 9 2414 10 6

402612

, , and X

Y

Z

© 2003 by CRC Press LLC

Page 334: Data Analysis Book

, (A.6)

where Cij is the jith cofactor of A, and |A| is the determinant of A (Greene,2000). The interested reader can examine cofactors, determinants, and matrixinversion computations in more detail by consulting any number of refer-ences on matrix algebra (such as Greene, 2000 and Smith, 1998).

A.1.4 Eigenvalues and Eigenvectors

Eigenvalues are the characteristic roots of a matrix, whereas eigenvectorsare the characteristic vectors. Characteristic roots or eigenvalues of a matrixhave many applications. A useful set of results for analyzing a square matrixA arises from the solutions to the following sets of equations:

, (A.7)

where A is a square matrix, is a vector of eigenvalues, and E contains thematrix eigenvectors. The eigenvalues that solve Equation A.7 can be obtainedby performing matrix manipulations, constraining the solutions such thatETE = 1 (to remove the indeterminacy), and then solving

. (A.8)

The solutions to these equations are nonzero only if the matrix |A – I|is singular or has a zero determinant. The eigenvalues of a symmetric matrixwill always be real, and fortunately most matrices that are solved for eigen-values and eigenvectors in statistical modeling endeavors are symmetric.

Example A.4

To find the characteristic roots of the matrix X, where

,

Equation A.8 is applied, so

.

The solutions or eigenvalues of this nth order polynomial are 6 and 3.

aij ijC

A

AE E

A I 0

X4 21 5

X I 04 2

1 54 5 2 1 9 182

© 2003 by CRC Press LLC

Page 335: Data Analysis Book

With the vector of eigenvalues determined, the original formulation shownin Equation A.7 can be used to find the eigenvectors. Equation A.7 can bemanipulated to obtain

(A.9)

Equation A.9 can then be used to find the eigenvectors of matrix A.Recall, however, that the constraint ETE = 1 must be imposed to result ina unique solution.

Example A.5

Continuing with the previous example, the eigenvectors of the matrix Xare obtained by applying Equation A.9,

.

Substituting the values of 6 and 3 for , respectively, results in the fol-lowing solutions:

There are an infinite number of solutions to these equations because theconstraint ETE = 1 has not been met. When both constraints are met, theeigenvectors can be shown to be

.

A.1.5 Useful Matrices and Properties of Matrices

There are numerous useful matrices that arise in statistical modeling. A sym-metric matrix is symmetric if A = AT; thus a symmetric matrix must also besquare. A diagonal matrix is a square matrix whose off-diagonal elements areall zeros. There are two types of diagonal matrices, the identity matrix andthe scalar matrix. The identity matrix is denoted by I. It is a diagonal matrixwhose elements on the main diagonal are all ones, as shown in Equation A.5.Premultiplying or postmultiplying any r r matrix A by an r r identitymatrix I results in A, such that AI = IA = A. The scalar matrix is a diagonalmatrix whose main-diagonal elements are a constant. Multiplying an r r

A I E 0.

X I EE

E0

4 21 5

00

1

2

substituting and

substituting and

6 2 2 0 0

3 2 0 2 0

1 2 1 2

1 2 1 2

:

: .

E E E E

E E E E

E E6 3

1 2

1 2

1 5

2 5

/

/

/

/ and

© 2003 by CRC Press LLC

Page 336: Data Analysis Book

matrix A by the r r scalar matrix I is equivalent to multiplying A by thescalar . A column vector with all elements equal to 1 is denoted by I. Asquare matrix with all elements equal to 1 is denoted by J.

Some useful theorems (shown without proof) in matrix algebra, whichenable calculations necessary in statistical modeling applications, includethe following:

A + B = B + A

(A + B) + C = A + (B + C)(AB)C = A(BC)C(A + B) = CA + CB

(A + B) = A + B

(AT)T = A

(A + B)T = AT + BT

(AB)T = BTAT

(ABC)T = CTBTAT

(AB)–1 = B–1A–1

(ABC)–1 = C–1 B–1 A–1

(A–1)–1 = A

(A–1)T = (AT) –1

A.1.6 Matrix Algebra and Random Variables

Matrix algebra is extremely efficient for dealing with random variables,random vectors, and random matrices. In particular, matrix algebra isextremely useful for manipulating matrices used in the development ofstatistical models and methods. This section introduces some of the basicmatrices used in the statistical modeling of random phenomena. Other matri-ces are shown in the text as necessary.

A common starting point in many statistical modeling applications is amatrix of random variables of size n p

, (A.10)

where the p columns in matrix X represent variables (for example, gender,weight, number of lanes, etc.), and the n rows in matrix X represent obser-vations across sampling units (for example, individuals, autos, road sections,etc.).

A mean matrix , representing the means across individuals, is given as

Xn p

p

n np

x x

x x

11 1

1

K

M O M

L

X

© 2003 by CRC Press LLC

Page 337: Data Analysis Book

, (A.11)

where the means of the elements in the mean matrix are calculated using

. (A.12)

A deviations matrix is obtained by subtracting matrix E[X] from X. Build-ing on the fact that VAR[Xi] = E[Xi – E[Xi]], in matrix form (with matrix sizesshown) the variance–covariance matrix of matrix X is obtained using

. (A.13)

A correlation matrix is obtained by computing the variance covariancematrix using standardized variables, where

replace the original xij terms.

A.2 Probability, Conditional Probability, and Statistical Independence

The probability of an event or outcome is the proportion of times the eventoccurs in a long-run sequence or number of trials (Johnson, 1994). In notation,probability is defined as

, (A.14)

where count (A) is the number of times event A occurs, and trials is thenumber of times the experiment was repeated or the recorded number ofobservations where event A could have occurred, typically denoted as n.

E

E x E x

E x E x

p

n np

X X11 1

1

K

M O M

L

X

E x

x

nj pij

iji

n

1 1 2; , , ...,

VAR E E Ep p n p n p

T

n p n p p np n

X X X X X

zx x

i n j pijij j

j

; , , ..., ; , , ...,1 2 1 2

P Acount A

trialstrials;

© 2003 by CRC Press LLC

Page 338: Data Analysis Book

Note that P(A) converges in probability as the number of trials approachesinfinity. Thus, one view of probability supports the notion that many trialsneed to be observed prior to obtaining a reliable estimate of probability. Incontrast, a Bayesian statistician can generate estimates of probability usingsubjective information (see Chapter 2 for additional details).

Conditional probability is defined as the probability of one event, say,event A, given that event B has already occurred. In notation, conditionalprobability is given as

, (A.15)

where P(AB) is the joint probability of events A and B occurring together —their joint probability. Conditional probability is a cornerstone of statisticalmodeling. For example, many statistical models explain or predict the prob-ability of an outcome, Y, given the independent variables X, such that themodel provides P(Y|X).

Statistical hypothesis tests are conditional probabilities. A classical or fre-quentist statistical hypothesis test, presented in general form, is given as

. (A.16)

Often the right-hand side of the denominator in Equation A.16 is unknownand sought. Thus, the classical hypothesis test does not provide the proba-bility of the null hypothesis being true, but instead the conditional proba-bility of observing the data given a true null hypothesis. Despite this fact,the conditional probability does provide objective evidence regarding theplausibility of the null hypothesis.

Bayes’ theorem is also based on the notion of conditional probability andis derived in the following manner:

(A.17)

Bayes’ theorem is used to circumvent practical and philosophical problemsof the classical hypothesis test result.

P A BP ABP B

|

PP P

Pdata|true null hypothesis

data true null hypothesis

true null hypothesis

P A BP ABP B

P B AP ABP A

P A BP B A P A

P B

| ;

| ;

| ;

and

so, by substitution

| which is Bayes theorem

© 2003 by CRC Press LLC

Page 339: Data Analysis Book

Statistical independence is a useful concept in the development of somestatistical methods. It describes a condition where the probability of occur-rence of one event, A, is entirely unaffected by the probability of occurrenceof another event, B. So if events A and B are independent, then

. (A.18)

By using the conditional probability formula in Equation A.15, it can be seenthat for statistically independent events,

. (A.19)

Equation A.19 reveals that if events A and B are statistically independent,then the probability that event A occurs given that event B has alreadyoccurred is the probability of event A. This is intuitive because event B doesnot influence event A.

A.3 Estimating Parameters in Statistical Models — Least Squares and Maximum Likelihood

In the text of the book many pages are devoted to estimating parameters instatistical models. Parameters are an integral part of statistical models. Often,beginning with a theory of a process or phenomenon that is known togenerate data, a model is postulated that may involve any number of vari-ables. The general model may be

. (A.20)

where Y is a vector of outcomes, is a vector of estimated model param-eters, X is a matrix of variables across n observations thought to impact orinfluence Y, and is a vector of disturbances — that portion of Y notaccounted for by the function of and the X. The majority of the text ofthis book discusses various ways of organizing functions, ƒ, that relate and the matrix X to the outcome vector Y. For example, the ordinary leastsquares regression model relates the X terms to Y through the expression

, where is the vector estimated parameters or the ’s. Theparameters in this or any statistical model are estimated from observationalor experimental data.

There are numerous methods for estimating parameters in statistical mod-els. Parameter estimation methods include ordinary or unweighted least

P AB P A P B

P A BP ABP B

P A P BP B

P A|

Y Xf ;

Y X

© 2003 by CRC Press LLC

Page 340: Data Analysis Book

squares, weighted least squares, maximum likelihood, and method ofmoments, to name but a few. These methods are discussed in detail, whereappropriate, in the development of statistical models in the text. In thissection two common parameter estimation methods are briefly presented,ordinary least squares and maximum likelihood, mainly to familiarize thereader with these concepts.

Ordinary least squares estimation involves minimizing the squared differ-ences between observed and predicted observations, such that

, (A.21)

where is the value of Y predicted by the statistical model for the ith trialor observation. The predicted value of Y is a function of the X terms andthe collection of estimated parameters . In least squares estimation, theparameter estimates are obtained by solving Equation A.21. Note that noassumptions about statistical distributions are required to obtain parameterestimates using least squares estimation.

Maximum likelihood methods also provide estimates of parameters instatistical models; however, the method for obtaining them is fundamentallydifferent. The principle behind maximum likelihood is that different popu-lations generate different samples, and so any particular sample is morelikely to come from some populations rather than others. For example, if arandom sample of y1, y2,…, yn was drawn, there is some parameter (forsimplicity assumed to be the sample mean) that is most likely to generatethe sample. Figure A.1 shows two different statistical distributions, A andB, which represent two different assumed sample means. In the figure, thesample mean A associated with distribution A is much more likely to gen-erate the sample of y than the sample mean B associated with distribution

FIGURE A.1Illustration of maximum likelihood estimation.

Q MIN Y Yi iˆ 2

Yi

0.3

0.2

0.1

0y1 y3 y2 yN y4 y5

f (ε)

ε

A B

© 2003 by CRC Press LLC

Page 341: Data Analysis Book

B. Maximum likelihood estimation seeks the parameter or set of parametersthat are most likely to have generated the observed data (y) among allpossible ’s. Unlike least squares estimation, a particular probability distri-bution must be specified to obtain maximum likelihood estimates.

Different parameter estimation methods and their associated propertiesare discussed in the text of the book when appropriate. Desirable propertiesof estimators in general terms are discussed in Chapter 3.

A.4 Useful Probability Distributions

Probability distributions are key to statistical and econometric analysis. Byidentifying the properties of underlying data-generating processes, one caneasily answer probability-related questions, such as the probability of theoccurrence of an event. Fortunately, well-understood probability distribu-tions approximate a large number of commonly studied phenomena. Forexample, arrival times of vehicles at the back of a queue and accident occur-rence on a section of roadway are often approximated well by a Poissondistribution. Many samples, such as vehicle speeds on a segment of freeway,are well approximated by normal distributions.

There are generally two types of probability distributions — continuousand discrete. Continuous distributions arise from variables that can take onany value within a range of values, and are measured on the ratio or intervalscale. Discrete distributions arise from ordinal data, or count data — datathat are strictly continuous but can only take on integer values. Two usefulproperties concerning probability distributions are that the probability ofany particular outcome lies between 0 and 1, and the sum of probabilitiesover all possible outcomes is 1.

In this section four commonly used statistical distributions are discussed:the standard normal Z distribution, the t distribution, the 2 distribution,and the F distribution. Tables of probabilities for these distributions areprovided in Appendix C. Other probability distributions are discussed inrelation to specific modeling applications when necessary. In addition,numerous probability distribution functions are described in Appendix B.

A.4.1 The Z Distribution

The Z distribution can be derived from the central limit theorem. If the meanis calculated on a sample of n observations drawn from a distribution

with mean and known finite variance 2, then the sampling distribution ofthe test statistic Z change to Z* is approximately standard normal distributed,regardless of the characteristics of the population distribution. For example,the population could be normal, Poisson, binomial, or beta distributed. The

X

© 2003 by CRC Press LLC

Page 342: Data Analysis Book

standard normal distribution is characterized by a bell-shaped curve, withmean zero and variance equal to one. A random variable Z* is computed asfollows:

. (A.22)

where denotes a confidence level.The statistic Z* is a random variable whose distribution approaches that

of the standard normal distribution as n approaches infinity. This is anextremely useful result. It says that regardless of the distribution of thepopulation from which samples are drawn, a test statistic computed usingEquation A.22 will approach the standard normal distribution as thesample becomes large. In practice, sample sizes of 20 or more will resultin a virtually normally distributed sampling distribution of the mean(Glenberg, 1996).

The Z distribution is also extremely useful for examining normal dis-tributions with any mean and variance. By using the standard normaltransformation,

, (A.23)

original variables Xi obtained from any normal distribution can be standard-ized to new variables Zi that are standard normal distributed. The distribu-tion functions for both the normal distribution and the standard normaldistribution are provided in Appendix B. Percentiles of the standard normaldistribution Z are provided in Appendix C.

A.4.2 The t Distribution

The t distribution is used for conducting many of the statistical testsdescribed throughout this book. Recall that the Z* statistic forms the basisfor confidence interval estimation and hypothesis testing when 2 is known.When 2 is unknown, however, it is natural to replace it with its unbiasedestimator s2. When this is done, a test statistic t* is obtained

, (A.24)

where t* is approximately t distributed with n – 1 degrees of freedom.

ZX

n

Z*

ZX

ii

tX

sn

t n* = 1

© 2003 by CRC Press LLC

Page 343: Data Analysis Book

This probability distribution was developed by W. S. Gossett, an employeeof the Guinness Brewery, who in 1919 published his work under the pseud-onym “Student.” Gossett called the statistic t and, since then, its distributionhas been called Student’s t distribution. The t distribution is similar to thestandard normal. Figure A.2 depicts both a t distribution and a standardnormal distribution. Like the standard normal distribution, t is symmetricaround zero. It is mound shaped, whereas the normal distribution is bellshaped. The extent to which the t distribution is more spread out than thenormal distribution is determined by the degrees of freedom of the distri-bution. As the sample size becomes large, the t distribution approaches thestandard normal Z distribution. It is important to note also that the t distri-bution requires that the population from which samples are drawn is normal.

The distribution functions for t distribution are provided in Appendix B.Percentiles of the t distribution are provided in Appendix C.

A.4.3 The 2 Distribution

The 2 distribution is extremely useful because it arises in numerous commonsituations. Statistical theory shows that the square of a standard normal variableZ is 2distributed with 1 degree of freedom. Also, let Z1, Z2,…, Zk be k independentstandard normal random variables. If each of these random variables is squared,their sum will follow a 2 distribution with k degrees of freedom, such that

. (A.25)

The X2 statistic shows that the 2 distribution arises from the sum of inde-pendent squared normal random variables. As a sum of squares, the 2

FIGURE A.2The standard normal Z and the t(3) probability density functions.

Density (f (x ))

.25

.2

.15

.1

.05

0−5 −4 −3 −2 −1 0

x1 2 3 4 5

.4

.35

.3

Normal (0, 1)

Student’s t (3)

X Zi ki

n2 2 2

1

© 2003 by CRC Press LLC

Page 344: Data Analysis Book

random variable cannot be negative and, as a result, is bounded on the lowend by zero. The 2 distribution is therefore skewed to the right. Figure A.3shows several 2 distributions with different degrees of freedom. Note thatas degrees of freedom increase, the 2 distribution approximates the normaldistribution. In fact, as the degrees of freedom increase, the 2 distributionapproaches a normal distribution with mean equal to the degrees of freedomand variance equal to two times the degrees of freedom.

Another useful application of the 2 distribution is to assess whether thevariance of the sample is the same as the variance in the population. If s2 isthe estimated variance of a random sample of size n drawn from a normalpopulation having variance 2 then the sampling distribution of the teststatistic X2 is approximately 2 distributed with = n – 1 degrees of freedom,such that

. (A.26)

Finally, the 2 distribution can be used to compare two distributions. Iffrequencies of events are observed in certain categories or frequency binsunder both observed and expected distributions, then the 2 distribution canbe used to test whether the observed frequencies are equivalent to theexpected frequencies. The test statistic is given as

FIGURE A.3The 2 density function with various degrees of freedom.

Density (f (x))

Chi-squared (2)

Chi-squared (5)

Chi-squared (10)

Chi-squared (15)

.55

.5

.45

.4

.35

.3

.25

.2

.15

.05

00 5 10 15

x20 25 30 35

.1

Xn s

X X

ni

i

n

22

2

2

12

21- 1

© 2003 by CRC Press LLC

Page 345: Data Analysis Book

, (A.27)

where I and J are the number of rows and columns in a two-way contingencytable. The test statistic can easily be made to accommodate multiway tables,where there might be, for example, I rows, J columns, and K elementsassociated with each ijth bin. In this case the degrees of freedom in EquationA.27 becomes (I – 1)(J – 1)(K – 1). The expected frequency in Equation A.27may be the result of a model of statistical independence, a frequency distri-bution based on an assumed statistical distribution like the normal distribu-tion, or an empirical distribution (where two empirical distributions arecompared). Caution needs to be applied when using the test statistic shownin Equation A.27, because small expected frequencies can compromise thereliability of the test statistic. In such cases, exact methods should be applied,as described in Mehta and Patel (1983).

The distribution functions for the 2 distribution are provided in AppendixB. Percentiles of the 2 distribution are provided in Appendix C.

A.4.4 The F Distribution

Another statistical distribution that is extremely useful in statistics, and thefinal distribution discussed in this section, is the F distribution. This distri-bution is named after the English statistician Sir Ronald A. Fisher, whodiscovered it in 1924. The F distribution is approximated by the ratio of twoindependent 2 random variables, each of which is divided by its owndegrees of freedom. For example, let 2

1 be a 2 random variable with 1degree of freedom, and 2

2 be a 2 random variable with 2 degrees of freedom.The ratio of these two random variables is F distributed with 1 = 1 and 2

= 2 degrees of freedom, respectively. That is,

. (A.28)

A useful application of the F distribution is to test the ratio of variancesof two samples drawn from a normal population, or to test whether twosamples were drawn from a single population with variance 2. If s1

2 and s22

are the estimated variances of independent random samples of size n1 andn2, respectively, then the sampling distribution of the test statistic F* isapproximately F distributed with 1 (numerator) and 1 (denominator)degrees of freedom, such that

. (A.29)

XO E

EI J

i

Ii i

ij

J2

1

2

1

2 1 1( )

[ , ]

F F* //

,12

1

22

21 2

Fss

F n n* 12

22 1 1 2 21 1,

© 2003 by CRC Press LLC

Page 346: Data Analysis Book

The F distributions, shown in Figure A.4 with different degrees of freedom,are asymmetric — a quality “inherited” from the 2 distributions; in additiontheir shape resembles that of the 2 distributions. Note also that F(2,10) F(10,2).

The distribution functions for the F distribution are provided in AppendixB. Percentiles of the F distribution are provided in Appendix C.

FIGURE A.4The F density function for different degrees of freedom combinations.

Density (f (x))

F (30, 30)F (2, 10)

F (10, 2)

1.1

1

.9

.8

.7

.6

.5

.4

.3

.2

.1

00.5 1 1.5 2 2.5

x3 3.5 4 4.5

© 2003 by CRC Press LLC

Page 347: Data Analysis Book

Appendix B: Glossary of Terms

A

Abscissa: The horizontal or x-coordinate of a data point as graphedon a Cartesian coordinate system. The y-coordinate is calledthe ordinate.

Accuracy: Degree to which some estimate matches the true state ofnature. Typically, accuracy is a measure of the long-run averageof predictions compared to the true, often unknown average (seealso precision, bias, efficiency, and consistency).

Accelerated lifetime model: An approach used in hazard-based analy-sis of duration data that assumes covariates rescale time directlyin a baseline survivor function, which is the survivor functionwhen all covariates are zero.

Aggregate: The value of a single variable that summarizes, adds, orrepresents the mean of a group or collection of data.

Aggregation: Compounding of primary data, usually for the purposeof expressing them in summary or aggregate form.

Aggregation bias: A problem that arises when forecasting aggregate(population) impacts using disaggregate nonlinear models, suchas the logit model.

Alpha level: The probability selected by the analyst that reflects thedegree of acceptable risk for rejecting the null hypothesis when,in fact, the null hypothesis is true. The degree of risk is notinterpretable for an individual event or outcome; instead, it rep-resents the long-run probability of making a Type I error (see alsobeta and Type II error).

Alternative hypothesis: The hypothesis that one accepts when the nullhypothesis (the hypothesis under test) is rejected. It is usuallydenoted by HA or H1 (see also null hypothesis).

Analysis of variance (ANOVA): Analysis of the total variability of a setof data (measured by their total sum of squares) into componentsthat are attributed to different sources of variation. The sourcesof variation are apportioned according to variation as a result of

© 2003 by CRC Press LLC

Page 348: Data Analysis Book

random fluctuations and those as a result of systematic differencesacross groups.

Approximation error: In general, an error due to approximation frommaking a rough calculation, estimate, or guess. When performingnumerical calculations, approximations result from rounding er-rors, for example, 22/7 3.1417.

Arithmetic mean: The result of summing all measurements from a pop-ulation or sample and dividing by the number of population orsample members. The arithmetic mean is also called the average,which is a measure of central tendency.

Attitude survey: Surveys individually designed to measure reliable andvalid information regarding respondent’s attitudes and/or pref-erences in order to assist in making critical decisions and to focusresources where they are most needed.

Autocorrelation: The temporal association across a time series of obser-vations; also referred to as serial correlation.

Autoregressive integrated moving average (ARIMA) process: Time se-ries that are made stationary by differencing and whose differ-enced observations are linearly dependent on past observationsand past innovations.

Autoregressive moving average (ARMA) process: Stationary time se-ries whose observations are linearly dependent on past observa-tions and past innovations.

Autoregressive (AR) process: Stationary time series that are character-ized by a linear relationship between adjacent observations. Theorder of the process p defines the number of past observationsupon which current observations depend.

Average: The arithmetic mean of a set of observations. The average is ameasure of central tendency of a scattering of observations, as arealso the median and mode.

B

Backshift operator: Model convention that defines stepping backwardin a time-indexed data series (see also time series).

Bar chart or diagram: A graph of observed data using a sequence ofrectangles, whose widths are fixed and whose heights are propor-tional to the number of observations, proportion of total observa-tions, or probability of occurrence.

© 2003 by CRC Press LLC

Page 349: Data Analysis Book

Bayes’ theorem: A theorem, developed by Bayes, that relates the con-ditional probability of occurrence of an event to the probabilitiesof other events. Bayes’ theorem is given as

,

where P(A|B) is the probability of event A given that event B hasoccurred, P(A) is the probability of event A, and P(A ) is theprobability of event A not occurring. Bayes’ theorem is used toovercome some of the interpretive and philosophical shortcom-ings of frequentist or classical statistical methods, and to incor-porate subjective information for obtaining parameter estimates.

Bayesian statistical philosophy: Bayesian statisticians base statisticalinference on a number of philosophical underpinnings that differin principle from frequentist or classical statistical thought. First,Bayesians believe that research results should reflect updates ofprior research results. In other words, prior knowledge should beincorporated formally into current research to obtain the best“posterior” or resultant knowledge. Second, Bayesians believethat much is gained from insightful prior, subjective informationregarding to the likelihood of certain types of events. Third, Baye-sians use Bayes’ theorem to translate probabilistic statements intodegrees of belief, instead of a classical confidence interval inter-pretation (see also frequentist statistical philosophy).

Before–after data: A study where data are collected prior to and follow-ing an event, treatment, or action. The event, treatment, or actionapplied between the two periods is thought to affect the dataunder investigation. The purpose of this type of study is to showa relationship between the data and the event, treatment, or ac-tion. In experimental research all other factors are either random-ized or controlled (see also panel data, time-series data, and cross-sectional data).

Bernoulli distribution: Another name for binomial distribution.Bernoulli trial: An experiment where there is a fixed probability p of

“success,” and a fixed probability 1 – p of “failure.” In a Bernoulliprocess, the events are independent across trials.

Best fit: See goodness of fit.Beta distribution: A distribution, which is closely related to the F-dis-

tribution, used extensively in analysis of variance. In Bayesianinference, the beta distribution is sometimes used as the priordistribution of a parameter of interest. The beta distribution, whenused to describe the distribution of lambda of a mixture of Poisson

P A BP B A P A

P B A P A P B A P A

© 2003 by CRC Press LLC

Page 350: Data Analysis Book

distributions, results in the negative binomial distribution. Thebeta density is given by

,

where and are shape parameters, the mean is /( + ), andvariance is /[( + +1)( + )].

Beta error: The probability assigned (or implied) by the analyst thatreflects the degree of acceptable risk for accepting the null hy-pothesis when, in fact, the null hypothesis is false. The degree ofrisk is not interpretable for an individual event or outcome; in-stead, it represents the long-run probability of making a Type IIerror (see also Type I error and alpha).

Beta parameters: The parameters in statistical models, often representedby the Greek letter . Beta parameters are meant to represent fixedparameters of the population.

Bias: When estimating population parameters, an estimator is biasedif its expected value does not equal the parameter it is intendedto estimate. In sampling, a bias is a systematic error introducedby selecting items nonrandomly from a population that is as-sumed to be random. A survey question may result in biasedresponses if it is poorly phrased (see also unbiased).

Binomial Distribution: The distribution of the number of successes inn trials, when the probability of success (and failure) remainsconstant from trial to trial and the trials are independent. Thisdiscrete distribution is also known as Bernoulli distribution orprocess, and is given by

,

where x is the number of “successes” out of n trials, p is theprobability of success, and ! represents the factorial operator, suchthat 3! = 3 2 1 (see also negative binomial).

Bivariate distribution: A distribution resulting from two random vari-ables, for example, the joint distribution of vehicle model year(MY) and annual mileage (AM) driven. If MY and AM are inde-pendent variables, then the bivariate distribution is approximate-ly uniform across cells. Because older vehicles are driven less peryear on average than newer vehicles, these two variables are

f xx x

x x dx

, ,1 1

1 1

0

1

1

1

P Y x n pn

n x xp px n x

( , ; )!

! !1

© 2003 by CRC Press LLC

Page 351: Data Analysis Book

dependent; that is, annual mileage is dependent on model year.A contingency table, or cross-classification analysis, is useful fortesting statistical independence among two or more variables.

C

Categorical variable: Also called a nominal scale variable, a variablethat has no particular ordering, and intervals between these vari-ables are without meaning. A categorical variable is a discretevariable. Examples of categorical variables include vehicle man-ufacturer and gender. For a categorical variable to be definedappropriately, it must consist of mutually exclusive and collec-tively exhaustive categories.

Causation or causality: When event A causes event B, there exists amaterial, plausible, underlying reason or explanation relatingevent B with event A. Causality is not proved with statistics —statistics and statistical models instead provide probabilistic evi-dence supporting or refuting a causal relation. The statistical ev-idence is a necessary but not sufficient condition to demonstratecausality. Causality is more defensible in light of statistical evi-dence obtained from designed experiments, whereas data ob-tained from observational and quasi-experiments too often yieldrelationships where causation cannot be discerned from associa-tion (see also post hoc theorizing and correlation).

Censored distribution: A statistical distribution that occurs when a re-sponse above or below a certain threshold value is fixed at thatthreshold. For example, for a stadium that holds 50,000 seats, thenumber of tickets sold cannot exceed 50,000, even though theremight be a demand for more than 50,000 seats.

Central limit theorem: If xave is calculated on a sample drawn from adistribution with mean and known finite variance 2, then thesampling distribution of the test statistic Z is approximately stan-dard normal distributed, regardless of the characteristics of theparent distribution (i.e., normal, Poisson, binomial, etc.). Thus, arandom variable Z* computed as follows is approximately stan-dard normal (mean = 0, variance = 1) distributed:

.ZX

n

Z*

© 2003 by CRC Press LLC

Page 352: Data Analysis Book

The distribution of random variable Z approaches that of thestandard normal distribution as n approaches infinity.

Central tendency: The tendency of quantitative data to cluster aroundsome variate value.

Chebyshev’s inequality: A useful theorem, which states that for aprobability distribution with mean and standard deviation

, the probability that an observation drawn from the distri-bution differs from by more than k is less than 1/k2 or, statedmathematically,

.

Chi-square distribution: A distribution that is of great importance forinferences concerning population variances or standard devia-tions, for comparing statistical distributions, and for comparingestimated statistical models. It arises in connection with the sam-pling distribution of the sample variance for random samplesfrom normal populations. The chi-squared density or distributionfunction is given by

,

where n/2 and 2 are shape and scale parameters, respectively,and _(z) is the gamma function. The chi-square distribution is aspecial case of the gamma distribution (with n/2 and 2 for shapeand scale parameters).

Class: Observations grouped according to convenient divisions of thevariate range, usually to simplify subsequent analysis (the upperand lower limits of a class are called class boundaries; the intervalbetween them the class interval; and the frequency falling intothe class is the class frequency).

Classical statistical philosophy: See frequentist and Bayesian statisticalphilosophy.

Cluster sampling: A type of sampling whereby observations are se-lected at random from several clusters instead of at randomfrom the entire population. It is intended that the heterogeneityin the phenomenon of interest is reflected within the clusters;i.e., members in the clusters are not homogenous with respectto the response variable. Cluster sampling is less satisfactoryfrom a statistical standpoint but often is more economical and/or practical.

Coefficient: See parameter.

P x kk12

f x x EXP x r n xn n/ // / / ;2 1 22 2 2 0

© 2003 by CRC Press LLC

Page 353: Data Analysis Book

Coefficient of determination: Employed in ordinary least squares re-gression and denoted R2, the proportion of total variance in thedata taken up or “explained” by the independent variables in theregression model. It is the ratio or proportion of explained vari-ance to total variance and is bounded by 0 and 1 for models withintercept terms. Because adding explanatory variables to a modelcannot reduce the value of R2, adjusted R2 is often used to comparemodels with different numbers of explanatory variables. The val-ue of R2 from a model cannot be evaluated as “good” or “bad”in singularity; it can only be judged relative to other models thathave been estimated on similar phenomenon. Thus an R2 of 30%for some phenomenon may be extremely informative, whereasfor another phenomenon it might be quite uninformative.

Collectively exhaustive: When a categorical variable is collectively ex-haustive it represents all possible categories into which a randomvariable may fall.

Compensating variation: Used in utility theory, the amount of moneythat would have to be paid (or taken away from) individuals torender them as well off after a change in a specific variable asthey were before a change in that variable.

Conditional probability: The long-run likelihood that an event will oc-cur given that a specific condition has already occurred, e.g., theprobability that it will rain today, given that it rained yesterday.The standard notation for conditional probability is P(A|B),which corresponds to the probability that event A occurs giventhe event B has already occurred. It can also be shown that

,

where P(AB) is the probability that both event A and B occur.Confidence coefficient: The measure of probability (see alpha error)

associated with a confidence interval that the interval will includethe true population parameter of interest.

Confidence interval or region: A calculated range of values known tocontain the true parameter of interest over the average of repeatedtrials with specific certainty (probability). The correct interpretationof a confidence interval is as follows: If the analyst were to drawsamples repeatedly at the same levels of the independent variablesand compute the test statistic (mean, regression, slope, etc.), thenthe true population parameter would lie in the (1 – )% confidenceinterval times out of 100, conditional on a true null hypothesis.

Confounded variables: Generally a variable in a statistical model that iscorrelated with a variable that is not included in a model; some-

P A BP ABP B

© 2003 by CRC Press LLC

Page 354: Data Analysis Book

times called an omitted variable problem. When variables in amodel are correlated with variables excluded from a model, thenthe estimates of parameters in the model are biased. The directionof bias depends on the correlation between the confounded vari-ables. In addition, the estimate of model error is also biased.

Consistency: An estimate of a population parameter, such as the pop-ulation mean or variance, obtained from a sample of observationsis said to be consistent if the estimate approaches the value of thetrue population parameter as the sample size approaches infinity.Stated more formally,

,

where are estimated parameters, are true population param-eters, n is sample size, and is a positive small difference betweenestimated and true population parameters.

Contemporaneous correlation: Correlation among disturbance terms ofdifferent statistical model equations within a system of equations.This correlation is typically caused by shared unobserved effectsof variates appearing in multiple equations.

Contingency coefficient: A measure of the strength of the associationbetween two variables, usually qualitative, on the basis of datatallied into a contingency table. The statistic is never negative andhas a maximum value less than 1, depending on the number ofrows and columns in the contingency table.

Contingency table: Cross-classification or contingency table analysis isa statistical technique that relies on properties of the multinomialdistribution, statistical independence, and the chi-square distri-bution to determine the strength of association (or lack thereof)between two or more factors or variables.

Continuous variable: A variable that is measured either on the intervalor ratio scale. A continuous variable can theoretically take on aninfinite number of values within an interval. Examples of contin-uous variables include measurements in distance, time, and mass.A special case of a continuous variable is a data set consisting ofcounts. Counts consist of non-negative integer values (see alsodiscrete, nominal, and ordinal data).

Control group: A comparison group of experimental units that do notreceive a treatment and are used to provide a comparison to theexperimental group with respect to the phenomenon of interest.

Correlation: The relationship (association or dependence) between twoor more qualitative or quantitative variables. When two variablesare correlated they are said to be statistically dependent; when

P nˆ 1 0 as for all

ˆ

© 2003 by CRC Press LLC

Page 355: Data Analysis Book

they are uncorrelated they are said to be statistically independent.For continuous variables Pearson’s product moment correlationcoefficient is often used, whereas for rank or ordinal data Ken-dall’s coefficient of rank correlation is used.

Correlation coefficient: A measure of the interdependence between twovariates or variables. For interval or ratio scale variables it is afraction, which lies between –1 for perfect negative correlationand 1 for perfect positive correlation.

Correlation matrix: For a set of variables X1 … Xn, with correlation be-tween Xi and Xj denoted by rij, this is a square symmetric matrixwith values rij.

Count data: Ratio-scale data that are non-negative integers.Covariance: The expected value of the product of the deviations of two

random variables from their respective means; i.e., E [(X1 – 1)(X2

– 2)] = i=1 to n [(X1i – 1)(X2 – 2)]/n. Correlation and covarianceare related statistics in that the correlation is the standardizedform of the covariance; that is, covariance is a measure of theassociation in original units, whereas the correlation is the mea-sure of association in standardized units.

Covariates: Term often used to refer to independent variables in themodel — the variables in the right-hand side of a statistical model.

Cross-sectional data: Data collected on some variable, at the same pointor during a particular period of time, from different geographicalregions, organizations, households, etc. (see also panel data, time-series data, and before–after data).

Cumulative distribution function: A function related to a distribu-tion’s density function that is written as F(x) = P[X < x], where Pdenotes probability, X is a random time variable, and x is somespecified value of the random variable. The density function cor-responding to the distribution function is the first derivative ofthe cumulative distribution with respect to x, f(x) = dF(x)/dx.

Cumulative frequency: The frequency with which an observed variabletakes on a value equal to or less than a specified value. Cumula-tive frequencies are often depicted in bar charts or histograms,known as cumulative frequency distributions.

D

Data: Information or measurements obtained from a survey, experi-ment, investigation, or observational study. Data are stored in adatabase, usually in electronic form. Data are discrete (measured

© 2003 by CRC Press LLC

Page 356: Data Analysis Book

on the nominal or ordinal scale) or continuous (measured on theinterval or ratio scale).

Data mining: Process undertaken with the intent to uncover previouslyundiscovered relationships in data. Unfortunately, data miningleads to a higher likelihood of illusory correlation, omitted vari-able bias, and post hoc theorizing, all of which threaten high-quality research and scientific investigations. Data mining shouldbe undertaken with great caution, and conclusions drawn fromit should be made with sufficient caveats and pessimism. Thelogical role of data mining in research is for generating researchideas that deserve follow-on detailed research investigations.

Deduction: A type of logic or thinking process used in statistical hy-pothesis testing. Deductive logic stipulates that if event A causesor leads to event B, and event B is not observed to occur, thenevent A also did not occur. In probabilistic terms, if the nullhypothesis is true (A), then the sample data should be observedwith high probability (B). If, instead, the sample data are observedwith low probability, then the null hypothesis is rejected as im-plausible (see also induction).

Degrees of freedom: The number of free variables in a set of observationsused to estimate statistical parameters. For example, the estimationof the population standard deviation computed on a sample ofobservations requires an estimate of the population mean, whichconsumes 1 degree of freedom to estimate — thus the sample stan-dard deviation has n – 1 degrees of freedom remaining. The degreesof freedom associated with the error around a linear regressionfunction has n – P degrees of freedom, as P degrees of freedom havebeen used to estimate the parameters in the regression model.

Density function: Often used interchangeably with probability densityfunction, the function that defines a probability distribution inthat integration of this function gives probability values.

Dependent variables: If a function is given by Y = f (X1,…, Xn), it iscustomary to refer to X1,…, Xn as independent or explanatoryvariables, and Y as the dependent or response variable. The ma-jority of statistical investigations in transportation aim to predictor explain values (or expected values) of dependent variablesgiven known or observed values of independent variables.

Descriptive statistics: Statistics used to display, describe, graph, or de-pict data. Descriptive statistics do not generally include modelingof data, but are used extensively to check assumptions of statis-tical models.

Deterministic model or process: This model, as opposed to a stochasticmodel, is one that contains effectively no or negligible randomelements and for which, therefore, the future course of the system

© 2003 by CRC Press LLC

Page 357: Data Analysis Book

is completely determined by its position, velocities, etc. An exam-ple of a deterministic model is given by force = mass acceleration.

Discrepancy function: Used primarily in structural equation models, thisfunction measures the differences between the observed vari-ance–covariance matrix and the variance–covariance matrix im-plied by a postulated structural equation model. All else beingequal, a small value of the discrepancy function is preferred to alarger value, indicating greater lack of fit to the data (see alsostructural equations and goodness-of-fit statistics).

Discrete/continuous models: A class of models that has a system of bothdiscrete and continuous interrelated dependent variables.

Discrete variable: A variable (or outcome) that is measured on the nom-inal or ordinal scale. Examples of discrete variables include modeof travel (with discrete possibilities being car, bus, train) andaccident severity (with discrete possibilities being property dam-age only, injury, fatality).

Dispersion: The degree of scatter or concentration of observations aroundthe center or middle. Dispersion is usually measured as a devia-tion about some central value such as the mean, standard orabsolute deviation, or by an order statistic such as deciles, quin-tiles, and quartiles.

Distribution: The set of frequencies or probabilities assigned to variousoutcomes of a particular event or trial. Densities (derived fromcontinuous data) and distributions (derived from discrete data)are often used interchangeably.

Distribution function: See cumulative distribution function.Disturbances: Also referred to as errors, residuals, and model noise, these

are random variables added to a statistical model equation toaccount for unobserved effects. The choice of their statistical dis-tribution typically plays a key role in parametric model deriva-tion, estimation, and inference.

Double sampling: The process by which information from a randomsample is used to direct the gathering of additional data oftentargeted at oversampling underrepresented components of thepopulation.

Dummy variables: See indicator variables.

E

Econometrics: The development and application of statistical and/ormathematical principles and techniques for solving economicproblems.

© 2003 by CRC Press LLC

Page 358: Data Analysis Book

Efficiency: A statistical estimator or estimate is said to be efficient if ithas small variance. In most cases, a statistical estimate is preferredif it is more efficient than alternative estimates. It can be shownthat the Cramer–Rao bound represents the best possible efficiency(lowest variance) for an unbiased estimator. That is, if an unbiasedestimator is shown to be equivalent to the Cramer–Rao bound,then there are no other unbiased estimators that are more efficient.It is possible in some cases to find a more efficient estimate of apopulation parameter that is biased.

Elasticity: Commonly used to determine the relative importance of avariable in terms of its influence on a dependent variable. It isgenerally interpreted as the percent change in the dependentvariable induced by a 1% change in the independent variable.

Empirical: Derived from experimentation or observation rather thanunderlying theory.

Endogenous variable: In a statistical model, a variable whose value isdetermined by influences or variables within the statistical model.An assumption of statistical modeling is that explanatory vari-ables are exogenous. When explanatory variables are endoge-nous, parameter estimation problems arise (see also exogenousvariable, instrumental variables, simultaneous equations, andstructural equations).

Enriched sampling: Refers to merging a random sample with a samplefrom a targeted population. For example, a random sample ofcommuters’ mode choices may be merged with a sample of com-muters observed taking the bus.

Error: In statistics the difference between an observed value and its“expected” value as predicted or explained by a model. In addi-tion, errors occur in data collection, sometimes resulting in out-lying observations. Finally, Type I and Type II errors refer tospecific interpretive errors made when analyzing the results ofhypothesis tests.

Error mean square: In analysis of variance and regression, the residualor error sum of squares divided by the degrees of freedom. Oftencalled the mean square error (MSE), the error mean square pro-vides an estimate of the residual or error variance of the popula-tion from which the sample was drawn.

Error of observation: An error arising from imperfections in the methodof observing a quantity, whether due to instrumental or to humanfactors

Error rate: In hypothesis testing, the unconditional probability of makingan error; that is, erroneously accepting or rejecting a statisticalhypothesis. Note that the probabilities of Type I and Type II errors,

and , are conditional probabilities. The first is subject to the

© 2003 by CRC Press LLC

Page 359: Data Analysis Book

condition that the null hypothesis is true, and the second is subjectto the condition that the null hypothesis is false.

Error term: See disturbance.Error variance: The variance of the random or unexplainable component

of a model; the term is used mainly in the presence of othersources of variation, as, for example, in regression analysis or inanalysis of variance.

Exogenous variable: In a statistical model, a term referring to a variablewhose value is determined by influences outside of the statisticalmodel. An assumption of statistical modeling is that explanatoryvariables are exogenous. When explanatory variables are endog-enous, parameter estimation problems arise (see also endogenousvariable, instrumental variables, simultaneous equations, andstructural equations).

Expectation: The expected or mean value of a random variable, or func-tion of that variable such as the mean or variance.

Experiment: A set of measurements carried out under specific and con-trolled conditions to discover, verify, or illustrate a theory, hy-pothesis, or relationship. Experiment is the cornerstone ofstatistical theory and is the only accepted method for suggestingcausal relationships between variables. Experimental hypothesesare not proved using statistics; however, they are disproved. El-ements of an experiment generally include a control group, ran-domization, and repeat observations.

Experimental data: Data obtained by conducting experiments under con-trolled conditions. Quasi-experimental data are obtained whensome factors are controlled as in an experiment, but some factorsare not. Observational data are obtained when no exogenous factorsother than the treatment are controlled or manipulated. Analysisand modeling based on quasi-experimental and observation dataare subject to illusory correlation and confounding of variables.

Experimental design: Plan or blueprint for conducting an experiment.Experimental error: Any error in an experiment whether due to stochastic

variation or bias (not including mistakes in design or avoidableimperfections in technique).

Exponential: A variable raised to a power of x. The function F(x) = ax isan exponential function.

Exponential distribution: A continuous distribution that is typically usedto model life cycles, growth, survival, hazard rate, or decay ofmaterials or events. Its probability density function is given by

,

where is single parameter of the exponential distribution.

f x EXP( ) x

© 2003 by CRC Press LLC

Page 360: Data Analysis Book

Exponential smoothing: Time-series regression in which recent obser-vations are given more weight by way of exponentially decayingregression parameters.

F

F distribution: A distribution of fundamental importance in analysisof variance and regression. It arises naturally as a ratio ofestimated variances and sums of squares. The F density func-tion is given as

,

where (z) is the gamma function, and n and m are the numeratorand denominator degrees of freedom for the F distribution, re-spectively (see gamma distribution for gamma function).

F ratio: The ratio of two independent unbiased estimates of variance ofa normal distribution, which has widespread application in theanalysis of variance.

F test: A computed statistic that under an appropriate null hypothesishas an approximate F distribution. It is used routinely to test fulland restricted statistical models.

Factor analysis: An analytical technique used to reduce the number ofP variables to a smaller set of parsimonious K < P variables, sothat the covariance among many variables is described in termsof a few unobserved factors. An important difference betweenprincipal components and factor analysis is that factor analysis isbased on a specific statistical model, whereas principal compo-nents analysis is not (see also principal components analysis andstructural equations).

Factorial experiment: An experiment designed to examine the effect ofone or more factors, with each factor applied to produce orthog-onal or independent effects on the response.

Fixed effects model: A model appropriate when unobserved hetero-geneity of sampled units is thought to represent the effects ofa particular sampled unit (see random effects model).

f x

n mn m

n mx

m nx

n mn

n m2

2 2

2 22 1

2

/ //

/

© 2003 by CRC Press LLC

Page 361: Data Analysis Book

Full factorial experiment: An experiment investigating all the possibletreatment combinations that may be formed from the factors un-der investigation.

Full information maximum likelihood (FIML): A likelihood functionwritten such that all possible information and restrictions related tomodel estimation are included. For example, in simultaneous equa-tions estimation, the FIML approach accounts for overidentificationof model parameters and cross-equation correlation of disturbances.

Frequency: The number of occurrences of a given type of event.Frequentist statistical philosophy: Also called classical statistical philos-

ophy, this philosophy represents the majority of currently appliedstatistical techniques in practice, although Bayesian philosophy isgaining increased acceptance. In simple terms, classical statisticiansassert that probabilities are obtained through long-run repeatedobservations of events. Frequentist statistical philosophy results inhypothesis tests that provide an estimate of the probability of ob-serving the sample data conditional on a true null hypothesis,whereas Bayesian statistical philosophy results in an estimate ofthe probability of the null hypothesis being true conditional on theobserved sample data (see also Bayesian statistical philosophy).

G

Gamma distribution: Adistribution that includes as special cases the chi-square distribution and the exponential distribution. It has manyimportant applications; in Bayesian inference, for example, it issometimes used as the a priori distribution for the parameter (mean)of a Poisson distribution. It is also used to describe unobservedheterogeneity in some models. The gamma density is given by

,

where is the shape parameter, is a scale parameter, and (z)is the gamma function, which is given by the formula

.

When the location parameter is equal to 0 and the scale pa-rameter is equal to 1, the gamma distribution reduces to the

f x xx

x( ) / ;1 0EXP

z t dtz t1

0

EXP

© 2003 by CRC Press LLC

Page 362: Data Analysis Book

standard gamma distribution, with mean and standard devi-ation .

Gantt chart: A bar chart showing actual performance or output ex-pressed as a percentage of a quota or planned performance perunit of time.

Gaussian distribution: Another name for the normal distribution.Generalized extreme value (GEV): A generalized distribution used to

derive a family of models that includes the multinomial andnested logit models.

Generalized least squares (GLS): An estimation procedure that gener-alizes least squares estimation by accounting for possible hetero-geneity and correlation of disturbance terms.

Goodness–of-fit (GOF) statistics: A class of statistics used to assess thefit of a model to observed data. There are numerous GOF measures,including the coefficient of determination R2, the F test, the chi-squaretest for frequency data, and numerous other measures. GOF statisticsmay be used to measure the fit of a statistical model to estimationdata, or data used for validation. GOF measures are not statisticaltests, like F-ratio and likelihood-ratio tests to compare models.

Gumbel distribution: The density for a continuous variable that ap-proximates many decay, survival, and growth processes, as wellprovides appealing analytical qualities in analytical problems.The Gumbel distribution is given by

,

where is a positive scale parameter, is a location parameter(mode), and the mean is + 0.5772/ .

H

Hazard function: Function used to describe the conditional probabilitythat an event will occur between time t and t + dt, given that theevent has not occurred up to time t. It is written as, h(t) = f(t)/[1– F(t)], where F(t) is the cumulative distribution function, and f(t)is the density function. In words, the hazard function, h(t), givesthe rate at which event durations are ending at time t (such asthe duration in an accident-free state that would end with theoccurrence of an accident), given that the event duration has notended up to time t.

F EXP EXP

© 2003 by CRC Press LLC

Page 363: Data Analysis Book

Heterogeneity: A term used to describe samples or individuals fromdifferent populations, which differ with respect to the phenome-non of interest. If the populations are not identical, they are saidto be heterogeneous, and by extension the sample data are alsosaid to be heterogeneous (see also unobserved heterogeneity).

Heteroscedasticity: In regression analysis, the property that the condition-al distributions of the response variable Y for fixed values of theindependent variables do not all have constant variance. Noncon-stant variance in a regression model results in inflated estimates ofmodel mean square error. Standard remedies include transforma-tions of the response, and/or employing a generalized linear model.

Histogram: A univariate frequency diagram in which rectangles pro-portional in area to the class frequencies are erected on sectionsof the horizontal axis, the width of each section representing thecorresponding class interval of the variate.

Holt–Winters smoothing: A seasonal time-series modeling approach inwhich the original series is decomposed into its level, trend, andseasonal components with each of the components modeled withexponentially smoothed regression.

Homogeneity: Term used in statistics to describe samples or individualsfrom populations that are similar with respect to the phenomenonof interest. If the populations are similar, they are said to behomogeneous, and by extension the sample data are also said tobe homogeneous.

Homoscedasticity: In regression analysis, the property that the condi-tional distributions of Y for fixed values of the independent vari-able all have the same variance (see also heteroscedasticity).

Hypothesis: In statistics, a statement concerning the value of parametersor form of a probability distribution for a designated populationor populations. More generally, a statistical hypothesis is a formalstatement about the underlying mechanisms that have generatedsome observed data.

Hypothesis testing: Term used to refer to testing whether or not observeddata support a stated position or hypothesis. Support of a researchhypothesis suggests that the data would have been unlikely if thehypothesis were indeed false (see also Type I and Type II errors).

I

Identification problem: A problem encountered in the estimation of sta-tistical models (in particular simultaneous and structural equations

© 2003 by CRC Press LLC

Page 364: Data Analysis Book

models) that creates difficulty in uncovering underlying equationparameters from a reduced form model.

Illusory correlation: Also called spurious correlation, an omitted vari-able problem, similar to confounding. Illusory correlation is usedto describe the situation where Y and X1 are correlated, yet therelation is illusory, because X1 is actually correlated with X2, whichis the true “cause” of the changes in Y.

Independence of irrelevant alternatives (IIA): A property of multi-nomial logit models that is an outgrowth of the model derivation,which assumes disturbance terms are independent. If the outcomedisturbances are correlated, the model will give erroneous out-come probabilities when forecasting.

Independent events: In probability theory, two events are said to bestatistically independent if, and only if, the probability that theywill both occur equals the product of the probabilities that eachone, individually, will occur. Independent events are not correlat-ed, whereas dependent events are (see also dependence and cor-relation).

Independent variables: Two variables are said to be statistically inde-pendent if, and only if, the probability that they will both occurequals the product of the probabilities that each one, individually,will occur. Independent events are not correlated, whereas depen-dent events are (see also dependence and correlation).

Indicator variables: Variables used to quantify the effect of a qualitativeor discrete variable in a statistical model. Also called dummyvariables, indicator variables typically take on values of zero orone. Indicator variables are coded from ordinal or nominal vari-ables. For a nominal variable with n levels, n – 1 indicator vari-ables are coded for use in a statistical model. For example, thenominal variable Vehicle Type: truck, van, or auto, the analystwould code X1 = 1 for truck, 0 otherwise, and X2 = 1 for van, 0otherwise. When both X1 and X2 are coded as zero, the respondentwas an auto.

Indirect least squares: A method used in simultaneous equations esti-mation that applies ordinary least squares to the reduced formmodel.

Indirect utility: Utility that has prices and incomes as arguments. It isbased on optimal consumption and is tied to underlying demandequations by Roy’s identity.

Induction: The type of logic or thinking process, where inferences aredrawn about an entire class or group based on observations on afew of its members (see also deduction).

© 2003 by CRC Press LLC

Page 365: Data Analysis Book

Inference: The process of inferring qualities, characteristics, or relation-ships observed in the sample onto the population. Statistical in-ference relies on both deductive and inductive logic systems.

Information criteria: Model performance measures that balance de-crease in model error with increase in model complexity. In gen-eral, information criteria are based on the Gaussian likelihood ofthe model estimates with a penalty for the number of modelparameters.

Innovation series: The zero-mean uncorrelated stochastic component ofa time series of observations that remains after all deterministicand correlated elements have been appropriately modeled. Theinnovations are also referred to as the series noise.

Instrumental variables: A common approach used in model estimationto handle endogenous variables — those variables that are corre-lated with the disturbance term causing a violation in modelassumptions. The ideal instrument is a variable that is highlycorrelated with the endogenous variable it replaces but is notcorrelated with the disturbance term.

Interaction: Two variables X1 and X2 are said to interact if the value ofX1 influences the value of X2 positively or negatively. An inter-action is a synergy between two or more variables and reflectsthe fact that their combined effect on a response not only de-pends on the level of the individual variables, but their com-bined levels as well.

Interval estimate: The estimation of a population parameter by speci-fying a range of values bounded by an upper and a lower limit,within which the true value is asserted to lie.

Interval scale: A measurement scale that has equal differences be-tween pairs of points anywhere on the scale, but the zero pointis arbitrary; thus, ratio comparisons cannot be made. An ex-ample interval scale is the temperature in degrees Celsius. Eachinterval on the scale represents a single degree; however, 50 Cis not twice as hot as 25 C (see also nominal, ordinal, and ratioscales).

J

Joint probability: The joint density function of two random variables,or bivariate density.

© 2003 by CRC Press LLC

Page 366: Data Analysis Book

K

Kendall’s coefficient of rank correlation: Denoted as ij, where i and jrefer to two variables, a coefficient that reflects the degree of linearassociation between two ordinal variables, and is bounded be-tween +1 for perfect positive correlation and –1 for perfect nega-tive correlation. The formula for ij is given by

,

where S is the sum of scores (see nonparametric statistic referencefor computing scores) and n is sample size.

L

Latent variable: A variable that is not directly observable. Examplesinclude intelligence, education, and satisfaction. Typically, latentvariables or constructs are measured inexactly with many vari-ables. For example, intelligence may be partially captured withIQ score, GPA (grade point average) from college, and numberof books read per year.

Least squares estimation: Also called ordinary least squares, a tech-nique of estimating statistical parameters from sample datawhereby parameters are determined by minimizing the squareddifferences between model predictions and observed values ofthe response. The method may be regarded as possessing anempirical justification in that the process of minimization givesan optimum fit of observation to theoretical models; in restrictedcases, such as normal theory linear models, estimated parame-ters have optimum statistical properties of unbiasedness andefficiency.

Level of significance: The probability of rejecting a null hypothesis,when it is in fact true. It is also known as , or the probability ofcommitting a Type I error.

Likelihood function: The probability or probability density of obtaininga given set of sample values, from a certain population, when thisprobability or probability density is regarded as a function of the

ij

S

n n12

1)

© 2003 by CRC Press LLC

Page 367: Data Analysis Book

parameter of the population and not as a function of the sampledata (see also maximum likelihood method).

Likelihood ratio test: A test frequently used to test for restrictions inmodels estimated by maximum likelihood. The test is used for awide variety of reasons, including the testing of the significanceof individual parameters and the overall significance of the mod-el. The likelihood ratio test statistic is –2[LL( R) – LL( U)], whereLL( R) is the log likelihood at convergence of the “restricted”model and LL( U) is the log likelihood at convergence of the“unrestricted” model. This statistic is 2 distributed with the de-grees of freedom equal to the difference in the numbers of param-eters in the restricted and unrestricted models, that is, the numberof parameters in the R and the U parameter vectors.

Limited information maximum likelihood (LIML): A likelihood func-tion written such that some possible information on the model isexcluded. For example, in simultaneous equations estimation, theLIML approach does not account for possible cross-equation cor-relation of disturbances.

Linear correlation: A somewhat ambiguous expression used to denoteeither Pearson’s product moment correlation, in cases where thecorresponding variables are continuous, or a correlation coeffi-cient on ordinal data such as Kendall’s rank correlation coefficient.There are other linear correlation coefficients in addition to thetwo listed here.

Linear model: A mathematical model in which the equations relatingthe random variables and parameters are linear in parameters.Although the functional form of the model Y = + 1X1 + 2X1

2

includes the nonlinear term 2X12, the model itself is linear, be-

cause the statistical parameters are linearly related to the randomvariables.

Logistic distribution: A distribution used for various growth modelsand in logistic regression models. The general logistic distributionis given by

,

where is the location parameter and is the scale parameter.The mean of the general logistic distribution is , and the varianceis 2 2/3.

f xEXP

x

EXPx

1

12

© 2003 by CRC Press LLC

Page 368: Data Analysis Book

Logit model: A discrete outcome model derived, in multinomial form(three or more outcomes), by assuming the disturbances areWeibull distributed (Gumbel extreme value type I).

Log-logistic distribution: A distribution used in a number of applica-tions including duration modeling. In duration modeling the log-logistic distribution allows for nonmonotonic hazard functionsand is often used as an approximation of the more computation-ally cumbersome lognormal distribution. The log-logistic withparameters > 0 and P > 0 has the density function

.

Lognormal distribution: A distribution where random variable X hasthe lognormal distribution if ln(X) is normally distributed withmean and variance 2. The density for the lognormal distribu-tion is given by

,

where and are location and scale parameters of the lognormaldistribution. The mean and variance of the lognormal distribu-tion are and , re-spectively.

M

Marginal effect: An economic term used to measure the effect that aunit change in an independent variable has on the response vari-able of interest. Marginal effects are indicators of how influentialor important a variable is in a particular data-generating process(see also significance).

Marginal rate of substitution (MRS): An economic term used to mea-sure the trade-off consumers are willing to make between at-tributes of a good. For example, if one were to use a logit modelin the analysis of discrete outcome data, the MRS is determinedfrom the ratio of parameter estimates.

Maximum likelihood method: A method of parameter estimation inwhich a parameter is estimated by the value of the parameter that

f xP x

x

P

P

1

21

f x EXP LN x x x2 2 1 22 2 0/ / ;/

EXP( / )2 2 EXP EXP[ ( )] [ ]2 22 2

© 2003 by CRC Press LLC

Page 369: Data Analysis Book

maximizes the likelihood function. In other words, the maximumlikelihood estimator is the value of that maximizes the proba-bility of the observed sample. The method can also be used forthe simultaneous estimation of several parameters, such as re-gression parameters. Estimates obtained using this method arecalled maximum likelihood estimates.

Mean: That value of a variate such that the sum of deviations from it iszero; thus, it is the sum of a set of values divided by their number.

Mean square error: For unbiased estimators, an estimate of the popu-lation variance, usually denoted MSE. For biased estimators, themean squared deviation of an estimator from the true value isequal to the variance plus the squared bias. The square root ofthe mean square error is referred to as the root mean square error.

Median: The middle-most number in an ordered series of numbers. Itis a measure of central tendency and is often a more robust mea-sure of central tendency; that is, the median is less sensitive tooutliers than is the sample mean.

Mixed multinomial logit models: A class of multinomial logit modelsthat has a mixing distribution introduced to account for randomvariations in parameters across the population.

Mode: The most common or most probable value observed in a set ofobservations or sample.

Model: A formal expression of a theory or causal or associative relation-ship between variables, which is regarded by the analyst as hav-ing generated the observed data. A statistical model is always asimplified expression of a more complex process; thus, the analystshould anticipate some degree of approximation a priori. A statis-tical model that can explain the greatest amount of underlyingcomplexity with the simplest model form is preferred to a morecomplex model.

Moving average (MA) processes: Stationary time series that are char-acterized by a linear relationship between observations and pastinnovations. The order of the process q defines the number of pastinnovations on which the current observation depends.

Multicollinearity: A term that describes the state when two variablesare correlated with each other. In statistical models, multicol-linearity causes problems with the efficiency of parameter esti-mates. It also raises some philosophical issues because it becomesdifficult to determine which variables (both, either, or neither) arecausal and which are the result of illusory correlation.

Multinomial logit model (MNL): A multinomial (three or more out-come) discrete model derived by assuming the disturbances areWeibull distributed (Gumbel extreme value type I).

© 2003 by CRC Press LLC

Page 370: Data Analysis Book

Multiple linear regression: A linear regression involving two or moreindependent variables. Simple linear regression, which is merelyused to illustrate the basic properties of regression models, containsone explanatory variable and is rarely if ever used in practice.

Mutually exclusive events: In probability theory, two events are said tobe mutually exclusive if and only if they are represented by dis-joint subsets of the sample space, namely, by subsets that haveno elements or events in common. By definition, the probabilityof mutually exclusive events A and B occurring is zero.

N

Negative binomial distribution: A discrete distribution characterizedby the count of observations that remain unchanged in a binomialprocess. It also arises as a combination of gamma distributedheterogeneity of Poisson means. The probability distribution ofthe negative binomial is given as

,

where k is the number of successes, p is the probability of success,and C(a,b) is the number of combinations of b objects taken froma objects, defined as

.

The mean and variance of the negative binomial distribution arek/p and k(1 – p)/p2. The negative binomial probability distributionprovides the probability of observing k successes in n Bernoullitrials (see also binomial and Bernoulli).

Nested logit model: A modification of the multinomial logit model, amodel that is derived from a generalized extreme value distribu-tion and that can eliminate common logit-model problems relatedto the independence of irrelevant alternatives property. The ap-proach of a nested logit model is to group alternative outcomessuspected of sharing unobserved effects into nests of outcomepossibilities.

Noise: A convenient term for a series of random disturbances or devi-ation from the actual distribution. Statistical noise is a synonymfor error term, disturbance, or random fluctuation.

P Y n C n k p p n k k kkk n k

1 1 1 1 2, ; , , , ... for

C a ba

a b b,

!! !

© 2003 by CRC Press LLC

Page 371: Data Analysis Book

Nominal scale: A variable measured on a nominal scale is the same asa categorical or discrete variable. The nominal scale lacks orderand does not possess even intervals between levels of the variable.An example of a nominal scale variable is vehicle type, wherelevels of response include truck, van, and auto (see also ordinal,interval, and ratio scales of measurement).

Nonlinear relation: A relation where a scatter plot between two vari-ables X1 and X2 will not produce a straightline trend. In manycases a linear trend is observed between two variables by trans-forming the scale of one or both variables. For example, a scatterplot of LN(X1) and X2 might produce a linear trend. In this case,the variables are said to be nonlinearly related in their originalscales, but linear in transformed scale of X1.

Nonparametric statistics: Statistical methods that do not rely on statis-tical distributions with estimable parameters. In general, nonpara-metric statistical methods are best suited for ordinal scalevariables or, better, for dealing with small samples, and for caseswhen parametric methods assumptions are suspect (see also para-metric statistics).

Nonrandom sample: A sample selected by a nonrandom method. Forexample, a scheme whereby units are self-selected would yield anonrandom sample, where units that prefer to participate do so.Some aspects of nonrandom sampling can be overcome, however.

Normal distribution: A continuous distribution that was first studiedin connection with errors of measurement and, thus, is referredto as the “normal curve of errors.” The normal distribution formsthe cornerstone of a substantial portion of statistical theory. Alsocalled the Gaussian distribution, the normal distribution has thetwo parameters and , when = 0 and = 1 the normaldistribution is transformed into the standard normal distribution.The normal distribution is characterized by its symmetric shapeand bell-shaped appearance. The normal distribution probabilitydensity is given by

.

Null hypothesis: In general, a term relating to the particular researchhypothesis tested, as distinct from the alternative hypothesis,which is accepted if the research hypothesis is rejected. Contraryto intuition, the null hypothesis is often a research hypothesis thatthe analyst would prefer to reject in favor of the alternative hy-pothesis, but this is not always the case. Erroneous rejection of

zEXP x 2 2

1 2

2

2

© 2003 by CRC Press LLC

Page 372: Data Analysis Book

the null hypothesis is known as a Type I error, whereas erroneousacceptance of the null hypothesis is known as a Type II error.

O

Observational data: Non-experimental data. Because there is no controlof potential confounding variables in a study based on observa-tional data, the support for conclusions based on observationaldata must be strongly supported by logic, underlying materialexplanations, identification of potential omitted variables andtheir expected biases, and caveats identifying the limitations ofthe study.

Omitted variable bias: Variables affecting the dependent variable thatare omitted from a statistical model are problematic. Irrelevantomitted variables cause no bias in parameter estimates. Importantvariables that are uncorrelated with included variables will alsocause no bias in parameter estimates, but the estimate of 2 isbiased high. Omitted variables that are correlated with an includ-ed variable X1 will produce biased parameter estimates. The signof the bias depends on the product of the covariance of the omit-ted variable and X1 and b1, the biased parameter. For example, ifthe covariance is negative and b1 is negative, then the parameteris biased positive. In addition, 2 is also biased.

One-tailed test: Also known as a one-sided test, a test of a statisticalhypothesis in which the region of rejection consists of either theright-hand tail or the left-hand tail of the sampling distributionof the test statistic. Philosophically, a one-tailed test represents theanalyst’s a priori belief that a certain population parameter iseither negative or positive.

Ordered probability models: A general class of models used to modeldiscrete outcomes that are ordered (for example, from low tohigh). Ordered probit and ordered logit are the most common ofthese models. They are based on the assumption of normal andWeibull distributed disturbances, respectively.

Ordinal scale: A scale of measurement occurs when a variable can takeon ordered values, but there is not an even interval between levelsof the variable. Examples of ordinal variables include measuringthe desirability of different models of vehicles (Mercedes SLK,Ford Mustang, Toyota Camry), where the response is highly de-sirable, desirable, and least desirable (see also nominal, interval,and ratio scales of measurement).

© 2003 by CRC Press LLC

Page 373: Data Analysis Book

Ordinary differencing: Creating a transformed series by subtracting theimmediately adjacent observations.

Ordinary least squares (OLS): See least squares estimation.Orthogonal: A condition of two variables where the linear correlation

coefficient between them is zero. In observational data, the cor-relation between two variables is almost always nonzero. Orthog-onality is an extremely desirable property among independentvariables in statistical models, and typically arises by careful de-sign of data collection through experimentation or cleverly craftedstated preference surveys.

Outliers: Observations that are identified as such because they “appear”to lie outside a large number of apparently similar observationsor experimental units according to a specified model. In somecases, outliers are traced to errors in data collecting, recording, orcalculation and are corrected or appropriately discarded. How-ever, outliers sometimes arise without a plausible explanation. Inthese cases, it is usually the analyst’s omission of an importantvariable that differentiates the outlier from the remaining, other-wise similar observations, or a misspecification of the statisticalmodel that fails to capture the correct underlying relationships.Outliers of this latter kind should not be discarded from the“other” data unless they are modeled separately, and their exclu-sion justified.

P

p-value: The smallest level of significance that leads to rejection of thenull hypothesis.

Panel data: Data that are obtained from experimental units across var-ious points in time, but not necessary continuously over time liketime-series data (see also before–after data, time-series data, andcross-sectional data).

Parameter: Sometimes used interchangeably with coefficient, an un-known quantity that varies over a certain set of inputs. In statis-tical modeling, it usually occurs in expressions defining frequencyor probability distributions in terms of their relevant parameters(such as mean and variance of normal distribution), or in statis-tical models describing the estimated effect of a variable or vari-ables on a response. Of utmost importance is the notion thatstatistical parameters are merely estimates, computed from thesample data, which are meant to provide insight regarding what

© 2003 by CRC Press LLC

Page 374: Data Analysis Book

the true population parameter value is, although the true popu-lation parameter always remains unknown to the analyst.

Parametric statistics: Methods of statistical analysis that utilize param-eterized statistical distributions. These methods typically requirethat a host of assumptions be satisfied. For example, a standardt test for comparing population means assumes that distributionsare approximately normally distributed and have equal variances(see also nonparametric statistics).

Pearson’s product moment correlation coefficient: Denoted as r i j ,where i and j refer to two variables, a coefficient that reflects thedegree of linear association between two continuous (ratio orinterval scale) variables and that is bounded between +1 for per-fect positive correlation and –1 for perfect negative correlation.The formula for rij is given by

,

where sij is the covariance between variables i and j, and si and sj

are the standard deviations of variables i and j, respectively.Point estimate: A single estimated value of a parameter, or an estimate

without a measure of sampling variability.Poisson distribution: A discrete distribution that is often referred to as

the distribution of rare events. It is typically used to describe theprobability of occurrence of an event over time, space, or length.In general, the Poisson distribution is appropriate when the fol-lowing conditions hold: the probability of “success” in any giventrial is relatively small; the number of trials is large; and the trialsare independent. The probability distribution function for thePoisson distribution is given as

,

where x is the number of occurrences per interval, and is themean number of occurrences per interval.

Population: In statistical usage, a term that is applied to any finite orinfinite collection of individuals. It is important to distinguishbetween the population, for which statistical parameters are fixedand unknown at any given instant in time, and the sample ofthe population, from which estimates of the population param-eters are computed. Population statistics are generally unknown

rs

s sijij

i j

P X x P xEXP

xx

x

( ; )!

, for = 1, 2, 3, ...,

© 2003 by CRC Press LLC

Page 375: Data Analysis Book

because the analyst can rarely afford to measure all members ofa population, so a random sample is drawn.

Post hoc theorizing: Theorizing that is likely to occur when the analystattempts to explain analysis results after the fact (post hoc). In thisless than ideal approach to scientific discovery, the analyst devel-ops hypotheses to explain the data, instead of the reverse (collect-ing data to nullify a well-articulated hypothesis). The number ofpost hoc theories that are developed to “fit” the data is limitedonly by the imagination of a group of scientists. With an abun-dance of competing hypotheses, and little forethought as to whichhypothesis is afforded more credence, there is little in the way ofstatistical justification to prefer one hypothesis to another. Espe-cially when data are observational, there is little evidence to elim-inate the prospect of illusory correlation.

Power: The probability that a statistical test of some hypothesis rejectsthe alternative hypothesis when the alternative is false. The poweris greatest when the probability of a Type II error is least. Poweris 1 – , whereas level of confidence is 1 – .

Precision: Degree of scatter to which repeated estimates agree with thetrue state of nature. Typically, precision is a measure of the vari-ation of predictions around the true and often unknown average(see also accuracy, bias, efficiency, and consistency).

Prediction interval: A calculated range of values known to containsome future observation over the average of repeated trials withspecific certainty (probability). The correct interpretation of a pre-diction interval is as follows. If the analyst were to repeatedlydraw samples at the same levels of the independent variables andcompute the test statistic (mean, regression slope, etc.), then afuture observation will lie in the (1 – )% prediction interval times out of 100. The prediction interval differs from the confi-dence interval in that the confidence interval provides certaintybounds around a mean, whereas the prediction interval providescertainty bounds around an observation.

Principal components analysis: An analytical tool used to explain thevariance–covariance structure of a relatively large multivariatedata set using a few linear combinations of the originally mea-sured variables. It is used to reveal structure in data and enablethe identification of underlying dimensions in the data (see alsofactor analysis and structural equations).

Probability density functions: Synonymous with probability distribu-tions (sometimes referred to simply as density functions); know-ing the probability that a random variable takes on certain values,judgments are made regarding how likely or unlikely were theobserved values. In general, observing an unlikely outcome tends

© 2003 by CRC Press LLC

Page 376: Data Analysis Book

to support the notion that chance was not acting alone. By posingalternative hypotheses to explain the generation of data, an ana-lyst can conduct hypothesis tests to determine which of two com-peting hypotheses best supports the observed data.

Probit model: A discrete outcome model derived from assuming thatthe disturbances are multivariate normally distributed.

Proportional hazards model: An approach used in hazard-based anal-ysis of duration data that assumes that the covariates act multi-plicatively on some underlying (or base) hazard function.

R

Random effects model: When unobserved heterogeneity of sampledunits is thought to represent a representative random sample ofeffects from a larger population of interest, then these effects arethought to be random. In random effects models, interest is notcentered on the effects of sampled units.

Random error: A deviation of an observed from a true value, whichoccurs as though chosen at random from a probability distribu-tion of such errors.

Randomization: Process used in the design of experiments. When cer-tain factors cannot be controlled and omitted variable bias hasthe potential to occur, randomization is used to assign subjects totreatment and control groups randomly, such that any systemat-ically omitted variable bias is distributed evenly among the twogroups. Randomization should not be confused with randomsampling, which serves to provide a representative sample.

Random sampling: A sample strategy whereby population membershave equal probability of being recruited into the sample. Oftencalled simple random sampling, it provides the greatest assur-ance that the sample is representative of the population ofinterest.

Random selection: Synonymous with random sampling, a sample se-lected from a finite population is said to be random if everypossible sample has equal probability of selection. This applies tosampling without replacement. A random sample with replace-ment is still considered random as long as the population is suf-ficiently large that the replaced experimental unit has smallprobability of being recruited into the sample again.

Random variable: A variable whose exact value is not known prior tomeasurement.

© 2003 by CRC Press LLC

Page 377: Data Analysis Book

Range: The largest minus the smallest of a set of variate values.Ratio scale: A variable measured on a ratio scale has order, possesses

even intervals between levels of the variable, and has an absolutezero. An example of a ratio scale variable is height, where levelsof response include 0.0 and 2000.0 inches (see also discrete, con-tinuous, nominal, ordinal, and interval).

Raw data: Data that have not been subjected to any sort of mathematicalmanipulation or statistical treatment such as grouping, coding,censoring, or transformation.

Reduced form: An equation derived from the combination of two ormore equations. Reduced form models are used in a number ofstatistical applications including simultaneous equations model-ing and the analysis of interrelated discrete/continuous data.

Regression: A statistical method for investigating the interdependenceof variables.

Repeatability: Degree of agreement between successive runs of an ex-periment.

Replication: The execution of an experiment or survey more than onceto increase the precision and to obtain a closer estimation of thesampling error.

Representative sample: A sample that is representative of a population(it is a moot point whether the sample is chosen at random orselected to be “typical” of certain characteristics; therefore, it isbetter to use the term for samples which turn out to be represen-tative, however chosen, rather than apply it to a sample chosenwith the objective of being representative). Random samples are,by definition, representative samples.

Reproducibility: An experiment or survey is said to be reproducible if,on repetition or replication under similar conditions, it gives thesame results.

Residual: The difference between the observed value and the fitted val-ue in a statistical model. Residual is synonymous with error,disturbance, and statistical noise.

Residual method: In time-series analysis, a classical method of estimat-ing cyclical components by first eliminating the trend, seasonalvariations, and irregular variations, thus leaving the cyclical rel-atives as residuals.

Robustness: A method of statistical inference is said to be robust if itremains relatively unaffected when all of its underlying assump-tions are not met.

Roy’s identity: An identity relating an indirect utility function with itsunderlying demand function.

© 2003 by CRC Press LLC

Page 378: Data Analysis Book

S

Sample: A part or subset of a population, which is obtained through arecruitment or selection process, usually with the objective ofobtaining an improved understanding of the parent population.

Sample size: The number of sampling units which have been sampledfrom the larger population of ultimate interest.

Sampling error: Also called sampling variability; that part of the differ-ence between an unknown population value and its statisticalestimate, which arises from chance fluctuations in data values asa natural consequence of random sampling.

Scatter diagram: Also known as a scatter plot, a graph showing twocorresponding scores for an individual as a point; the result is aswarm of points.

Seasonal cycle length: The length of the characteristic recurrent patternin seasonal time series, given in terms of number of discreteobservation intervals.

Seasonal differencing: Creating a transformed series by subtracting ob-servations that are separated in time by one seasonal cycle.

Seasonality: The time-series characteristic defined by a recurrent patternof constant length in terms of discrete observation intervals.

Seemingly unrelated equations: A system of equations that are not di-rectly interrelated, but have correlated disturbances because ofshared unobserved effects across disturbance terms.

Selectivity bias: The bias that results from using a sample that is pop-ulated by an outcome of some nonrandom process. For example,if a model were developed to study the accident rates of sportscar owners, great care would have to be taken in interpreting theestimation results. This is because sports car owners are likely tobe a self-selected sample of faster drivers willing to take morerisks. Thus in interpreting the results, it would be difficult tountangle the role that having a faster, high-performance car hason accident rates from the fact that riskier drivers are attracted tosuch cars.

Self-selection in survey collection: Phenomenon that can occur whensurvey respondents are allowed to deny participation in a survey.The belief is that respondents who are opposed or who are apa-thetic about the objectives of the survey are less likely to partici-pate, and their removal from the sample will bias the results ofthe survey. Self-selection can also occur because respondents whoare either strongly opposed or strongly supportive of a survey’sobjectives respond to the survey. A classic example is television

© 2003 by CRC Press LLC

Page 379: Data Analysis Book

news polls that solicit call-in responses from listeners — the re-sults from which are practically useless for learning how thepopulation at large feels about an issue.

Serial correlation: The temporal association between observations of aseries of observations ordered across time. Also referred to asautocorrelation.

Significance: An effect is statistically significant if the value of the sta-tistic used to test it lies outside acceptable limits, i.e., if the hy-pothesis that the effect is not present is rejected. Statisticalsignificance does not imply practical significance, which is deter-mined by also evaluating the marginal effects.

Simultaneous equations: A system of interrelated equations with de-pendent variables determined simultaneously.

Skewness: The lack of symmetry in a probability distribution. In askewed distribution the mean and median are not coincident.

Smoothing: The process of removing fluctuations in an ordered seriesso that the result is “smooth” in the sense that the first differ-ences are regular and higher-order differences are small. Al-though smoothing is carried out using freehand methods, it isusual to make use of moving averages or the fitting of curvesby least squares procedures. The philosophical grounds forsmoothing stem from the notion that measurements are madewith error, such that artificial “bumps” are observed in the data,whereas the data really should represent a smooth or continuousprocess. When these “lumpy” data are smoothed appropriately,the data are thought to better reflect the true process that gen-erated the data. An example is the speed–time trace of a vehicle,where speed is measured in integer miles per hour. Accelerationsof the vehicle computed from differences in successive speedsare overestimated due to the lumpy nature of measuring speed.Thus an appropriate smoothing process on the speed data re-sults in data that more closely resembles the underlying data-generating process. Of course, the technical difficulty withsmoothing lies in selecting the appropriate smoothing process,as the real data are never typically observed.

Spurious correlation: See illusory correlation.Standard deviation: The sample standard deviation si is the square root

of the sample variance and is given by the formula

,s

x nx

nx

ii

n2 2

1

1

© 2003 by CRC Press LLC

Page 380: Data Analysis Book

where n is the sample size and is the sample mean. The samplestandard deviation shown here is a biased estimator, even thoughthe sample variance is unbiased, and the bias becomes larger asthe sample size becomes smaller.

Standard error: The square root of the variance of the sampling distri-bution of a statistic.

Standard error of estimate: The standard deviation of the observed val-ues about a regression line.

Standard error of the mean: Standard deviation of the means of severalsamples drawn at random from a large population.

Standard normal transformation: Fortunately, the analyst can trans-form any normal distributed variable into a standard normaldistributed variable by making use of a simple transformation.Given a normally distributed variable X, Z is defined such that

.

The new variable Z is normally distributed with = 0 and = 1and is a standard normal variable.

Standard scores: Scores expressed in terms of standard deviations fromthe mean. Standard scores are obtained using the standard normaltransformation (see also normal distribution and Z-value).

State dependence: Often used to justify the inclusion of lagged variablesin a model that represent previous conditions. For example, if onewere modeling an individual’s choice of home-to-work mode onTuesday, using their mode choice on Monday could represent the“mode state.” Great caution must be exercised when using and in-terpreting the results of such state variables because they may becapturing spurious state dependence, which means the parameter ispicking up the effects of unobserved factors and not true state effects.

Statistic: A summary value calculated from a sample of observations.Statistical independence: In probability theory, two events are said to

be statistically independent if, and only if, the probability thatthey both occur is equal to the product of their individual prob-abilities. That is, one event does not depend on another for itsoccurrence or non-occurrence. In statistical notation, statisticalindependence is given by

,

where P(A B) is the probability that both event A and B occur.

x

Zx

ii x

x

P AB P A P B

© 2003 by CRC Press LLC

Page 381: Data Analysis Book

Statistical inference: Also called inductive statistics, a form of reasoningfrom sample data to population parameters; that is, any general-ization, prediction, estimate, or decision based on a sample andmade about the population. There are two schools of thought instatistical inference, classical or frequentist statistics and Bayesianinference (see also induction and deduction).

Statistics: The branch of mathematics that deals with all aspects of thescience of decision making and analysis of data in the face ofuncertainty.

Stochastic: An adjective that implies that a process or data-generatingmechanism involves a random component or components. A sta-tistical model consists of stochastic and deterministic compo-nents.

Stratification: The division of a population into parts, known as strata.Stratified random sampling: A method of sampling from a population

whereby the population is divided into subsamples, known asstrata, especially for the purpose of drawing a sample, and thenassigned proportions of the sample are sampled from each stra-tum. The process of stratification is undertaken to reduce thevariability of stratification statistics. Strata are generally selectedsuch that interstrata variability is maximized, and intrastrata vari-ability is small. When stratified sampling is performed as desired,estimates of strata statistics are more precise than the same esti-mates computed on a simple random sample.

Structural equations: Equations that capture the relationships among asystem of variables, often a system consisting of latent variables.Structural equations are typically used to model an underlyingbehavioral theory.

Survivor function: A function that gives the probability that an eventduration is greater than or equal to some specified time, t. It isfrequently used in hazard analyses of duration data and is writtenS(t) = P[T t], where P denotes probability, T is a random timevariable, and t is some specified value of the random variable.

Systematic error: An error that is in some sense biased, having a distri-bution with a mean that is not zero (as opposed to a randomerror).

T

t distribution: Distribution of values obtained from the sampling dis-tribution of the mean when the variance is unknown. It is used

© 2003 by CRC Press LLC

Page 382: Data Analysis Book

extensively in hypothesis testing — specifically tests about par-ticular values of population means. The probabilities of the t-distribution approach the standard normal distribution probabil-ities rapidly as n exceeds 30. The density of the t distribution isgiven as

,

where C(n) is the normalizing constant given by

.

The mean of the t distribution is zero and the variance is n/(n –2) for n > 2.

Three-stage least squares (3SLS): A method used in simultaneousequation estimation. Stage 1 obtains two-stage least squares(2SLS) estimates of the model system. Stage 2 uses the 2SLS esti-mates to compute residuals to determine cross-equation correla-tions. Stage 3 uses generalized least squares (GLS) to estimatemodel parameters.

Tied data: A problem encountered in the analysis of duration data inwhich a number of observations end their durations at the sametime. When duration exits are grouped at specific times, the like-lihood function for proportional and accelerated lifetime hazardmodels becomes increasingly complex.

Time-series data: A set of ordered observations on a quantitative char-acteristic of an individual or collective phenomenon taken at dif-ferent points of time. Although it is not a requirement, it iscommon for these points to be equidistant in time (see also cross-sectional and before–after data).

Transferability: A concern with all models is whether or not theirestimated parameters are transferable spatially (among regionsor cities) or temporally (over time). From a spatial perspective,transferability is desirable because it means that parameters ofmodels estimated in other places are used, thus saving the costof additional data collection and estimation. Temporal transfer-ability ensures that forecasts made with the model have somevalidity in that the estimated parameters are stable over time(see also validation).

Transformation: The change in the scale of a variable. Transformationsare performed to simplify calculations, to meet specific statistical

f x C nxn

n

12 1 2/

C n n n n1 2 21 2/ / //

© 2003 by CRC Press LLC

Page 383: Data Analysis Book

modeling assumptions, to linearize an otherwise nonlinear rela-tion with another variable, to impose practical limitations on avariable, and to change the characteristic shape of a probabilitydistribution of the variable in its original scale.

Truncated distribution: A statistical distribution that occurs when a re-sponse above or below a certain threshold value is discarded. Forexample, assume that certain instrumentation can read measure-ments only within a certain range — data obtained from thisinstrument will result in a truncated distribution, as measure-ments outside the range are discarded. If measurements are re-corded at the extreme range of the measurement device, butassigned a maximum or minimum value, then the distribution iscensored.

Two-stage least squares (2SLS): A method used in simultaneous equa-tion estimation. Stage 1 regresses each endogenous variable onall exogenous variables. Stage 2 uses regression-estimated valuesfrom stage 1 as instruments, and ordinary least squares is usedto obtain statistical models.

Two-tailed test: A statistical test of significance in which the directionof an anticipated effect is not known a priori.

Type I error: If, as the result of a test statistic computed on sample data,a statistical hypothesis is rejected when it should be accepted, i.e.,when it is true, then a Type I error is committed. The , or levelof significance, is preselected by the analyst to determine the TypeI error rate. The level of confidence of a particular test is givenby 1 – .

Type II error: If, as the result of a test statistic computed on sample data,a statistical hypothesis is accepted when it is false, i.e., when itshould have been rejected, then a Type II error is committed. Theerror rate is preselected by the analyst to determine the Type IIerror rate, and the power of a particular statistical test is givenby 1 – .

U

Unbiased estimator: An estimator whose expected value (the mean ofthe sampling distribution) equals the parameter it is supposed toestimate. In general, unbiased estimators are preferred to biasedestimators of population parameters. There are rare cases, how-ever, when biased estimators are preferred because they result inestimators with smaller standard errors.

© 2003 by CRC Press LLC

Page 384: Data Analysis Book

Uniform distribution: Distributions that are appropriate for cases whenthe probability of obtaining an outcome within a range of out-comes is constant. It is expressed in terms of either discrete orcontinuous data. The probability density function for the discreteuniform distribution is given by

,

where x is the value of random variable, is the lowermost valueof the interval for x, and is the uppermost value of the intervalfor x.

Unobserved heterogeneity: A problem that arises in statistical model-ing when some unobserved factors (not included in the model)systematically vary across the population. Ignoring unobservedheterogeneity can result in model specification errors that can leadone to draw erroneous inferences on model results.

Utility maximization: A basic premise of economic theory wherebyconsumers facing choices are assumed to maximize their personalutility subject to cost and income constraints. Utility theory maybe used in the derivation and interpretation of discrete outcomemodels (but is not necessary).

V

Validation: A term used to describe the important activity of defendinga statistical model. The only way to validate the generalizabilityor transferability of an estimated model is to make forecasts orbackcasts with a model and compare them to data that were notused to estimate the model. This exercise is called external vali-dation. The importance of this step of model building cannot beoverstated, but it remains perhaps the least-practiced step of mod-el building, because it is expensive and time-consuming, andbecause some modelers and practitioners confuse goodness-of-fitstatistics computed on the sample data with the same computedon validation data.

Validity: Degree to which some procedure is founded on logic (internalor formal validity) or corresponds to nature (external or empiricalvalidity).

P X x U x; ,

1

0

for < x <

elsewhere

© 2003 by CRC Press LLC

Page 385: Data Analysis Book

Variability: A statistical term used to describe and quantify the spreador dispersion of data around its center, usually the mean. Knowl-edge of data variability is essential for conducting statistical testsand for fully understanding data. Thus, it is often desirable toobtain measures of both central tendency and spread. In fact, itmay be misleading to consider only measures of central tendencywhen describing data.

Variable: A quantity that may take any one of a specified set of values.

Variance: A generic term used to describe the spread of data around thecenter of the data. The center typically is a mean, a median, ora regression function. The variance is the square of standarddeviation.

Variate: A quantity that may take any of the values of a specified set witha specified relative frequency or probability; also known as arandom variable.

W

Weibull distribution: A distribution commonly used in discrete out-come modeling and duration modeling. With parameters > 0and P > 0, the Weibull distribution has the density function

.

Weight: A numerical coefficient attached to an observation, frequently bymultiplication, so that it will assume a desired degree of impor-tance in a function of all the observations of the set.

Weighted average: An average of quantities to which have been at-tached a series of weights in order to make allowance for theirrelative importance. Weights are commonly used to accommo-date nonrandom samples in statistical models, such as stratifiedsamples, and are used to remediate heterogeneity with weightedleast squares.

White noise: For time-series analysis, noise that is defined as a serieswhose elements are uncorrelated and normally distributedwith mean zero and constant variance. The residuals fromproperly specified and estimated time-series models should bewhite noise.

f x P x EXP xP P1

© 2003 by CRC Press LLC

Page 386: Data Analysis Book

Z

Z-value: A measure of a variable that quantifies its relative positionwithin a standard normal distribution. It is the number of stan-dard deviations from 0. As rules of thumb, Z-values of 1, 2, and3 standard deviations from zero contain approximately 67, 95,and 99% of all observations in the distribution.

© 2003 by CRC Press LLC

Page 387: Data Analysis Book

Appendix C: Statistical Tables

© 2003 by CRC Press LLC

Page 388: Data Analysis Book

TABLE C.1

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

–4.00 0.0001 0.0000 1.0000 –3.50 0.0009 0.0002 0.9998–3.99 0.0001 0.0000 1.0000 –3.49 0.0009 0.0002 0.9998–3.98 0.0001 0.0000 1.0000 –3.48 0.0009 0.0003 0.9998–3.97 0.0001 0.0000 1.0000 –3.47 0.0010 0.0003 0.9997–3.96 0.0002 0.0000 1.0000 –3.46 0.0010 0.0003 0.9997–3.95 0.0002 0.0000 1.0000 –3.45 0.0010 0.0003 0.9997–3.94 0.0002 0.0000 1.0000 –3.44 0.0011 0.0003 0.9997–3.93 0.0002 0.0000 1.0000 –3.43 0.0011 0.0003 0.9997–3.92 0.0002 0.0000 1.0000 –3.42 0.0011 0.0003 0.9997–3.91 0.0002 0.0001 1.0000 –3.41 0.0012 0.0003 0.9997

–3.90 0.0002 0.0001 1.0000 –3.40 0.0012 0.0003 0.9997–3.89 0.0002 0.0001 1.0000 –3.39 0.0013 0.0003 0.9997–3.88 0.0002 0.0001 1.0000 –3.38 0.0013 0.0004 0.9996–3.87 0.0002 0.0001 1.0000 –3.37 0.0014 0.0004 0.9996–3.86 0.0002 0.0001 0.9999 –3.36 0.0014 0.0004 0.9996–3.85 0.0002 0.0001 0.9999 –3.35 0.0015 0.0004 0.9996–3.84 0.0003 0.0001 0.9999 –3.34 0.0015 0.0004 0.9996–3.83 0.0003 0.0001 0.9999 –3.33 0.0016 0.0004 0.9996–3.82 0.0003 0.0001 0.9999 –3.32 0.0016 0.0004 0.9996–3.81 0.0003 0.0001 0.9999 –3.31 0.0017 0.0005 0.9995

–3.80 0.0003 0.0001 0.9999 –3.30 0.0017 0.0005 0.9995–3.79 0.0003 0.0001 0.9999 –3.29 0.0018 0.0005 0.9995–3.78 0.0003 0.0001 0.9999 –3.28 0.0018 0.0005 0.9995–3.77 0.0003 0.0001 0.9999 –3.27 0.0019 0.0005 0.9995–3.76 0.0003 0.0001 0.9999 –3.26 0.0020 0.0006 0.9994–3.75 0.0003 0.0001 0.9999 –3.25 0.0020 0.0006 0.9994–3.74 0.0004 0.0001 0.9999 –3.24 0.0021 0.0006 0.9994–3.73 0.0004 0.0001 0.9999 –3.23 0.0022 0.0006 0.9994–3.72 0.0004 0.0001 0.9999 –3.22 0.0022 0.0006 0.9994–3.71 0.0004 0.0001 0.9999 –3.21 0.0023 0.0007 0.9993

–3.70 0.0004 0.0001 0.9999 –3.20 0.0024 0.0007 0.9993–3.69 0.0004 0.0001 0.9999 –3.19 0.0025 0.0007 0.9993–3.68 0.0005 0.0001 0.9999 –3.18 0.0025 0.0007 0.9993–3.67 0.0005 0.0001 0.9999 –3.17 0.0026 0.0008 0.9992–3.66 0.0005 0.0001 0.9999 –3.16 0.0027 0.0008 0.9992–3.65 0.0005 0.0001 0.9999 –3.15 0.0028 0.0008 0.9992–3.64 0.0005 0.0001 0.9999 –3.14 0.0029 0.0008 0.9992–3.63 0.0006 0.0001 0.9999 –3.13 0.0030 0.0009 0.9991–3.62 0.0006 0.0001 0.9999 –3.12 0.0031 0.0009 0.9991–3.61 0.0006 0.0001 0.9999 –3.11 0.0032 0.0009 0.9991

–3.60 0.0006 0.0002 0.9998 –3.10 0.0033 0.0010 0.9990–3.59 0.0006 0.0002 0.9998 –3.09 0.0034 0.0010 0.9990–3.58 0.0007 0.0002 0.9998 –3.08 0.0035 0.0010 0.9990–3.57 0.0007 0.0002 0.9998 –3.07 0.0036 0.0011 0.9989–3.56 0.0007 0.0002 0.9998 –3.06 0.0037 0.0011 0.9989–3.55 0.0007 0.0002 0.9998 –3.05 0.0038 0.0011 0.9989–3.54 0.0008 0.0002 0.9998 –3.04 0.0039 0.0012 0.9988–3.53 0.0008 0.0002 0.9998 –3.03 0.0040 0.0012 0.9988–3.52 0.0008 0.0002 0.9998 –3.02 0.0042 0.0013 0.9987–3.51 0.0008 0.0002 0.9998 –3.01 0.0043 0.0013 0.9987

–3.50 0.0009 0.0002 0.9998 –3.00 0.0044 0.0014 0.9987

© 2003 by CRC Press LLC

Page 389: Data Analysis Book

–3.00 0.0044 0.0014 0.9987 –2.50 0.0175 0.0062 0.9938–2.99 0.0046 0.0014 0.9986 –2.49 0.0180 0.0064 0.9936–2.98 0.0047 0.0014 0.9986 –2.48 0.0184 0.0066 0.9934–2.97 0.0049 0.0015 0.9985 –2.47 0.0189 0.0068 0.9932–2.96 0.0050 0.0015 0.9985 –2.46 0.0194 0.0069 0.9930–2.95 0.0051 0.0016 0.9984 –2.45 0.0198 0.0071 0.9929–2.94 0.0053 0.0016 0.9984 –2.44 0.0203 0.0073 0.9927–2.93 0.0054 0.0017 0.9983 –2.43 0.0208 0.0076 0.9925–2.92 0.0056 0.0018 0.9982 –2.42 0.0213 0.0078 0.9922–2.91 0.0058 0.0018 0.9982 –2.41 0.0219 0.0080 0.9920

–2.90 0.0060 0.0019 0.9981 –2.40 0.0224 0.0082 0.9918–2.89 0.0061 0.0019 0.9981 –2.39 0.0229 0.0084 0.9916–2.88 0.0063 0.0020 0.9980 –2.38 0.0235 0.0087 0.9913–2.87 0.0065 0.0021 0.9980 –2.37 0.0241 0.0089 0.9911–2.86 0.0067 0.0021 0.9979 –2.36 0.0246 0.0091 0.9909–2.85 0.0069 0.0022 0.9978 –2.35 0.0252 0.0094 0.9906–2.84 0.0071 0.0023 0.9977 –2.34 0.0258 0.0096 0.9904–2.83 0.0073 0.0023 0.9977 –2.33 0.0264 0.0099 0.9901–2.82 0.0075 0.0024 0.9976 –2.32 0.0271 0.0102 0.9898–2.81 0.0077 0.0025 0.9975 –2.31 0.0277 0.0104 0.9896

–2.80 0.0079 0.0026 0.9974 –2.30 0.0283 0.0107 0.9893–2.79 0.0081 0.0026 0.9974 –2.29 0.0290 0.0110 0.9890–2.78 0.0084 0.0027 0.9973 –2.28 0.0296 0.0113 0.9887–2.77 0.0086 0.0028 0.9972 –2.27 0.0303 0.0116 0.9884–2.76 0.0089 0.0029 0.9971 –2.26 0.0310 0.0119 0.9881–2.75 0.0091 0.0030 0.9970 –2.25 0.0317 0.0122 0.9878–2.74 0.0094 0.0031 0.9969 –2.24 0.0325 0.0126 0.9875–2.73 0.0096 0.0032 0.9968 –2.23 0.0332 0.0129 0.9871–2.72 0.0099 0.0033 0.9967 –2.22 0.0339 0.0132 0.9868–2.71 0.0101 0.0034 0.9966 –2.21 0.0347 0.0135 0.9865

–2.70 0.0104 0.0035 0.9965 –2.20 0.0355 0.0139 0.9861–2.69 0.0107 0.0036 0.9964 –2.19 0.0363 0.0143 0.9857–2.68 0.0110 0.0037 0.9963 –2.18 0.0371 0.0146 0.9854–2.67 0.0113 0.0038 0.9962 –2.17 0.0379 0.0150 0.9850–2.66 0.0116 0.0039 0.9961 –2.16 0.0387 0.0154 0.9846–2.65 0.0119 0.0040 0.9960 –2.15 0.0396 0.0158 0.9842–2.64 0.0122 0.0042 0.9959 –2.14 0.0404 0.0162 0.9838–2.63 0.0126 0.0043 0.9957 –2.13 0.0413 0.0166 0.9834–2.62 0.0129 0.0044 0.9956 –2.12 0.0422 0.0170 0.9830–2.61 0.0132 0.0045 0.9955 –2.11 0.0431 0.0174 0.9826

–2.60 0.0136 0.0047 0.9953 –2.10 0.0440 0.0179 0.9821–2.59 0.0139 0.0048 0.9952 –2.09 0.0449 0.0183 0.9817–2.58 0.0143 0.0049 0.9951 –2.08 0.0459 0.0188 0.9812–2.57 0.0147 0.0051 0.9949 –2.07 0.0468 0.0192 0.9808–2.56 0.0151 0.0052 0.9948 –2.06 0.0478 0.0197 0.9803–2.55 0.0155 0.0054 0.9946 –2.05 0.0488 0.0202 0.9798–2.54 0.0158 0.0055 0.9945 –2.04 0.0498 0.0207 0.9793–2.53 0.0163 0.0057 0.9943 –2.03 0.0508 0.0212 0.9788–2.52 0.0167 0.0059 0.9941 –2.02 0.0519 0.0217 0.9783–2.51 0.0171 0.0060 0.9940 –2.01 0.0529 0.0222 0.9778

–2.50 0.0175 0.0062 0.9938 –2.00 0.0540 0.0227 0.9772

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 390: Data Analysis Book

–2.00 0.0540 0.0227 0.9772 –1.50 0.1295 0.0668 0.9332–1.99 0.0551 0.0233 0.9767 –1.49 0.1315 0.0681 0.9319–1.98 0.0562 0.0238 0.9761 –1.48 0.1334 0.0694 0.9306–1.97 0.0573 0.0244 0.9756 –1.47 0.1354 0.0708 0.9292–1.96 0.0584 0.0250 0.9750 –1.46 0.1374 0.0722 0.9278–1.95 0.0596 0.0256 0.9744 –1.45 0.1394 0.0735 0.9265–1.94 0.0608 0.0262 0.9738 –1.44 0.1415 0.0749 0.9251–1.93 0.0619 0.0268 0.9732 –1.43 0.1435 0.0764 0.9236–1.92 0.0632 0.0274 0.9726 –1.42 0.1456 0.0778 0.9222–1.91 0.0644 0.0281 0.9719 –1.41 0.1476 0.0793 0.9207

–1.90 0.0656 0.0287 0.9713 –1.40 0.1497 0.0808 0.9192–1.89 0.0669 0.0294 0.9706 –1.39 0.1518 0.0823 0.9177–1.88 0.0681 0.0301 0.9699 –1.38 0.1540 0.0838 0.9162–1.87 0.0694 0.0307 0.9693 –1.37 0.1561 0.0853 0.9147–1.86 0.0707 0.0314 0.9686 –1.36 0.1582 0.0869 0.9131–1.85 0.0721 0.0322 0.9678 –1.35 0.1604 0.0885 0.9115–1.84 0.0734 0.0329 0.9671 –1.34 0.1626 0.0901 0.9099–1.83 0.0748 0.0336 0.9664 –1.33 0.1647 0.0918 0.9082–1.82 0.0761 0.0344 0.9656 –1.32 0.1669 0.0934 0.9066–1.81 0.0775 0.0352 0.9648 –1.31 0.1691 0.0951 0.9049

–1.80 0.0790 0.0359 0.9641 –1.30 0.1714 0.0968 0.9032–1.79 0.0804 0.0367 0.9633 –1.29 0.1736 0.0985 0.9015–1.78 0.0818 0.0375 0.9625 –1.28 0.1759 0.1003 0.8997–1.77 0.0833 0.0384 0.9616 –1.27 0.1781 0.1020 0.8980–1.76 0.0848 0.0392 0.9608 –1.26 0.1804 0.1038 0.8962–1.75 0.0863 0.0401 0.9599 –1.25 0.1827 0.1056 0.8943–1.74 0.0878 0.0409 0.9591 –1.24 0.1849 0.1075 0.8925–1.73 0.0893 0.0418 0.9582 –1.23 0.1872 0.1094 0.8907–1.72 0.0909 0.0427 0.9573 –1.22 0.1895 0.1112 0.8888–1.71 0.0925 0.0436 0.9564 –1.21 0.1919 0.1131 0.8869

–1.70 0.0940 0.0446 0.9554 –1.20 0.1942 0.1151 0.8849–1.69 0.0957 0.0455 0.9545 –1.19 0.1965 0.1170 0.8830–1.68 0.0973 0.0465 0.9535 –1.18 0.1989 0.1190 0.8810–1.67 0.0989 0.0475 0.9525 –1.17 0.2012 0.1210 0.8790–1.66 0.1006 0.0485 0.9515 –1.16 0.2036 0.1230 0.8770–1.65 0.1023 0.0495 0.9505 –1.15 0.2059 0.1251 0.8749–1.64 0.1040 0.0505 0.9495 –1.14 0.2083 0.1271 0.8729–1.63 0.1057 0.0515 0.9485 –1.13 0.2107 0.1292 0.8708–1.62 0.1074 0.0526 0.9474 –1.12 0.2131 0.1314 0.8686–1.61 0.1091 0.0537 0.9463 –1.11 0.2155 0.1335 0.8665

–1.60 0.1109 0.0548 0.9452 –1.10 0.2178 0.1357 0.8643–1.59 0.1127 0.0559 0.9441 –1.09 0.2203 0.1379 0.8621–1.58 0.1145 0.0570 0.9429 –1.08 0.2226 0.1401 0.8599–1.57 0.1163 0.0582 0.9418 –1.07 0.2251 0.1423 0.8577–1.56 0.1182 0.0594 0.9406 –1.06 0.2275 0.1446 0.8554–1.55 0.1200 0.0606 0.9394 –1.05 0.2299 0.1469 0.8531–1.54 0.1219 0.0618 0.9382 –1.04 0.2323 0.1492 0.8508–1.53 0.1238 0.0630 0.9370 –1.03 0.2347 0.1515 0.8485–1.52 0.1257 0.0643 0.9357 –1.02 0.2371 0.1539 0.8461–1.51 0.1276 0.0655 0.9345 –1.01 0.2396 0.1563 0.8438

–1.50 0.1295 0.0668 0.9332 –1.00 0.2420 0.1587 0.8413

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 391: Data Analysis Book

–1.00 0.2420 0.1587 0.8413 –0.50 0.3521 0.3085 0.6915–0.99 0.2444 0.1611 0.8389 –0.49 0.3538 0.3121 0.6879–0.98 0.2468 0.1635 0.8365 –0.48 0.3555 0.3156 0.6844–0.97 0.2492 0.1660 0.8340 –0.47 0.3572 0.3192 0.6808–0.96 0.2516 0.1685 0.8315 –0.46 0.3589 0.3228 0.6772–0.95 0.2541 0.1711 0.8289 –0.45 0.3605 0.3264 0.6736–0.94 0.2565 0.1736 0.8264 –0.44 0.3621 0.3300 0.6700–0.93 0.2589 0.1762 0.8238 –0.43 0.3637 0.3336 0.6664–0.92 0.2613 0.1788 0.8212 –0.42 0.3653 0.3372 0.6628–0.91 0.2637 0.1814 0.8186 –0.41 0.3668 0.3409 0.6591

–0.90 0.2661 0.1841 0.8159 –0.40 0.3683 0.3446 0.6554–0.89 0.2685 0.1867 0.8133 –0.39 0.3697 0.3483 0.6517–0.88 0.2709 0.1894 0.8106 –0.38 0.3711 0.3520 0.6480–0.87 0.2732 0.1921 0.8078 –0.37 0.3725 0.3557 0.6443–0.86 0.2756 0.1949 0.8051 –0.36 0.3739 0.3594 0.6406–0.85 0.2780 0.1977 0.8023 –0.35 0.3752 0.3632 0.6368–0.84 0.2803 0.2004 0.7995 –0.34 0.3765 0.3669 0.6331–0.83 0.2827 0.2033 0.7967 –0.33 0.3778 0.3707 0.6293–0.82 0.2850 0.2061 0.7939 –0.32 0.3790 0.3745 0.6255–0.81 0.2874 0.2090 0.7910 –0.31 0.3802 0.3783 0.6217

–0.80 0.2897 0.2119 0.7881 –0.30 0.3814 0.3821 0.6179–0.79 0.2920 0.2148 0.7852 –0.29 0.3825 0.3859 0.6141–0.78 0.2943 0.2177 0.7823 –0.28 0.3836 0.3897 0.6103–0.77 0.2966 0.2207 0.7793 –0.27 0.3847 0.3936 0.6064–0.76 0.2989 0.2236 0.7764 –0.26 0.3857 0.3974 0.6026–0.75 0.3011 0.2266 0.7734 –0.25 0.3867 0.4013 0.5987–0.74 0.3034 0.2296 0.7703 –0.24 0.3896 0.4052 0.5948–0.73 0.3056 0.2327 0.7673 –0.23 0.3885 0.4091 0.5909–0.72 0.3079 0.2358 0.7642 –0.22 0.3894 0.4129 0.5871–0.71 0.3101 0.2389 0.7611 –0.21 0.3902 0.4168 0.5832

–0.70 0.3123 0.2420 0.7580 –0.20 0.3910 0.4207 0.5793–0.69 0.3144 0.2451 0.7549 –0.19 0.3918 0.4247 0.5754–0.68 0.3166 0.2482 0.7518 –0.18 0.3925 0.4286 0.5714–0.67 0.3187 0.2514 0.7486 –0.17 0.3932 0.4325 0.5675–0.66 0.3209 0.2546 0.7454 –0.16 0.3939 0.4364 0.5636–0.65 0.3230 0.2579 0.7421 –0.15 0.3945 0.4404 0.5596–0.64 0.3251 0.2611 0.7389 –0.14 0.3951 0.4443 0.5557–0.63 0.3271 0.2643 0.7357 –0.13 0.3956 0.4483 0.5517–0.62 0.3292 0.2676 0.7324 –0.12 0.3961 0.4522 0.5478–0.61 0.3312 0.2709 0.7291 –0.11 0.3965 0.4562 0.5438

–0.60 0.3332 0.2742 0.7258 –0.10 0.3970 0.4602 0.5398–0.59 0.3352 0.2776 0.7224 –0.09 0.3973 0.4641 0.5359–0.58 0.3372 0.2810 0.7190 –0.08 0.3977 0.4681 0.5319–0.57 0.3391 0.2843 0.7157 –0.07 0.3980 0.4721 0.5279–0.56 0.3411 0.2877 0.7123 –0.06 0.3982 0.4761 0.5239–0.55 0.3429 0.2912 0.7088 –0.05 0.3984 0.4801 0.5199–0.54 0.3448 0.2946 0.7054 –0.04 0.3986 0.4840 0.5160–0.53 0.3467 0.2981 0.7019 –0.03 0.3988 0.4880 0.5120–0.52 0.3485 0.3015 0.6985 –0.02 0.3989 0.4920 0.5080–0.51 0.3503 0.3050 0.6950 –0.01 0.3989 0.4960 0.5040

–0.50 0.3521 0.3085 0.6915 0.00 0.3989 0.5000 0.5000

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 392: Data Analysis Book

0.00 0.3989 0.5000 0.5000 0.50 0.3521 0.6915 0.30850.01 0.3989 0.5040 0.4960 0.51 0.3503 0.6950 0.30500.02 0.3989 0.5080 0.4920 0.52 0.3485 0.6985 0.30150.03 0.3988 0.5120 0.4880 0.53 0.3467 0.7019 0.29810.04 0.3986 0.5160 0.4840 0.54 0.3448 0.7054 0.29460.05 0.3984 0.5199 0.4801 0.55 0.3429 0.7088 0.29120.06 0.3982 0.5239 0.4761 0.56 0.3411 0.7123 0.28770.07 0.3980 0.5279 0.4721 0.57 0.3391 0.7157 0.28430.08 0.3977 0.5319 0.4681 0.58 0.3372 0.7190 0.28100.09 0.3973 0.5359 0.4641 0.59 0.3352 0.7224 0.2776

0.10 0.3970 0.5398 0.4602 0.60 0.3332 0.7258 0.27420.11 0.3965 0.5438 0.4562 0.61 0.3312 0.7291 0.27090.12 0.3961 0.5478 0.4522 0.62 0.3292 0.7324 0.26760.13 0.3956 0.5517 0.4483 0.63 0.3271 0.7357 0.26430.14 0.3951 0.5557 0.4443 0.64 0.3251 0.7389 0.26110.15 0.3945 0.5596 0.4404 0.65 0.3230 0.7421 0.25790.16 0.3939 0.5636 0.4364 0.66 0.3209 0.7454 0.25460.17 0.3932 0.5675 0.4325 0.67 0.3187 0.7486 0.25140.18 0.3925 0.5714 0.4286 0.68 0.3166 0.7518 0.24820.19 0.3918 0.5754 0.4247 0.69 0.3144 0.7549 0.2451

0.20 0.3910 0.5793 0.4207 0.70 0.3123 0.7580 0.24200.21 0.3902 0.5832 0.4168 0.71 0.3101 0.7611 0.23890.22 0.3894 0.5871 0.4129 0.72 0.3079 0.7642 0.23580.23 0.3885 0.5909 0.4091 0.73 0.3056 0.7673 0.23270.24 0.3876 0.5948 0.4052 0.74 0.3034 0.7703 0.22960.25 0.3867 0.5987 0.4013 0.75 0.3011 0.7734 0.22660.26 0.3857 0.6026 0.3974 0.76 0.2989 0.7764 0.22360.27 0.3847 0.6064 0.3936 0.77 0.2966 0.7793 0.22070.28 0.3836 0.6103 0.3897 0.78 0.2943 0.7823 0.21770.29 0.3825 0.6141 0.3859 0.79 0.2920 0.7852 0.2148

0.30 0.3814 0.6179 0.3821 0.80 0.2897 0.7881 0.21190.31 0.3802 0.6217 0.3783 0.81 0.2874 0.7910 0.20900.32 0.3790 0.6255 0.3745 0.82 0.2850 0.7939 0.20610.33 0.3778 0.6293 0.3707 0.83 0.2827 0.7967 0.20330.34 0.3765 0.6331 0.3669 0.84 0.2803 0.7995 0.20040.35 0.3752 0.6368 0.3632 0.85 0.2780 0.8023 0.19770.36 0.3739 0.6406 0.3594 0.86 0.2756 0.8051 0.19490.37 0.3725 0.6443 0.3557 0.87 0.2732 0.8078 0.19210.38 0.3711 0.6480 0.3520 0.88 0.2709 0.8106 0.18940.39 0.3697 0.6517 0.3483 0.89 0.2685 0.8133 0.1867

0.40 0.3683 0.6554 0.3446 0.90 0.2661 0.8159 0.18410.41 0.3668 0.6591 0.3409 0.91 0.2637 0.8186 0.18140.42 0.3653 0.6628 0.3372 0.92 0.2613 0.8212 0.17880.43 0.3637 0.6664 0.3336 0.93 0.2589 0.8238 0.17620.44 0.3621 0.6700 0.3300 0.94 0.2565 0.8264 0.17360.45 0.3605 0.6736 0.3264 0.95 0.2541 0.8289 0.17110.46 0.3589 0.6772 0.3228 0.96 0.2516 0.8315 0.16850.47 0.3572 0.6808 0.3192 0.97 0.2492 0.8340 0.16600.48 0.3555 0.6844 0.3156 0.98 0.2468 0.8365 0.16350.49 0.3538 0.6879 0.3121 0.99 0.2444 0.8389 0.1611

0.50 0.3521 0.6915 0.3085 1.00 0.2420 0.8413 0.1587

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 393: Data Analysis Book

1.00 0.2420 0.8413 0.1587 1.50 0.1295 0.9332 0.06681.01 0.2396 0.8438 0.1563 1.51 0.1276 0.9345 0.06551.02 0.2371 0.8461 0.1539 1.52 0.1257 0.9357 0.06431.03 0.2347 0.8485 0.1515 1.53 0.1238 0.9370 0.06301.04 0.2323 0.8508 0.1492 1.54 0.1219 0.9382 0.06181.05 0.2299 0.8531 0.1469 1.55 0.1200 0.9394 0.06061.06 0.2275 0.8554 0.1446 1.56 0.1182 0.9406 0.05941.07 0.2251 0.8577 0.1423 1.57 0.1163 0.9418 0.05821.08 0.2226 0.8599 0.1401 1.58 0.1145 0.9429 0.05701.09 0.2203 0.8621 0.1379 1.59 0.1127 0.9441 0.0559

1.10 0.2178 0.8643 0.1357 1.60 0.1109 0.9452 0.05481.11 0.2155 0.8665 0.1335 1.61 0.1091 0.9463 0.05371.12 0.2131 0.8686 0.1314 1.62 0.1074 0.9474 0.05261.13 0.2107 0.8708 0.1292 1.63 0.1057 0.9485 0.05151.14 0.2083 0.8729 0.1271 1.64 0.1040 0.9495 0.05051.15 0.2059 0.8749 0.1251 1.65 0.1023 0.9505 0.04951.16 0.2036 0.8770 0.1230 1.66 0.1006 0.9515 0.04851.17 0.2012 0.8790 0.1210 1.67 0.0989 0.9525 0.04751.18 0.1989 0.8810 0.1190 1.68 0.0973 0.9535 0.04651.19 0.1965 0.8830 0.1170 1.69 0.0957 0.9545 0.0455

1.20 0.1942 0.8849 0.1151 1.70 0.0940 0.9554 0.04461.21 0.1919 0.8869 0.1131 1.71 0.0925 0.9564 0.04361.22 0.1895 0.8888 0.1112 1.72 0.0909 0.9573 0.04271.23 0.1872 0.8907 0.1094 1.73 0.0893 0.9582 0.04181.24 0.1849 0.8925 0.1075 1.74 0.0878 0.9591 0.04091.25 0.1827 0.8943 0.1056 1.75 0.0863 0.9599 0.04011.26 0.1804 0.8962 0.1038 1.76 0.0848 0.9608 0.03921.27 0.1781 0.8980 0.1020 1.77 0.0833 0.9616 0.03841.28 0.1759 0.8997 0.1003 1.78 0.0818 0.9625 0.03751.29 0.1736 0.9015 0.0985 1.79 0.0804 0.9633 0.0367

1.30 0.1714 0.9032 0.0968 1.80 0.0790 0.9641 0.03591.31 0.1691 0.9049 0.0951 1 .81 0.0775 0.9648 0.03521.32 0.1669 0.9066 0.0934 1.82 0.0761 0.9656 0.03441.33 0.1647 0.9082 0.0918 1.83 0.0748 0.9664 0.03361.34 0.1626 0.9099 0.0901 1.84 0.0734 0.9671 0.03291.35 0.1604 0.9115 0.0885 1.85 0.0721 0.9678 0.03221.36 0.1582 0.9131 0.0869 1.86 0.0707 0.9686 0.03141.37 0.1561 0.9147 0.0853 1.87 0.0694 0.9693 0.03071.38 0.1540 0.9162 0.0838 1.88 0.0681 0.9699 0.03011.39 0.1518 0.9177 0.0823 1.89 0.0669 0.9706 0.0294

1.40 0.1497 0.9192 0.0808 1.90 0.0656 0.9713 0.02871.41 0.1476 0.9207 0.0793 1.91 0.0644 0.9719 0.02811.42 0.1456 0.9222 0.0778 1.92 0.0632 0.9726 0.02741.43 0.1435 0.9236 0.0764 1.93 0.0619 0.9732 0.02681.44 0.1415 0.9251 0.0749 1.94 0.0608 0.9738 0.02621.45 0.1394 0.9265 0.0735 1.95 0.0596 0.9744 0.02561.46 0.1374 0.9278 0.0722 1.96 0.0584 0.9750 0.02501.47 0.1354 0.9292 0.0708 1.97 0.0573 0.9756 0.02441.48 0.1334 0.9306 0.0694 1.98 0.0562 0.9761 0.02381.49 0.1315 0.9319 0.0681 1.99 0.0551 0.9767 0.0233

1.50 0.1295 0.9332 0.0668 2.00 0.0540 0.9772 0.0227

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 394: Data Analysis Book

2.00 0.0540 0.9772 0.0227 2.50 0.0175 0.9938 0.00622.01 0.0529 0.9778 0.0222 2.51 0.0171 0.9940 0.00602.02 0.0519 0.9783 0.0217 2.52 0.0167 0.9941 0.00592.03 0.0508 0.9788 0.0212 2.53 0.0163 0.9943 0.00572.04 0.0498 0.9793 0.0207 2.54 0.0158 0.9945 0.00552.05 0.0488 0.9798 0.0202 2.55 0.0155 0.9946 0.00542.06 0.0478 0.9803 0.0197 2.56 0.0151 0.9948 0.00522.07 0.0468 0.9808 0.0192 2.57 0.0147 0.9949 0.00512.08 0.0459 0.9812 0.0188 2.58 0.0143 0.9951 0.00492.09 0.0449 0.9817 0.0183 2.59 0.0139 0.9952 0.0048

2.10 0.0440 0.9821 0.0179 2.60 0.0136 0.9953 0.00472.11 0.0431 0.9826 0.0174 2.61 0.0132 0.9955 0.00452.12 0.0422 0.9830 0.0170 2.62 0.0129 0.9956 0.00442.13 0.0413 0.9834 0.0166 2.63 0.0126 0.9957 0.00432.14 0.0404 0.9838 0.0162 2.64 0.0122 0.9959 0.00422.15 0.0396 0.9842 0.0158 2.65 0.0119 0.9960 0.00402.16 0.0387 0.9846 0.0154 2.66 0.0116 0.9961 0.00392.17 0.0379 0.9850 0.0150 2.67 0.0113 0.9962 0.00382.18 0.0371 0.9854 0.0146 2.68 0.0110 0.9963 0.00372.19 0.0363 0.9857 0.0143 2.69 0.0107 0.9964 0.0036

2.20 0.0355 0.9861 0.0139 2.70 0.0104 0.9965 0.00352.21 0.0347 0.9865 0.0135 2.71 0.0101 0.9966 0.00342.22 0.0339 0.9868 0.0132 2.72 0.0099 0.9967 0.00332.23 0.0332 0.9871 0.0129 2.73 0.0096 0.9968 0.00322.24 0.0325 0.9875 0.0126 2.74 0.0094 0.9969 0.00312.25 0.0317 0.9878 0.0122 2.75 0.0091 0.9970 0.00302.26 0.0310 0.9881 0.0119 2.76 0.0089 0.9971 0.00292.27 0.0303 0.9884 0.0116 2.77 0.0086 0.9972 0.00282.28 0.0296 0.9887 0.0113 2.78 0.0084 0.9973 0.00272.29 0.0290 0.9890 0.0110 2.79 0.0081 0.9974 0.0026

2.30 0.0283 0.9893 0.0107 2.80 0.0079 0.9974 0.00262.31 0.0277 0.9896 0.0104 2.81 0.0077 0.9975 0.00252.32 0.0271 0.9898 0.0102 2.82 0.0075 0.9976 0.00242.33 0.0264 0.9901 0.0099 2.83 0.0073 0.9977 0.00232.34 0.0258 0.9904 0.0096 2.84 0.0071 0.9977 0.00232.35 0.0252 0.9906 0.0094 2.85 0.0069 0.9978 0.00222.36 0.0246 0.9909 0.0091 2.86 0.0067 0.9979 0.00212.37 0.0241 0.9911 0.0089 2.87 0.0065 0.9980 0.00212.38 0.0235 0.9913 0.0087 2.88 0.0063 0.9980 0.00202.39 0.0229 0.9916 0.0084 2.89 0.0061 0.9981 0.0019

2.40 0.0224 0.9918 0.0082 2.90 0.0060 0.9981 0.00192.41 0.0219 0.9920 0.0080 2.91 0.0058 0.9982 0.00182.42 0.0213 0.9922 0.0078 2.92 0.0056 0.9982 0.00182.43 0.0208 0.9925 0.0076 2.93 0.0054 0.9983 0.00172.44 0.0203 0.9927 0.0073 2.94 0.0053 0.9984 0.00162.45 0.0198 0.9929 0.0071 2.95 0.0051 0.9984 0.00162.46 0.0194 0.9930 0.0069 2.96 0.0050 0.9985 0.00152.47 0.0189 0.9932 0.0068 2.97 0.0049 0.9985 0.00152.48 0.0184 0.9934 0.0066 2.98 0.0047 0.9986 0.00142.49 0.0180 0.9936 0.0064 2.99 0.0046 0.9986 0.0014

2.50 0.0175 0.9938 0.0062 3.00 0.0044 0.9987 0.0014

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 395: Data Analysis Book

3.00 0.0044 0.9987 0.0014 3.50 0.0009 0.9998 0.00023.01 0.0043 0.9987 0.0013 3.51 0.0008 0.9998 0.00023.02 0.0042 0.9987 0.0013 3.52 0.0008 0.9998 0.00023.03 0.0040 0.9988 0.0012 3.53 0.0008 0.9998 0.00023.04 0.0039 0.9988 0.0012 3.54 0.0008 0.9998 0.00023.05 0.0038 0.9989 0.0011 3.55 0.0007 0.9998 0.00023.06 0.0037 0.9989 0.0011 3.56 0.0007 0.9998 0.00023.07 0.0036 0.9989 0.0011 3.57 0.0007 0.9998 0.00023.08 0.0035 0.9990 0.0010 3.58 0.0007 0.9998 0.00023.09 0.0034 0.9990 0.0010 3.59 0.0006 0.9998 0.0002

3.10 0.0033 0.9990 0.0010 3.60 0.0006 0.9998 0.00023.11 0.0032 0.9991 0.0009 3.61 0.0006 0.9999 0.00013.12 0.0031 0.9991 0.0009 3.62 0.0006 0.9999 0.00013.13 0.0030 0.9991 0.0009 3.63 0.0006 0.9999 0.00013.14 0.0029 0.9992 0.0008 3.64 0.0005 0.9999 0.00013.15 0.0028 0.9992 0.0008 3.65 0.0005 0.9999 0.00013.16 0.0027 0.9992 0.0008 3.66 0.0005 0.9999 0.00013.17 0.0026 0.9992 0.0008 3.67 0.0005 0.9999 0.00013.18 0.0025 0.9993 0.0007 3.68 0.0005 0.9999 0.00013.19 0.0025 0.9993 0.0007 3.69 0.0004 0.9999 0.0001

3.20 0.0024 0.9993 0.0007 3.70 0.0004 0.9999 0.00013.21 0.0023 0.9993 0.0007 3.71 0.0004 0.9999 0.00013.22 0.0022 0.9994 0.0006 3.72 0.0004 0.9999 0.00013.23 0.0022 0.9994 0.0006 3.73 0.0004 0.9999 0.00013.24 0.0021 0.9994 0.0006 3.74 0.0004 0.9999 0.00013.25 0.0020 0.9994 0.0006 3.75 0.0003 0.9999 0.00013.26 0.0020 0.9994 0.0006 3.76 0.0003 0.9999 0.00013.27 0.0019 0.9995 0.0005 3.77 0.0003 0.9999 0.00013.28 0.0018 0.9995 0.0005 3.78 0.0003 0.9999 0.00013.29 0.0018 0.9995 0.0005 3.79 0.0003 0.9999 0.0001

3.30 0.0017 0.9995 0.0005 3.80 0.0003 0.9999 0.00013.31 0.0017 0.9995 0.0005 3.81 0.0003 0.9999 0.00013.32 0.0016 0.9996 0.0004 3.82 0.0003 0.9999 0.00013.33 0.0016 0.9996 0.0004 3.83 0.0003 0.9999 0.00013.34 0.0015 0.9996 0.0004 3.84 0.0003 0.9999 0.00013.35 0.0015 0.9996 0.0004 3.85 0.0002 0.9999 0.00013.36 0.0014 0.9996 0.0004 3.86 0.0002 0.9999 0.00013.37 0.0014 0.9996 0.0004 3.87 0.0002 1.0000 0.00013.38 0.0013 0.9996 0.0004 3.88 0.0002 1.0000 0.00013.39 0.0013 0.9997 0.0003 3.89 0.0002 1.0000 0.0001

3.40 0.0012 0.9997 0.0003 3.90 0.0002 1.0000 0.00013.41 0.0012 0.9997 0.0003 3.91 0.0002 1.0000 0.00013.42 0.0011 0.9997 0.0003 3.92 0.0002 1.0000 0.00003.43 0.0011 0.9997 0.0003 3.93 0.0002 1.0000 0.00003.44 0.0011 0.9997 0.0003 3.94 0.0002 1.0000 0.00003.45 0.0010 0.9997 0.0003 3.95 0.0002 1.0000 0.00003.46 0.0010 0.9997 0.0003 3.96 0.0002 1.0000 0.00003.47 0.0010 0.9997 0.0003 3.97 0.0001 1.0000 0.00003.48 0.0009 0.9998 0.0003 3.98 0.0001 1.0000 0.00003.49 0.0009 0.9998 0.0002 3.99 0.0001 1.0000 0.0000

3.50 0.0009 0.9998 0.0002 4.00 0.0001 1.0000 0.0000

TABLE C.1 (CONTINUED)

Normal Distribution

z f(z) F(z) 1–F(z) z f(z) F(z) 1–F(z)

© 2003 by CRC Press LLC

Page 396: Data Analysis Book

TABLE C.2

Critical Values for the t Distribution

v 0.1 0.05 0.025 0.01 0.005 0.0025 0.001

1 3.078 6.314 12.706 31.821 63.657 318.309 636.6192 1.886 2.920 4.303 6.965 9.925 22.327 31.5993 1.638 2.353 3.182 4.541 5.841 10.215 12.9244 1.533 2.132 2.776 3.747 4.604 7.173 8.6105 1.476 2.015 2.571 3.365 4.032 5.893 6.869

6 1.440 1.943 2.447 3.143 3.707 5.208 5.9597 1.415 1.895 2.365 2.998 3.499 4.785 5.4088 1.397 1.860 2.306 2.896 3.355 4.501 5.0419 1.383 1.833 2.262 2.821 3.250 4.297 4.781

10 1.372 1.812 2.228 2.764 3.169 4.144 4.587

11 1.363 1.796 2.201 2.718 3.106 4.025 4.43712 1.356 1.782 2.179 2.681 3.055 3.930 4.31813 1.350 1.771 2.160 2.650 3.012 3.852 4.22114 1.345 1.761 2.145 2.624 2.977 3.787 4.14015 1.341 1.753 2.131 2.602 2.947 3.733 4.073

16 1.337 1.746 2.120 2.583 2.921 3.686 4.01517 1.333 1.740 2.110 2.567 2.898 3.646 3.96518 1.330 1.734 2.101 2.552 2.878 3.610 3.92219 1.328 1.729 2.093 2.539 2.861 3.579 3.88320 1.325 1.725 2.086 2.528 2.845 3.552 3.850

21 1.323 1.721 2.080 2.518 2.831 3.527 3.81922 1.321 1.717 2.074 2.508 2.819 3.505 3.79223 1.319 1.714 2.069 2.500 2.807 3.485 3.76824 1.318 1.711 2.064 2.492 2.797 3.467 3.74525 1.316 1.708 2.060 2.485 2.787 3.450 3.725

26 1.315 1.706 2.056 2.479 2.779 3.435 3.70727 1.314 1.703 2.052 2.473 2.771 3.421 3.69028 1.313 1.701 2.048 2.467 2.763 3.408 3.67429 1.311 1.699 2.045 2.462 2.756 3.396 3.65930 1.310 1.697 2.042 2.457 2.750 3.385 3.646

35 1.306 1.690 2.030 2.438 2.724 3.340 3.59140 1.303 1.684 2.021 2.423 2.704 3.307 3.55145 1.301 1.679 2.014 2.412 2.690 3.281 3.52050 1.299 1.676 2.009 2.403 2.678 3.261 3.496

100 0.290 1.660 1.984 2.364 2.626 3.174 3.3901.282 1.645 1.960 2.326 2.576 3.091 3.291

© 2003 by CRC Press LLC

Page 397: Data Analysis Book

TABLE C.3

Critical Values for the Chi-Square Distribution 2

v .9999 .9995 .999 .995 .99 .975 .95 .90

1 .07157 .06393 .05157 .04393 .0002 .0010 .0039 .01582 .0002 .0010 .0020 .0100 .0201 .0506 .1026 .21073 .0052 .0153 .0243 .0717 .1148 .2158 .3518 .58444 .0284 .0639 .0908 .2070 .2971 .4844 .7107 1.06365 .0822 .1581 .2102 .4117 .5543 .8312 1.1455 1.6103

6 .1724 .2994 .3811 .6757 .8721 1.2373 1.6354 2.20417 .3000 .4849 .5985 .9893 1.2390 1.6899 2.1673 2.83318 .4636 .7104 .8571 1.3444 1.6465 2.1797 2.7326 3.48959 .6608 .9717 1.1519 1.7349 2.0879 2.7004 3.3251 4.1682

10 .889 1.2650 1.4787 2.1559 2.5582 3.2470 3.9403 4.8652

11 1.1453 1.5868 1.8339 2.6032 3.0535 3.8157 4.5748 5.577812 1.4275 1.9344 2.2142 3.0738 3.5706 4.4038 5.2260 6.303813 1.7333 2.3051 2.6172 3.5650 4.1069 5.0088 5.8919 7.041514 2.0608 2.6967 3.0407 4.0747 4.6604 5.6287 6.5706 7.789515 2.4082 3.1075 3.4827 4.6009 5.2293 6.2621 7.2609 8.5468

16 2.7739 3.5358 3.9416 5.1422 5.8122 6.9077 7.9616 9.312217 3.1567 3.9802 4.4161 5.6972 6.4078 7.5642 8.6718 10.085218 3.5552 4.4394 4.9048 6.2648 7.0149 8.2307 9.3905 10.864919 3.9683 4.9123 5.4068 6.8440 7.627 8.9065 10.1170 11.650920 4.3952 5.3981 5.9210 7.4338 8.2604 9.5908 10.8508 12.4426

21 4.8348 5.8957 6.4467 8.0337 8.8972 10.2829 11.5913 13.239622 5.2865 6.4045 6.9830 8.6427 9.5425 10.9823 12.3380 14.041523 5.7494 6.9237 7.5292 9.2604 10.1957 11.6886 13.0905 14.848024 6.2230 7.4527 8.0849 9.8862 10.8564 12.4012 13.8484 15.658725 6.7066 7.9910 8.6493 10.5197 11.5240 13.1197 14.6114 16.4734

26 7.1998 8.5379 9.2221 11.1602 12.1981 13.8439 15.3792 17.291927 7.7019 9.0932 9.8028 11.8076 12.8785 14.5734 16.1514 18.113928 8.2126 9.6563 10.3909 12.4613 13.5647 15.3079 16.9279 18.939229 8.7315 10.2268 10.9861 13.1211 14.2565 16.0471 17.7084 19.767730 9.2581 10.8044 11.5880 13.7867 14.9535 16.7908 18.4927 20.5992

31 9.7921 11.3887 12.1963 14.4578 15.6555 17.5387 19.2806 21.433632 10.3331 11.9794 12.8107 15.1340 16.3622 18.2908 20.0719 22.270633 10.8810 12.5763 13.4309 15.8153 17.0735 19.0467 20.8665 23.110234 11.4352 13.1791 14.0567 16.5013 17.7891 19.8063 21.6643 23.952335 11.9957 13.7875 14.6878 17.1918 18.5089 20.5694 22.4650 24.7967

36 12.5622 14.4012 15.3241 17.8867 19.2327 21.3359 23.2686 25.643337 13.1343 15.0202 15.9653 18.5858 19.9602 22.1056 24.0749 26.492138 13.7120 15.6441 16.6112 19.2889 20.6914 22.8785 24.8839 27.343039 14.2950 16.2729 17.2616 19.9959 21.4262 23.6543 25.6954 28.195840 14.8831 16.9062 17.9164 20.7065 22.1643 24.4330 26.5093 29.0505

© 2003 by CRC Press LLC

Page 398: Data Analysis Book

41 15.48 17.54 18.58 21.42 22.91 25.21 27.33 29.9142 16.07 18.19 19.24 22.14 23.65 26.00 28.14 30.7743 16.68 18.83 19.91 22.86 24.40 26.79 28.96 31.6344 17.28 19.48 20.58 23.58 25.15 27.57 29.79 32.4945 17.89 20.14 21.25 24.31 25.90 28.37 30.61 33.35

46 18.51 20.79 21.93 25.04 26.66 29.16 31.44 34.2247 19.13 21.46 22.61 25.77 27.42 29.96 32.27 35.0848 19.75 22.12 23.29 26.51 28.18 30.75 33.10 35.9549 20.38 22.79 23.98 27.25 28.94 31.55 33.93 36.8250 21.01 23.46 24.67 27.99 29.71 32.36 34.76 37.69

60 27.50 30.34 31.74 35.53 37.48 40.48 43.19 46.4670 34.26 37.47 39.04 43.28 45.44 48.76 51.74 55.3380 41.24 44.79 46.52 51.17 53.54 57.15 60.39 64.2890 48.41 52.28 54.16 59.20 61.75 65.65 69.13 73.29

100 55.72 59.90 61.92 67.33 70.06 74.22 77.93 82.36

200 134.02 140.66 143.84 152.24 156.43 162.73 168.28 174.84300 217.33 225.89 229.96 240.66 245.97 253.91 260.88 269.07400 303.26 313.43 318.26 330.90 337.16 346.48 354.64 364.21500 390.85 402.45 407.95 422.30 429.39 439.94 449.15 459.93600 479.64 492.52 498.62 514.53 522.37 534.02 544.18 556.06

700 569.32 583.39 590.05 607.38 615.91 628.58 639.61 652.50800 659.72 674.89 682.07 700.73 709.90 723.51 735.36 749.19900 750.70 766.91 774.57 794.47 804.25 818.76 831.37 846.07

1000 842.17 859.36 867.48 888.56 898.91 914.26 927.59 943.131500 1304.80 1326.30 1336.42 1362.67 1375.53 1394.56 1411.06 1430.25

2000 1773.30 1798.42 1810.24 1840.85 1855.82 1877.95 1897.12 1919.392500 2245.54 2273.86 2287.17 2321.62 2338.45 2363.31 2384.84 2409.823000 2720.44 2751.65 2766.32 2804.23 2822.75 2850.08 2873.74 2901.173500 3197.36 3231.23 3247.14 3288.25 3308.31 3337.93 3363.53 3393.224000 3675.88 3712.22 3729.29 3773.37 3794.87 3826.60 3854.03 3885.81

4500 4155.71 4194.37 4212.52 4259.39 4282.25 4315.96 4345.10 4378.865000 4636.62 4677.48 4696.67 4746.17 4770.31 4805.90 4836.66 4872.285500 5118.47 5161.42 5181.58 5233.60 5258.96 5296.34 5328.63 5366.036000 5601.13 5646.08 5667.17 5721.59 5748.11 5787.20 5820.96 5860.056500 6084.50 6131.36 6153.35 6210.07 6237.70 6278.43 6313.60 6354.32

7000 6568.49 6617.20 6640.05 6698.98 6727.69 6769.99 6806.52 6848.807500 7053.05 7103.53 7127.22 7188.28 7218.03 7261.85 7299.69 7343.488000 7538.11 7590.32 7614.81 7677.94 7708.68 7753.98 7793.08 7838.338500 8023.63 8077.51 8102.78 8167.91 8199.63 8246.35 8286.68 8333.349000 8509.57 8565.07 8591.09 8658.17 8690.83 8738.94 7880.46 8828.50

TABLE C.3 (CONTINUED)

Critical Values for the Chi-Square Distribution 2

v .9999 .9995 .999 .995 .99 .975 .95 .90

© 2003 by CRC Press LLC

Page 399: Data Analysis Book

9500 8995.90 9052.97 9079.73 9148.70 9182.28 9231.74 9274.42 9323.781000 9482.59 9541.19 9568.67 9639.48 9673.95 9724.72 9768.53 9819.19

Critical Values for the Chi-Square Distribution 2

v .10 .05 .025 .01 .005 .001 .0005 .0001

1 2.7055 3.8415 5.0239 6.6349 7.8794 10.8276 12.1157 15.13672 4.6052 5.9915 7.3778 9.2103 10.5966 13.8155 15.2018 18.42073 6.2514 7.8147 9.3484 11.3449 12.8382 16.2662 17.7300 21.10754 7.7794 9.4877 11.1433 13.2767 14.8603 18.4668 19.9974 23.51275 9.2364 11.0705 12.8325 15.0863 16.7496 20.5150 22.1053 25.7448

6 10.6446 12.5916 14.4494 16.8119 18.5476 22.4577 24.1028 27.85637 12.0170 14.0671 16.0128 18.4753 20.2777 24.3219 26.0178 29.87758 13.3616 15.5073 17.5345 20.0902 21.9550 26.1245 27.8680 31.82769 14.6837 16.9190 19.0228 21.6660 23.5894 27.8772 29.6658 33.7199

10 15.9872 18.3070 20.4832 23.2093 25.1882 29.5883 31.4198 35.5640

11 17.2750 19.6751 21.9200 24.7250 26.7568 31.2641 33.1366 37.367012 18.5493 21.0261 23.3367 26.2170 28.2995 32.9095 34.8213 39.134413 19.8119 22.3620 24.7356 27.6882 29.8195 34.5282 36.4778 40.870714 21.0641 23.6848 26.1189 29.1412 31.3193 36.1233 38.1094 42.579315 22.3071 24.9958 27.4884 30.5779 32.8013 37.6973 39.7188 44.2632

16 23.5418 26.2962 28.8454 31.9999 34.2672 39.2524 41.3081 45.924917 24.7690 27.5871 30.1910 33.4087 35.7185 40.7902 42.8792 47.566418 25.9894 28.8693 31.5264 34.8053 37.1565 42.3124 44.4338 49.189419 27.2036 30.1435 32.8523 36.1909 38.5823 43.8202 45.9731 50.795520 28.4120 31.4104 34.1696 37.5662 39.9968 45.3147 47.4985 52.3860

21 29.6151 32.6706 25.4789 38.9322 41.4011 46.7970 49.0108 53.962022 30.8133 33.9244 36.7807 40.2894 42.7957 48.2679 50.5111 55.524623 32.0069 35.1725 38.0756 41.6384 44.1813 49.7282 52.0002 57.074624 33.1962 36.4150 39.3641 42.9798 45.5585 51.1786 53.4788 58.613025 34.3816 37.6525 40.6465 44.3141 46.9279 52.6197 54.9475 60.1403

26 35.5632 38.8851 41.9232 45.6417 48.2899 54.0520 56.4069 61.657327 36.7412 40.1133 43.1945 46.9629 49.6449 55.4760 57.8576 63.164528 37.9159 41.3371 44.4608 48.2782 50.9934 56.8923 59.3000 64.662429 39.0875 42.5570 45.7223 49.5879 52.3356 58.3012 60.7346 66.151730 40.2560 43.7730 467.9792 50.8922 53.6720 59.7031 62.1619 67.6326

31 41.4217 44.9853 48.2319 52.1914 55.0027 61.0983 63.5820 69.105732 42.5847 46.1943 49.4804 53.4858 56.3281 62.4872 64.9955 70.571233 43.7452 47.3999 50.7251 54.7755 57.6484 63.8701 66.4025 72.029634 44.9032 48.6024 51.9660 56.0609 58.9639 65.2472 67.8035 73.481235 46.0588 49.8018 53.2033 57.3421 60.2748 66.6188 69.1986 74.9262

TABLE C.3 (CONTINUED)

Critical Values for the Chi-Square Distribution 2

v .9999 .9995 .999 .995 .99 .975 .95 .90

© 2003 by CRC Press LLC

Page 400: Data Analysis Book

TABLE C.3 (CONTINUED)

Critical Values for the Chi-Square Distribution 2

v .10 .05 .025 .01 .005 .001 .0005 .0001

36 47.2122 50.9985 54.4373 58.6192 61.5812 67.9852 70.5881 73.365037 48.3634 52.1923 55.6680 59.8925 62.8833 69.3465 71.9722 77.797738 49.5126 53.3835 56.8955 61.1621 64.1814 70.7029 73.3512 79.224739 50.6598 54.5722 58.1201 62.4281 65.4756 72.0547 74.7253 80.646240 51.8051 55.7585 59.3417 63.6907 66.7660 73.4020 76.0946 82.0623

41 52.95 56.94 60.56 64.95 68.05 74.74 77.46 83.4742 54.09 58.12 61.78 66.21 69.34 76.08 78.82 84.8843 55.23 59.30 62.99 67.46 70.62 77.42 80.18 86.2844 56.37 60.48 64.20 68.71 71.89 78.75 81.53 87.6845 57.51 61.66 65.41 69.96 73.17 80.08 82.88 89.07

46 58.64 62.83 66.62 71.20 74.44 81.40 84.22 90.4647 59.77 64.00 67.82 72.44 75.70 82.72 85.56 91.8448 60.91 65.17 69.02 73.68 76.97 84.04 86.90 93.2249 62.04 66.34 70.22 74.92 78.23 85.35 88.23 94.6050 63.17 67.50 71.42 76.15 79.49 86.66 89.56 95.97

60 74.40 79.08 83.30 88.38 91.95 99.61 102.69 109.5070 85.53 90.53 95.02 100.43 104.21 112.32 115.58 122.7580 96.58 101.88 106.63 112.33 116.32 124.84 128.26 135.7890 107.57 113.15 118.14 124.12 128.30 137.21 140.78 148.63

100 118.50 124.34 129.56 135.81 140.17 149.45 153.17 161.32

200 226.02 233.99 241.06 249.45 255.26 267.54 272.42 283.06300 331.79 341.40 349.87 359.91 366.84 381.43 387.20 399.76400 436.65 447.63 457.31 468.72 476.61 493.13 499.67 513.84500 540.93 553.13 563.85 576.49 585.21 603.45 610.65 626.24600 644.80 658.09 669.77 583.52 692.98 712.77 720.58 737.46

700 748.36 762.66 775.21 789.97 800.13 821.35 829.71 847.78800 851.67 866.91 880.28 895.98 906.79 929.33 938.21 957.38900 954.78 970.90 985.03 1001.63 1013.04 1036.83 1046.19 1066.401000 1057.72 1074.68 1089.53 1106.97 1118.95 1143.92 1153.74 1174.931500 1570.61 1609.23 1630.35 1644.84 1674.97 1686.81 1712.30

2000 2081.47 2105.15 2125.84 2150.07 2166.16 2214.68 2243.812500 2591.04 2617.43 2640.47 2667.43 2685.89 2724.22 2739.25 2771.573000 3099.69 3128.54 3153.70 3183.13 3203.28 3245.08 3261.45 3296.663500 3607.64 3638.75 3665.87 3697.57 3719.26 3764.26 3781.87 3819.744000 4115.05 4148.25 4177.19 4211.01 4234.14 4282.11 4300.88 4341.22

4500 4622.00 4657.17 46897.83 4723.63 47.48.12 4798.87 4818.73 4861.405000 5128.58 5165.61 5197.88 5235.57 5261.34 5314.73 5335.62 5380.485500 5634.83 5673.64 5707.45 5746.93 5773.91 5829.81 5851.68 5898.636000 6140.81 6181.31 6216.59 6257.78 6285.92 6344.23 6367.02 6415.986500 6646.54 6688.67 6725.36 6768.18 6797.45 6858.05 6881.74 6932.61

© 2003 by CRC Press LLC

Page 401: Data Analysis Book

7000 7152.06 7195.75 7233.79 7278.19 7308.53 7371.35 7395.90 7448.627500 7657.38 7702.58 7741.93 7787.86 7819.23 7884.18 7909.57 7964.068000 8162.53 8209.19 8249.81 8297.20 8329.58 8396.59 8422.78 8479.008500 8667.52 8715.59 8757.44 8806.26 8839.60 8908.62 8935.59 8993.489000 9172.36 9221.81 9264.85 9315.05 9349.34 9420.30 9448.03 9507.53

9500 9677.07 9727.86 9772.05 9823.60 9858.81 9931.67 9960.13 10021.2110000 10181.66 10233.75 10279.07 10331.93 10368.03 10442.73 10471.91 10534.52

TABLE C.3 (CONTINUED)

Critical Values for the Chi-Square Distribution 2

v .10 .05 .025 .01 .005 .001 .0005 .0001

© 2003 by CRC Press LLC

Page 402: Data Analysis Book

TABLE C.4

Critical Values for the F Distribution

For given values of 1 and 2, this table contains values of ; defined by Prob[F ] = = 0.1.1

2 1 2 3 4 5 6 7 8 9 10 50 100

1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 60.19 62.69 63.01 63.332 28.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.47 9.48 9.493 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.15 5.14 5.134 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.80 3.78 3.76

5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.15 3.13 3.106 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.77 2.75 2.727 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.52 2.50 2.478 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 2.35 2.32 2.299 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.22 2.19 2.16

10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 2.32 2.12 2.09 2.0611 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2.04 2.01 1.9712 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 1.87 1.94 1.9013 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 1.92 1.88 1.8514 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 1.87 1.83 1.80

15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 1.83 1.79 1.7616 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.79 1.76 1.7217 3.03 2.64 2.44 2.31 2.22 2.5 2.10 2.06 2.03 2.00 1.76 1.73 1.6918 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.74 1.70 1.6619 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 1.71 1.67 1.63

20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.69 1.65 1.6125 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.61 1.56 1.5250 2.81 2.41 2.20 2.06 1.97 1.90 1.84 1.80 1.76 1.73 1.44 1.39 1.34100 2.76 2.36 2.14 2.00 1.91 1.83 1.78 1.73 1.69 1.66 1.35 1.29 1.20

2.71 2.30 2.08 1.94 1.85 1.77 1.72 1.67 1.63 1.60 1.24 1.17 1.00

F v v0 1 1 2. , , F v v0 1 1 2. , ,

© 2003 by C

RC

Press LL

C

Page 403: Data Analysis Book

For given values of 1 and 2, this table contains values of ; defined by Prob[F ] = = 0.05.1

2 1 2 3 4 5 6 7 8 9 10 50 100

1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 251.8 253.0 254.32 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.48 19.49 19.503 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.58 8.55 8.534 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.70 5.66 5.63

5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.44 4.41 4.366 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 3.75 3.71 3.677 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.32 3.27 3.238 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.02 2.97 2.939 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 2.80 2.76 2.71

10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.64 2.59 2.5411 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.51 2.46 2.4012 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.40 2.35 2.3013 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.31 2.26 2.2114 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.24 2.19 2.13

15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.18 2.12 2.0716 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.12 2.07 2.0117 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.08 2.02 1.9618 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.04 1.98 1.9219 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.00 1.94 1.88

20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 1.97 1.91 1.8425 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 1.84 1.78 1.7150 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 1.60 1.52 1.45100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 1.48 1.39 1.28

3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.35 1.25 1.00

F v v0 05 1 2. , , F v v0 05 1 2. , ,© 2003 by C

RC

Press LL

C

Page 404: Data Analysis Book

TABLE C.4 (CONTINUED)

Critical Values for the F Distribution

For given values of 1 and 2, this table contains values of ; defined by Prob[F ] = = 0.025.1

2 1 2 3 4 5 6 7 8 9 10 50 100

1 647.8 799.5 864.2 899.6 921.8 937.1 948.2 956.7 963.3 968.6 1008 1013 10182 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39 39.40 39.48 39.49 39.503 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 14.42 14.01 13.96 13.904 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 8.38 8.32 8.26

5 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6.14 6.08 6.026 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46 4.98 4.92 4.857 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 4.28 4.21 4.148 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 3.81 3.74 3.679 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 3.47 3.40 3.33

10 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 3.22 3.15 3.0811 6.72 5.26 4.63 4.28 4.04 3.88 3.76 3.66 3.59 3.53 3.03 2.96 2.8812 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 2.87 2.80 2.7213 6.41 4.97 4.35 4.00 3.77 3.60 3.48 3.39 3.31 3.25 2.74 2.67 2.6014 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.21 3.15 2.64 2.56 2.49

15 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 2.55 2.47 2.4016 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 3.05 2.99 2.47 2.40 2.3217 6.04 4.62 4.01 3.66 3.44 3.28 3.16 3.06 2.98 2.92 2.41 2.33 2.2518 5.98 4.56 3.95 3.61 3.38 3.22 3.10 3.01 2.93 2.87 2.35 2.27 2.1919 5.92 4.51 3.90 3.56 3.33 3.17 3.05 2.96 2.88 2.82 2.30 2.22 2.13

20 5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 2.84 2.77 2.25 2.17 2.0925 5/69 4/29 3.69 3.35 3.13 2.97 2.85 2.75 2.68 2.61 2.08 2.00 1.9150 5.34 3.97 3.39 3.05 2.83 2.67 2.55 2.46 2.38 2.32 1.75 1.66 1.54100 5.18 3.83 3.25 2.92 2.70 2.54 2.42 2.32 2.24 2.18 1.59 1.48 1.37

5.02 3.69 3.12 2.79 2.57 2.41 2.29 2.19 2.11 2.05 1.43 1.27 1.00

F v v0 025 1 2. , , F v v0 025 1 2. , ,

© 2003 by C

RC

Press LL

C

Page 405: Data Analysis Book

For given values of 1 and 2, this table contains values of ; defined by Prob[F ] = = 0.01.1

2 1 2 3 4 5 6 7 8 9 10 50 100

1 4052 5000 5403 5625 5764 5859 5928 5981 6022 6056 6303 6334 63362 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.48 99.49 99.503 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 26.35 26.24 26.134 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 13.69 13.58 13.46

5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.24 9.13 9.026 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.09 6.99 6.887 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 5.86 5.75 5.658 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.07 4.96 4.869 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 4.52 4.41 4.31

10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.12 4.01 3.9111 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 3.81 3.71 3.6012 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 3.57 3.47 3.3613 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.38 3.27 3.1714 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.22 3.11 3.00

15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.08 2.98 2.8716 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 2.97 2.86 2.7517 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 2.87 2.76 2.6518 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 2.78 2.68 2.5719 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 2.71 2.60 2.49

20 8.10 5.85 4.94 4.43 4.10 2.87 3.70 3.56 3.46 3.37 2.64 2.54 2.4225 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.40 2.29 2.1750 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 2.70 1.95 1.82 1.70100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.50 1.74 1.60 1.45

6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 1.53 1.32 1.00

F v v0 01 1 2. , , F v v0 01 1 2. , ,© 2003 by C

RC

Press LL

C

Page 406: Data Analysis Book

TABLE C.4 (CONTINUED)

Critical Values for the F Distribution

For given values of 1 and 2, this table contains values of ; defined by Prob[F ] = = 0.005.1

2 1 2 3 4 5 6 7 8 9 10 50 100

1 16211 20000 21615 22500 23056 23437 23715 23925 24091 24224 25211 25337 254652 198.5 199.0 199.2 199.2 199.3 199.3 199.4 199.4 199.4 199.4 199.5 199.5 199.53 55.55 49.80 47.47 46.19 45.39 44.84 44.43 44.13 43.88 43.69 42.21 42.02 41.834 31.33 26.28 24.26 23.15 22.46 21.97 21.62 21.35 21.14 20.97 19.67 19.50 19.32

5 22.78 18.31 16.53 15.56 14.94 14.51 14.20 13.96 13.77 13.62 12.45 12.30 12.146 18.63 14.54 12.92 12.03 11.46 11.07 10.79 10.57 10.39 10.25 9.17 9.03 8.887 16.24 12.40 10.88 10.05 9.52 9.16 8.89 8.68 8.51 8.38 7.35 7.22 7.088 14.69 11.04 9.60 8.81 8.30 7.95 7.69 7.50 7.34 7.21 6.22 6.09 5.959 13.61 10.11 8.72 7.96 7.47 7.13 6.88 6.69 6.54 6.42 5.45 5.32 5.19

10 12.83 9.43 8.08 7.34 6.97 6.54 6.30 6.12 5.97 5.85 4.90 4.77 4.6411 12.23 8.91 7.60 6.88 6.42 6.10 5.86 5.68 5.54 5.42 4.49 4.36 4.2312 11.75 8.51 7.23 6.52 6.07 5.76 5.52 5.35 5.20 5.09 4.17 4.04 3.9013 11.37 8.19 6.93 6.23 5.79 5.48 5.25 5.08 4.94 4.82 3.91 3.78 3.6514 11.06 7.92 6.68 6.00 5.56 5.26 5.03 4.86 4.72 4.60 3.70 3.57 3.44

15 10.80 7.70 6.48 5.80 5.37 5.07 4.85 4.67 4.54 4.42 3.52 3.39 3.2616 10.58 7.51 6.30 5.64 5.21 4.91 4.69 4.52 4.38 4.27 3.37 3.25 3.1117 10.38 7.35 6.16 5.50 5.07 4.78 4.56 4.39 4.25 4.14 3.25 3.12 2.9818 10.22 7.21 6.03 5.37 4.96 4.66 4.44 4.28 4.14 4.03 3.14 3.01 2.8719 10.07 7.09 5.92 5.27 4.85 4.56 4.34 4.18 4.04 3.93 3.04 2.91 2.78

20 9.94 6.99 5.82 5.17 4.76 4.47 4.26 4.09 3.96 3.85 2.96 283 2.6925 9.48 6.60 5.46 4.84 4.43 4.15 3.94 3.78 3.64 3.54 2.65 2.52 2.3850 8.63 5.90 4.83 4.23 3.85 3.58 3.38 3.22 3.09 2.99 2.10 1.95 1.81100 8.24 5.59 4.54 3.96 3.59 3.33 3.13 2.97 2.85 2.74 1.84 1.68 1.51

7.88 5.30 4.28 3.762 3.35 3.09 2.90 2.74 2.62 2.52 1.60 1.36 1.00

F v v0 005 1 2. , , F v v0 005 1 2. , ,

© 2003 by C

RC

Press LL

C

Page 407: Data Analysis Book

For given values of 1 and 2, this table contains values of ; defined by Prob[F ] = = 0.001.1

2 1 2 3 4 5 6 7 8 9 10 50 100

2 998.5 999.0 999.2 999.2 999.3 999.3 999.4 999.4 999.4 999.4 999.5 999.5 999.53 167.0 148.5 141.1 137.1 134.6 132.8 131.6 130.6 129.9 129.2 124.7 124.1 123.54 74.14 61.25 45.18 53.44 51.71 50.53 49.66 49.00 48.47 48.05 44.88 44.47 44.05

5 47.18 37.12 33.20 31.09 29.75 28.83 28.16 27.65 27.24 26.92 24.44 24.12 23.796 35.51 27.00 23.70 21.92 20.80 20.03 19.46 19.03 18.69 18.41 16.31 16.03 15.757 29.25 21.69 18.77 17.20 16.21 15.52 15.02 14.63 14.33 14.08 12.20 11.95 11.708 25.41 18.49 15.83 14.39 13.48 12.86 12.40 12.05 11.77 11.54 9.80 9.57 9.339 22.86 16.39 13.90 12.56 11.71 11.13 10.70 10.37 10.11 9.89 8.26 8.04 7.81

10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.20 8.96 8.75 7.19 6.98 6.7611 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.35 8.12 7.92 6.42 6.21 6.0012 18.64 12.97 10.80 9.63 8.89 8.38 8.00 7.71 7.48 7.29 5.83 5.63 5.4213 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.80 5.37 5.17 4.9714 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.80 6.58 6.40 5.00 4.81 4.60

15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 4.70 4.51 4.3116 16.12 10.97 9.01 7.94 7.27 6.80 6.46 6.19 5.98 5.81 4.45 4.26 4.0617 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 4.24 4.05 3.8518 15.38 10.39 8.49 7.46 6.81 6.35 6.02 5.76 5.56 5.39 4.06 3.87 3.6719 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 3.90 3.71 3.51

20 14.82 9.95 8.10 7.10 6.46 6.02 5.69 5.44 5.24 5.08 3.77 3.58 3.3825 13.88 9.22 7.45 6.49 5.89 5.46 5.15 4.91 4.71 4.56 3.28 3.09 2.8950 12.22 7.96 6.34 5.46 4.90 4.51 4.22 4.00 3.82 3.67 2.44 2.25 2.06100 11.50 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.30 2.08 1.87 1.65

10.83 6.91 5.42 4.62 4.10 3.74 3.47 3.27 3.10 2.96 1.75 1.45 1.00

F v v0 001 1 2. , , F v v0 001 1 2. , ,© 2003 by C

RC

Press LL

C

Page 408: Data Analysis Book

Appendix D: Variable Transformations

D.1 Purpose of Variable Transformations

Transformations are used to present or translate data to different scales. Inmodeling applications, transformations are often used to improve the com-patibility of data with assumptions underlying a modeling process, to lin-earize a relationship between two variables whose relationship is nonlinear,or to modify the range of values of a variable. Both dependent and indepen-dent variables can be transformed.

Although transformations may result in an improvement of specific mod-eling assumptions, such as linearity or homoscedasticity, they can result inthe violation of other assumptions. Thus, transformations must be used inan iterative fashion, with close monitoring of other modeling assumptionsas transformations are made.

A difficulty arises when transforming the dependent variable Y becausethe variable is expressed in a form that may not be of interest to the inves-tigation, such as the log of Y, the square root of Y, or the inverse of Y. Whencomparing statistical models, the comparisons should always be made onthe original untransformed scale of Y. These comparisons extend to good-ness-of-fit statistics and model validation exercises. An example in transpor-tation of the necessity to compare models in original units is described inFomunung et al. (2000).

Transformations not only reflect assumptions about the underlying rela-tion between variables, but also the underlying disturbance structure of themodel. For example, exponential transformations imply a multiplicative dis-turbance structure of the underlying model, and not an additive disturbancestructure that is assumed in linear regression. For example, when the under-lying function Y = EXP( X) + is suspected, a log transformation givesLN(Y) = ln( EXP( X) + ) = LN[( EXP( X))(1 + / EXP( X))] = LN( ) +

X + LN(1 + / EXP( X)). Although the model is linear, the disturbanceterm is not the one specified in ordinary least squares regression because itis a function of X, , and , and thus is multiplicative. It is important thatdisturbance terms always be checked after transformations to make surethey are still compatible with modeling assumptions. Guidelines for apply-ing transformations include:

© 2003 by CRC Press LLC

Page 409: Data Analysis Book

1. Transformations on a dependent variable will change the distribu-tion of disturbance terms in a model. Thus, incompatibility of modeldisturbances with an assumed distribution can sometimes be rem-edied with transformations of the dependent variable.

2. Nonlinearities between the dependent variable and an independentvariable are often linearized by transforming the independent vari-able. Transformations of an independent variable do not change thedistribution of disturbance terms.

3. When a relationship between a dependent and independent variablerequires extensive transformations to meet linearity and disturbancedistribution requirements, often there are alternative methods forestimating the parameters of the relation, such as nonlinear andgeneralized least squares regression.

4. Confidence intervals computed on transformed variables need to becomputed by transforming back to the original units of interest.

D.2 Commonly Used Variable Transformations

Taken in the context of modeling the relationship between a dependentvariable Y and independent variable X, there are several motivations fortransforming a variable or variables. It should be noted that most transfor-mations attempt to make the relation between Y and X linear because linearrelationships are easier to model. Transformations done to Y and X in theiroriginally measured units are merely done for convenience of the modeler,and not because of an underlying problem with the data.

The following figures show common transformations used to linearize arelationship between two random variables, X and Y. Provided are plots ofthe relationships between X and Y in their untransformed states, and thensome examples of transformations on X, Y, or both that are used to linearizethe relation.

D.2.1 Parabolic Transformations

Parabolic transformations are used to linearize a nonlinear or curvilinearrelation. The parabolic transformation is used when the true relation betweenY and X is given as Y = + X + X2. The transformation is done by simplyadding a squared or quadratic term to the right-hand side of the equation,which is really more than a mere transformation.

The nature of the relationship depends on the values of , and . FigureD.1 shows an example of a relation between Y and X when all parametersare positive, and Figure D.2 shows an example of the relation between Yand X when and are positive and is negative.

© 2003 by CRC Press LLC

Page 410: Data Analysis Book

D.2.2 Hyperbolic Transformations

Hyperbolic transformations are used to linearize a variety of curvilinearshapes. A hyperbolic transformation is used when the true relation betweenY and X is given as Y = X/( + X), as shown in Figures D.3 and D.4. Bytransforming both Y and X using the inverse transformation, one can gen-erate a linear relationship such that 1/Y = 0 + 1(1/X) + disturbance. In thistransformation, = 1 and = 0. Figure D.3 shows an example of a relationbetween Y and X when is positive, whereas Figure D.4 shows an exampleof the relation between Y and X when is negative.

FIGURE D.1Parabolic relation (functional form: Y = + X + X2, where > 0, > 0, > 0).

FIGURE D.2Parabolic relation (functional form: Y = + X + X2, where > 0, > 0, < 0).

FIGURE D.3Hyperbolic relation (functional form: Y = X/( + X), where > 0).

Y

X

Y

X

Y

X

© 2003 by CRC Press LLC

Page 411: Data Analysis Book

D.2.3 Exponential Functions

The natural log transformation is used to correct heterogeneous variance insome cases, and when the data exhibit curvature between Y and X of a certaintype. Figures D.5 and D.6 show the nature of the relationship between Y andX for data that are linearized using the log transformation. The nature of theunderlying relation is Y = EXP( X), where and are parameters of therelation. To get this relation in linear model form, one transforms both sidesof the equation to obtain LN(Y) = LN( EXP( X)) = LN( ) + LN(EXP( X)) =LN( ) + X = 0 + 1X. In linearized form 0 = LN( ) and 1 = . Figure D.5

FIGURE D.4Hyperbolic relation (functional form: Y = X/( + X), where < 0).

FIGURE D.5Exponential relation (functional form: Y = EXP( X), where > 0).

FIGURE D.6Exponential functions (functional form: Y = EXP( X), where < 0).

Y

X

Y

X

Y

X

© 2003 by CRC Press LLC

Page 412: Data Analysis Book

shows examples of the relation between Y and X for > 0, and Figure D.6shows examples of the relation between Y and X for < 0.

D.2.4 Inverse Exponential Functions

Sometimes the exponential portion of the mechanism is proportional to theinverse of X instead of untransformed X. The underlying relationship, whichis fundamentally different from the exponential relations shown in FiguresD.5 and D.6, is given by Y = EXP( /X). By taking the log of both sides ofthis relationship one obtains the linear model form of the relation, LN(Y) =LN( ) + /X = 0 + 1/X 1. Figure D.7 shows examples of the inverse expo-nential relation when is positive, and Figure D.8 shows examples when is negative.

D.2.5 Power Functions

Power transformations are needed when the underlying structure is of theform Y = X and transformations on both variables are needed to linearizethe function. The linear form of the power function is LN(Y) = LN( X ) =LN( ) + LN(X) = 0 + 1 LN(X). The shape of the power function depends

FIGURE D.7Inverse exponential functions (functional form: Y = EXP( /X), where > 0).

FIGURE D.8Inverse exponential functions (functional form: Y = EXP( /X), where < 0).

Y

X

Y

X

© 2003 by CRC Press LLC

Page 413: Data Analysis Book

on the sign and magnitude of . Figure D.9 depicts examples of powerfunctions with greater than zero, and Figure D.10 depicts examples ofpower functions with less than zero.

FIGURE D.9Power functions (functional form: Y = X , where > 0).

FIGURE D.10Power functions (functional form: Y = X , where < 0).

Y

X

β > 1

β = 1

0 < β < 1

Y

X

β = 1

β < −1

−1 > β > 0

© 2003 by CRC Press LLC