decision 411 class 1 notes - fuqua school of businessrnau/decision411_2007/decision 411... · notes...

1

Decision 411: Forecasting Fall Term 1, 2009 Professor Bob Nau Notes for Class 1

1.1 Introduction a. Statistical forecasting b. Signal vs. noise c. Some simple cases d. Forecasting risks

1.2 Get to know your data a. Acquiring, merging, synchronizing, and cleaning b. Bits and bytes, text and numbers c. Grabbing data off web pages and document files

1.3 The (not-so) simplest model: a constant forecast a. Almost all you need to remember about basic statistics b. Forecasting with the mean model

1.1 INTRODUCTION ”I have seen the future and it is very much like the past, only longer.” −K. Albran, The Profit 1.1(a) Statistical forecasting Forecasting can take many forms—staring into crystal balls or bowls of tea leaves, seeking the opinions of experts, drawing historical parallels, brainstorming, scenario analysis, creating Monte-Carlo simulation models by inserting random variables in spreadsheet models, building econometric models based on economic theories, building meteorological models or climatological models based on physical theories, etc.—but statistical forecasting, which is the main topic of this course, is the art and science of forecasting from data, with or without a solid theory to guide you. The key idea is simple: look for statistical patterns in the data that were stable in the past and which, after closer analysis, you believe will remain stable in the future. In other words, figure out the way in which the future will look very much like the past, only longer. The devil is in the details, though. When you first look at your data, you may see a lot of bumps and wiggles and curves and X-Y relationships whose pattern is not obvious, or (what is often worse), you may see patterns that aren’t really there. Some important patterns may not be visible because you have not looked at the data in the right way or identified all the relevant explanatory variables or thought about all their possible connections. These obstacles to pattern-recognition can be overcome by using the statistical tools we will study in this course: techniques of exploratory data analysis, use of data transformations to decompose complex

2

patterns into simpler ones, identifying plausible models based on the patterns that you see and other knowledge you may have, and eventually fitting the models to data and comparing their goodness-of-fit in both absolute and relative terms. By the end of the day you hope to come up with a model that makes good predictions and whose margins of error are known and which tells you some things you didn’t already know about trends and causal factors that are important for your general understanding of your business or your field. 1.1(b) Signal vs. noise The variable you want to forecast can be viewed as a combination of signal and noise. The signal is the predictable component, and the noise is what is left over. The term “noise” is intended to conjure up an analogy with the sort of noise or static that you hear on a busy street or when listening to a cell phone with a weak signal. As a matter of fact, audible noise and noise in your data are statistically the same thing. If you digitize a noisy audio signal and analyze it on the computer, it looks just like noisy data, and if you play noisy data back through a speaker, it sounds just like ordinary noise. The technical term for a data series that is pure noise is that it is a sequence of “independent identically-distributed (i.i.d.) random variables.” The RAND() function in Excel generates i.i.d. random values uniformly distributed between 0 and 1 whenever it is recalculated or copied. The Crystal Ball software package that is used in Fuqua’s Decision Models course has additional random functions for generating i.i.d. random variables of other types (normal, beta, gamma, binomial, etc.). In this course we won’t be creating our own random variables; we will be analyzing ones that arise naturally. It is up to you to find a forecasting model (in the form of a mathematical equation) that finds the signal buried in the noise and extrapolates it in an appropriate fashion. This is not always easy, because sometimes it is hard to separate the two. On one hand, very complex patterns in data may look quite random until you know what you are looking for, and on the other hand, data that are completely random may appear to the naked eye to have some kind of interesting pattern. Sensitive statistical tests are needed to get a better idea of whether the pattern you see in the data is really random or whether there is some signal yet to be extracted. If you fail to detect a signal that is really there, or falsely detect a signal that isn’t really there, your forecasts will be at best suboptimal and at worst dangerously misleading. 1.1(c) Some simple cases Some of the simplest signal-noise patterns that you find in data are i.i.d. variations around a line or curve, or i.i.d. changes relative to an earlier period in time. The very simplest case is that of i.i.d. variations around a horizontal line, like this:

3

The appropriate forecasting model for this is series is the mean model (a.k.a. constant model) in which you simply predict that series will equal its mean value in every period. A variable that measures a response to a survey question or a property of objects coming off an assembly line or customers arriving at a service location might look like this. Variables that have more complicated patterns might look like this after some mathematical transformation. The mean model is not entirely trivial: you need to estimate the mean as well as the standard deviation of the variations around the mean, and the standard deviation needs to be used appropriately to calculate confidence intervals for the predictions. The accuracy of the parameter estimates and forecasts will also depend on the sample size in a systematic way. These issues are discussed in more detail later in this handout. A slightly more interesting pattern is that of i.i.d. variations around a sloping line on a plot of your variable-of-interest versus some other variable, which indicates some degree of correlation between the two variables. (The coefficient of correlation between two variables measures the strength of the linear relationship between them, on a scale of -1 to +1. We’ll discuss that in depth in the next class.)

Time Sequence Plot for XConstant mean = 49.9744

0 20 40 60 80 100 1200

20

40

60

80

100X

actualforecast95.0% limits

4

The appropriate forecasting model in this case would be a simple regression model. If the X-axis is the time index, it is called a trend-line model. If you transform the variable by “de-trending,” i.e., subtracting out the trend line, it becomes a time series that looks like it was generated by the mean model.1 Another more interesting pattern is that of a time series that undergoes i.i.d. changes from one period to the next, which might look like this:

1 In terms of its statistical properties, a series that has been de-trended by subtracting out an estimated trend line is not exactly like a series that is a sampled from a random process with zero trend. The de-trended series has an estimated trend of exactly zero within the data sample (because by definition it has been adjusted to zero out the estimated trend), whereas a series that is sampled from a random process with no trend will generally have a non-zero but not statistically significant estimated trend due to sampling error.

Time Sequence Plot for ZLinear trend = 51.5832 + 0.968143 t

0 20 40 60 80 100 12050

75

100

125

150

175

200Z


5

The appropriate forecasting model for this series is the random walk model, so-called because the variable takes random steps up and down as it goes forward. This model is of profound importance in financial data analysis. If you transform the variable by computing its period-to-period changes (the “first difference” of the time series), it becomes a time series that is described by the mean model. What is particularly interesting about the random walk model is the precise way in which the confidence limits for the forecasts get wider at longer forecast horizons, which is central to the theory of option pricing in finance. Notice that there appear to be some interesting patterns in this graph, e.g., a series of regular peaks and valleys from period 60 onward. This is just a statistical illusion: the time series was artificially generated using i.i.d. random steps. This is typical of random walk patterns—they don’t look as random as they are! You need to analyze the statistical properties of the steps (in particular, their autocorrelations) to determine if they are truly random or if they have non-random properties such as momentum or mean-reversion or periodicity. Another commonly seen pattern in time series data is that of i.i.d. variations around a seasonal pattern, which might look like this:

Time Sequence Plot for YRandom walk with drift = 0.434184

0 20 40 60 80 100 1200

20

40

60

80

100Y


6

RetailxAutoNSAFORECAST

1992 1994 1996 1998 2000 2002 2004 20061

1.5

2

2.5

3

3.5(X 100000)

This pattern is seen in retail sales, travel, tourism, housing construction, power and water consumption, agriculture, and many other measures of economic activity and environmental conditions. The noise is often not immediately apparent on such a graph, because the random variations are usually small in comparison to the seasonal variations, but the random variations are very important in seasonally-adjusted terms. Here, if you look closely, you will see that the seasonal variations are not identical and there are also changes in short-term trends due to business-cycle effects. Another important type of pattern is that of correlations among many different variables. You are probably familiar with X-Y scatterplots in which one variable is plotted against another. When working with many variables, we often want to look at an array of scatterplots between all pairs of variables, which is a so-called scatterplot matrix like the following:

age

features

price

sqfeet

tax

age

features

price

sqfeet

tax

A scatterplot matrix is the visual companion of a correlation matrix that shows all pairwise coefficients of correlation between variables. (Both of our software packages—FSBstats+ and Statgraphics—will draw scatterplot matrices for you.) Here the data set consists of facts about houses sold in a community in a given year, and the object might be to study how the selling price is correlated with features of the house such as its age, its square footage, its tax value, and the number of features it includes out of some set of desirable features. By eyeballing the scatterplot matrix you can see at a glance whether there are strong linear or nonlinear

7

relationships among any pairs of variables and you can also quickly identify qualitative properties of their individual distributions, such as whether they are randomly scattered or have only a few possible values or have any extreme values (“outliers”). Here the selling price seems to be linearly related to all 4 of the other variables, as suggested by the red lines that have been drawn on the plots in which price is the Y-axis variable.2 You can also see that the “features” variable has only 8 possible values, which are evidently integers from 0 to 7 since this is a counting variable Complex patterns like this are usually fitted by multiple regression models. (The red lines here are not regression lines, just hand-drawn lines that have been added for visual emphasis.) The examples shown above illustrate some of the most basic patterns we will look for—means and standard deviations that may or may not be stable, trends that may or may not be linear, walks up and down that may or may not be random, seasonality, cyclical patterns, correlations among many variables—and to suggest some forecasting models that might be used to explain them. These basic models will serve as building-blocks for more sophisticated models that we will study later on. 1.1(d) Risks of forecasting Forecasting is a risky enterprise: those who live by the crystal ball often end up eating broken glass. There are three distinct sources of forecasting risk and corresponding ways to measure and try to reduce them. I’ll state them here in general terms, and we will discuss them in more detail in the context of the mean model (later in this handout) and in other models that will be covered later.

(i) Intrinsic risk is random variation that is beyond explanation with the data and tools you have available. It’s the noise in the system. The intrinsic risk is usually measured in terms of the “standard error of the model,” which is the estimated standard deviation of the noise in the variable you are trying to predict.3 Although there is always some intrinsic risk (the future is always to some extent unpredictable), your estimate of its magnitude can sometimes be reduced by refining a model so that it finds additional patterns in what was previously thought to be noise, e.g., by using more or better explanatory variables. This doesn’t necessarily mean the original model was “wrong,” but merely that it was too simplistic or under-informed.

(ii) Parameter risk is the risk due to errors in estimating the parameters of the

forecasting model you are using, assuming that you are fitting the correct model to the data. This is usually a much smaller source of forecast error than intrinsic risk, if the model is really correct. Parameter risk is measured in terms of the “standard errors of the coefficient estimates” in a forecasting model. Parameter risk can be reduced in principle by obtaining a larger sample of data. However, when you are predicting time series, more sample data is not always better. Using a larger sample

2 In a scatterplot matrix, a given variable is the Y-axis variable in all of the plots in its corresponding row, and it is the X-axis variable in all of the plots in its corresponding column. 3 In general the term “standard error” refers to the estimated standard deviation of the error that is being made in estimating a coefficient or predicting a value of some variable.

8

might mean including older data that is not as representative of current conditions. No pattern really stays the same forever, which is known as the “blur of history” problem.

The standard error of the forecast that a model ultimately produces is computed from a formula that involves the standard error of the model and the standard errors of the coefficients, thus taking into account both the intrinsic risk and the parameter risk. The standard error of the forecast is always greater than the standard error of the model, i.e., the standard error of the model is a lower bound on forecast accuracy. How much greater it is depends on the standard errors of the various coefficients.

To the extent that you can’t reduce or eliminate intrinsic risk and parameter risk, you can and should try to quantify them in realistic terms, so as to be honest in your reporting and so that appropriate risk-return tradeoffs can be made when decisions are based on the forecast.

(iii) Model risk is the risk of choosing the wrong model, i.e., making the wrong

assumptions about whether or how the future will resemble the past. This is usually the most serious form of forecast error, and there is no “standard error” for measuring it, because every model assumes itself to be correct. Model risk can be reduced by following good statistical practices, which I will emphasize throughout this course. In fact, you might say that this course is mostly about how to choose the right forecasting model rather than the wrong one. If you follow good practices for exploring the data, understanding the assumptions that are behind the models, and testing the assumptions as a routine part of your analysis, you are much less likely to make serious errors in choosing a model. The risk of choosing the wrong model is very high if you try to rely on “automatic” forecasting methods without understanding your own data, systematically exploring it, using your own judgment and experience, and carefully testing the model assumptions. There is no magic bullet—that’s why you need to take this course!

How do you know when your model is good, or at least not obviously bad? One very basic test of your model is whether its errors really look like pure noise, i.e., independent and identically distributed random values. If the errors are not pure noise, then by definition there is some pattern in them, and you could make them smaller by adjusting the model to explain that pattern. We will study a number of statistical tests that are used to determine whether the errors of a forecasting model are truly random.

9

1.2. GET TO KNOW YOUR DATA 1.2(a) Acquiring, merging, synchronizing, and cleaning it To construct a statistical forecasting model, the first thing you need to know (or find out) is where to look for data that may be relevant.4 Vast amounts of data are available on-line, including free databases provided by agencies of the federal government, proprietary databases maintained by research firms and trade associations, internal company databases, news services, scientific journals, etc. The Fuqua library has subscriptions to many commercial databases that are used in business research. If the data that you need is not directly accessible via the internet or your school or company network—e.g., if you find an interesting data set mentioned in a newspaper or journal article—you may be able to obtain a copy of it by simply e-mailing the author. After locating the data you wish to use, you need to obtain it in a useful file format. Most sources nowadays are capable of providing data directly in the form of Excel files, although you may sometimes find that you need to obtain data in some other form (e.g., text form) and then import it into Excel. (More about this later.) If your data comes from multiple sources, you will need to merge it together in an appropriate way, e.g., by copying and pasting columns of data from different sources onto the same Excel worksheet and/or by setting up named ranges on different worksheets in the same workbook. When doing this, you need to pay close attention to time scales or row indices of different variables in order to make sure that they are properly aligned for analysis. If your variables are misaligned on your spreadsheet, you may end up inadvertently building a model that “cheats” by predicting past values of one variable from future values of other variables, or even from future values of itself. It may have great stats, but it is useless for predicting the future. (Don’t laugh, this happens!) Before starting to analyze the data you have obtained, you should ask:

• Where did it come from? (How was it originally recorded? By whom? How frequently? Under what conditions?)

• Where has it been? (What other systems has it passed through? How has it been adjusted, aggregated, averaged, or otherwise massaged?)

• Is it clean or dirty? (Are there data entry errors? Missing data? Misalignment of time periods? Changes in reporting practices? Bizarre events?) And last but not least...

• In what units is it measured? Is it measured in monthly totals or an annual rate? In nominal or constant (inflation-adjusted) units of currency? Does it represent the current level of something, or does it represent the absolute change from one period to another, or the percentage change from one period to another? Has it been seasonally adjusted, and if so, how? Are the units consistent from one variable to another, e.g., do some variables represent quantities of goods or advertising spots while others represent dollar sales of the same goods or dollar expenditures on the same advertising? Which units should you use, and how should you expect them to be statistically related to each other?)

4 If your instructor has not already provided it to you…

10

Later, when you write up the results of your analysis, the variables in your data set should be clearly annotated to indicate their sources, units of measurement, and any problems or peculiarities you are aware of.

When gathering data from different sources, you need to pay close attention to the definitions and descriptions of the variables so that you know exactly how they should be lined up with each other and how they should be transformed (if necessary) to be in units that are mutually compatible and which shed the right light on the problem you are trying to solve. For example, you may start out with a variable measured in units of dollars—say, consumer spending on some type of commodity. You may or may not wish to adjust the series for inflation by dividing it by a general price index so as to convert from units of nominal dollars to constant dollars (i.e., equivalent purchasing power). Or you may wish to adjust by a product-specific price index to convert from units of dollars to (approximate) quantities sold. Or you may wish to divide by population in order to measure consumption in per-capita terms. Or you might want to look at ratios or reciprocal of variables. For example, you might start out with data on automobile fuel efficiency measured in terms of miles-per-gallon. It might make more sense to analyze fuel efficiency in terms of gallons-per-mile, which is the reciprocal.

In some cases you may need to reconcile different sampling intervals, e.g., some variables may be monthly while others are quarterly. To use them in the same model you will have to convert one to the other in whatever way is appropriate. In converting monthly to quarterly data, sometimes you want to add three monthly values together to get the appropriate quarterly value (e.g., if the variable measures a quantity of some commodity bought or sold), and sometimes you may want to sample every third month (e.g., if the variable is an instantaneous measurement of a price or an economic indicator).

Sampling-interval conversions are usually straightforward to do with Excel formulas, although sometimes tedious, and it’s easy to make mistakes if you are not careful. And when doing the conversions, you need to do them in such a way that they match up correctly in time. For example, quarter-ending dates are not the same for all companies, and not all monthly variables have the same meaning—some may be averaged over a month while others are sampled on the first or last day of the month, and so on. Also, variables that are monthly totals or averages of daily or weekly data have statistical properties that are very different from those of variables that are instantaneous measurements performed once a month. (Monthly variables that are constructed from daily or weekly averages tend to be more “autocorrelated.” We’ll discuss this statistical phenomenon in more detail later in the course.) Another problem is that not all quarters or months have the same number of weekdays or weekends, because they do not consist of whole numbers of weeks. Business-day adjustments for quarterly or monthly data are not too hard—there are tables for this—but conversions of weekly data to monthly data or vice versa are messy.5

5 Various ways to rationalize the day/week/month/quarter system have been proposed, but alas for forecasting, they have not caught on. One obvious solution would be to have 13 months, each consisting of exactly 28 days, plus a separate New-Year’s-Day and an occasional leap day. (This works because 13x28=364.) The thirteenth month could be called “Vacation.” A more ingenious solution would be to have each quarter consist of two 4-week

11

The data you obtain may be clean or dirty. It may or may not contain coding errors or missing values or redefinitions of variables in mid-stream. These are things you need to look for at the descriptive stage of your analysis. Be sure to look at graphs of all your variables before you toss them into models, look at statistics such as the minimum and maximum to make sure there are not crazy numbers, and look at the number of non-missing cases to make sure they are all close to the same length. If you find errors or missing values, you can sometimes correct them or interpolate between neighboring values, but in other situations you may have delete rows of data to keep the model honest or delete entire variables to have a decent sample size for the ones that are used together. The Excel add-in that we will be using this course (FSBstats) makes it easy to generate plots and tables of summary statistics for many variables at once, which is helpful in looking for such problems.

The bad news here is that the process of assembling, cleaning, adjusting, and documenting the units of the data is often the most time-consuming step in forecasting, and failure to attend to these mundane details may lead to disastrous errors of modeling. The good news is that you often learn a great deal in the process, gaining insight into the trends and forces which are influencing the variables you wish to predict, even before you start fitting models. You may also identify ways in which your organization's data can be better collected, better organized, better integrated, and better summarized for purposes of decision-making in the future.

1.2(b) Bits and bytes, text and numbers As you probably know, data is stored in the computer in the form of “bits and bytes.” The basic unit of information is a bit, which is a single binary digit, i.e., 0 or 1. This is the amount of information stored in an electronic switch or magnetic dot that can be in one of two states (on/off, up/down). A group of 8 bits is a byte, which can be used to represent a number between 0 and 255 in binary notation or else a character out of an alphabet of 256 characters. When data is stored in the form of text, one byte equals one character. If you type a 10-page document with 500 words on each page and an average of 7 characters per word (including spaces), it requires a minimum of 10x500x7 = 35,000 bytes of storage in your computer’s memory or disk drive. (If you type the document in plain text format using Notepad, it will take up about exactly this much space. If you use Word, it will take up more space for formatting information and other overhead.) Numeric data can be stored in either “text” form or “binary” form. Data usually enters and leaves the computer in text form, because that is the form in which it is typed in or printed out. Inside the computer, it can still be stored in text form, although it is usually converted to binary form for purposes of computation. When numbers are stored in text form, one byte equals one decimal digit or decimal point or +/- sign or number-separating character or line-break character. Data files in text form are usually created in tab-delimited or space-delimited or comma-separated-value (CSV) format, in which consecutive data fields in the same row are separated by tabs or spaces or commas, respectively, and breaks between rows are indicated by carriage-returns or line-feeds. Thus, the average months and one 5-week month, so that both quarters and months would consist of whole numbers of weeks, and we would still have 12 months and 4 quarters.

12

amount of bytes of storage per number in text form depends in general on the number of digits of precision and the number of fields in a row of data: a 2-digit positive integer takes 3 bytes (including the tab or space or comma character next to it) and a 10-digit negative number with a decimal point takes 13 bytes, and an extra byte is needed for every break between rows. In some cases fixed format may be used, in which a fixed number of characters is allocated to each data field. Numbers are stored in binary form in analytic software such as Excel or Statgraphics. In “double-precision binary floating point” form, which is most commonly used, each number requires 8 bytes of storage, which are used to store it in scientific notation, i.e., in the form 6.0221415 x 1023, with up to 15 digits after the decimal point and a power-of-10 between -308 and +308. 15 digits of precision is enough to keep track of whole numbers on the order of 1 quadrillion (a billion billion) and is adequate for most business purposes unless you are counting in yen. Why do you need to know anything about this? There are several reasons why you should know the basic facts about data storage given above. First, you should have a least an order-of-magnitude idea of how much disk space and internal memory is needed for working with large data sets. Personal computers have such large disk drives these days that the size of a data set per se is rarely an issue as far as storage is concerned A gigabyte of storage space holds more than 100 million numbers at 8 bytes per number. However, if you are running regressions on data sets containing many thousands of data points, you can very quickly end up with files that are many megabytes in size, and eventually they start to add up. If your data set consists of 50 variables that each have 5000 observations, it needs around 2M bytes (=50x5000x8) of storage space, and if you start fitting a lot of regression models to it and storing all the results on separate worksheets, you can easily end up with a file that is 20M in size, and if you save the file under a new name every time you add something new to it (which is usually a good idea), then you may end up with many files this size while working on a single assignment. This will still not begin to fill up your hard disk (or even your iPod), but it can get unwieldy for, say, sending your results to others in the form of e-mail attachments. Very large files may also cause your analytical software to bog down or choke when it is performing computations, because it needs a lot of extra memory space for intermediate calculations (e.g,, for inverting matrices while running regressions). You are pushing the limit with the software will use in this course if your data worksheet has half-a-million cells, e.g., 10,000 rows and 50 columns. (Don’t panic, they won’t usually be anywhere near that large, but occasionally people get ambitious on their final projects .) The second reason you need to know a little bit about data storage is that you will sometimes need to convert binary numeric data to text format or vice versa in order to move it from one computer program to another or to do various kinds of data-editing. Human brains read and write and store numbers in text form, so regardless of the form in which a computer program works with numbers internally, it must convert them to text form when communicating with us. Any computer program that is used for analyzing numbers must be able to import and export those numbers in text form, and therefore different computer programs can always pass numeric data back and forth in the form of text files, even if they have different proprietary binary formats for their own files. So, for example, if you need to obtain data from a source that cannot export it

13

in Excel format, it almost certainly will be able to export it in text format, so you should be familiar with the process of importing a text file into Excel. (This is not hard: when you open a text file in Excel, a text-import wizard pops up and guides you through it. Mainly you just need to check a box to indicate whether the field-separating characters are tabs, spaces, or commas.) Also, you sometimes need to perform data-editing or data-cleaning tasks via search-and-replace operations or via concatenation (joining) of character strings. In some cases these operations are more easily done in a text-editing program and in other cases they are more easily done in Excel. 1.2(c) Grabbing data off web pages and document files Another useful trick you should learn is that of copying numeric data in text form right off a web page or a word processing document or a pdf file. When shopping for data on the internet, you may find that you are able to get a display of rows and columns of numbers on a web page or in a document file of some kind, but there is no option to download the original data file from which the table was created. No problem—just highlight the screen area that contains the table of data and copy and paste it into Excel. This usually works if the displayed data is in html, Word, or pdf form, i.e., if it is not merely “bitmapped.” (Even in that situation there is still hope: you can use optical-character-recognition software or ask your low-paid subordinates to type them in for you.) Pdf files are sometimes bit-mapped images, but usually they have the text buried in them in order to be scalable and searchable, and in that case you can usually copy the text off the page after using the “select” arrow to highlight it. When a table of data is copied from a document into Excel in this way, an entire row of numbers sometimes ends up as a long text string stored in a single cell, with spaces between the numbers. No problem—just highlight the single column of text strings and use the Data/Text-to-Columns command to break it up into multiple columns of numbers. If you end up with a row of data in Excel that you want to transpose into a column, you can do this by copying (or cutting) it to the clipboard and then using the Edit/Paste-Special/Values/Transpose command to transpose it from row to column format when pasting it back. You can also transpose a whole table of numbers this way. If you want to “unwrap” multiple rows of data in a table to obtain a single long column of numbers, you can do this by first copying the table into Word. If it shows up there as a bunch of lines of text with numbers separated by spaces, you can use search-and-replace to replace spaces with ^p’s (paragraph breaks), which creates a single column that can be copied and pasted back to Excel. (If the numbers are separated by more than one space to begin with, first use search-and-replace a few times to replace several consecutive spaces with a single space.) If the copied table of numbers shows up in Word as a table rather than rows of text, you can use the “Merge Cells” command to convert it to a table with a single column and then copy and paste that back into Excel. There’s nearly always a way to shovel your data around.

14

1.3 THE (NOT-SO) SIMPLEST MODEL: A CONSTANT FORECAST Consider the following time series consisting of 20 observations: 114, 126, 123, 112, 68, 116, 50, 108, 163, 79, 67, 98, 131, 83, 56, 109, 81, 61, 90, 92.

Time Series Plot for X

X

0 4 8 12 16 2050

70

90

110

130

150

170

How should we forecast what will happen next? The simplest forecasting model that we might consider is the mean model, which assumes that the time series consists of i.i.d. variations around a fixed mean value. 1.3(a) Almost all you need to remember about basic statistics To set the stage for using the mean model for forecasting, let’s review some of the most basic concepts of statistics. Let:

X = random variable N = size of entire population (possibly infinite) n = size of a finite sample

The population (“true”) mean μ is the average of the all values in the population:

The population variance σ2 is the average squared deviation from the true mean: The population standard deviation σ is the square root of the population variance, i.e., the “root mean squared” deviation from the true mean.

1Ni ix Nμ ==∑

22 1( )N

ii xN

μσ =

−=∑

15

In forecasting applications, we never observe the whole population. The problem is to forecast from a finite sample. Hence statistics such as means and standard deviations must be estimated with error. The sample mean is the average of the all values in the sample: The sample variance s2 is the average squared deviation from the sample mean, except with a factor of n−1 rather than n in the denominator: …and the sample standard deviation is its square root, s. Why the factor of n−1 in the denominator, rather than n? This corrects for the fact that the mean has been estimated from the same sample, which “fudges” it in a direction that makes the mean squared deviation around it less than it ought to be. Technically we say that a “degree of freedom for error” has been used up by calculating the sample mean from the same data. The correct adjustment to get an “unbiased” estimate of the true variance is to divide by the number of degrees of freedom, not the number of data points The corresponding statistical functions in Excel are:

• Population mean = AVERAGE (x1, …xN)

• Population variance = VARP (x1, …xN)

• Population std. dev. = STDEVP(x1, …xN)

• Sample mean = AVERAGE (x1, …xn)

• Sample variance = VAR (x1, …xn)

• Sample std. dev. = STDEV (x1, …xn)

Now, why the obsession with squared error? It is conventional in the field of statistics to measure variability in terms of average squared deviations instead of average absolute deviations around a central value, because squared error has a lot of nice properties:

• The central value around which average squared deviations are minimized is the mean, so by attempting to minimize squared deviations we are implicitly calculating means.

• Variances (rather than standard deviations) are additive when random variables that are statistically independent are added together.

• From a decision-theoretic viewpoint, large errors usually have proportionally worse consequences than small errors, hence squared error is more representative of economic consequences of error.

1ni iX x n==∑

22 1( )

1

nii x X

sn

= −=

−∑

16

• Variances and covariances play a key role in normal distribution theory & regression analysis.

Moving onward, the standard error of the mean is:

• This is the estimated standard deviation of the error that we would make in using the sample mean as an estimate of the true mean μ , if we repeated this exercise with other independent samples of size n.

• It measures the precision of our estimate of the (unknown) true mean from a limited sample of data.

• As n gets larger, SEmean gets smaller and the distribution of the error in estimating the mean

approaches a normal distribution. This result is known as the “Central Limit Theorem.” You may ask, what’s the difference between a standard deviation and a standard error?

• The term “standard deviation” (usually) refers to the actual root-mean-squared deviation of a population or a sample of data around its mean.

• The term “standard error” refers to the estimated root-mean-squared deviation of the error in a parameter estimate or a forecast under repeated sampling.

• Thus, a standard error is the “standard deviation of the error” in estimating or forecasting something

1.3(b) Forecasting with the mean model. Now let’s go forecasting with the mean model:

• Let denote a forecast of xn+1 based on data observed up to period n

• If xn+1 is assumed to be independently drawn from the same population as the sample x1,

…, xn, then the forecast that minimizes mean squared error is simply the sample mean:

What is the standard deviation of the error we can expect to make in using as a forecast for xn+1? This is called the standard error of the forecast, and it has two components:

meansSE

n=

1ˆnx X+ =

2 2 1 1 1 (1 )2fcst meanSE s SE s s

n n= + = + ≈ +

This term measures the intrinsic risk (“noise” in the data); s is also called the standard error of the model in this case

This term measures the parameter risk (error in

estimating the “signal” in the data)

1ˆnx +

1ˆnx +

17

Note that if you square both sides, what you have is that the estimated variance of the forecast error is the sum of the estimated variance of the noise and the estimated variance of the mean. Variances of the different components of forecast error are usually additive in this way, even for more complicated models.

For the mean model, the result is that the forecast standard error is slightly larger than the sample standard deviation, namely by a factor of about 1+(1/(2n)) . For a sample size n=20 there is not much difference : 1+(1/40) = 1.025, so the increase in forecast standard error due to parameter risk (i.e., the need to estimate an unknown mean) is only 2.5%. In general, the estimated parameter risk is a relatively small component of the forecast standard error if (i) the number of data points is large in relation to the number of parameters estimated, and (ii) the model is not attempting to extrapolate trends too far into the future or otherwise make predictions for what will happen far away from the “center of mass” of the data that was fitted (e.g., for historically unprecedented values of independent variables in a regression model). A point forecast should always be accompanied by a confidence interval to indicate the accuracy that is claimed for it, but what does “confidence” mean? It’s sort of like “probability,” but not exactly. Rather,

• An x% confidence interval is an interval calculated by a rule which has the property that the interval will cover the true value x% of the time under simulated conditions, assuming the model is correct.

• Loosely speaking, there is an x% probability that your data will fall in your x% confidence interval for the prediction—but only if your model and its underlying assumptions are correct and the sample size is reasonably large. This is why we test model assumptions and why we should be cautious in drawing inferences from small samples.

The basic formula is: confidence interval = forecast ± t standard errors

• If the distribution of forecast errors is assumed to be normal, a confidence interval for the forecast is equal to the forecast plus-or-minus some number (“t”) of forecast standard errors

• For a 95% interval, the appropriate value of t is the “critical value of the t distribution with a tail probability of .05 and n−1 degrees of freedom”, which is denoted by t.05,n−1

• In Excel, t.05,n−1 = TINV(.05, n−1)

• In most cases (as we will see), t.05,n−1 ≈ 2, so a 95% confidence interval is “plus-or-minus

two standard errors.”

The t distribution is called “Student’s t distribution” because it was discovered by W.S. Gossett of Guinness Brewery, who was a pioneer of statistical quality control in the brewing industry but also published scientific articles anonymously under the pen name “Student”. The t distribution is the distribution of the quantity

( )mean

XSE

μ−

18

which is the number of standard errors by which the sample mean deviates from the true mean when the standard deviation of the population is unknown (i.e., when SEmean is calculated from s

rather than σ). The t distribution resembles a standard normal (z) distribution but with “fatter tails” for small n, as shown in the following chart and table.

In comparing a t distribution to a standard normal distribution, what matters is the “tail area probability” that falls outside some given number of standard errors: a 95% confidence interval is the number of standard errors +/- outside of which there is a tail area probability of 5%. For a lower number of d.f., the “tails” are slightly “fatter”, so a greater number of standard errors is needed for a given level of confidence.

The number of standard errors to use for calculating confidence intervals is very similar for normal and t distributions—i e., it doesn’t really depend on the sample size—except for very low d.f. or very high confidence:

# std. errors needed for 2-sided confidence interval with confidence level:

d.f. 50% 80% 90% 95% 99% 99.5% 99.9% Infinity* 0.67 1.28 1.64 1.96 2.58 2.81 3.29

200 0.68 1.29 1.65 1.97 2.60 2.84 3.34 100 0.68 1.29 1.66 1.98 2.63 2.87 3.39 50 0.68 1.30 1.68 2.01 2.68 2.94 3.50 20 0.69 1.33 1.73 2.09 2.85 3.15 3.85 10 0.70 1.38 1.81 2.23 3.17 3.58 4.59

*Normal

Normal vs. t: much difference?

-4 -3 -2 -1 0 1 2 3 4

Normal t with 20 df t with 10 df t with 5 df

19

Empirical rules of thumb

• For n>20, a 68% confidence interval is roughly plus-or-minus one standard error and a 95% confidence interval is plus-or-minus two standard errors.

• A 50% confidence interval is roughly plus or minus two-thirds of a standard error, i.e., one-third the width of a 95% confidence interval.

• A prediction interval that covers 95% of the data is often too wide to be very informative. A 1-out-of-20 chance of falling outside some interval is a bit hard to visualize. 50% (a “coin flip”) or 90% (1-out-of-10) might be easier for a non-specialist to understand.

• Because the distribution of errors is generally bell-shaped with a central peak and “thin tails” (you hope!), the 95% limits are pretty far out compared to where most of the data is really expected to fall.

• Another thing to consider in deciding which level of confidence to use for constructing a confidence interval is whether or not you are concerned about extreme events. If it is a high-stakes decision, you may be very interested in the low-probability events in the tails of the distribution, in which case you might really want to focus attention on the location of the edge of the 95% or even the 99% confidence interval for your prediction. (In this case you will also be very interested in whether the distribution of errors is really a normal distribution! If it isn’t, these calculations will not be realistic.) If it is a routine or low-stakes decision, then maybe you are more interested in describing the middle range of the distribution, e.g., the range in which you expect 50% or 80% or 90% of the values to fall.

• My own preference: just report the forecast and its standard error and leave it to others to apply the rules above to obtain whatever confidence intervals they want.

Our example continued: Time series X (n=20*, d.f. =19**): 114, 126, 123, 112, 68, 116, 50, 108, 163, 79, 67, 98, 131,

83, 56, 109, 81, 61, 90, 92 The “true” parameters of the population from which this series was sampled are μ=100, σ=30, unbeknownst to us in real life. The statistics of the sample, which are all we can observe, are: …and the standard errors of the mean and forecast are: We can use the standard error of the forecast to calculate confidence intervals for the forecasts:

• 95% CI* = 96.35 ± 2.093 × 29.68 = [34.2, 158.5]

• 50% CI** = 96.35 ± 0.688 × 29.68 = [77.8, 114.9]

28.96 / 20 6.48meanSE = =2 228.96 6.48 29.68fcstSE = + =

96.35, 28.96X s= =

20

• These are based on the critical t-values t.05,19 = 2.093 and t.5,19 = 0.688. In presenting your results to others, know what to round off and when in the number of decimals you choose to display. 6 Don’t overwhelm your audience with more decimals than are meaningful for understanding the results. The calculations above have been rounded off to 1 decimal place, and it would be silly to report any more decimal places than that. In fact, 1 decimal place is already too much, given that the original data consists of integers. The forecast standard error is around 30, so 0.1 is only 1/300 of a standard error. 7 It would suffice here to say that the point forecast is 96 with a 95% confidence interval of [34, 158], or 96 +/- 62, which means rounding off to 1/30 of a standard error. Just because your software can give you up to 15 digits of precision in any calculation doesn’t mean that you need to display all of them!

Having said all that, don’t go to the trouble of adjusting the displayed precision of your results in work that you submit in this course. Just copy and paste the default reports that you get in Excel or Statgraphics.

The forecast and confidence interval calculated above are for the value of X in the very next period (i.e., period 21), but under the assumptions of the mean model, the forecast and confidence interval for the next period also apply to all future periods. The model assumes that all future observations will be drawn from the same distribution, from here to eternity. So, we can extrapolate this one forecast arbitrarily far into the future, and a plot of the forecasts and confidence intervals stretching off into the future will look like a set of parallel horizontal lines. If X is a variable that is inherently a time series, such as a series of weekly or monthly observations of an economic variable or a weather variable, the long-horizon forecasts produced by the mean model do not look very realistic. The distant future is usually more uncertain than the immediate future, and most time series forecasting models will reflect this by yielding confidence intervals that gradually get wider at longer horizons. The longer-horizon forecasts for the mean model make more sense if X is not really a time series, but rather a set of independent random samples from a population that is not changing with time, in which case all the observations are statistically the same whether they are performed today or tomorrow, sequentially or in bunches. Here are plots of the forecasts and confidence intervals produced by Statgraphics for periods 21 through 30, at both the 95% level of confidence and the 50% level of confidence:

6 You can use the “increase decimal” and “decrease decimal” functions on the Home toolbar in Excel to adjust the number of displayed decimals, and the number displayed in Excel will be the default number displayed in Word or Powerpoint if you copy a table of numbers. 7 Here is a technical fact to note: the difference between the critical t-value for a 95% confidence interval and a 95.5% confidence interval is about 0.05 regardless of the sample size, so unless you are placing large bets on low-probability events, you are splitting hairs if you report your results to a precision of greater than 1/20 of a forecast standard error. Even that amount of precision would be overstated given that there is also model risk. About the only reason for reporting your results with more digits than are really significant is for audit-trail purposes in case someone tries to exactly reproduce your results. But for that purpose you can just keep your spreadsheet files. In fact, you should ALWAYS be sure to keep your spreadsheet files and make sure that the “final” version has a file name that clearly identifies it with your final report.

21


Xactualforecast95.0% limits

0 10 20 30 4025

50

75

100

125

150

175


X


0 10 20 30 4025

50

75

100

125

150

175

Note that the 50% confidence interval is about exactly 1/3 the width of the 95% confidence interval. And, as pointed out above, the width of the confidence interval remains the same as we forecast farther into the future. The calculation of confidence intervals shown above takes into account the intrinsic risk and parameter risk in the forecast, assuming the model is correct, i.e., assuming the mean and standard deviation are constant over time, and that variations from the mean are statistically independent from one period to another. But what about model risk? For example, suppose it is possible that there is a linear trend? In that case it might be appropriate to fit a trend-line model, which is a simple regression model in which the independent variable is the time index:

22

Time Sequence Plot for XLinear trend = 114.611 + -1.7391 t

Xactualforecast50.0% limits

0 10 20 30 4025

50

75

100

125

150

175

This is a very different modeling assumption, and it leads to very different forecasts and confidence intervals for the next 10 periods. The forecasts now trend downward by 1.7391 per period. The 50% confidence intervals are only slightly wider than those of the mean model. (The confidence intervals for the trend-line model get wider at longer forecast horizons, but not by much.) The 50% confidence intervals for the period-30 forecast actually have relatively little overlap between the two models, so the choice of model makes a big difference out there. Is the trend-line a plausible alternative model, based on what the data is telling us? Trend line models will be discussed in more detail in the next class, but for a preview of that discussion, one indicator we can look at is the statistical significance of the estimated trend coefficient in the regression model, as measured by its t-stat. The t-stat of the trend coefficient is 1.61 in this case. (The rest of the regression output is not shown here, but trust me, this is the real t-stat.) It is almost significant at the 0.10 level, for which the critical t-value for a sample of size 20 is 1.73 as shown in the table above. I would call this a “marginally” significant value—not so small as to be completely invisible but probably not big enough to be useful. Another thing that most naïve individuals immediately ask about a regression model is: “what’s the R-squared?” The (adjusted) R-squared is the percentage by which the error variance of the regression model is less than that of the mean model (the “percent of variance explained”), and in this case it is 7.8%, not very impressive. A better thing to look at is the standard error of the regression model, which is a lower bound on its forecast standard error. The standard error of the regression model is 27.8 in this case, compared to a forecast standard error of 29.7 for the mean model, not a big improvement. However, these two numbers cannot both be correct as far as errors in predicting the future are concerned, because each one is based on the assumption that its own model is the correct one, and the models are different. The model that thinks it is best is not always right. So, the data does not strongly indicate that there is a trend. However, this is a small sample, so we shouldn’t rely only on the data for such an important inference. We should take into account everything we know about the situation that might indicate whether a linear trend is to be expected, and if so, whether the estimated trend is at least in the right direction. But if we do choose the trend-line model, it would be dangerous to extrapolate the trend very far into the future based on such a small sample, because the longer-horizon forecasts and confidence intervals depend very sensitively on the assumption that the trend is linear and constant.

23

A trend-line model is not very “robust” as far as long-term forecasting is concerned, even less so than the mean model. If you have no a priori idea whether the series has a positive trend, negative trend, or zero trend, then it is more conservative to assume zero trend (i.e., to stick with the mean model) than to try to estimate a non-zero trend from a small sample. However, if have good reasons for believing that there is a trend in a particular direction, but you just don’t know if it is steep enough to stand out against the background noise in a small sample, then the trend line model might be logically preferred even if its estimated trend coefficient is only marginally significant. However, its confidence intervals for long-horizon forecasts still shouldn’t be taken very seriously. We’ll look at alternative models for trend extrapolation in the coming weeks. There are also other ways in which the mean model might be wrong. For example, it could be the case that successive deviations from the mean are correlated with each other, or it could be the case that the standard deviation is not constant, or it could be the case that the mean underwent a “step change” at some point in the middle of the series. Each of these alternative hypotheses could be tested by fitting a more complicated model and evaluating the significance of its additional coefficients, but once again, you should not rely only on significance tests to judge the correctness of the model—you should draw on any other knowledge you may have, particularly given the small sample size. You shouldn’t just blindly test a lot of other models without good motivation. Remember that at the end of the day you may have to explain your choice of model to other decision makers, perhaps betting their money and your reputation on it, and you ought to have a better argument than just “this one has good p-values” or “it has an R-squared of 85%” or (worst of all) “my automatic forecasting software picked it.”

Let me make this clearer: there are no magic wands for creating good forecasting models, but there are “best practices” you should try to learn, and that’s what this course is about. T-stats, p-values, and R-squared values are statistics you should know how to interpret in practical terms, but they are not the most important numbers in your analysis and they are certainly not the bottom line. Your bottom-line issues are the following:

• What new things have you learned from your data collection and from your analysis that might be useful to yourself or others?

• Does your forecasting model make sense to you, based on everything you know?

• Would it make sense to someone else—can you explain it in plain language?

• Is it no more complicated than it needs to be?

• How big are the errors you will make when using it?

• How is it relevant for decision making?

24

Yes, the mean model is very simple, but it is the foundation for more sophisticated models we will encounter later (random walk, regression, ARIMA), and it has the same generic features as any statistical forecasting model:

→ One or more parameters to be estimated

→ An estimate of the intrinsic risk, which is called the “standard error of the model”

→ An estimate of the parameter risk(s), which is called the “standard error of the coefficient(s)”

→ A forecast standard error that reflects both intrinsic risk & parameter risk

…and model risk too! If you understand how these factors are estimated and evaluated for the mean model, you will understand most of what you need to know about how to interpret more complex models.

decision 411 class 1 notes - fuqua school of businessrnau/decision411_2007/decision 411... · notes...

Documents