statistics review

Contents Untitled Divider

Class 2

Hypothesis TestingEE Example

Showroom not a problem if service time to non-customers < 5760 secondsOur sample mean of 100 people was 4880 seconds

Why sample? Because it’s not feasible to survey the entire population of 4000BUT that’s just a sample mean - if we polled 100 other people, we’d probably get something different.

So what we want is some semblance of assurance that average of the sample means does not exceed 5760 seconds.

What we did in Bootcamp was create a confidence interval that showed the range of sample means we are likely to obtain IF the population mean is 5760. A 95% confidence interval demonstrates the range of sample means we’re likely to obtain 95% of the time if the population mean is 5760.

So the 95% confidence interval = 5,760±t(alpha,∂ƒ(sx) Alpha = % observations in the tailsWhere t = t-statistic that corresponds to alpha (what t-value corresponds to the number of sdev above or below the mean?)∂ƒ = number of degrees of freedom (#variables that are free to vary - just think “number of observations – 1”)Sx = standard error of the mean

Standard deviation of the sample = standard deviation/√(sample size)

t value associated with 5% of the observations lying in the tails with 99 degrees of freedom was 1.98

Standard error of the mean was 261Hence, 95% confidence interval = 5,760 ± 1.98(261).What does the 1.98 mean? It means that 95% of the observations lie within 1.98 standard deviations of the mean.

Hypothesis TestingSo as an example, take a murder trial. Jury is told to presume “Not Guilty.” It’s up to the prosecutor to get the jury to reject that hypothesis with evidence. ∆ atty tries to get you to doubt the evidence.Not guilty verdict means ∏ atty failed to sustain his burden of proofSo now let’s apply that logic to our hypothesis testing.

The hypothesis to be tested is the null hypothesis (“not guilty”)The alternative hypothesis (H_A) is that the defendant IS guiltySo let’s set up our hypothesis test:

H_0: ∆ is not guiltyH_A: ∆ is guilty

Only if the evidence exceeds the minimum burden of proof can the null hypothesis be rejected!BUT

Type I Error: null hypothesis is correct but the hypothesis is rejectedType II Error: null hypothesis is incorrect but the hypothesis is not rejectedType I Error is more likely to occur in a civil case - the burden of proof is lowerType II Error is more likely to occur in a criminal case - burden of proof is really stringent

So applying the courts to hypothesis testingNull hypothesis is establishedH_0: Service time to non-buyers ≤ 5760 secondsIf the null hypo isn’t true, the alternative must be true: service time to non-buyers ≥ 5760 secondsNote: could you have reversed these hypothesis? Sure, but it’s easier to have the null hypothesis be “if this is true, our response is status quo” If you reject the null hypothesis, we’re going to have to change something.

So if we don’t have evidence of a problem, we wont’ change anything - we’ll only implement this policy of charging for help if our service time ≥ 5760’

The null hypothesis is only rejected if the evidence exceeds the minimum burden of proof.

Test ExampleComputer company thinks introducing different colored computers may increase profits. TO be profitable, sales have to increase by at least 275 units/week.New colors were test marketed over 36 weeks.So let’s set up our hypothesis test

H_0: average sales/week ≤ 275 units (not worth it)H_0: µ≤275

H_A: average sales/week > 275 units (let’s do it!)H_A: µ>275

Hokay. So we did our little sample and got a mean of 290. Are we done? No! We have to know “what is the probability that we could draw a sample mean of 290 if the population mean was 275 or less”The answer is called the P-ValueTo determine the P-Value ,you need to know how many standard deviations 290 is from 275.

That’s the t-statistic: 290-population mean/standard error of the mean.So we calculate our t-statistic: 1.76, when excel isn’t being weird.We can find the p-value by using the excel function T.DIST.RT (RT means Right Tail)

So X = 1.76; Deg. Freedom = 35What determines the p-value that implies “strong evidence”?

In other words, what’s a reasonable doubt?Significance level: alpha

Basically, you set it yourself at the outset: so you set it to 0.05. If your P-value is 5% or less, you reject the null hypothesisBUT the most popular alphas are: .10, .05, .01.

If the P-value is less than alpha, you are concluding that the likelihood of getting that sample mean if the hypothesized mean is true is small.

If the odds of your null hypothesis’ being true is less than ten percent, you’ll reject it I will only reject the null hypothesis if that p-value is alpha or lessNote: as alpha shrinks, the more stringent the test and the less likely we are to reject the null hypothesis. (you’re moving from civil to criminal)So going from .10 alpha to .01 alpha, you’re less likely to make a Type I error (it’s harder to reject the null)

Type I = you rejected the null, but you shouldn’t haveBUT if you’re less likely to reject the null, then by definition you’re more likely to not reject it when you should have!

Now, had we used a .05 alpha, we would have rejected the null hypothesis and started producing colored computers.BUT if we had used a .01 alpha, we would fail to reject the null hypothesis - we would not produce the colored computers. The evidence isn’t strong enough to shift the status quo in favor of producing colored computers.So what’s the right answer? We don’t know! It all depends on the alpha you choose – but if you choose it, that’s pretty arbitrary.

So... what’s the best test? .1? .05? .01? It depends on what the business is willing to stomach as a risk.

Going from 10% to 5% to 1% decreases likelihood of a Type I error and increases the likelihood of a Type II errorHow do you decide which one to go with? Well, it all comes down to the bottom line:How costly is a Type I error relative to a Type II error?

So how costly would it be to manufacture colored computers that don’t generate sufficient sales to cover the added costs? On the other hand, how much profit is foregone from failing to produce colored computers that would have been probable.

Note: why do we say “fails to reject the null hypothesis?”Think back to the courtroom analogy: innocent means you didn’t do it, not guilty means the prosecution didn’t sustain its burden of proofFail to reject the null hypothesis states that the available evidence isn’t strong enough to cause one to believe ethe null hypothesis is incorrect

Now then, our test marketing example used a one-tailed test: we only cared about whether our number was big enough.But there is also something called a two-tailed test

Two tailed test is necessary when either a large value or small value will cause the null hypothesis to be rejected.Ex) E-cig manufacturer believes average age of e-smoker is 25. If average age is either significantly higher or lower than 25, ad plan won’t be effective - if the sample mean lies in either tail

So here is our test:H_0: µ = 25

H_A: µ ≠ 25Okay. Here’s our data:

Sample mean = 34Standard error of the mean = 5.135 degrees of freedom

t-statistic = (34-25)/5.1=> 1.76

So should we reject it? Not necessarily. In the one-tailed test, everything to the right was 5%. Here, we have two tails, so we have to look for a p-value of 2.5%So for a two-tailed test, use T.DIST.2T

=8.7%So if alpha is 5%, then we would fail to reject the null hypothesis - it falls short of our tail. Thus we have concluded that we don’t have enough evidence to say our target market is not 25

Suppose the sample mean had been 18 and standard error of the mean was 3.2 (with 35 df)

t = 18-25/3.2 = -2.1875 (so 18 lies 2.1875 SDEV below 25)P-value = ???

YOU CAN’T PLUG A NEGATIVE NUMBER INTO T.DIST.2TNo worries though, just plug in the absolute value (it doesn’t matter because the graph is symmetrical)So we plug that in, get p-value of 3.5%, meaning we would reject the null hypothesis.

3.5% means that there is a 3.5% chance that we would get a sample mean of 18 if the true mean were 25. So - we don’t think our target market is 25.

Notes on HWWrite out an explanation of what you did

Consumer Packaging ExampleReview

Hypothesis testing says if the mean is this, what are the odds that we would get a sample mean of that

So the colored computer exampleOur test market sample was 290, which makes us say “wow, above and beyond break even!”Buuuuuuut it’s also the sample mean.

So we calculated the P-Value: if the population mean is this, what’s the probability we get this sample mean. So that was a one-tailed test.

Our P-Value was in one tail (critical region being 5%), so we rejected the null hypothesis.

But what about a two-tailed test? Then, the areas in the critical region sum to 5% (each tail is 2.5%)

The example we gave was our e-cig market: if our hypothesized average was 25, we didn’t care if it was 10 years older or 10 years younger - we want to know if we’re advertising to the right people. That foundation laid, let’s go with a real example:

Company is considering two kinds of packing: which will sell better.Each type is sold in 36 demographically similar sales districts

Demographically similar is key because we need to know that to assume an unbiased sampling population. If they’re not demographically similar, your sample might not be representative, and so biased.

Test market lasts one monthSo we plug that Data into Excel and find that our mean sales are

Pack 1: 290.54Pack 2: 262.75So we go with Pack 1, right? Well, not so fast - we don’t know yet

Pack 1 sold better with the sample population, but would it sell better for the entire population?Enter statistics

First things first, one or two-tailed test?Two-tailed test, our null hypothesis would be that the sales of each package type are the same; the alternative hypothesis is they’re not

H_o: µ1 = µ2H_A: µ1 ≠ µ2Or you could rewrite it as: H_o: µ1-µ2 = 0H_A: µ1-µ2 ≠ 0But... if you reject the null hypothesis, what do you do? All this tells us is that the two aren’t the same; one is better than the other - but we don’t know what that is. So we want to do a one-tailed test

A one-tailed test allows us to directly test whether one package type sells more than the other

Our sample mean shows higher average sales for package type 1; we want to see if the difference is statistically significantSo. Our null hypothesis is that H_o: µ1-µ2≤0; H_A: µ1-µ2>0

We’re saying that µ2≥µ1 is our null hypothesis because if we fail to reject it, we stay with our current state of things

This test will be different though! We have two sample means and are making inferences about two population means. The difference between the two populations is µ1-µ2Difference between the two standard deviations is the square root of the sum of the squares of the two standard deviations (variance sum law)

Remember though: we’re dealing with samples, the difference becomes the sum of the squares sample deviations/sample numbers

That’s complicated, so we write the differences of the standard deviations as S Xbar1-Xbar2

We use the Excel function T.TEST to perform our testArray 1: data set 1Array 2: data set 2Tails: how many tails do you want?Type:

3 is usually what you want - you have two samples with different standard deviations.

So we get 1.1% as an answer

This means that if package 1 does not sell more than package 2, then we have a 1.1 percent chance of drawing these sample means that show 1 selling more than 2 in the manner we did.

So do we reject the null hypothesis? Well that depends on the significance level

If our significance level was 5%, then that means we will reject the null hypothesis if the p value is 5% or less. So if alpha is .05, then we’d reject the null hypothesis and then choose package 1If our significance level was 1%, then we would fail to reject the null hypothesis - we would be saying that we’re not persuaded that 1 is better than 2. In which case, we’d do a more extensive test market. The status quo is that you don’t know. Once again, our decision as to what significance level should be be based on our assessment as to which type of error (Type 1 or 2) would be more costly to the firm

Type 1: Rejecting the null hypothesis, but you shouldn’t have. Type 2: failing to reject the null hypothesis when you should have.

The bar you set for evidence is always circumstance specificIf you’re testing for differences in proportions, the difference in standard deviations becomes So your test statistic becomes:

Note that this is a z-score, not a t-statisticWhen you’re dealing with proportions, you need to have a good sample size (at least 30 observations)

Asset Returnsso the general idea is: you want a diversified portfolio so that the stocks that do well compensate for the stocks that tank.So we’ve got our data in the spreadsheet; let’s investigate the stability of each asset class by comparing the returns for the first 20 years with the second.

But first look at the averages: there seems to be a fair bit of differenceSince we want to know if the month returns have been stable in each 20 period subset, we will use the following two-tailed test:

H0: µ1 - µ2 = 0HA: µ1 - µ2 ≠ 0*Why not a one-tailed test? The two tailed test is about showing whether or not they’re the same; a one-tailed test would say which period was better. But that’s not helpful for investing - we’re trying to predict the future

T.TESTThe p-value is 0.628. We cannot reject the null hypothesis that returns have been stable Corporate Bond: really small P value - strong evidence to reject the idea that the returns are the sameGovernment: 1.1% probability; depending on alpha, may or may not reject null hypothesis

T-bills: infinitesimal number, so reject the idea that returns are the same. Note: when do you use a one-tailed test versus a two tailed test

Use a one-tailed test when you want a value that is AT LEAST XXX.Use a two-tailed test when you want a value that is EXACTLY XXX. Bigger OR Smaller is a problem.

Ch. 3: Intro to Regression Analysis

The Basic BasicsFirst off, dependent v. independent variable

Dependent variable’s value depends on another variableIndependent is a given

Everything we’ve done thus far assumes that we’re operating in a constant environment. If one factor changes, the distribution will shift accordingly. we need a tools to alter predictions for the dependent variable when factors that influence it change

=> Enter Least Squares Regression AnalysisNow let’s think about equations a sec: they have dependent and independent variables

So Y = 3x + 5, x is independent and y is dependentSo. Let’s get this out of the theoretical.

Law of Demand: as the price of a good rises, the quantity demanded decreases.

Graphic representation is called the Demand CurveNote: the Demand Curve is from the buyer’s perspective, so the price is the independent variable and demand is the dependent. The buyer doesn’t set the price, it’s a given - but based on the price, the buyer decides how much to purchase.

Now then, since the demand curve is a line, it must have an equation.Quantity demanded is the dependent variable.Price is the independent variable’Q = ƒ(P) or Q = a + bPWhat does “a” mean in the equation?

=> The amount we could give away for free: the value of Q when P = 0

What about “b”?=> Change in quantity demanded resulting from a given change in price: ∆Q/∆P

Note that the demand curve has a negative slope: when price goes up, demand goes down and vice versa. Hence we actually write the demand curve equation as Q = a - bP

a is the value of Q when P = 0 - It’s the number we could just give away for freeb tells us how much Q increases for each one-unit increase in P Every demand curve has a line that fits the generic Q = a - bP equation.

So suppose the demand curve equation is Q = 10 - 2PIf the goods were free, people would take 10For each $1 increase in price, demand decreases by 2

Note that this equation is set up in X-intercept form (blame Alfred Marshall, thanks)

The slope is given in run/riseAlso: note that quantity demanded may not be the same as quantity sold

If you stock out, all you know is what you sold, not what you could have sold (demand)

Now then, it’d be pretty great to actually have the demand equationHow do you do this?

Well, you keep track of your sales and price data and you can come up with a cluster of dots! Woo hoo! Dots!

And of course, you want to take those dots and figure out the regression line so that you can figure out your demand curve equationLeast Squares Regression: fits a line through a set of data and reports the equation

How does it work?Comes up with a line and squares the distance of each data point from the line. Why square? Because that gets rid of the negative - we don’t care if it’s above or below, we just want to know the absolute valueThen you sum the squares of the deviations and the one with the least sum shows us which line fits the data bestThe lower the sum of the squares, the better the data fits the line

Hence the name “least squares regression”How do you calculate the slope of the line? Through a very complicated equation on slide 39.

So what’s our average price? 47.What does p_i refer to? Each of the prices in the list. So first is p_1, p_2. So p_i just says, you have a bunch of observations and we’re going to name them So what does this do? We take an instance and determine how much it varies from p[bar],

So let’s do it in ExcelKnown y’s: all the dependent variablesKnown x’s: all the independent variablesConst.: TRUEStats: blank

Note: get the formula, plug in 50 and get 907. Okay, that’s not the exact number you’d sell at that price (the line isn’t a perfect fit), but it is what you’d expect to sell on average

Now. we’re only going to focus on the dots that correspond to $50, and we’re going to imagine they’re 3D. We would see a normal distribution.On the horizontal axis, we’re focusing on unit sales ONLY when P=$50. When we charge 50, how much will we sell

Average distribution says, on average, we’ll sell 907.Sometimes you’re going to sell more, sometimes fewer Now our regression estimate (the point estimate) is the average of unit sales when the price is $50.

What if P = $40? Now we’re going to get an average of 1056 sold. The point estimate is 1056 - when the price is $40, you will sell an average of 1056 - the entire distribution has shiftedSo. We also call the point estimate the conditional mean: the mean on the condition that the price is $40, or whatever you choose

Average value of the dependent variable given the independent variable

Whoah. More vocabStandard error of regression: equivalent of the standard error of the mean, which is

Remember, standard deviation is a generic unit of distribution. So lots of times we have subdivisions: standard deviation of this in particularSo when we have a certain distribution for our P = $40, we want to know, 95% of the time, how much can we expect to sell when the price is $50?

What do you need to know to make that calculation?Basically you’re constructing a confidence intervalSo you’d need to know the standard deviation - and that’s the standard error of regression.

Now then, remember there’s the whole issue with our population of interest and sample? Because it’s rarely practical to sample the entire population.

And of course the problem there is that you’re probably not going to get the same sample mean as the population meanBut what we do know is that if you took repeated samples from the same population and averaged THOSE means, then you would get the mean of the populationBut... c’mon, we’re not going to do that either - we’re only going to take our sample mean and then draw inferences about the sample mean.

So. Again, we’re going to be using samples to run our regressions most of the time.

What do we do? Well suppose we believe the population slope is -10. The 95% confidence interval would be

Now, if we want to get our confidence interval, we need to know the standard deviation and the t-value corresponding to a 95% confidence interval

Standard deviation is the standard error of regressionThe t-statistic means that when you have 29 degrees of freedom 95% of observations will lie within 2.04 standard deviations from the mean.So our confidence interval is -10±(2.04)(1.175)

That means that if the population mean (slope) was -10, 95% of the time, a sample mean would be between -12.39 and -7.61

“-12.39 ≤ ß≤ -7.61” is the correct way to write itBUT we got -14.9 - that’s a number outside our desired range

So we reject our null hypothesist(alpha/2).

t tells us: how many standard deviations from the mean a number is

So this is the t value that corresponds to alpha over twoAlpha = 5, that means the area in the tails = 5%Alpha/2 means that we’re splitting alpha into the tails - each tail gets 2.5%So we’re looking for the corresponding t value for a 95% confidence interval

n-2 corresponds to the number of degrees of freedomThe t value differs depending not he number of degrees of freedomWhy n-2?

If you had one observation, you could draw infinite lines through that lineIf you had two observations, you could draw one

In order to draw the line, you need two pointsIf you have three, you could draw two linesThat’s why degrees of freedom = n-2: you need at least 2 points, anything beyond that gives you the number of degrees of freedom

s_b = standard error of the coefficient .Kstat gives us the info we need to construct a confidence interval

Standard error of the coefficient? CheckIt even gives us the t-value for the 95% confidence interval

95% of the sample slopes will be within 2.05 standard deviations of the population slope

Now, we could also do this same calculation yin the standard normal distribution

Standard normal is the distribution when there are 29 degrees of freedom - a generic, statistical fact that’s true in any situation with 29 degrees of freedomSo we could also get our t-statistic to find how many standard deviations 14.9 is from 10

My number - what you think the population slope is/standard error of the coefficient14.0 – (-10)/1.17 = -4.19So this tells us that if the population mean is -10, -14.9 is -4.19 standard deviations below -10But the standard normal distribution says we need it to be within 2.04 standard deviationsSo again, we reject the null hypothesis

We could also set this up as a hypothesis: (Note: ß = population slope)Ho: ß = -10Ha: ß ≠ -10

Now,most statistical software incorporate a default hypothesis that the population slope is zero.

All the data that is conveyed to you is based on the assumption that your hypothesis is that the population slope is 0

Slope of 0 means there’s no correlation between the variable: if X changes, Y doesn’t change (ß = ∆y/∆x = 0 because ∆y = 0)So then that means the alternative hypothesis is that they ARE related

So, statistically, what you’re really saying is:Ho: ß = 0Ha: ß ≠ 0

If the two were unrelated, what would it look like?Dots would be scattered all over with no pattern

If you were to run a regression on that, the line would be completely horizontaly = a + 0xSo really, y = a

But... how do you calculate a?Average of the values for ySo in our GMAT score v. height example, no matter what the height, your y-intercept would just be the average GMAT score.

BUT we don’t know the population regression!We’re reaching into the population, pulling out a sample, and making our observationOur sample equation is going to be a little bit differentPerhaps our sample says GMAT = 500 +.08(height)But... what if it says GMAT = 580 – 1.2(height)?Which one is correct?

Well, knowing that the correlation is 0, we can say neither - they came from a population with a correlation of zero

Those numbers mean nothing - it’s just a number that’s a little different from the true slopeWhich brings us back to our original problem: we don’t know the true slope

BLURGH!This is why our default hypothesis is that the population slope = 0

In most applications, the purpose o fthe study is to see if X and Y are related

If the evidence is strong enough, we can conclude that the theory is unlikely to be correct; that X and Y are related

When you run stat, the regression shows a t-statistic (called a t-ration) based on an implicit assumption that the population slope = 0

What does 12.6812 mean? That tells us that a slope of -14.9 lies 12.68 standard deviations below zero

That’s pretty good evidence that we should reject the hypothesis that Price and Quantity are unrelated.

Makes sense: if we rejected the notion that -14.9 was too far from -10, then it’s REALLY too far from 0

Remember, 95% of sample slopes lie within 2.04 standard deviations of 0. If our number is 12 standard deviations, then that’s probably not a good assumption.

Significance is our P-value: the likelihood that you’d get a number that’s 12.68 standard deviations from 0. And the answer is virtually 0.

What’s that beta-weight number?Similar to the slope coefficient, except it measures changes in terms of standard deviations.

Every time you increase price by one standard deviation, the demand falls by 0.92 standard deviations

We could ALSO set this up as a one-tailed hypothesis test.Kstat assumes a 2 tailed test with Ho: ß = 0In the case of the price coefficient, the null hypothesis will be rejected if the coefficient is either positive or negative

But c’mon, the secret to selling more is never to raise the price: then a one-tailed test would be better

So Ho: ß ≥ 0 Ha: ß < 0In other words, our null is that there’s either no relationship or a positive relationship. Because we’re still relying on a hypothesized population slope of 0, the t-statistic is still -12.868But we’d rely on a one-tailed test via T.DIST.RT

Recall that the number 1 is the area in the right tail, hence the area to the left is 1-X. The states that the probability of getting a sample slope that lies 12.19 standard deviations below zero if the population slop was greater than or equal to zero are virtually zero

Autorama CaseSo we’ve got our data in the data page; we run univariate statistics - what do we see?

Mean: Income; 60359 (18900–101300)Price: 19522 (5100–19650)

Now let’s make a scatterplot, then run regressionSo what’s our regression equation?

Constant - 5788Income - 0.2275So. P = 5788 + 0.2275(Income)

That means, each extra dollar in income is associated with a $0.22 increase in the price of a carOr, in terms of something that’s actually useful, each $1000 increase in income is associated with a $227.50 increase in the price of the car

So our sample regression suggests a positive relationship between income and car prices, but how confident are we that this is really the case?

Ho: income and car prices are unrelatedHa: income and car prices ARE relatedSo,

Ho: Income slope coefficient = 0Ha: Income slope coefficient ≠ 0

We’ll use a 5% significance test to render a conclusion (hey, it’s the default in stat)

Now then, stat tells us that with 98 degrees of freedom, 95% of the sample slopes should lie within 1.98 standard deviations of the hypothesized population slope of zero (that’s the t-statistic)The t-ratio is 9.0759 - so assuming that the population slope is zero, a slope of 0.2275 is 9.07 standard deviations from zeroMoreover, our p-value says that the likelihood of getting a sample slope of 0.2275 when the population slope is 0 is almost 0%.Therefore, we conclude that the car price IS related to income.

Now, what if we’d done a 1% significance test?Kstat gave us the 5% test, but how would you go about doing a 1% testSo to find the critical value for a 1% test, you want to use T.INV.

Probability = 0.01; Deg_freedom = 98And we get 2.63That means 99% of the time, the sample slope will be within 2.63 standard deviations of the mean. And again, ours was still way off the map.

BUT We know that 0.2275 is only our sample slope - for every $1000 increase in income, people spend $227.50 more on cars

But that’s only a sample slope, we dont’ know the populationSo let’s get a confidence interval

ß = sample slop ± (t value corresponding to 98 deg freedom)(standard distribution of the sample)0.1778 ≤ ß ≤ 0.277

So Each $1 increase in income - an increase in the price of the car the customer buys by an amount between $.1778 and $.227So each $1000 increase leads to an increase between $177.80 and $227.00

Now, we can use our tools to construct a confidence around the point estimate (the average number of units sold when P = $50 or whatever)

To figure out the plus or minus part of the confidence interval, it should be 907 ± something

That something requires that we know the t-statistic and the standard deviation of the distribution - we call that the standard error or the estimate

s_ˆyAka standard error of the estimated mean

Introducing stat Prediction: the yellow boxes are places you can plug in numbers

So value for prediction is where you can put in the price to get the point estimate of the quantity soldThe standard error of the estimate is reported to youIf you want to do a 95% confidence interval, you can do that. You can change it too - but 95 is default.So it gives us the t-statisticLet’s do our confidence interval907 ± (t-statistic)(standard error of the estimated mean)

What’s the difference between confidence and prediction intervals??

Confidence interval: range of values within which the population conditional mean value liesIn contrast, the prediction interval contains the values within which an observation selected at random might fall.

If you were to reach into that population and pull a sample out at random, what range would you expect to seeHow do you calculate it? Using the standard error of prediction

Now, the prediction interval is going to be wider - it makes sense if you look at the equation. So... if the point estimate is the same, why does it matter if the dots are scattered or clustered?

Well, that’s where goodness of fit comes in Remember, the point estimate is a conditional mean: on average, this is the value for the dependent value when the independent variable takes on a given valueIn other words, when P = $50, the number of units demanded will, on average, be equal to 93BUT in comparing the two scatterplots, which demand schedule will show a greater variation around the mean?

The one where the dots are more scattered.Now, least squares regression will always find the best line - but simply looking at the equation doesn’t tell you which equation is a better fit for the data.

Goodness of FitMost commonly used measure of goodness of fit is called the coefficient of determination

AKA R-SquaredMeasures the percentage of the total variation in the dependent variable that can be explained by the variation in the independent variable.Say wha?

So the total variation is the difference between the actual number and the averageThe explained variation is the difference between the point estimate and the actual numberSo. Going back to our definition, there is a variation of 156 between the number and the mean; we can explain 150 of those variations.

But... we were still off by six unitsThose are the unexplained variation.

The unexplained variation might refer to other independent variables that influence the quantity for which we do not have the dataIt could also be random error

For example, are your monthly cell phone minutes exactly the same every month?

Sometimes you can explain a decrease: yeah, it’s low because I was in Europe for three weeks.But sometimes it’s just random.

BUT remember: the coefficient of determination is a percentage

So take the explained variation and divide by the total variation. Here, we get 96%

The coefficient of determination performs that operation on squared deviations between each observation and the mean (total variation) and the fitted value and the mean (explained variation)R^2 = (explained variation of all observations)/(total variation)

And hey look: kstat reports the coefficient of determinationSo if it says 84.72%, what does that mean?

84.72% of the difference between the variation in quantity can be explained by the price.

What does this NOT mean?This does NOT mean that your prediction for the quantity will be correct 84.72% of the time.

Now then, R^2 ranges between 0 and 1. R^2 = 1 means the independent variable can explain 100% of the variation. In other words, ever dot in the scatterplot is exactly on the regression line.

R^2 = 0 means a variation in the independent variable explains nothing in terms of the variation in the dependent variable.

So think about our GMAT and height relationship: changing the height explains none of the variance between the mean GMAT and the observed score.

Now, R^2 must be adjusted for degrees of freedomLet’s go back to our GMAT = a + b(Height)Suppose we only had two observations: (470, 72) and (626,61). Well our R^2 would be 1! Duh. That’s because mathematically, if you have only two points, you don’t have any degrees of freedom so you’re going to get a perfect R^2 of because both observations lie on the line.Now if we added a third observation, it’s less likely that all three are exactly on the line. By definition, two of them will be on the line. So we have one degree of freedom.Now, as you add degrees of freedom, R^2 gets smaller and smaller. That’s why we rely on the adjusted R^2: R^2 is biased

We account for the number of degrees of freedom by using the adjusted coefficient of determination

It’s generally going to be a little less than the regular R^2 (makes sense when you think about it)

Now. Don’t be deceived by a high R^2!!!It may not always be indicative of useful information. Ex) You’re a meteorologist and predict rainfall. Which model will give you the highest R^2?

1. Rainfall = a + b(Dew Point)

2. Rainfall = a + b(Relative HUmidity)3. Rainfall = a + b(# people with umbrellas)Well obviously 3 is going to give you the highest R^2, but that’s not really helpful for predicting rainfall, is it?

Back to our data set What if you truncate the data by lopping off three zeros?Sure, go ahead - just be sure and put those back on when you’re interpreting your data - but when you’re plugging values back into the equation, keep the zeroes off.

Newspaper CaseStartup costs are $2M to add Sunday edition; fixed operating costs per annum = $1MCompany needs to forecast Sunday circalation to determine if the investment is worhtwhile.

Break even = 260,000What does our adjusted coefficient of determination of 86.08% mean?

That means that 86.08 of the variation in Sunday circulation can be explained by the number of Daily circulation

Thus, in order to know our projected Sunday circulation, we need to know our daily circulation

So let’s say it’s 190,000BUT ¡ojo! plug it into your equation in the same form you’d enter the data (so here, lop off those three zeroes)Well and so you plug it in and get a great number, but remember, that’s just a sample.

Take a look at your confidence and prediction intervalsShows just how risky this is.

Adjusted Coefficient of determination: 86.08% of the variation in Sunday edition circulation can be explained by knowing the Daily edition circulation. Equation: Sunday = 24.76 +1.35(Daily)So... What do we project for Sunday when Daily is 190,000?

First of all, you have to plug in 190 because we divided all the data points by 1000 when we entered it in.So we plug in 190 and get 281.486, which is really 281,486. Well our point estimate is 281,486, our break-even is 260,000. So that means we’re good, right? Maybe

281,000 is our average sample circulation - based on our estimated sample slope. On average, if daily circulation is 190, then the sunday circulation will be 280The reality is, the number could be different than that. And our 95% confidence interval says the true conditional mean is between 214 and 348.

Now then. A confidence interval is trying to predict the true conditional mean. A prediction interval takes it one step further: it says, I’m not interested in what the average newspaper circulation is, I want to know what MY newspaper is going to get. If I were to reach into this distribution, what range of values am I likely to see?

It’ll be bigger, because you’re looking for a single number, not an average, so the range will be greater.And you plug that in - and get -18 and 581 (or rather 0 ≤ newspapers ≤ 581)

Eesh. So that’s ugly. What if you tried 90%? That’s a narrower interval, and still pretty good confidence.

We can also set this up as a hypothesis test.Ho: Sunday loses moneyHa: Sunday does not lose moneyBut let’s refine that even further - if we need 260 and we sell 190 daily, let’s make our hypothesis

Ho: with a daily circulation of 190, the average newspaper will not sell enough Sunday papers to make a profitHa: with a daily circulation of 190, the average newspaper will not sell enough Sunday papers to make a profitOR

Ho: given x = 190, y ≤ 260Ha: given x = 190, y > 260

Now we have our hypotheses. We want to know how far our point estimate of 281 is from 260. We have to measure this in terms of standard deviations, so we need a t-statistic

Take your point estimate – 260 (break even)/(standard error of the estimated mean)

SO. (281–260)/33.156.t = 0.648

That means that our number is 0.648 standard deviations above 260. How do we convert that to probability?Use T.DIST.RT.

Why one-tailed test? Because our null hypothesis that Sunday papers are less than or equal to our estimate. Using a two tailed test would lead to rejecting selling way more than our point estimate.So we plug it into the box: t-value = 0.648; Deg_Freedom = 33 (35 observations – 2 coefficients = 33)

And we get 26% - there’s a one in four chance that you’d get a sample like this if Sunday papers lose money. For us to feel really good about this, our p-value would have to be less than five percent - it’s not even close to that.

But - this result refers to the average newspaper whose daily circulation is 190. What about OUR newspaper? What are the odds that one of those newspapers, drawn at random, will make a profit? That’s going to change our hypothesis test a bit.

So. Given daily = 190, what is the probability that one observation at random would be profitable?Ho: with a daily circulation of 190, an individual newspaper will not sell enough Sunday papers to make a profit

Ha: with a daily circulation of 190, an individual newspaper will sell enough Sunday papers to make a profitHo: given x = 190, yi ≤ 260Ha: givne x = 190, yi > 260

Our t-value here is going to be different. Same numerator (predicted minus average) BUT because we’re pulling out a single number, the denominator is the standard error of prediction

So (281–260)/147 = 0.146.That means our number, 281, lies 0.146 from the mean.Let’s find out the odds of getting a number like thatWe plug it into T.DIST.RT and get 44% chance that our sample would say this makes money when we in fact don’t make money.

We cannot reject the null hypothesis. What have we done?

We need to sell 260 to profitBased on daily circulation of 190, our point estimate is 281 - on average, that’s how many Sundays you’d sell.90% of the time, a random paper would have a Sunday circulation of 31 to 531 when Daily = 190The odds of getting that average is 26%The odds of getting that sample is 44%

Multiple Regressions

Practice Set Economic theory assumes that the quantity demands depends on more than just price

When we focus on price, we’re assuming that those other factors are constantA demand curve focuses on the relationship between price and quantity demanded, if all other factors are held constant!Now - suppose a major competitor raised his price - how will that affect the demand for your good?

It’ll shift your curve to the right - you’ll likely sell more at the same price.

So this is our demand curve - but is it a single demand curve, or two demand curves? Well let’s circle the ones sold given our price AND assuming our competitor was charging 50 (spoiler: it’s the lower line)In other words, when the competitor charged $50, our demand curve was the lower line, but when the competitor raised his price to $60, our demand curve shifted to the right, causing us to sell more. And look, their slopes are even roughly comparable. BUT initially we didn’t report the competitor’s price - all regression sees is the price we charged and the quantities we sold. Regression found a single line to best fit the data: right down the center:

So the regression line we had splits the center. If you were to do a bunch of point estimates, you’d get points along the regression line.Now, based upon our regression equation, if we charged 40, we could expect to sell 112. BUT... if we account for the fact that our competitor is charging $50, our point estimate is going to be too high - we’re overestimating what we’ll actually sell. Likewise, if our competitors are charging $60, we’ll be underestimating what we’ll actually say.What’s the point? We didn’t tell the computer that there’s another variable involved and so the regression line is says, this is a number and I can guarantee that it’s not the best number.

SO. If we’re systematically overestimating when competitor charges $50 and underestimating when competitor charges $60, we have a problem.When we look at our data, we can see that when the competitor charges $60, we really did get too low an estimate and too high an estimate for when other guy charges $50.

Blurgh it all? Not so fast. Thus far, we’ve used a single independent variable in our equation: simple linear regression: Q = a – bPBut if there’s more than one variable that’s important, we need to include more than one variable in our regression.

Using only one variable when you should use two results in an unreliable demand curve. Sacré bleu! C’est pas possible! Failure to include an important independent variable is referred to as omitted variable bias.

Enter multiple regression.If a $10 increase in competitor’s price led to a 10 unit increase in unit sales, how much would unit sales increase if the competitor’s price increased by $5?

Well, logic would say 5 units. And a $1 increase would lead to a 1 unit increase.

Basic Equation: Q = a – b1P + b2CPb1: the amount our quantity changes when our price changesb2: the amount of quantity changes when our competitor’s price changes. When we run the regression on our data set, we get

Q = 115 – 1.38P + 0.94CPAnd what does our adjusted R^2 mean?

Given 17 degrees of freedom (17 because we now have three coefficients) 96.3% of the variance in quantity demanded can be explained by knowing the price we charge and the price the competitor charged.What was the R^2 when we only included our own price?

78%! Wow! We just added an additional 18.3% of explained variation.What’s the other 4%?

Might involve another variableMight just be random variance.

Okay. How do we do this on excel?Data Analysis + RegressionY-variable is your dependent variable

Now, we get all these lovely slope coefficients that say when price goes up, quantity sold goes down. Well one thing we might want to test is whether it’s really true we have a negative slope?So excel gives you a t-statistic that assumes that what you’re trying to test si whether the two variables are related to each other - they show how far the number is from 0.

And if you look at ours, we see t-statistics of 17, -19, and 9.5 - so it is EXTREMELY unlikely that there is no relation. And look at the P-values - almost 0% chance that there’s no relation.

Here, this seems like common sense - but in other applications, it might not be so intuitive. That’s why multiple regression and t-statistics and p-values are so helpful - we can see if the variables are truly related to quantity or whatever it is you’re analyzing.

Qualitative Variables

We know that regression calculates coefficients based on numerical dataQuantityPriceBUT location isn’t a number - it’s one or the other

What if we were to ignore location and estimate?Q = a - b1(Price)Not a good idea - the regression line would just split the difference between the two demand curvesSo that would be omitted variable bias

No buenoWhat should we do then? We will assign them a number (called dummy variables)

First: what not to do is give Gilbert a 1; Scottsdale a 2 and run a regression

Nope, can’t do that - regression is going to scale the numbers, meaning 2 x Gilbert = Scottsdale

Okay, here’s what you DO do: Select an attribute.

Ex) Store is in the Gilbert location.So you go down and assign a 1 to every observation that has the attributeAssign a 0 to every observation that does not have the attribute (Scottsdale)

Now then we get an equation: Q = 11,721 – 629PRICE – 1833LOCATION

How do you estimate the quantity demanded in Gilbert if P = $13?

Plug in a 1 for Gilbert (remember, that was the original input)

So. Q = 11,721 – 629(13) - 1833(1) = 1711And for Scottsdale?

Plug in a 0 for locationQ = 11,721 – 629(13) – 1833(0) = 3544

So this is in accordance with our scatterplots: we see that we are in fact selling less in Gilbert than in Scottsdale

How much less? 1833 less. Note that when we estimated sales in Scottsdale, we plugged in a 0 for location. Because we plugged in 0 for the dummy variable, we got the same result as if we had used the formula Q = 11,721 – 629PRICE

So what must be true about “Q = 11,721 – 629PRICE”?In essence, this is the regression line for Scottsdale!-1833(LOCATION) is the adjustment you make for Gilbert. You’re always subtracting 1833 from the Scottsdale point estimateSo this is a dummy coefficient that means that, all else being equal, 1833 fewer units are sold at the Gilbert location than the Scottsdale location.

More generally, it compares the observations that have the attribute with those that do not have the attribute.

So REALLY the equation is Q = 11,721 – 629PRICE – 1833GILBERT

BUT are unit sales really lower at the Gilbert location?After all, we’re relying on sample data - we don’t know about the population...To test if unit sales at the Gilbert Location are different from Scottsdale, we need to run a t-test T-statistic measures how far your sample slope measures from the hypothesized population slope, measured in standard deviations

Now then, most statistical software assumes that the hypothesized population slope lies fro the hypothesized population slope. In other words, we’re asking whether the location slope varies with the independent variableWhat we’re asking is if the location is related to the price slope.So. We have 9 degrees of freedom (12 observations minus at least 3 coefficients that must be estimated)And if you look, we have a t-statistic for computing 95% confidence intervals.

t-statistic = 2.262That means that, if the true population slope were zero (no difference between the locations), then 95% of the time, the sample slopes will be with 2.2622 standard deviations of 0 And look at the t-ratio reported under regression: we have -11.2574. That means that if the population slope is zero, then getting a slope of 1833 is 11.2574 standard deviations below zeros. And look at the significance (that’s really the P-value) - it’s 0.0001%. Those are our odds of drawing that sample slope if the population slope is zero. If in fact, location does not matter, then if

that were true, then the odds of getting a number that lies 11.62 standard deviations below zero are 0.0001%.

So this is very powerful evidence to reject the null hypothesis that Gilbert and Scottsdale are equivalent.And what if we’d reversed how we coded the dummy variable?

Well then the slope would be + 1833 and it would be 11.62 standard deviations above 0.

Q = 11,721 – 629PRICE is the demand curve for GilbertAnd then you have to add the 1833 to get the Scottsdale #

BUT the x-intercept is also going to change. Think about it: Running a regression on Gilbert (location = 0), then you adjust from Gilbert to get to ScottsaleSo we know that the x-intercept of Gilbert is 1833 less than Scottsdale’sSo that means the intercept will be 9888, not 11721!

BUT does it make a difference? Nope - try calculating G location at P=13L=0, Q = 11,721 – 629(13) + 1833(0) = 1707And Scottsdale? Q = 11,721 – 629(13) + 1833(1) = 3544

Amazing! Math! Numbers!Reversing the coding doesn’t change the point estimatesIt only changes the identity of the default attribute.

Now then. What if created a GILBERT variable (“1” = Gilbert, “0” = not Gilbert) and a Scottsdale variable (1 = Scottsdale, 0 = not scottsdale)

What equation would you get?Q = a – b1P + b2Gilbert + b3Scottsdale

And when you plug it in?ERROR! Why?

‘It’s likely that one of the variables in your model is perfectly dependent on one or more of the others. Eliminate that variable rom your model and try again.“

So what does that mean? Basically you’re adjusting Scottsdale for Gilbert and then adjusting right back. “Perfect multicollinearity”: perfectly correlated

So if you picked an observation at random from the data and told you the value for PRICE, youwoulnd’t know with certainty the corresponding value for Gilbert or ScottsdaleBut if you picked an observation at random and knew that Gilbert was 1 and Scottsdale was 0, then they have a perfect inverse correlationNow then, originally we had a single variable that we used for the equation: all Gilberts got 1, all not Gilberts got 0.

That gave us an equation with a demand for Scottsdale plus an adjustment for GilbertSo you don’t need a dummy variable for each location, just one - whichever one you don’t choose is the default

In other words, the default portion of the equation picks up observations that aren’t coded as 1.

By trying to run a regression with each location, who’s the default group?

Q = 11,721 – 629PRICE: what does that refer to?

Well it doesn’t refer to anything - we got rid of the default group! It’s everyone who’s neither Scottsdale nor Gilbert - but there is nobody else!

This is the dummy variable trap: creating a separate dummy variable for each possibility leads to perfect multicollinearity.

You MUST leave out one of the options so it will be picked up in the default portion.

BUT now let’s imagine we have THREE locations. Shoot. In our previous example, if we had a Gilbert Variable, then a 0 was by definition ScottsdaleBut now what does not Gilbert mean? It means Scottsdale or Glendale. So lets say we had a dummy variable for all three locations and attempted to estimate “Q = a – b1PRICE + b2GILBERT + b3SCOTTSDALE + b4GLENDALE”

Well, smart one, you’d have fallen into the dummy variable trap again - perfect multicollinearity!You need to leave out one of the options - then it can get picked up in the default portion of the equation

So say we only have values for Gilbert and Scottsdale: Q = a – b1PRICE + b2GILBERT + b3SCOTTSDALE.

Q = a – b1PRICE is our Glendale ValueWhat does the Gilbert coefficient tell us?

Well, that’s the adjustment you make to the default group for Glendale. It compares Gilbert to Glendale.And this makes sense: you take the default (Glendale) and add the adjustment for Gilbert and then the adjustment for Scottsdale x 0, or 0

And the Scottsdale coefficient?That compares Scottsdale with Glendale.

So we go through and do our assignments: 1 for Gilbert, 1 for Scottsdale, both 0 for GlendaleSo we get the equation

Q = 11919 – 639PRICE – 1918GILBERT – 77ScottsdaleAnd what does R^2 mean? 97.08 of the total variation in quantity can be explained by the price and store locationIs there a significant difference between unit sales in Glendale and Gilbert?

t-statistic for Gilbert is -14.24, meaning the sample slope for Gilbert is 14.24 standard deviations below zero, so the significance/P-value is nearly zero - the odds you’d reach into the population and get those slopes are close to zero, if the two are about the same

So this is very, very powerful evidence to reject the null hypothesis that they’re the same

Okay, now Glendale v. Scottsdalethe t-statistic is -0.55 standard deviations below 0Well that’s not too far off. So if Scottsdale and Glendale were basically the same and you ran regression after regression, the odds of getting a sample regression, 95% of your sample slopes would be within 2.14 standard deviations of 0

And this number is within 2.14 - in fact, our odds of getting such a slope is 58.9%.So here, we fail to reject the null hypothesis

BUT what if we omitted Scottsdale and added a Glendale variable?Well then Scottsdale would be the defaultBut our intercept would change: go from 11919 to 11919 – 77, or 11,842. Q = 11,842 –639PRICE

But what would the coefficient for Gilbert be?Well, if Glendale to Gilbert is 1,919, then Scottsdale to Gilbert would be (Glendale - Gilbert) – (Glendale - Scottsdale) or 1840Now we’re comparing everything to Scottsdale

HW Part a on problem CE 2,

Standard error of the estimate: standard deviation of the estimates, who in this case, is 145. Standard error of the coefficient: slope coefficients.

Now then, the difference between the actual value of the dependent variable and the predicted value is called the residual: basically, how far off is your value from the real number

Yi - yˆ = Residual (Yi is the real number; Yˆ is the estimated)Okay. So you’re at the residuals tab; you sort from lowest to highest

We see that at the highest prices we overestimated and the lower, you underestimated.So you have a pattern with your residuals - and that’s bad. You want your residuals to be random. If you make Scottsdale your default, you get the opposite pattern: overestimate at low prices and underestimate the high prices.Why? Because we forced a parallel line. All we did was tell the computer was to generate a different intercept, but didn’t tell it to look at the slopes of the lines.

So... our dummy variable for location dealt with the different intercepts - our equation was Q = 11,721 – 629P –1833GILBERT.But clearly we have a difference in slopes—how do we adjust for that??So how do you adjust for differences in slopes? Well we need a coefficient to adjust the slope.

We call these interaction terms (book calls it a slope dummy variable): create a variable in which you multiply the value for the dummy variable by the variable whose slope differsIn this case, the variable is GILBERT x PRICESo our equation is

Q = 11,721 – 550PRICE - 1833GILBERT – 70GILBERT X PRICESo the Gilbert Slope is -550–70Gilbert Demand is Q = 9888 - 620 PRICEThis isn’t perfect, but it reduces bias - the over/underestimates are gone now. So what does this mean: the first part of the equation, the default price, is the same as before: that’s Scottsdale - Scottsdale didn’t get a 1 dummy variable, so it’s a 0, so it’s the default.Now then, “-1833GILBERT” is how you adjust the intercept to get the Scottsdale intercept

Interaction coefficient of 70 adjusts the slope. Why do we do the dummy variable system?

Why not just separate out Scottsdale and Gilbert?Because we get better results with more observations. More efficient estimates - and the same equation.

Head-Hunting Agency Case StudyFacts

CEO Seek advertises it can find a suitable candidate within 15 days or the service is freeOur job is to determine if the premise is feasible. You suspect that it’s harder to find a suitable CEO for a large firm than for a smaller firm.

Issues PresentedIs the size of the firm related to the number of days needed to find a suitable candidate?What would you recommend re 15-day guaranteeDoes the search for a CEO differ from the search for lower level managers in terms of the number of days?

Data48 observations, for either a CEO search or lower level management search.Each shows # days required to find candidateEach observation shows size of client firm.So. Data set Headhunting.xls

Days: number of days to find candidateSize: number of employees (measured in hundreds)LOWConst: lower level manager = 1; CEO = 0LOWSlope = interaction term (size of firm multiplied by the dummy variable)

Regression: Days v. SizeSo the coefficient for size says “0.00597”

∆DAYS/∆SIZE. So for every 100 increase in the number of employees, you add 0.006 days to the job search

But we also have to look at the t-statistic and p-statisticIf there was no relation between firm size and days it takes to fill, the slope would be 0.So look at the t-ratio: 0.29 deviations from zero. And the P-value tells us that if there were no relation between the two, you could reach into the population and pull out a sample that does show a relationship is 76%.So we fail to reject the null hypothesis that there is a relation between the two

But do a scatterplot!

So that’s why we go that tiny slope!The regression line split the difference!

Better throw in a dummy variable, huh?Regression with Dummy Variable for CEO and Manager

DAYS = 18.82 - 0.005(SIZE) - 8,935(LOW CONST)So for every 100 more employees, the number of days required goes down by 0.005

But what does R^2 mean again?72,29% of the variance can be explained by knowing the firm size and whether you’re searching for the CEO or manager.

What about the t-statistic that corresponds to the SIZE coefficient: (stat calls it t-ratio): 0.4899

So pretty smallAnd our p-value is 62.53%, meaning that if there is no relation, then 62.53% of the time, we’ll get numbers that show this kind of relationship. And so we would fail to reject the null.

And what about the dummy variable coefficient? What does that mean (-8.93)

If you’re searching for a lower level manager, then it takes 8.9 fewer days to do the search

And what about the t-ratio: -15.79?What is the odds that we’d get a number like 8.93 if the relation was 0?

15.79 standard below zeroThere’s virtually no chance you’d get a number like this if it was the same for both groups

So reject the null hypothesis that there’s no relation between length and whom you’re searching for.

Regression Using Interaction TermsNow we get the equation:

Q = 13.17 + 0.088SIZE - 1.02(LOW) - 0.17(LOW X SIZE)So how many days does it take to fill CEO position?

Days = 13.17 + 0.887SIZENow then, eyeballing our scatterplot, can we see a difference between the intercepts? Hard to say.

But what about the slopes? Yeah, clearly different. Let’s look at the statistical results

Is the intercept for lower level management different than CEOs? The intercept we got was 1.022 - that means it would take on eless day to fill lower than CEOAny significance to that?

Null hypothesis is that the true slope is 0The t-ratio says this is 1.42 standard deviations from 0And the P-Value says 15.88% chance of getting this number

And the slope adjustor for the CEO/Manager? We have a t-statistic of 12.54 and P value of 0.0000%So clearly there is a relationship - we reject the null hypothesisAnd that makes sense, given the scatterplot - the relationship was really obvious. We couldn’t really tell the difference between the intercepts just by eyeballing it, but this one was obvious

So what’s the equation for lower level searches?DAYS = 12.15 - .105

First, you adjust the intercept: 13.17 -LOWConst(1.02), so we get 12.15And then the Slope? You have to take the slope and then multiply 0.0887(LOWSlope), or -.105

Now go to Univariate Statstics

Now we’re going to develop a point estimate and 95% confidence and prediction interval for firms of 2000, 5000, and 9000And our predicted values are 14.95, 17,611, 21.16 (for CEOs)What about management?

10.62, 8.33, and 5.28, respectively.Conclusions: the larger the size of the firm, the longer the search for CEO

Larger the firm, shorter search for lower level managers15 day guarantee works for lower level managers, but not for CEOs

Vocabulary

General TermsStandard deviation: rarely used because it’s of a full population

Standard error of the coefficient: standard deviation within your sample coefficient

AKA Sometimes called Std. Dev. of the Sample or Std. Error of the Sample

t-ratio: distance that the coefficient is from a hypothetical slope of 0, measured in standard deviations (Coefficient/Standard Error of the Coefficient)

Specifically for where the coefficient lies from zero in terms of standard errors of the coefficient.

Significance: exact same thing as the t-ratio, except measured in probability

Coefficient of determination: doesn’t take into account degrees of freedomAdjusted coefficient of determination: explains how much of the variation in the dependent variable can be explained by variation in the independent variable(s)

AKA R^2 (i.e., KStat is weird, everyone else calls it R^2)T-statistic: the number you would use to create the upper and lower bound you’d make for a confidence interval. Sample mean – population mean/standard

Say you want to test true pop mean = 1. sample mean = 0.33, st. dev. = .005, df = 129. 0.5% significance test

Ho: pop mean =1Ha: pop mean ≠ 1If t-statistic you generate is less than the one in the t-table, you fail to reject the null hypothesisSO t-statistic = | 0.33 – 1 |/.005 = 134t-table for 100, 129 df says 1.984SO. If your calculated t-statistic is less than the appropriate t-statistic from the t-chart, then you

Standard error of the mean: Standard Errors...

Standard error of regression: standard deviation for this individual observation that we--for this regression equation, this is the st. dev

What you use to build confidence Standard error of prediction: incorporates the unknowns and makes the interval a bit bigger

Standard error of the mean: when in doubt, use this. How to build a confidence interval for a point estimate. Say you don’t want a 95% confidence interval, you want a 90% confidence interval.

Given coefficient: .235, st. error = 0.97, 100 df., 0.9 confidence intervalBetween this and our t-chart, we have everything we need to build a 90% confidence

We’re hypothesizing that the center of our bell curve is .235; we want to know the upper and lower bounds that embrace 90% of that bell curve

Look at confidence level at the bottomSo t-stat = 1.66 so you use that to convert your standard deviation into the difference both directions.Coefficient ± (t-statistic from chart)(standard deviation)

= .235 ± (1.66)(0.97)Given coefficient of 3.5; std. error = 1.2; df = 30; 95% confidence interval

=3.5 ± 1.2*2.042If your standard deviation gets smaller, your window will shrink; same if you decrease your confidence interval because you’re looking for something quite narrow: 50% of the time it will fall between these numbers is more specific than 99% of the time it will fall between these numbers

Use the standard error of the coefficient. Multicollinearity

Coeff t-stat = 0.75; Variance inflation - 25; 0.05 significance level; 30 DFVariance inflation speaks to the coefficient they’re related to. So go back tot he coefficient, apply the square root of your variance factor

Do an analysis of varianceFirst column: base equationSEcond: extendedThird: difference between the two. Significance less than 5% causes you to reject the null hypothesis that there is no explanatory value.

Time SeriesBased on the assumption that historical trends continue

That means you have to ask if time is an elementPrice and quantity: time is irrelevantSales over time: yeah, time matters

Convert all your time periods into consecutive periodsChoosing the best model: whichever model gives the smallest residuals in the most recent periods

Spurious Correlations

So we have this weight/CPI regression: Weight = 70+ .54(CPI)Significance = 0.000%So inflation is CLEARLY responsible for weight.These are type I errors.

So let’s look at the refrigerator again. Demand Curves: You may be trying to estimate a demand curve and then find that when price goes up, demand goes up

But the cause could be because demand is just plain higherOr maybe the competition raised its prices more

Beware reverse causation: fire trucks cause firesRemember: regression doesn’t measure causation, merely correlation.

So as an example, chief manager of a wine company believes that preferences for brands of wine are related to household incomes

Sales A refers to Almaden. Run a regression in the form Sales A = a + b(Income A)So we run a regression on our data and come up with the formula Sales A = 3 + .50Income. BUT income and sales are in thousands of dollars

.50 means for every $1000 increase in income, sales increase by $500. Each $1 increase in income increases Almaden sales by $.50.

And for Bianco wineSales B = a + b(Income B)And look, we get the same equation: Sales B = a + .5(income B)So it’s about the same as A. But look at the scatter plot:

But... that’s not a line, that’s a curveBut we told regression to find a line, so that’s what it did.

Regression doesn’t care what the scatterplot looks like: it just tries to find the best line to fit the data, regardless of whether the data falls into that pattern!

So how do nonlinear relationship affect your estimates?Click on the Kstat tab that says residuals

So we see that for the first 2 lower incomes, the model predicts too highFor middle incomes, the model underpredictsAnd then the final two incomes are overestimates again. So our residual (the difference between estimated and observed) has a pattern: lowHIGHlow

Now then, if the regression line were passing through the center of the scatter plot, there would be no pattern to the residuals, but here we have a clear, nonrandom pattern.So what’s responsible for this pattern?Well we need to add a quadratic:

∆y/∆x = ∆Sales/∆income. An increase in the Sales = -5.999 + 2.78(Income) - .13(Income squared)

As income rises, the relationship between income and sales gets smaller.Instead of a straight line, it goes up, then goes down: the relationship between sales and income eventually becomes negative. And if you look at the residuals, we can see that our point estimates are spot onThe inclusion of that quadratic term has fit a curve to nonlinear data.

Phew. Now then, nonlinear data can usually be spotted with a satterplot. But it may not be so easily spotted if you’re using multiple regression (if you have more than two dimensions in your graph)A more reliable means of spotting nonlinearity is through a residual plot.

So going back to our Sales B = 3 + .5(Income B) line.We know that’s no bueno. BUT we can then plot the residuals:

So not only did we plot the residuals and see a pattern of negative, positive, negative, but the size of the residuals follows a pattern: so that’s pretty clear evidence that we’re trying to fit a line through nonlinear data. And compare to Almaden wine: the dots seemed to follow a linear pattern, line passed through the data. And look what happens when you plot the residuals: They’re all over the place! No pattern to the residuals.

This is good: we want there to be no pattern to residuals - the only way to get a better number is with a crystal ball.

And what about Casarosa wines? Let’s run a regression: Again, we get Sales C = 3 + 0.5(Income C)So the line is the same as A and B... but there’s probably something funky going on with the scatterplot. Well... less funky and more almost perfect correlation with a single outlier

And look: that outlier tilts the line up. At the really low levels, we’ll underestimate, and then we’ll overestimate moving on, with the exception of the outlier, which is an underestimate.

And look, we have exactly that result. Now then, Kstat is set up to find outliers.

Go to “Model Analysis” then Residuals.“std’ized” refers to “studentized residual”

This column measures how far the residual lies from 0, measured in standard deviations So if you line passes through the center of the data, the residuals should average to 0 (positives and negatives cancel out)

The idea is that some of the residuals will be closer to 0 than others - really far from 0 is an outlier. And if you look, the outlier is highlighted in red: 2,8 standard deviations above 0.The default number they give you is for a 95% confidence interval. And our number here is 2.2622 - 95% of residuals are within 2.622 standard deviations.

So Kstat will highlight any residual that lies in the 5% tail region.

So moving on to the Casarosa example from the email, you run a regression of Sales v. Income and get 4.025 = 0.34(Income)

Lovely little R^2 too: 99.95%!But that’s with the right information: what if it’s the wrong info

To check, go to charts, fitted line plots, and then plot against income. Almost perfect, as you can seeBut what happens when there’s bad info? Say you pu tin 8.41 as 841?

Totally tanked our R2! 7.01%. And our t-ratio is 1.32, with significance of 21.8%

Sales = -192.78 + 30.62 (income)

So... every dollar increase in income translates to an extra 30.62 in sales

And when you plot it on a fitted line plot: So the regression line is way steeper than it should be.Note: when you saw the correct data, it looked like data was upward sloping

Now it looks flat.But it’s not - it just looks that way because that outlier up at 841 messes with the scale

Okay, that was a typo: always double check your data entry. But... what if the outlier wasn’t a typo? What if there really was that oddball day?

One way to minimize the impact of an outlier is to increase sample size

Well lookie there, the impact of the outlier has been substantially reduced from when it was one of 14. Bias is still there, but it’s must smaller

Note: we just copied and pasted the data several times- that’s not what you’d really do, obviously

Here’s what you really want to do though: unless it was the result of a typo, an outlier probably has a cause

If you can determine the cause, you can add the independent variable that was responsible for the outlier to your equation.

Add an outlier dummy variable!!! In this case, we’re saying the cause is a huge Black Friday sale, so our dummy variable is BF = 1, ≠BF = 0.Now we have the equation: Sales = 4.01 + 0.34(Income) + 832(BF)And look at the residuals: as you’d expect with an R2, the fitted values are equal to the actual values, including for BF value.

Never, ever, ever omit the outliers from your data sets!Why? Because when you’re doing empirical studies, you’re testing a theory. So when you omit outliers, you’re giving the impression that you rigged your resutsl

And look at this crazy plot. Of course you’d never get results like this in the real world, but look how this leverage point has a disproportionately large influence on the regression equation. It looks like there’s really no relation between income and sales, but the regression line, due to the leverage point, says otherwise.

But Kstat thinks life is just dandy - this point fits the line exactly, so A LEVERAGE POINT IS NOT AN OUTLIER

And yet clearly something ain’t right: so what do you do? Go to model analysis.

Now look at the leverage column - our leverage point is highlighted in RED. So instead of studentized, you look at leverage.

And stat has identified that this has a disproportionate effect on the regression line.

Now then, scatterplots like this are really unusualLet’s have another example: store near ASU sells higher-end goods aimed at students. Plot income to Sales.

So here we have But this really doesn’t make sense: the relationship between what students earn and spend is biased by the inclusion of all those students who make no money and don’t shop at the store.

But... what if students are supported by their parents? Or earned money over the summer to spend: so summer income supports spending during the year. Income = $0; spending ≠ $0

So now you can’t just get rid of the zero incomes... but those who do have no income and spend, spend at a different rate. So maybe you need two regression equations: one for lower incomes and one for higher; this matches two distinct market segmets

Now, this is an example of the threshold effectFor our purposes, know this issue is here and that issues arise from it

Sometimes, it’s just a matter of exercising good judgment.Income vs. Spending

A prime example of heteroskedasticity: people who make very little spend every bit they have (they have to), but people who make more money can spend or save with more discretion.

But check out that residual plot: Why do we care? If there’s more than one independent variable, y ouhave to look at the residual plot to spot the error.We use the Breuch Pagan test

Run a regression of the general formY = a + b1X1 + b2X2 + bnXn

Then you run the residuals (yi - yî), square them, and run the regression. µ2 = a +c1X1 + c2X2 +cnXn

Where µ2 is the squared residualsIf there’s no heteroskedasticity present, the slopes in the second equation will be zero - there’s no correlation between the residuals and the independent variable. If there is such a relationship, there will be a slope >0Here is the hypothesis test

Ho: c1 = c2 = cn = 0Ha: c1 ≠ c2 ≠ cn ≠ 0

Using the R2 from residual regression, the Breusch-Pagan test statistic is calculated by something we don’t have to do. We then test against a chi-square statistic.

But kstat does it for you as well as report the corresponding P-value

Look at it in Model Analysis, look for the statistic and P-value, right next to each other

P-value: odds of getting a BP stat if there were no heteroskedasticity, meaning we have strong/weak evidence to reject the notion that heteroskedasticity does not exist

How do you fix heteroskedasticity? Easiest (and only way we’ll cover): convert the data into logarithms (semi-log or log-log(, run the regression, examine the residuals for signs of heteroskedasticity, and perform a BP test.

If it fixed the problem, that’s the equation to go withIf it didn’t fix it by doing semi-log, try log-logIf THAT didn’t fix it, then don’t worry about it for our purposes

Ch.7: Dealing with Multicollinearity

So. Hotdog example.Interpret Pdub: -0.0007642. Market share was in terms of percent; price was in cents

Each one cent increase in the price of a Dubuque hotdog causes its market share to fall by 0.076%

Logical enough: that’s the demand curveBut let’s evaluate this

T-statistic: this number is 9.6042 standard deviations below 0So the odds of getting this number if there was no relationship between price and market share are very very low

So our market share is clearly related to the price of an Oscar Meyer hotdog. Ditto for PB reg

POscar: 0.00026333Every cent increase in Oscar Meyer’s price causes Dubuque’s market share to rise by 0.026%And look at the t-statistic: 3.13 standard deviations above 0, P value = 0.0%

PBPReg: 0.000459698. So every 1 cent increase in BPReg, D gains market share of 0.0046%

T-stat = 5.88P-val = 0.0%

But... look at R2: 51.27%Now we run the regression, but swap out BPReg and put in all beef.

In terms of Dubuque’s market share/price relationship, that stayed the same. Ditto for Oscar Meyer price/our market shareAnd PBPBeef has slope of 0.00040, so for every 1 cent increase in price, our market share increases by 0.04%.

T-ratio = 5.7Significance = 0.0%

So our colleague who says OM is our only competitor is full of bologna (nyuck nyuck nyuck)Now run the regression one more time and include all the brands.

Well shoot. No impact on relation between our price and market share, OM price and our market share, but look what it did to our BP data!

Well now it looks like we don’t have sufficient information to say that there’s a relationship between BP’s price and our market shareOur significance levels went to 29,7% and 72,8%. Useless!

Okay now this is just funky: either BP hotdog flying solo clearly had a relation to our market share

But together, no impact?Either our market share is influenced, or it’s not. But we have strong evidence that on their own they are, but together they’re not.So... what’s the explanation? Go to charts and plot Pdub (independent variable) against POscar (independent variable): Not much of a correlation - maybe a positive correlation? But it’s not strong.Now Oscar Meyer against BPreg

Again, there appears to be some positive correlation, but it’s not strong

Now BPReg v. BPBeef:

Well there’s a strong correlation there.This is visual evidence of multicollinearity: strong correlation between independent variables

In Ch. 5, we talked about perfect multicollinearity - when you have perfect multicollinearity, you can’t run a regression equation again. But here, the two variables are strongly correlated, but not perfectly

So we can still estimate a regression equation, but it has a problem

How do you spot multicollinearity?Look at the correlation coefficient: correlation coefficient of 1 indicates a perfect, positive correlation between variables. -1 indicates perfect negative correlation between variables. 0 means no correlation.So how do you do this? Go to kstat and click on correlations

So we have the correlation coefficient for our hotdog price and it’s positively correlated with all other brands: as other brands go up, so do oursBut look at the correlation coefficient relation the ball part regular price to all beef price. As you can see, it’s almost 1

Why does multicollinearity create problems?Our equation is: MKTDUB = a + b1Pdub + b2Poscar + b3Pbpreg

+ b4Pbpbeef

Differentiation: If we were to increase the values of the variables by 1 unit, what would happen to our market share

Partial derivatives: identifies the change in the dependent variable resulting from a given change in the independent variable holding all other values constantSo impact of a change in price of a regular BP Hotdog on

Dub’s market share, holding constant the price of other types of hotdogs.

But you can’t hold the all beef and regular constant because the same company is in charge of both: so they always move together.But since BPReg is highly correlated with BPBeef, regression doesn’t see the independent effect that BPReg has on D’s market share - and the same goes with BPBeef.

Why do we care? this case is based on the concern that if BP changes its price, it may affect D’s market share. And our colleagues said that OM was the only competitor.

So... does D’s market share appear to be sensitive to price of BP hotdogs?

When the regression includes both types of BP hotdogs, the equation misleads D into thinking BP doesn’t have an effect; it doesn’t appear to be when you look at the straight regression line. But we know, from running individual regressions for each kind of BP hotdog, that there IS an influence

Well go back to the regression with just PReg.Standard error of the coefficient is .0000078But when you run it with both, the standard error of the coefficient increases to 0.00033: it’s four times bigger than the other equation.

Multicollinearity inflates the size of the standard error of the coefficient. Why do we care?

t=bˆ(sample slope)-Popslope (default assumption is 0)/standard error of the coefficient (S[sub]bˆ)And if you have multicollinearity, you make the denominator bigger, decreasing the t-statisticAnd if the t-statistic is too small, it makes it look more significantYou will be less likely to reject the null hypothesis when a relationship between variables actually exists (Type II error)

Type I error = null hypothesis is true, but you reject it as not trueType II error = null hypothesis is false, but you fail to reject it as true.

And we’ve already seen this: when only one BP hotdog was in the equation, it’s t-statistic showed it was significantly related to D’s market share,. And when both were in the equation, neither were shown to be related to D’s market share.

So how do you know if the correlation is large enough to create a problem?

We use variance inflation factorsVariance Inflation Factor: indicates the amount by which multicollinearity inflates the standard error of the coefficient.

Suppose that the t-statistic is 0.5. So then the idea is that there’s no relation. But the t-statistic is sample slope - 0/standard error of the coefficient

And that causes your number to be articifically low. So it’s misleadingHow do you know if it’s low enough to make a difference in your conclusion?That’s the variance inflation factor

So... going back to boot campVariance: Sum(each outcome - mean)^2/number of outcomes

Less popular because it’s harder to interpret. It’s the average squared deviationMakes sense conceptually, but not much use in terms of application because it’s a square

Standard Deviation: square root of the variance. Much easier to wrap your head around intuitively

Why did we go back to that?Suppose the t-statistic is 0.5. If the variance inflation factor = 36, then standard error of the coefficient has been inflated six times due to multicollinearity (b/c standard deviation = square root of variance)

To find the variance inflation factors, click statistics, then model analysis.

We get: pdub: 1.36 (sqrt = 1.1). So our t-statistic is lower than it really is, it should be 1.1 times what it is. But we already said it’s related, so now it’s really related - in other words, since we already concluded the variables are related to the dependent variable, that just strengthens our conclusionPoscar: 1.66 (sqrt = ). We’ve already concluded there’s a relation, so the fact that the t-statistic is depressed by sort 1.66 isn’t too worrisome. Pbreg: 25.96 (sqrt = 5.00)

So we concluded there’s no relation, but it should be 5.something HIGHER than what we got)We got a t-ratio of 1.04, but it should be 5.2! Well if that were the case, we’d say strong correlation!

BPBeef: 25.14 (sqrt= 5)T statistic is 0.35, but it should be 1.6 - so likely significant at the 10% level

So what do we see? Multicollinearity has led us into failing to reject the null hypothesis when it was false!

So is multicollinearity always a problem? Well, it depends on what you’re trying to do.

What impact does multicollinearity have on the point estimate?When the points average out to something other than the population slopeWhen you only included regular BP hotdogs, regression equation was Prediction without multicollinearity:

Prediction with multicollinearity: the sameSo the problems are thus:

T-statstics are depressed, which may mislead you into concluding variables are not related when, in fact, they are.

Slope coefficients for the independent variables that are correlated are not a reliable measure of how changes in that independent variable affect the dependent variable.

But... if two independent variable are highly correlated and have low t-statistics, how do you know if they’re correlated with the dependent variable?We call it a “Joint F-Test”

Suppose PBPReg and PBPBeef have no explanatory value. BP prices have nothing to do with D’s market share.

Well if that’s true, then running a regression with them and without them will do equally well. That’s what the F-test does.

If this were true, the residual sum of squares (RSS), or Sum(y[sub]i (actual observation for the dependent variable) - yˆ(predicted value))^2So if we had MDub = a + b1Dub + b2POscar and MDub = a + b1Dub + b2POscar + b3PBPReg + b4PBPBeef

And if b3PBPReg + b4PBPBeef have no explanatory value, than the RSS for both equations would be the same. If the equation without PBPreg and PBPBeef has less explanatory power than the full equation, the RSS without b3PBPReg + b4PBPBeef will be larger than for the full equation

So you set it up as a hypothesis test.

What does this mean? RSS[sub]r = restricted, [sub]u = unrestrictedNormally, you have to consult an F-table to see if the F-statistic is large enough to cause you to reject the null hypothesis - but Kstat reports the p-value associated with the statistic

Analysis of varianceLeft column is the restricted equation; right is the unrestricted (check the boxes for the variables you’re adding)

Difference of the sum of squares shows whether adding the new values is significant. We get significance = 0.00003%, so that’s very, very strong evidence that these two variables add to the explanatory

We still don’t know on the individual level which variable causes what, but we CAN see that, collectively, the two variables make a difference. BP is a competitor. It’s not enough to just run the restricted equation. And if you had output that reported only the F-statistic?

Use the excel function F.DIST.RT: F-statistic goes in the X box.

When do we NOT worry about multicollinearity? When you’re looking for a point estimate When you just want a control variable

So say you want to estimate a demand equation that controls for the state of the economy

Your equation includes real GDP and real GSPNow then, GDP and GSP may be highly correlated.You can’t do anything about them, but in a recession, you’ll overestimate sales if you throw those numbers outIn other words, they have an impact, whether you can do anything about it or not. But you’re not interested in the coefficients for each variable, you just want to control for the state of the economy

NOTE: if you’re running a residual plot, do it against the same dependent variable Extrapolation

You don’t want to extrapolate outside your sample dataYou don’t know if the pattern is going to continue - and so regression will give you a wider confidence interval

Hidden ExtrapolationSO you pick values within your dataBut... even though you pick numbers within your range, you can still do hidden extrapolation if your prediction values never appear at the same time - hence it’s as if you extrapolated outside the sample rangeSo. You’ve picked numbers that are all within your range: your standard error of the estimated mean, used to calculate your confidence interval, is 0.002444.

But if you change one input to something that’s outside your resample range (and by that I mean moving one variable that’s linked to another more than the link would suggest), the standard error of the estimated mean gets way bigger, making your confidence interval bigger

Ch. 9: Because our brains aren’t full enough

Time Series Analysis and ForecastingTime series says that you assume your trend/pattern will continue. When you do that, you’re implicitly assuming that sales are a function of time: whenever you assume that past trends will continue in the future, that’s what you’re saying

Of course, time isn’t the cause and effect variableSo what we’ve done thus far is to use the real predictive variables (e.g., price of goods, competitors’ prices) so that we can plug them in for analysis

If you want to show that the trend is going to go in the future? Create a time variable and number each period. Then you can plug into kstat

Run a regression and then predict using 13–16But, what if your data shows that you don’t have a linear regression?

You’re going to get a bad estimate if you use regression to predict.Which is better for fitting an equation to a fitted line plot to a chart?

R2 is good tool - but it depends

If you’re doing sales forecasting Because you’re forecasting into the near future you want smaller residuals for the more recent periodsDon’t worry too much about the ancient residuals: look and see how well it’s done lately

SO our most recent residuals range from -14 to 41 for the quadraticTo compare to a semi-log, you have to convert your numbers out of log format and back into units

And the semi log residuals go from -4 to 4.1How do you eliminate heteroskedasticity?

Try running as a semi logMulticollinearity

Whe you have two variables that are highly correlated, regression can’t figure out who does what, so it splits the difference between the two variables - and that split is arbitraryYou can ask whether your variables are unnecessaryMulticollinearity says there’s no way to trust these numbers and there’s no way to fix it. You can test whether they belong in there at all, by doing a joint-F test: that tells you whether the variables add to the equation or add nothing.

EXAM REVIEWThings you have to do

Plug variables into the equation - no predict tabCalculate confidence intervals

Point estimate plus or minus t-statistic x ?Calculate t-statistic

sample coefficient(slope)-hypothesized population slope/standard error of the coefficientBUT the t-statistic that kstat provides assumes your hypothesized slope is 0, so your t-statistic is really your sample coefficient/standard error of the coefficient

But if you have a different hypothesis, then that t-statistic is going to be different than what kstat reports

T-Table: if you’re doing a 2-tailed test, 5% significance, 22 degrees of freedom: go down to 22, go over to 5%, then that’s your critical t-value above which you reject null hypothesis, below which you fail to reject.

Ch. 8

So we plot advertising expense against sales and get this scatterplot:Pretty ugly, n’est pas?

So why don’t we add a quadratic?Okay, so now we add a column for a quadratic and then click charts> fitted line plotsPlot predicted values against exp.

So much prettierSemi-log: Another way to deal is convert the dependent variable into natural logs and estimate the equation in the form LN(Sales) = a + bEXP

How do you do that?Create a column that converts dependent variable into logarithms by using =LN(SALES)Now run a regression on Ln(Sales) and Expense

And we get an equationThe Coefficient states that a $1 increase in advertising expenditures leads to a 0.06 increase in the natural log of sales.But it ALSO means that a $1 increase in advertising expenditures leads to a 6% increase in sales.

Same thing, but the second one sounds a lot better in presentation

But when you have the dependent variable expressed as a percentage, the relationship is nonlinear And yet... it doesn’t fit as well: So let’s try log-log: create a log column for both independent AND dependent variables

And it does, in fact, look betterIt finds a linear relation between the logs, although the logs, by definition, are nonlinear.And sure enough, this fits much, MUCH better than linear or semi-log, and even slightly better than the quadratic. Now, suppose you wanted to predict sales when advertising expenditure = $2000

Well first of all, your input needs to be log of $2000But second, the data was input into the spreadsheet in thousandsSO you’d actually plug in =ln(2)And we GET 2.816

What does that mean?That means that the natural log of sales is 2.816So how do you get the increase in sales?

Use “=EXP(Number)And we get 16.71864BUT you plugged it in in thousands, so change the result: $16,7186A $2000 increase in advertising results in a $16,718 increase in sales

Now, there’s this thing called Heteroskedasticity: when the variation in the dependent variable is correlated with the value of the independent variable

Suppose we surveyed firms and regressed the workers’ salary against the numb rod years of experience

So we would expect a positive correlationBut what’s the variation in salaries for people with little experience v. those with lots

Less variation for those with little experienceEntry level is entry level

But more variation as time goes on

So we’d expect to see something like this:The variation changes with the value of the independent variable. Kinda looks like you sneezed on the scatterplot.And what does regression do with that?

Good news is that heteroskedasticity does not result in biasBut why is it a problem?

Because regression reports a single number as the error of the coefficient, but there is not such single number: it’s always changing!Why is this a problem? Because we rely on the standard error of the coefficient to establish a confidence interval for the population slope

We also rely on the t-statistic to determine if X and Y are related.

It tells us whether variables are related or whether the evidence is weak But the standard error of the coefficient is the denominator of the t-statistic

statistics review

Documents