practice+problems2_4031_f14
DESCRIPTION
Regression Practice problemsTRANSCRIPT
-
1
ISyE4031 Regression and Forecasting
Practice Problems 2
Fall 2014
1. In mobile ad hoc computer networks, messages must be forwarded from computer to computer
until they reach their destinations. The data overhead is the number of bytes of information that
must be transmitted, and it measures the success of a protocol. A regression study was
performed to predict the data overhead (kB) by using the independent variables average speed
(m/s), pause time (s), and link change rate (LCR, 100/s). The studied model was
2163253143322110 xxxxxxxxy
The regression equation is
Overhead = 406 1.93 Speed 0.387 Pause + 1.02 LCR 0.0618 Speed*LCR + 0.290 Pause*LCR + 0.0401 Speed**2
Predictor Coef SE Coef T P Significant?(Y/N) Keep? Why?
Constant 406.491 9.032 45.01 0.000
Speed -1.932 1.137 -1.70 0.107
Pause -0.3866 0.2202 -1.76 0.096
LCR 1.0153 0.9208 1.10 0.285
Speed*LCR -0.06182 0.02102 -2.94 0.009
Pause*LCR 0.28985 0.04259 6.81 0.000
Speed**2 0.04013 0.02056 1.95 0.067
a. By looking at the output and by considering the p values, state whether a predictor should be
kept in the model by using = 0.05, and why? Fill in the table.
b. What would the expected data overhead be if the average speed is 15 m/s, the pause time is 10
s, and the LCR is 20?
2. The weekly sales (in $1000 per week) for fast-food outlets in each of four cities were
collected. The objective is to model sales (y) as a function of traffic flow (in 1,000 of cars),
adjusting for city-to-city variations that might be due to size or other market conditions. The
linear regression model is therefore:
y = + x + C +C + C +, where x is traffic flow,
C1=
otherwise ,0
1city if ,1 , C2 =
otherwise ,0
2city if ,1 , and C3 =
otherwise ,0
3city if ,1,
and City 4 is the base level.
The following output was obtained.
SALES = 1.08 - 1.22 CITY_1 - 0.531 CITY_2 - 1.08 CITY_3 + 0.104 TRAFFIC
Predictor Coef SE Coef T P
Constant 1.0834 0.3210 3.37 0.003
CITY_1 -1.2158 0.2054 -5.92 0.000
CITY_2 -0.5308 0.2848 -1.86 0.078
CITY_3 -1.0765 0.2265 -4.75 0.000
TRAFFIC 0.103673 0.004094 25.32 0.000
S = 0.362307 R-Sq = 97.9% R-Sq(adj) = 97.5%
-
2
Analysis of Variance
Source DF SS MS F P
Regression 4 116.656 29.164 222.17 0.000
Residual Error 19 2.494 0.131
Total 23 119.150
Answer the following questions by using = 0.05.
a. Are the mean weekly sales identical for all four cities? Why? I.e., support your answer by
hypothesis testing and expected values. Use =0.05.
b. Does City 2 have more expected sales than City 4? State the hypothesis that you consider
explicitly.
c. Which city has the most expected sales? Explain by referring to the coefficients, hypothesis
testing, and expected values.
d. What is the expected sales in City 3 when the traffic flow is recorded as 60,000 cars?
e. Suppose that you modeled and solved the problem as a simple linear regression model:
y = + x + , where y: Sales and x: Traffic flow. In order to decide on the significance of the cities, compare this reduced model and the complete model by using a partial (nested) F test.
State the hypothesis explicitly.
SALES = 0.018 + 0.108 TRAFFIC
Analysis of Variance
Source DF SS MS F P
Regression 1 111.34 111.34 313.75 0.000
Residual Error 22 7.81 0.35
Total 23 119.15
3. Screening Techniques.
a. In a linear regression study five predictors are being evaluated by using a stepwise regression.
By considering the quantities given below, perform the first two steps of the stepwise regression,
i.e., write down the selected variable(s) in step1 and step 2. Assume entry= remove= 0.10.
Step1 Predictors t-stat for each Xj
p-value for each Xj
X1 11.21
0.000
X2 22.90
0.000
X3 2.75
0.015
X4 22.41
0.000
X5 10.71
0.000
Step2 Predictors t-stat for each Xj
p-value for each Xj
X1, X2 3.91,9.92
0.002,0.000
X1, X3 10.41,2.37
0.000,0.033
X1, X4 3.84,9.67
0.002,0.000
X1, X5 3.06,2.74
0.009,0.016
X2, X3 24.43,-3.40
0.000,0.004
Predictors
t-stat for each Xj
p-value for each Xj
X2, X4 0.72,-0.41
0.486,0.690
X2, X5 7.19,1.34
0.000,0.200
X3, X4 -3.31,23.86
0.005,0.000
X3, X5 2.02,9.48
0.063,0.000
X4, X5 6.97,1.39
0.000,0.250
Step 1:
Step 2:
b. Consider the best subset output given below. What is the best subset of variables? State your
reasons.
-
3
Response is y
Mallows x x x x x x
Vars R-Sq R-Sq(adj) Cp S 1 2 3 4 5 6
1 51.2 45.1 144.9 2496.8 X
1 47.6 41.1 156.0 2587.1 X
2 93.9 92.1 14.9 945.09 X X
2 79.6 73.8 59.1 1726.6 X X
3 97.2 95.9 6.6 686.37 X X X
3 95.8 93.7 11.1 848.68 X X X
4 98.8 97.9 3.7 492.34 X X X X
4 97.6 95.8 7.3 694.63 X X X X
5 99.0 97.7 5.2 510.32 X X X X X
5 98.8 97.3 5.7 549.94 X X X X X
6 99.0 97.1 7.0 574.97 X X X X X X
4. Residual analysis and diagnostics.
a. A linear regression model, y = + x + x + , was fitted to 24 heat treatment data points, and several test results for diagnostics statistics were produced. Consider the following
four of those 24 observations and state whether they are unusual or not. If an observation is
unusual, is it outlier, leverage, and/or influential? State your reasons explicitly by referring to the
diagnostics statistics.
Observation y SRES1 TRES1 HI1 COOK1
1 0.013 -1.47200 -1.50366 0.053974 0.04121
2 0.068 -3.17206 -3.33242 0.689127 3.48608
3 0.056 -0.12900 -0.12679 0.269057 1.02204
4 0.014 -2.87807 -2.90441 0.058732 0.11762
Observation #1:
Observation #2:
Observation #3:
Observation #4:
5. The data on annual sale revenues (in billions of dollars) of the Eastman Kodak Company over
a 25-year period were considered. The following time series plot depicts the actual annual
revenues (y) over the 25 years.
24222018161412108642
20.0
17.5
15.0
12.5
10.0
7.5
5.0
Index
Y
Time Series Plot of Y
-
4
A quadratic trend model, yt = + t + t2 + was studied and the following results were
obtained. The regression equation is
Y = 3.15 + 1.46 t - 0.0411 t^2
Predictor Coef SE Coef T P
Constant 3.152 1.137 2.77 0.011
t 1.4617 0.2194 6.66 0.000
t^2 -0.041122 0.008832 -4.66 0.000
S = 2.04889 R-Sq = 80.6% R-Sq(adj) = 78.9%
Analysis of Variance
Source DF SS MS F P
Regression 2 384.04 192.02 45.74 0.000
Residual Error 22 92.36 4.20
Total 24 476.39
a. Would you consider this model and the predictors useful (significant)? Support your answer
by looking at the p values, and the tests. Assume = 0.10.
b. Would you change your answer in part (a) when you saw the residual plots given below by
considering the assumptions on errors? In other words, check whether the results justify the
basic error term assumptions or not, i.e., i ~ i.i.d. Normal(0, 2). State the hypotheses that you
are testing and the residual plot that you are referring to explicitly, and use .
5.02.50.0-2.5-5.0
99
90
50
10
1
Residual
Pe
rce
nt
15105
4
2
0
-2
-4
Fitted Value
Re
sid
ua
l
43210-1-2-3
8
6
4
2
0
Residual
Fre
qu
en
cy
24222018161412108642
4
2
0
-2
-4
Observation Order
Re
sid
ua
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for Y
A-D = 0.397, p-value = 0.343
Durbin-Watson statistic = 0.532552
- Each i has a normal distribution:
- Each i has an identical distribution:
- Each i is independent:
6. In a chemical experiment, the true relationship between yield (y) and reaction time (x) is
assumed to be: y = ee x21 .
-
5
a. First, apply a transformation to the equation so that a simple linear regression solution can be
found.
b. Then, consider the solution to the transformed model: *y = -2.3+0.6 x.
What are 1 , 2 , and the predicted value of y (i.e., y ) when x = 5?
7. Answer the following short-answer questions.
a. If the variance inflation factor (VIF) is not less than 1, we can say that there exists
multicollinearity between independent variables. True or False?
b. An unusual data point can be both an outlier and a high leverage point. True or False?
c. What can we detect when t-tests for all (or nearly all) parameters are non-significant
whereas the F-test for overall model adequacy (H0: 1==k = 0) is significant?
d. Suppose a fitted linear regression model is y = 15 + 2 x1 3x2 + 2
2x + 4 x1 x2. What is the
amount of change in the expected value of y for every one-unit increase in x1, holding x2 fixed
at 2?