practice+problems2_4031_f14

5
1 ISyE4031 Regression and Forecasting Practice Problems 2 Fall 2014 1. In mobile ad hoc computer networks, messages must be forwarded from computer to computer until they reach their destinations. The data overhead is the number of bytes of information that must be transmitted, and it measures the success of a protocol. A regression study was performed to predict the data overhead (kB) by using the independent variables average speed (m/s), pause time (s), and link change rate (LCR, 100/s). The studied model was 2 1 6 3 2 5 3 1 4 3 3 2 2 1 1 0 x x x x x x x x y The regression equation is Overhead = 406 1.93 Speed 0.387 Pause + 1.02 LCR ‒ 0.0618 Speed*LCR + 0.290 Pause*LCR + 0.0401 Speed**2 Predictor Coef SE Coef T P Significant?(Y/N) Keep? Why? Constant 406.491 9.032 45.01 0.000 Speed -1.932 1.137 -1.70 0.107 Pause -0.3866 0.2202 -1.76 0.096 LCR 1.0153 0.9208 1.10 0.285 Speed*LCR -0.06182 0.02102 -2.94 0.009 Pause*LCR 0.28985 0.04259 6.81 0.000 Speed**2 0.04013 0.02056 1.95 0.067 a. By looking at the output and by considering the p values, state whether a predictor should be kept in the model by using = 0.05, and why? Fill in the table. b. What would the expected data overhead be if the average speed is 15 m/s, the pause time is 10 s, and the LCR is 20? 2. The weekly sales (in $1000 per week) for fast-food outlets in each of four cities were collected. The objective is to model sales (y) as a function of traffic flow (in 1,000 of cars), adjusting for city-to-city variations that might be due to size or other market conditions. The linear regression model is therefore: y = + x + C + C + C +, where x is traffic flow, C 1 = otherwise , 0 1 city if , 1 , C 2 = otherwise , 0 2 city if , 1 , and C 3 = otherwise , 0 3 city if , 1 , and City 4 is the base level. The following output was obtained. SALES = 1.08 - 1.22 CITY_1 - 0.531 CITY_2 - 1.08 CITY_3 + 0.104 TRAFFIC Predictor Coef SE Coef T P Constant 1.0834 0.3210 3.37 0.003 CITY_1 -1.2158 0.2054 -5.92 0.000 CITY_2 -0.5308 0.2848 -1.86 0.078 CITY_3 -1.0765 0.2265 -4.75 0.000 TRAFFIC 0.103673 0.004094 25.32 0.000 S = 0.362307 R-Sq = 97.9% R-Sq(adj) = 97.5%

Upload: cthunder1

Post on 22-Sep-2015

220 views

Category:

Documents


1 download

DESCRIPTION

Regression Practice problems

TRANSCRIPT

  • 1

    ISyE4031 Regression and Forecasting

    Practice Problems 2

    Fall 2014

    1. In mobile ad hoc computer networks, messages must be forwarded from computer to computer

    until they reach their destinations. The data overhead is the number of bytes of information that

    must be transmitted, and it measures the success of a protocol. A regression study was

    performed to predict the data overhead (kB) by using the independent variables average speed

    (m/s), pause time (s), and link change rate (LCR, 100/s). The studied model was

    2163253143322110 xxxxxxxxy

    The regression equation is

    Overhead = 406 1.93 Speed 0.387 Pause + 1.02 LCR 0.0618 Speed*LCR + 0.290 Pause*LCR + 0.0401 Speed**2

    Predictor Coef SE Coef T P Significant?(Y/N) Keep? Why?

    Constant 406.491 9.032 45.01 0.000

    Speed -1.932 1.137 -1.70 0.107

    Pause -0.3866 0.2202 -1.76 0.096

    LCR 1.0153 0.9208 1.10 0.285

    Speed*LCR -0.06182 0.02102 -2.94 0.009

    Pause*LCR 0.28985 0.04259 6.81 0.000

    Speed**2 0.04013 0.02056 1.95 0.067

    a. By looking at the output and by considering the p values, state whether a predictor should be

    kept in the model by using = 0.05, and why? Fill in the table.

    b. What would the expected data overhead be if the average speed is 15 m/s, the pause time is 10

    s, and the LCR is 20?

    2. The weekly sales (in $1000 per week) for fast-food outlets in each of four cities were

    collected. The objective is to model sales (y) as a function of traffic flow (in 1,000 of cars),

    adjusting for city-to-city variations that might be due to size or other market conditions. The

    linear regression model is therefore:

    y = + x + C +C + C +, where x is traffic flow,

    C1=

    otherwise ,0

    1city if ,1 , C2 =

    otherwise ,0

    2city if ,1 , and C3 =

    otherwise ,0

    3city if ,1,

    and City 4 is the base level.

    The following output was obtained.

    SALES = 1.08 - 1.22 CITY_1 - 0.531 CITY_2 - 1.08 CITY_3 + 0.104 TRAFFIC

    Predictor Coef SE Coef T P

    Constant 1.0834 0.3210 3.37 0.003

    CITY_1 -1.2158 0.2054 -5.92 0.000

    CITY_2 -0.5308 0.2848 -1.86 0.078

    CITY_3 -1.0765 0.2265 -4.75 0.000

    TRAFFIC 0.103673 0.004094 25.32 0.000

    S = 0.362307 R-Sq = 97.9% R-Sq(adj) = 97.5%

  • 2

    Analysis of Variance

    Source DF SS MS F P

    Regression 4 116.656 29.164 222.17 0.000

    Residual Error 19 2.494 0.131

    Total 23 119.150

    Answer the following questions by using = 0.05.

    a. Are the mean weekly sales identical for all four cities? Why? I.e., support your answer by

    hypothesis testing and expected values. Use =0.05.

    b. Does City 2 have more expected sales than City 4? State the hypothesis that you consider

    explicitly.

    c. Which city has the most expected sales? Explain by referring to the coefficients, hypothesis

    testing, and expected values.

    d. What is the expected sales in City 3 when the traffic flow is recorded as 60,000 cars?

    e. Suppose that you modeled and solved the problem as a simple linear regression model:

    y = + x + , where y: Sales and x: Traffic flow. In order to decide on the significance of the cities, compare this reduced model and the complete model by using a partial (nested) F test.

    State the hypothesis explicitly.

    SALES = 0.018 + 0.108 TRAFFIC

    Analysis of Variance

    Source DF SS MS F P

    Regression 1 111.34 111.34 313.75 0.000

    Residual Error 22 7.81 0.35

    Total 23 119.15

    3. Screening Techniques.

    a. In a linear regression study five predictors are being evaluated by using a stepwise regression.

    By considering the quantities given below, perform the first two steps of the stepwise regression,

    i.e., write down the selected variable(s) in step1 and step 2. Assume entry= remove= 0.10.

    Step1 Predictors t-stat for each Xj

    p-value for each Xj

    X1 11.21

    0.000

    X2 22.90

    0.000

    X3 2.75

    0.015

    X4 22.41

    0.000

    X5 10.71

    0.000

    Step2 Predictors t-stat for each Xj

    p-value for each Xj

    X1, X2 3.91,9.92

    0.002,0.000

    X1, X3 10.41,2.37

    0.000,0.033

    X1, X4 3.84,9.67

    0.002,0.000

    X1, X5 3.06,2.74

    0.009,0.016

    X2, X3 24.43,-3.40

    0.000,0.004

    Predictors

    t-stat for each Xj

    p-value for each Xj

    X2, X4 0.72,-0.41

    0.486,0.690

    X2, X5 7.19,1.34

    0.000,0.200

    X3, X4 -3.31,23.86

    0.005,0.000

    X3, X5 2.02,9.48

    0.063,0.000

    X4, X5 6.97,1.39

    0.000,0.250

    Step 1:

    Step 2:

    b. Consider the best subset output given below. What is the best subset of variables? State your

    reasons.

  • 3

    Response is y

    Mallows x x x x x x

    Vars R-Sq R-Sq(adj) Cp S 1 2 3 4 5 6

    1 51.2 45.1 144.9 2496.8 X

    1 47.6 41.1 156.0 2587.1 X

    2 93.9 92.1 14.9 945.09 X X

    2 79.6 73.8 59.1 1726.6 X X

    3 97.2 95.9 6.6 686.37 X X X

    3 95.8 93.7 11.1 848.68 X X X

    4 98.8 97.9 3.7 492.34 X X X X

    4 97.6 95.8 7.3 694.63 X X X X

    5 99.0 97.7 5.2 510.32 X X X X X

    5 98.8 97.3 5.7 549.94 X X X X X

    6 99.0 97.1 7.0 574.97 X X X X X X

    4. Residual analysis and diagnostics.

    a. A linear regression model, y = + x + x + , was fitted to 24 heat treatment data points, and several test results for diagnostics statistics were produced. Consider the following

    four of those 24 observations and state whether they are unusual or not. If an observation is

    unusual, is it outlier, leverage, and/or influential? State your reasons explicitly by referring to the

    diagnostics statistics.

    Observation y SRES1 TRES1 HI1 COOK1

    1 0.013 -1.47200 -1.50366 0.053974 0.04121

    2 0.068 -3.17206 -3.33242 0.689127 3.48608

    3 0.056 -0.12900 -0.12679 0.269057 1.02204

    4 0.014 -2.87807 -2.90441 0.058732 0.11762

    Observation #1:

    Observation #2:

    Observation #3:

    Observation #4:

    5. The data on annual sale revenues (in billions of dollars) of the Eastman Kodak Company over

    a 25-year period were considered. The following time series plot depicts the actual annual

    revenues (y) over the 25 years.

    24222018161412108642

    20.0

    17.5

    15.0

    12.5

    10.0

    7.5

    5.0

    Index

    Y

    Time Series Plot of Y

  • 4

    A quadratic trend model, yt = + t + t2 + was studied and the following results were

    obtained. The regression equation is

    Y = 3.15 + 1.46 t - 0.0411 t^2

    Predictor Coef SE Coef T P

    Constant 3.152 1.137 2.77 0.011

    t 1.4617 0.2194 6.66 0.000

    t^2 -0.041122 0.008832 -4.66 0.000

    S = 2.04889 R-Sq = 80.6% R-Sq(adj) = 78.9%

    Analysis of Variance

    Source DF SS MS F P

    Regression 2 384.04 192.02 45.74 0.000

    Residual Error 22 92.36 4.20

    Total 24 476.39

    a. Would you consider this model and the predictors useful (significant)? Support your answer

    by looking at the p values, and the tests. Assume = 0.10.

    b. Would you change your answer in part (a) when you saw the residual plots given below by

    considering the assumptions on errors? In other words, check whether the results justify the

    basic error term assumptions or not, i.e., i ~ i.i.d. Normal(0, 2). State the hypotheses that you

    are testing and the residual plot that you are referring to explicitly, and use .

    5.02.50.0-2.5-5.0

    99

    90

    50

    10

    1

    Residual

    Pe

    rce

    nt

    15105

    4

    2

    0

    -2

    -4

    Fitted Value

    Re

    sid

    ua

    l

    43210-1-2-3

    8

    6

    4

    2

    0

    Residual

    Fre

    qu

    en

    cy

    24222018161412108642

    4

    2

    0

    -2

    -4

    Observation Order

    Re

    sid

    ua

    l

    Normal Probability Plot Versus Fits

    Histogram Versus Order

    Residual Plots for Y

    A-D = 0.397, p-value = 0.343

    Durbin-Watson statistic = 0.532552

    - Each i has a normal distribution:

    - Each i has an identical distribution:

    - Each i is independent:

    6. In a chemical experiment, the true relationship between yield (y) and reaction time (x) is

    assumed to be: y = ee x21 .

  • 5

    a. First, apply a transformation to the equation so that a simple linear regression solution can be

    found.

    b. Then, consider the solution to the transformed model: *y = -2.3+0.6 x.

    What are 1 , 2 , and the predicted value of y (i.e., y ) when x = 5?

    7. Answer the following short-answer questions.

    a. If the variance inflation factor (VIF) is not less than 1, we can say that there exists

    multicollinearity between independent variables. True or False?

    b. An unusual data point can be both an outlier and a high leverage point. True or False?

    c. What can we detect when t-tests for all (or nearly all) parameters are non-significant

    whereas the F-test for overall model adequacy (H0: 1==k = 0) is significant?

    d. Suppose a fitted linear regression model is y = 15 + 2 x1 3x2 + 2

    2x + 4 x1 x2. What is the

    amount of change in the expected value of y for every one-unit increase in x1, holding x2 fixed

    at 2?