tv watching time project

22
HEWLETT-PACKARD [Type the document title] [Type the document subtitle] Surajit Basak 4/16/2015

Upload: surajit-basak

Post on 21-Dec-2015

10 views

Category:

Documents


1 download

DESCRIPTION

TV Watching Time Project for statistics and probability.

TRANSCRIPT

Page 1: TV Watching Time Project

Hewlett-Packard

[Type the document title]

[Type the document subtitle]

Surajit Basak

4/16/2015

Page 2: TV Watching Time Project

ContentsIntroduction.................................................................................................................................................1

Problem Statement......................................................................................................................................2

List of Technical Tasks...............................................................................................................................3

Data Description..........................................................................................................................................3

Qualitative Variables:..............................................................................................................................3

Quantitative Variables:............................................................................................................................4

Analysis:......................................................................................................................................................4

Conclusions and Recommendations:.........................................................................................................13

Appendix:..................................................................................................................................................14

Regression after outlier removal-1:.......................................................................................................14

Regression after outlier removal-2:.......................................................................................................15

Introduction

This regression class was very reallyinteresting; it not only taught us more about the

useful statistical techniques but also taught us the regression analysis.This is one of the most

important statistical tool used in professional world. Especially, the fact that we can predict the

future or another variable from already available data, which are collected from real world,

makesthe regression modeling so vital and useful. Therefore, we would like to build a regression

model using the collected data, so that we can use the learned materials in the real life data to

enhance our knowledge and practice the methods which will strengthen the learned knowledge.

The regression analysis can be used in many real life situations so getting the proper data

is not a problem. As the first step, we looked at few frequent activities in our daily life to get the

data. This is because, as these activities are done regularly (if not every day), we can collect data

from our daily life easily. The second step was,the result of this regression model should be

useful in our daily life. From these two approaches, we found TV watching as the topic.

Nowadays, most people watch TV and we can easily profile personal data for each person.

Page 3: TV Watching Time Project

Moreover, TV watching time with general-personal data is significant to broadcasting

companies, TV manufacturers and advertising companies. If we get a significant regression

model, these companies can utilize the result to target the viewers based on the specific factors as

they need. For example, if people aged between 40-45 years old and within $30,000~$35,000

income range has the highest TV watch time, the advertising companies should focus on the

products which these category wants.

We thought this will be a useful regression where we are learning by applying regression

model in the real world data.

Problem Statement

               Though a high proportion of people watch television, but still some don’t and even the

viewing time and habit mostly depends on the personal choice like what kind of program they

like, their leisure time and so on. Therefore we set the target to capture the data from regular TV

watchers and who can possibly affect company’s profit. Statistically perceiving hours of the

people watching television will support companies to develop a strategy in advertisement.

            In this report, we came up with various factors which can affect people’s TV watching

time. The factors such as their gender, race, employment, spouse presence, cable availability,

years of education, numbers of children, amount of income, and hours spent on leisure are

considered in this project. These factors are taken into account as independent variables in our

regression model in order to forecast the hours spend on TV watching. Our overall objective is to

give an idea about how the TV watching time can be differed by certain significant factors, and

later the companies can relate their advertisement to influence these factors to increase their

profit margin. As for example if we find out that Women tend to watch TV more than men then

the advertising companies can give advertises targeting the women more than men and that

definitely will increase their profit margin.

Page 4: TV Watching Time Project

List of Technical Tasks

               Here our target is to find a proper regression model to predict the TV watching time

based on the other independent variables.

As regression model has many assumptions which should be fulfilled to consider the

model as valid. So we also need to run the assumptions check and need to make the correction if

necessary.

Here I will start with scatter diagram plot which will inform me whether there are any

linear relationship between the dependent and independent variables.

After looking at it I will start with regression analysis and see whether there are any

outliersin the data or not. If there are outliers found I will remove them till I have the data with

no significant outliers. This will take care of another assumption of the regression analysis.

Then I will select a subset model of the significant variables from the full model.

All the others assumptions will be check on this model to see whether the assumptions

are validated or not.

Data Description

Rather than collecting the data from online or from some other sources, our group decided to physically collect the data, since we wanted to have our own data (which is more accurate) rather than one collected by others. For accuracy, our group member went to several different locations such as Georgia State, Atlantic Station, and Coca-Cola and selected random people to collect the data, by selecting random people we tried to eliminate the data collection or sampling bias. We used several different qualitative variables and quantitative variables for the data collection. Below are the independent variables which we chose as these are really important in affecting the TV watch time.

Page 5: TV Watching Time Project

Qualitative Variables:

Gender: 1= Male and 0=Female (Qualitative variable with Nominal Scale)

Asian: 1= Asian and 0 = Non-Asian (Qualitative variable with Nominal Scale)

Caucasian: 1= Caucasian and 0 = Non-Caucasian (Qualitative variable with Nominal Scale)

African-American: 1= African-American and 0 = Non-African-American (Qualitative variable with Nominal Scale)

Employment: 1= Viewer has a job, 0 = he/she does not (Qualitative variable with Nominal Scale)

Spouse: 1= Viewer is married, 0 = he/she is not (Qualitative variable with Nominal Scale)

Cable TV: 1= Viewer has cable connection, 0 = he/she does not (Qualitative variable with Nominal Scale)

Education: Measurement of viewer's education level. (1 ~ High School Diploma, 2 ~ College, 3 ~ Graduate school) (Qualitative variable with Ordinal Scale)

Quantitative Variables:

Age: Quantitative measurement of viewer's age (Quantitative variable with Ratio Scale)

Children: Quantitative measurement of viewer's number of children (Quantitative variable with Ratio Scale)

Income: Quantitative measurement of viewer's income (Income range) (Quantitative variable with Ratio Scale)

Leisure: Quantitative measurement of viewer's time spent on leisure (Hour spent on leisure weekly) (Quantitative variable with Ratio Scale)

Our dependent variables is,

Hours: Hours spent on watching TV weekly (Quantitative variable with Ratio Scale)

Page 6: TV Watching Time Project

Analysis:

There are several assumptions which we need to check before performing the regression

analysis. As the regression model depends on these assumptions so violating them may give a

regression equation which is not useful at all.

But before proceeding to any that kind of analysis we need to check the relationship

between dependent and independent variables. The best way to do it is by looking at the

correlation matrix or by looking at the scatter plot.

The scatter plot should be considered only for the quantitative variables thus the obtained

scatter plots are given below.

5550454035302520

10

8

6

4

2

0

AGE

HO

URS

Scatterplot of HOURS vs AGE

3.02.52.01.51.00.50.0

10

8

6

4

2

0

CHILDREN

HO

URS

Scatterplot of HOURS vs CHILDREN

8000070000600005000040000300002000010000

10

8

6

4

2

0

INCOME

HO

URS

Scatterplot of HOURS vs INCOME

9876543210

10

8

6

4

2

0

LEISURE

HO

URS

Scatterplot of HOURS vs LEISURE

The above plots show no significant relationship between the independent variables and

the dependent variable. We can only see some support of a negative relationship between Hours

and the independent variable income. Let us look at the correlation matrix for more information.

The obtained correlation matrix is given below,

Page 7: TV Watching Time Project

Correlation: HOURS, AGE, CHILDREN, INCOME, LEISURE

HOURS AGE CHILDREN INCOMEAGE 0.037 0.601

CHILDREN 0.000 0.038 0.997 0.594

INCOME -0.375 0.011 0.124 0.000 0.877 0.079

LEISURE -0.123 -0.115 0.007 0.204 0.082 0.104 0.919 0.004

Cell Contents: Pearson correlation P-Value

From the above result it is clear that only income has a significant correlation with the

hours. Though the result suggests that we should eliminate the variables which are not

significantly correlated with the dependent variable Hours but as we also have many qualitative

dummy variables so I am proceeding with taking these “insignificant” variables in my

regression.

Before starting to analyze the data, we need to check the assumptions of regression analysis:

i) Linear relationship:

ii) Normality:

iii) No or little multicollinearity:

iv) Homoscedasticity:

v) No significant outliers in the model:

vi) No serial correlation in the model:

Now the 1st assumption is already validated through the scatter plots.

The all other assumptions can be checked before the regression analysis also but as we

may need to select a subset so I am keeping the assumptions check for the later part.

Considering all the independent variables the full regression model output is given below.

Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN, AFRICAN AMER, EMPLOYED, ...

Page 8: TV Watching Time Project

Analysis of Variance

Source DF SeqSS ContributionAdj SS Adj MS F-Value P-ValueRegression 12 342.100 54.39% 342.100 28.508 18.58 0.000 GENDER 1 0.126 0.02% 0.627 0.627 0.41 0.524 ASIAN 1 0.004 0.00% 0.458 0.458 0.30 0.585 CAUCASIAN 1 1.176 0.19% 2.751 2.751 1.79 0.182 AFRICAN AMERICAN 1 1.754 0.28% 0.255 0.255 0.17 0.684 EMPLOYED 1 268.434 42.68% 227.611 227.611 148.36 0.000 SPOUSE 1 0.220 0.03% 4.661 4.661 3.04 0.083 CABLE 1 27.197 4.32% 28.553 28.553 18.61 0.000 AGE 1 0.441 0.07% 1.435 1.435 0.94 0.335 EDUCATION 1 18.581 2.95% 5.963 5.963 3.89 0.050 CHILDREN 1 2.483 0.39% 4.527 4.527 2.95 0.088 INCOME 1 20.832 3.31% 18.628 18.628 12.14 0.001 LEISURE 1 0.852 0.14% 0.852 0.852 0.56 0.457Error 187 286.900 45.61% 286.900 1.534 Lack-of-Fit 186 286.400 45.53% 286.400 1.540 3.08 0.431 Pure Error 1 0.500 0.08% 0.500 0.500Total 199 629.000 100.00%

Model Summary

S R-sq R-sq(adj) PRESS R-sq(pred)1.23864 54.39% 51.46% 329.350 47.64%

Coefficients

Term Coef SE Coef 95% CI T-Value P-Value VIFConstant 5.944 0.502 ( 4.954, 6.934) 11.85 0.000GENDER -0.116 0.181 ( -0.473, 0.242) -0.64 0.524 1.06ASIAN 0.162 0.296 ( -0.422, 0.745) 0.55 0.585 1.86CAUCASIAN 0.372 0.278 ( -0.176, 0.919) 1.34 0.182 2.09AFRICAN AMERICAN 0.111 0.272 ( -0.426, 0.648) 0.41 0.684 2.16EMPLOYED -3.050 0.250 ( -3.544, -2.556) -12.18 0.000 1.13SPOUSE -0.397 0.228 ( -0.846, 0.052) -1.74 0.083 1.69CABLE 0.793 0.184 ( 0.431, 1.156) 4.31 0.000 1.07AGE 0.00898 0.00929 ( -0.00934, 0.02730) 0.97 0.335 1.12EDUCATION -0.263 0.133 ( -0.526, 0.000) -1.97 0.050 1.27CHILDREN 0.190 0.111 ( -0.028, 0.409) 1.72 0.088 1.66INCOME -0.000021 0.000006 (-0.000032, -0.000009) -3.48 0.001 1.30LEISURE -0.0294 0.0395 ( -0.1074, 0.0485) -0.75 0.457 1.11

Regression Equation

HOURS = 5.944 - 0.116 GENDER + 0.162 ASIAN + 0.372 CAUCASIAN + 0.111 AFRICAN AMERICAN - 3.050 EMPLOYED - 0.397 SPOUSE + 0.793 CABLE + 0.00898 AGE - 0.263 EDUCATION + 0.190 CHILDREN - 0.000021 INCOME - 0.0294 LEISURE

Fits and Diagnostics for Unusual Observations

Obs HOURS Fit SE Fit 95% CI ResidStdResid Del Resid HI Cook’s D3 5.000 2.021 0.318 (1.394, 2.648) 2.979 2.49 2.52 0.0659293 0.0310 9.000 6.545 0.379 (5.797, 7.292) 2.455 2.08 2.10 0.0937056 0.0311 9.500 2.964 0.291 (2.390, 3.538) 6.536 5.43 5.90 0.0552064 0.1315 8.500 5.766 0.339 (5.096, 6.435) 2.734 2.30 2.32 0.0750477 0.0362 9.500 6.274 0.333 (5.617, 6.930) 3.226 2.70 2.75 0.0720839 0.0471 6.000 2.831 0.301 (2.237, 3.424) 3.169 2.64 2.68 0.0590371 0.0383 7.000 2.951 0.331 (2.299, 3.603) 4.049 3.39 3.49 0.0712014 0.0796 2.000 5.014 0.331 (4.361, 5.667) -3.014 -2.53 -2.56 0.0713602 0.0499 3.000 5.445 0.310 (4.833, 6.057) -2.445 -2.04 -2.06 0.0627761 0.02116 8.000 5.450 0.364 (4.732, 6.169) 2.550 2.15 2.18 0.0865024 0.03120 5.500 2.874 0.297 (2.288, 3.461) 2.626 2.18 2.21 0.0576223 0.02163 3.500 6.153 0.341 (5.481, 6.824) -2.653 -2.23 -2.25 0.0755750 0.03188 3.000 5.362 0.391 (4.591, 6.134) -2.362 -2.01 -2.03 0.0996398 0.03192 7.500 2.939 0.258 (2.431, 3.447) 4.561 3.76 3.91 0.0432661 0.05

Page 9: TV Watching Time Project

Obs DFITS 3 0.67055 R 10 0.67568 R 11 1.42586 R 15 0.66147 R 62 0.76682 R 71 0.67157 R 83 0.96693 R96 -0.71031 R99 -0.53224 R116 0.66936 R120 0.54557 R163 -0.64377 R188 -0.67419 R192 0.83049 R

R Large residual

Durbin-Watson Statistic

Durbin-Watson Statistic = 1.76295

From the above output we can clearly see that many variables are insignificant in the

model. Moreover though the overall regression model is significant but the R-square value is

54.39% implying only 54.39% of the variation is getting explained by the regression model.

As many variables are insignificant so we should select some model with removing all

these insignificant variables. But as there are many outliers in the data (which can cause some

variable to be insignificant) so I am removing these outliers at first.

As we know for normal distribution 95%, 99.73% of the values fall within 2 and 3

standard deviation of the mean respectively. So lets remove all data points having standardized

residual value more than +2 or less than -2. After deleting them and running the regression

model the obtained output is given in Appendix: Regression after outlier removal-1.

We can still see some outliers falling in the outside of 2 standard deviation interval. By

keep deleting those data points and rerunning the model we reached at the point where no

standardized residuals have value outside 3 standard deviation interval.

As there is a 5% chance that the standardized residual will be outside 2 standard

deviation interval so I am keeping this dataset and running the stepwise selection method with

alpha to enter as 0.05 and alpha to remove as 0.15.

The obtained model is given below. All the in between regression output is given in the

appendix with proper numberings.

Page 10: TV Watching Time Project

Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN, AFRICAN AMER, EMPLOYED, ...

Stepwise Selection of Terms

α to enter = 0.05, α to remove = 0.15

Analysis of Variance

Source DF SeqSS ContributionAdj SS Adj MS F-Value P-ValueRegression 6 184.481 75.48% 184.481 30.747 85.66 0.000 EMPLOYED 1 162.047 66.30% 142.103 142.103 395.88 0.000 SPOUSE 1 0.075 0.03% 1.551 1.551 4.32 0.039 CABLE 1 11.339 4.64% 12.425 12.425 34.61 0.000 CHILDREN 1 2.968 1.21% 4.224 4.224 11.77 0.001 INCOME 1 6.257 2.56% 4.870 4.870 13.57 0.000 LEISURE 1 1.795 0.73% 1.795 1.795 5.00 0.027Error 167 59.945 24.52% 59.945 0.359 Lack-of-Fit 166 59.445 24.32% 59.445 0.358 0.72 0.761 Pure Error 1 0.500 0.20% 0.500 0.500Total 173 244.425 100.00%

Model Summary

S R-sq R-sq(adj) PRESS R-sq(pred)0.599125 75.48% 74.59% 66.7525 72.69%

Coefficients

Term Coef SE Coef 95% CI T-Value P-Value VIFConstant 5.572 0.196 ( 5.186, 5.959) 28.47 0.000EMPLOYED -3.193 0.160 ( -3.509, -2.876) -19.90 0.000 1.10SPOUSE -0.242 0.116 ( -0.472, -0.012) -2.08 0.039 1.64CABLE 0.5500 0.0935 ( 0.3654, 0.7345) 5.88 0.000 1.03CHILDREN 0.1946 0.0567 ( 0.0826, 0.3066) 3.43 0.001 1.67INCOME -0.000011 0.000003 (-0.000017, -0.000005) -3.68 0.000 1.13LEISURE -0.0450 0.0201 ( -0.0847, -0.0053) -2.24 0.027 1.04

Regression Equation

HOURS = 5.572 - 3.193 EMPLOYED - 0.242 SPOUSE + 0.5500 CABLE + 0.1946 CHILDREN - 0.000011 INCOME - 0.0450 LEISURE

Fits and Diagnostics for Unusual Observations

Obs HOURS Fit SE Fit 95% CI ResidStdResid Del Resid HI Cook’s D1 0.500 1.715 0.088 (1.541, 1.888) -1.215 -2.05 -2.07 0.021474 0.018 0.500 2.082 0.090 (1.904, 2.260) -1.582 -2.67 -2.72 0.022635 0.0216 0.000 1.247 0.145 (0.960, 1.534) -1.247 -2.15 -2.17 0.058811 0.0421 7.000 5.823 0.173 (5.481, 6.165) 1.177 2.05 2.07 0.083517 0.0529 0.000 1.466 0.128 (1.214, 1.718) -1.466 -2.50 -2.55 0.045412 0.0447 0.000 1.379 0.118 (1.147, 1.612) -1.379 -2.35 -2.38 0.038685 0.0349 7.000 5.835 0.195 (5.449, 6.220) 1.165 2.06 2.08 0.106367 0.0758 7.500 5.928 0.169 (5.594, 6.262) 1.572 2.74 2.79 0.079613 0.0973 4.000 5.321 0.171 (4.983, 5.658) -1.321 -2.30 -2.33 0.081243 0.0780 3.500 4.817 0.168 (4.486, 5.149) -1.317 -2.29 -2.32 0.078552 0.0687 4.500 3.109 0.201 (2.712, 3.506) 1.391 2.46 2.50 0.112469 0.1195 7.000 5.781 0.168 (5.449, 6.112) 1.219 2.12 2.14 0.078559 0.05106 0.000 1.409 0.113 (1.185, 1.633) -1.409 -2.39 -2.43 0.035807 0.03110 4.000 5.214 0.186 (4.847, 5.581) -1.214 -2.13 -2.16 0.096315 0.07115 3.500 2.157 0.092 (1.975, 2.338) 1.343 2.27 2.30 0.023601 0.02

Page 11: TV Watching Time Project

Obs DFITS1 -0.306549 R8 -0.414255 R16 -0.542294 R 21 0.625467 R29 -0.555239 R47 -0.477636 R 49 0.716874 R 58 0.820652 R73 -0.692789 R80 -0.677400 R 87 0.891054 R 95 0.625761 R106 -0.468253 R110 -0.703592 R115 0.357270 R

R Large residual

Durbin-Watson Statistic

Durbin-Watson Statistic = 2.11398

Here we can see that quite few variables came to be significant. Though some residuals

are outside 2 standard deviation interval but none are outside 3 standard deviation interval. As

there are 174 data points here so 9 of the residuals are expected to be outside the 2 standard

deviation by normality rule and we can see that the number of residuals which is outside the 2

standard deviation interval is 15 which is close.

By applying the above method we also took care of the 5th assumption which is “No

significant outliers in the model”.

Now lets check the other assumptions.

The normality check can be done using the Normal probability plot of the residuals which

is given below.

Page 12: TV Watching Time Project

210-1-2

99.9

99

95

90

80706050403020

10

5

1

0.1

Residual

Perc

ent

Normal Probability Plot(response is HOURS)

From the above plot no significant deviation is found and thus normality assumption is

validated.

Similarly the Homoscedasticity assumption can be tested using the Residual vs Fit plot

which is given below.

Page 13: TV Watching Time Project

654321

2

1

0

-1

-2

Fitted Value

Res

idua

l

Versus Fits(response is HOURS)

The plot suggests a little deviation from the randomness however all values are within the

3 standard deviation. So ignoring this little deviation we can say that the Homoscedasticity

assumption is validated.

The Durbin-Watson Statistic = 1.76295 implying no significant serial correlation thus

another assumption is validated.

The last assumption is the multicollinearity which can be checked using the Variance

Inflation Factors (VIFs) we can see all VIFs have low values implying no multicollinearity in the

mode. Letscheck the correlation matrix to be sure.

The correlation matrix is given below,

As many variables are qualitative here so using the proper correlation method the

obtained output is given below,

Spearman Rho: EMPLOYED, SPOUSE, CABLE, CHILDREN, INCOME, LEISURE

EMPLOYED SPOUSE CABLE CHILDREN INCOMESPOUSE -0.016 0.838

CABLE 0.112 0.091 0.139 0.231

Page 14: TV Watching Time Project

CHILDREN -0.050 0.694 -0.016 0.511 0.000 0.833

INCOME 0.245 0.062 0.057 0.128 0.001 0.420 0.452 0.092

LEISURE -0.067 -0.030 -0.036 0.012 0.162 0.377 0.695 0.640 0.877 0.033

Cell Contents: Spearman rho P-Value

Here we can see two obvious significance between “Income and employed” and “Spouse

and Children”. But as we saw that the Variance Inflation Factors for all variables are low so there

is no multicollinearity present in the model.

So we can say all steps are performed and the model is performing really well.

Conclusions and Recommendations:

From the above analysis we have some pretty clear idea about the data and outcome. We

saw that the outliers really affect the model. When the outlier was present most of the variables

came insignificant and after taking care of the outliers many variables are coming significant.

Though the final model is looking good here but we can also improve it by spending

more time on it and playing with the data more. By using more iterative approach we can

identify more significant variables like interaction terms and higher order terms which would

improve the model.

Here we can see that the model is performing well. The F test suggests that the regression

model is significant (P-value < 0.05) at 5% significance level. The variables “EMPLOYED”,

“SPOUSE”, “CABLE”, “CHILDREN” and “INCOME” are significant at 5% significance level.

The R-sq is 75.48% and Adjusted R-sq is 74.59% implying 75.48% of the variation in

Hours has been explained by the regression model. Thus the model is really good.

The significant variables also giving us enough information. We can see as the variables

are important so the advertising companies should target the non-employed people. They should

Page 15: TV Watching Time Project

also consider the non married people as it seems like non married persons spend more time on

watching TV than married people (the beta coefficient is negative).

The should also consider the people having more children and who has cable connection

to optimize their profit.

But the regression model suggested to approach the less income persons as well as less

leisure time people which might be a mistake. We should run more tests to see whether this is a

fact or just a small mistake due to the characteristic of the collected data.

Appendix:

Regression after outlier removal-1:

Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN, AFRICAN AMER, EMPLOYED, ...

Analysis of Variance

Source DF SeqSS ContributionAdj SS Adj MS F-Value P-ValueRegression 12 258.689 70.75% 258.689 21.557 34.87 0.000 GENDER 1 3.991 1.09% 0.413 0.413 0.67 0.415 ASIAN 1 0.731 0.20% 0.121 0.121 0.20 0.659 CAUCASIAN 1 11.922 3.26% 0.055 0.055 0.09 0.766 AFRICAN AMERICAN 1 1.642 0.45% 0.158 0.158 0.25 0.614 EMPLOYED 1 199.412 54.54% 190.482 190.482 308.07 0.000 SPOUSE 1 0.462 0.13% 2.276 2.276 3.68 0.057 CABLE 1 18.445 5.04% 17.970 17.970 29.06 0.000 AGE 1 1.704 0.47% 1.843 1.843 2.98 0.086 EDUCATION 1 5.017 1.37% 1.086 1.086 1.76 0.187 CHILDREN 1 4.703 1.29% 6.084 6.084 9.84 0.002 INCOME 1 8.544 2.34% 6.837 6.837 11.06 0.001 LEISURE 1 2.116 0.58% 2.116 2.116 3.42 0.066Error 173 106.967 29.25% 106.967 0.618 Lack-of-Fit 172 106.467 29.12% 106.467 0.619 1.24 0.630 Pure Error 1 0.500 0.14% 0.500 0.500Total 185 365.656 100.00%

Model Summary

S R-sq R-sq(adj) PRESS R-sq(pred)0.786324 70.75% 68.72% 126.045 65.53%

Coefficients

Term Coef SE Coef 95% CI T-Value P-Value VIFConstant 5.327 0.341 ( 4.654, 6.000) 15.62 0.000GENDER 0.097 0.119 ( -0.138, 0.332) 0.82 0.415 1.05ASIAN 0.086 0.195 ( -0.298, 0.470) 0.44 0.659 1.78CAUCASIAN 0.054 0.182 ( -0.305, 0.414) 0.30 0.766 2.06AFRICAN AMERICAN 0.089 0.177 ( -0.260, 0.438) 0.50 0.614 2.12EMPLOYED -3.139 0.179 ( -3.492, -2.786) -17.55 0.000 1.12SPOUSE -0.290 0.151 ( -0.588, 0.008) -1.92 0.057 1.72

Page 16: TV Watching Time Project

CABLE 0.652 0.121 ( 0.413, 0.890) 5.39 0.000 1.07AGE 0.01050 0.00608 ( -0.00150, 0.02251) 1.73 0.086 1.11EDUCATION -0.1156 0.0872 ( -0.2878, 0.0565) -1.33 0.187 1.26CHILDREN 0.2242 0.0715 ( 0.0831, 0.3653) 3.14 0.002 1.67INCOME -0.000013 0.000004 (-0.000020, -0.000005) -3.33 0.001 1.24LEISURE -0.0485 0.0262 ( -0.1001, 0.0032) -1.85 0.066 1.11

Regression Equation

HOURS = 5.327 + 0.097 GENDER + 0.086 ASIAN + 0.054 CAUCASIAN + 0.089 AFRICAN AMERICAN - 3.139 EMPLOYED - 0.290 SPOUSE + 0.652 CABLE + 0.01050 AGE - 0.1156 EDUCATION + 0.2242 CHILDREN - 0.000013 INCOME - 0.0485 LEISURE

Fits and Diagnostics for Unusual Observations

Obs HOURS Fit SE Fit 95% CI ResidStdResid Del Resid HI Cook’s D9 5.000 2.857 0.188 (2.485, 3.229) 2.143 2.81 2.86 0.057396 0.0410 7.000 5.112 0.235 (4.648, 5.575) 1.888 2.52 2.56 0.089301 0.0530 4.000 1.755 0.231 (1.300, 2.211) 2.245 2.99 3.06 0.086113 0.0649 8.000 5.776 0.225 (5.331, 6.221) 2.224 2.95 3.02 0.082241 0.0675 1.500 3.138 0.206 (2.732, 3.544) -1.638 -2.16 -2.18 0.068314 0.0388 4.000 5.723 0.215 (5.298, 6.148) -1.723 -2.28 -2.31 0.075053 0.0399 8.000 6.123 0.254 (5.622, 6.625) 1.877 2.52 2.56 0.104455 0.06115 3.000 4.785 0.250 (4.291, 5.278) -1.785 -2.39 -2.43 0.101250 0.05130 7.000 5.450 0.220 (5.016, 5.885) 1.550 2.05 2.07 0.078376 0.03131 2.500 4.248 0.271 (3.714, 4.782) -1.748 -2.37 -2.40 0.118514 0.06149 5.000 2.998 0.220 (2.564, 3.432) 2.002 2.65 2.70 0.078220 0.05183 2.000 4.273 0.253 (3.774, 4.772) -2.273 -3.05 -3.13 0.103446 0.08

Obs DFITS 9 0.70691 R 10 0.80053 R 30 0.93842 R 49 0.90440 R75 -0.59065 R88 -0.65697 R 99 0.87501 R115 -0.81480 R130 0.60432 R131 -0.88007 R149 0.78636 R183 -1.06315 R

R Large residual

Durbin-Watson Statistic

Durbin-Watson Statistic = 1.94462

Regression after outlier removal-2:

Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN, AFRICAN AMER, EMPLOYED, ...

Analysis of Variance

Source DF SeqSS ContributionAdj SS Adj MS F-Value P-ValueRegression 12 185.856 76.04% 185.856 15.488 42.57 0.000 GENDER 1 0.728 0.30% 0.009 0.009 0.03 0.873 ASIAN 1 0.036 0.01% 0.341 0.341 0.94 0.335 CAUCASIAN 1 9.831 4.02% 0.000 0.000 0.00 0.983 AFRICAN AMERICAN 1 0.027 0.01% 0.072 0.072 0.20 0.658

Page 17: TV Watching Time Project

EMPLOYED 1 152.910 62.56% 137.866 137.866 378.98 0.000 SPOUSE 1 0.217 0.09% 1.339 1.339 3.68 0.057 CABLE 1 10.858 4.44% 11.504 11.504 31.62 0.000 AGE 1 0.195 0.08% 0.193 0.193 0.53 0.467 EDUCATION 1 1.706 0.70% 0.187 0.187 0.52 0.474 CHILDREN 1 3.180 1.30% 4.262 4.262 11.72 0.001 INCOME 1 4.431 1.81% 3.477 3.477 9.56 0.002 LEISURE 1 1.736 0.71% 1.736 1.736 4.77 0.030Error 161 58.569 23.96% 58.569 0.364 Lack-of-Fit 160 58.069 23.76% 58.069 0.363 0.73 0.758 Pure Error 1 0.500 0.20% 0.500 0.500Total 173 244.425 100.00%

Model Summary

S R-sq R-sq(adj) PRESS R-sq(pred)0.603145 76.04% 74.25% 69.9706 71.37%

Coefficients

Term Coef SE Coef 95% CI T-Value P-Value VIFConstant 5.518 0.276 ( 4.973, 6.063) 19.99 0.000GENDER 0.0151 0.0943 ( -0.1712, 0.2014) 0.16 0.873 1.05ASIAN 0.146 0.151 ( -0.152, 0.444) 0.97 0.335 1.75CAUCASIAN -0.003 0.143 ( -0.285, 0.279) -0.02 0.983 2.00AFRICAN AMERICAN -0.061 0.138 ( -0.334, 0.211) -0.44 0.658 2.01EMPLOYED -3.232 0.166 ( -3.560, -2.904) -19.47 0.000 1.16SPOUSE -0.230 0.120 ( -0.467, 0.007) -1.92 0.057 1.72CABLE 0.5409 0.0962 ( 0.3509, 0.7308) 5.62 0.000 1.08AGE 0.00354 0.00486 ( -0.00605, 0.01313) 0.73 0.467 1.13EDUCATION -0.0494 0.0688 ( -0.1853, 0.0865) -0.72 0.474 1.29CHILDREN 0.1977 0.0577 ( 0.0836, 0.3117) 3.42 0.001 1.70INCOME -0.000010 0.000003 (-0.000016, -0.000004) -3.09 0.002 1.29LEISURE -0.0453 0.0208 ( -0.0863, -0.0044) -2.18 0.030 1.10

Regression Equation

HOURS = 5.518 + 0.0151 GENDER + 0.146 ASIAN - 0.003 CAUCASIAN - 0.061 AFRICAN AMERICAN - 3.232 EMPLOYED - 0.230 SPOUSE + 0.5409 CABLE + 0.00354 AGE - 0.0494 EDUCATION + 0.1977 CHILDREN - 0.000010 INCOME - 0.0453 LEISURE

Fits and Diagnostics for Unusual Observations

Obs HOURS Fit SE Fit 95% CI ResidStdResid Del Resid HI Cook’s D1 0.500 1.914 0.148 (1.622, 2.205) -1.414 -2.42 -2.45 0.059821 0.038 0.500 1.965 0.133 (1.703, 2.226) -1.465 -2.49 -2.53 0.048293 0.0216 0.000 1.319 0.193 (0.938, 1.701) -1.319 -2.31 -2.34 0.102496 0.0521 7.000 5.720 0.205 (5.315, 6.124) 1.280 2.26 2.29 0.115574 0.0523 3.500 2.303 0.127 (2.053, 2.554) 1.197 2.03 2.05 0.044229 0.0129 0.000 1.392 0.159 (1.079, 1.706) -1.392 -2.39 -2.43 0.069151 0.0333 6.000 4.838 0.193 (4.457, 5.219) 1.162 2.03 2.05 0.102488 0.0447 0.000 1.365 0.159 (1.052, 1.679) -1.365 -2.35 -2.38 0.069404 0.0358 7.500 5.869 0.182 (5.509, 6.229) 1.631 2.84 2.90 0.091226 0.0673 4.000 5.479 0.232 (5.021, 5.936) -1.479 -2.66 -2.71 0.147459 0.0980 3.500 4.719 0.194 (4.336, 5.103) -1.219 -2.14 -2.16 0.103423 0.0487 4.500 3.152 0.223 (2.711, 3.593) 1.348 2.41 2.44 0.137104 0.0794 1.500 2.789 0.193 (2.409, 3.170) -1.289 -2.26 -2.29 0.101942 0.04106 0.000 1.337 0.149 (1.042, 1.631) -1.337 -2.29 -2.32 0.061106 0.03115 3.500 2.116 0.135 (1.848, 2.383) 1.384 2.36 2.39 0.050391 0.02

Obs DFITS1 -0.61923 R8 -0.57008 R16 -0.79113 R 21 0.82669 R 23 0.44095 R29 -0.66204 R

Page 18: TV Watching Time Project

33 0.69405 R47 -0.65010 R 58 0.91922 R73 -1.12589 R80 -0.73343 R 87 0.97371 R94 -0.76986 R106 -0.59129 R115 0.55046 R

R Large residual

Durbin-Watson Statistic

Durbin-Watson Statistic = 2.11168