mutiple linear regression project

Multiple Linear Regression Project Guided By: Dr. Chen The University of Texas at Arlington Japan Shah & Vishrut Mehta 5/5/16 Applied Linear Regression- IE 5318

1 | P a g e

Project Proposal Problem statement: In this project of simple linear regression analysis, we are trying to determine the relationship between one response variable (Y) and four predictor variables (X1, X2, X3, X4). We want to determine how the different values of all the predictor variables affect the value of the response variable. And among all predictor variable, which variable shows the most linear relationship with the response. The variables: The variables are as follows

Response Variable (Y): Number of the followers of a person on the Twitter (in millions) Predictor Variable (X1): Number of tweets posted by a particular person Predictor Variable (X2): Number of years passed since that person has joined the twitter Predictor Variable (X3): Number of photos and videos posted Predictor Variable (X4): Number of people that person is following back

The data collection process: We have used the official site of Forbes to find the first 100 most followed people on the twitter (http://www.forbes.com/sites/maddieberg/2015/06/29/twitters-most-followed-celebrities-retweets-dont-always-mean-dollars/#35671f137ef3) Also we are using twitter to collect the data for the response variable and each predictor variables. (www.twitter.com) We searched for these people on the twitter for their accounts and verified them as their original account by looking at the symbol , which is the symbol for the official account of persons on the twitter. We are using the first 40 most followed people amongst them. Why modeling this data set would be meaningful? As we all know how much social media is affecting the lives of people in today's world and it is constantly evolving with the new social media sites. It has revolutionized how we look at the things. While some people have made their way to success through their hard work and years of experience, the others have made it through social media which also includes the same amount of hard work. Amongst all of them today, the second most popular social media site is Twitter according to ebizmba.com (http://www.ebizmba.com/articles/social-networking-websites). It is necessary to understand on which factors the popularity of the people is driven. We are going to determine that among all these four predictor variables, which variable affects the popularity of the person the most.

2 | P a g e

Matrix Scatterplot and Pairwise Correlation Coefficient.

3 | P a g e

The scatter plot matrix helps us to understand the relationship between the response variable i.e. No of followers(Y) and various predictor variables (No of tweets(X1), Years since they joined(X2), No of photos uploaded(X3) and Following back(X4).It also shows the scatterplot between the predictor variable such as X1 vs X2, X1 vs X3, X1 vs X4, X2 vs X3, X2 vs X4 and X3 vs X4. Correlation coefficient matrix is between response(Y) and predictor variable(X) and predictor(X) vs predictor variable(X).The value ranges from 0 to 1 when correlation is between response and predictor. If the value is greater than 0.7, equal to 0.5 and less than 0.3 is consider high, medium and low correlation respectively. It is always good to have high (i.e. greater or equal to 0.7) correlation between response and predictor variable. Large correlation coefficient between predictors indicates multicollinearity. In Multicollinearity ,we consider correlation coefficient should be between -1≤ r ≤ +1.If the correlation coefficient between two predictor variable is greater than zero then high value of one predictor with high value of another predictor and low values of one predictor with low value of another predictor. If the correlation coefficient is less than zero then high value of one predictor occur with low values of another predictor and low values of one predictor with high values of another predictor. It is not good to have high correlation coefficient between two predictors because high correlation indicates severe multicollinearity. Multicollinearity can cause increase in the variance of coefficient estimates and make estimates sensitive to the change in the model. We always want correlation coefficient between predictors near zero. Below is the discussion from the scatterplot and correlation matrix. Y Vs X plots No of followers and No of tweets (Y vs X1): Here, there is positive upward trend and the correlation is 0.14782. It is low correlation. No of followers and Years since they joined (Y vs X2): Here, there is positive upward trend and the correlation is 0.51656. It is moderate correlation. No of followers and No of photos uploaded(Y Vs X3): Here, there is a positive trend and the correlation is 0.32294.It is low correlation. No of followers and following back(Y Vs X4): Here, there is positive upward trend and the correlation is 0.47663. It is moderate correlation. X vs X Plots No of tweets vs Years since they joined(X1 vs X2): Here, there is a positive upward trend and correlation coefficient is 0.14618, which is near to zero. These two predictors are very less correlated, which is good. No of tweets vs No of photos uploaded(X1 vs X3): Here, there is a strong positive upward trend and correlation coefficient is 0.77547.These two predictors are highly correlated, which is bad. No of tweets vs following back(X1 vs X4): Here, there is a positive upward trend and correlation coefficient is 0.18112, which is near zero. These two predictors are very less correlated, which is good. Years since they joined vs No of photos uploaded(X2 vs X3): Here, there is a positive upward trend and correlation coefficient is 0.20020, which is near zero. These two predictors are very less correlated, which is good. Years since they joined vs following back(X2 vs X4): Here, there is a positive upward trend and correlation coefficient is 0.57997.These two predictors are moderately correlated, which is bad. No of photos vs following back(X3 vs X4): Here, there is positive upward trend and correlation coefficient is 0.23780, which is near zero. These two predictors are less correlated, which is good. Overall we can say that there is a multicollinearity problem. Potential Complication There is severe multicollinearity between X2 vs X3.

II PRELIMINARY MULTIPLE LINEAR REGRESSION MODEL

4 | P a g e

The general linear regression model is Yi = β0+β1Xi1+β2Xi2+…..+ βp-1Xi, p-1+εi Where: β0, β1, β2 …… βp-1 are parameters, Xi1…...,Xip-1 are known constants, εi are independent N (0, σ2) , i = 1,….n

Fitted model is Y (No of followers) = -25.40410 – 0.00044178No of tweets + 8.15496Years since they joined + 0.00877No of photos uploaded + 0.00003188Following back Model Assumption For model adequacy, we need to satisfy following model assumptions a) The current MLR model is reasonable b) The residuals have constant variance. c) The residuals are normally distributed. It is not required but desired. d) The residuals are uncorrelated e) No outliers f) the predictors are not highly correlated with each other. a) The current MLR model form is reasonable Yi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi, Where Yi is the No of followers ,β0, β1, β2, β3, β4 are the parameters Xi1: No of tweets,Xi2 Years since they joined,Xi3 No of photos uploaded ,Xi4 Following back

5 | P a g e

From the above graph of residual vs each predictors i.e. No of tweets ,Years Since they joined ,No of photos uploaded ,Following back, we can say that there is no curvature in all graph .This shows that MLR model is reasonable. b) The residuals have constant variance. Residual vs Y: This plots helps us to understand whether residual have constant variance or not.

From above graph, we can say that there is a funnel shape. So residuals have non constant variance. Modified Levene Test: Test for Constancy of Error Variance.

6 | P a g e

F test H0: Variance is constant H1: Variance is not constant Decision rule: p<α then Reject H0. α=0.05 From above table, p=0.0006, which is less than 0.05. So we reject Ho. It is strong conclusion. We conclude that error variance is not constant Two Sample T test Now, as variance is not constant, we will look into the unequal variance row. H0: Variance is constant H1: Variance is not constant Decision rule: p<α then Reject H0. α=0.05 From table, p=0.0482, which is less than 0.05.So we reject H0. We conclude that error variance is not constant c) The residuals are normally distributed Normal probability plot: This plot helps us to understand whether residuals are normally distributed or not. This plot is residual versus expected scores.

From above graph, we can say that residuals are right skewed. So residuals are not normal. Test for Normality

7 | P a g e

H0: Normality is Okay H1: Normality is violated Decision rule: ρ < c (α, n) then Reject H0. α=0.10 From the critical values table c (α, n) = 0.977 and from above table ρ = 0.92338, which is less than 0.972. So we reject H0. It is a strong conclusion. We conclude that Normality is violated d) The residuals are uncorrelated Data were not collected in time order, so time plot is not relevant. e) Outliers: There are no outliers f) The predictors are not highly correlated with each other. Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It measures how much variances of the estimated regression coefficient are inflated as compared to when predictors are not linearly related. It can found by regressing Xk on other p-2 predictors. Formula for VIF is (VIF) k = 1 / 1-Rk2 where R2 k is coefficient of multiple determination.

Guideline: (VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5. (VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274. We can say that No of tweets, Years since they joined, No of photos uploaded and following back is 2.5, 1.51, 2.58, 1.54 times inflated as compared to when predictors are not linearly related. From above table, we can say that none of the predictors have VIF > 5 and VIF = 2.037 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. Transformation There was a non-constant variance and non-normality of the residual, so to get the model assumption satisfies, we need to perform transformation. Our main goal of transformation is to get non constant variance. As no transformation can able to make normality okay. We started our variance stabilizing transformations from weakest to strongest. Weakest is the sqrt transformation Y = sqrt Y i.e. λ=0.5 and the strongest is Y = -1/Y, λ=-1.Sqrt root transformation could not satisfies our model assumption so we move to log transformation (i.e. λ=0). a) The current MLR model form is reasonable

8 | P a g e

LogtenYi = β0+β1xil+β2xi2+β3xi3+β4xi4+εi

Where LogtenYi: No of followers ,β0, β1, β2, β3, β4 are the parameters

Xi1: No of tweets, Xi2 Years since they joined Xi3 : No of photos uploaded ,Xi4: Following back

From the above graph of residual vs each predictor i.e. No of tweets, Years since they joined No of photos uploaded and following back, we can say that there is no curvature in all of the graphs. So MLR model is reasonable. b) The residuals have constant variance Residual vs log10Y

From above graph, we can say that there is no funnel shape, so variance is constant.

9 | P a g e

Modified Levene test

Here α=0.05, F test: H0: Variance is constant H1: Variance is not constant Decision rule: p<α then Reject H0. α=0.05 From table, p=0.1910 which is greater than 0.05. So we fail to reject Ho. We conclude that variance is constant. As variance is constant, we move to equal variance row of t test. Two Sample T test: H0: Variance is constant H1: Variance is not constant Decision rule: p<alpha then Reject H0. α=0.05 From table, p=0.0936, which is greater than 0.05.So we fail to reject H0. We conclude that variance is constant C) Normality plot and Normality test

Normality test

α=0.05 , H0: Normality is Okay H1: Normality is violated

10 | P a g e

Decision rule: ρ < c (α, n) then Reject H0. From the critical values table c (α, n) = 0.972 and from the above table ρ = 0.974, which is greater than 0.972. So we fail to reject H0. It is a weak conclusion. We conclude that normality is okay. If we take α=0.10, then it will fail the normality test but in multiple linear regression, normality assumption is not required but desired, so we can move ahead with this result. The important assumption that need to be satisfy is non constant variance and we have gained with log transformation so we have stopped transformation at log. d) Data were not collected in time order, so time plot is not relevant. e) Diagnostic

Outlying X observation The hat matrix is helpful in identifying the X outliers. The diagonal element of the hat matrix is helpful in finding X outliers. The diagonal element hii values is between 0 and 1 and their sum is p. p is no of parameters. hii is the distance between the Values of ith case and mean of X values of n cases. The diagonal element in this context is called leverage of X values. If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier. 2p/n= (2*5)/39 = 0.2564. Now we will look which value is greater than 0.2564. By comparing, we got 3, 9,11,15,31 and 32 observation are greater than 0.2564. So observation 3, 9,11,15,31 and 32 are x outliers. Outlying Y observation We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value. Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will perform Bonferroni Outlier test.

11 | P a g e

Bonferroni outlier test α=0.10, n=39, p = 5 Bonferroni critical value = |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 33) = t (0.9987, 33) = 3.25817 To find Y outliers, we will look which observation value is greater than 3.25817.By comparing, we could not found any Y outliers. So from above two results, we found 3,9,11,15,31,32 as X outliers. Influence After finding outlier with respect to X values, the next step is to look whether these X outliers are influential or not. To check their influence, we need to perform influence measures. There are three influence measures 1) DFFITS 2) DFBETAS 3) COOKDI 1) DFFITS: It is helpful to understand influence of outliers on the fitted value Y(No of followers) Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential. 2*sqrt (5/39) = 0.71611

Observation No Type of Outlier DFFITS Value from Table Remarks 3 X -1.3807 From comparison, we can say that

observation 3 have higher value than 0.71611.This observation have influence on the fitted value. Of no of followers

9 X -0.3078 11 X -0.6020 15 X -0.0438 31 X -0.6945 32 X 0.0242

2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of Intercept, No of tweets, Years since they joined, No of photos uploaded and Following back. Here we will look absolute value of DFBETAS. A large absolute value of DFBETAS is consider as an influential. Guideline |DFBETAS| > 2/ (sqrt n) is consider influential. 2/ (sqrt 39) = 0.3202

Observation No

Type of Outlier

DFBETAS value from table Remarks Intercept X1 X2 X3 X4 Outlier 3 and 31

is influencing on Following back predictor coefficient and No of tweets predictor coefficient respectively.

3 X 0.0599 0.1398 -0.0490 -0.0977 -1.0565 9 X -0.0173 0.0315 0.0061 0.0439 -0.2427

11 X 0.0519 0.2885 -0.0207 -0.5411 0.1529 15 X 0.0005 -0.0241 0.0021 -0.0033 0.0029 31 X 0.0824 -0.5061 -0.0539 0.1173 0.1629 32 X 0.0802 -0.0093 -0.0203 0.0124 0.0083

12 | P a g e

3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an aggregate influence measure of the ith case on all n fitted values. It is denoted as Di. Guideline Di > F (0.50, p, n-p) then observation i is said to be influential. F (0.50, 5, 34) = 0.8878

Observation No Type of Outlier Cook’s value form table Remarks 3 X 0.21800 None of the observation

is above F value, so we can say that none of the observation is influential on all n fitted values of no of followers

9 X 0.06657 11 X 0.08801 15 X 0.00886 31 X 0.08420 32 X 0.00127

f) Variance Inflation Factor

Guideline: (VIF) > > 1, (VIF) k > > 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5 (VIF) 1 = 2.50905, (VIF) 2 = 1.51656, (VIF) 3 = 2.58140 (VIF) 4 = 1.54274 So from above table, we can say that none of the predictors have VIF > 5 and VIF = 2.037 is not much bigger than 1.So conclude that serious multicollinearity is not a problem.

The preliminary model is

13 | P a g e

Log10No of Followers = 0.87514 – 0.00000570No of tweets + 0.8402Years since they joined + 0.00011227No of photos uploaded + 3.081619E-7Following back The parameter estimates are b0: 0.87514, b1 : -0.00000570, b2: 0.08402, b3 : 0.00011227, b4: 3.081619E-7 There would be decrease in no of followers when no of tweets increase by one unit, given that other variable are held constant. There would be 0.08402 increase in no of followers when Years Since they joined increase by one year, given that other predictors variable are held constant. There would be 0.00011227 increase in no of followers when No of photos uploaded increase by 1no,given that other predictors are held constant. There would be 3.08E-7 increase in no of followers when following back is increase by 1no, given that other predictors are held constant. There are 39 observation, so the degree of freedom for the corrected total is n-1 i.e. 38.We have 4 predictors, so the degree of freedom for the model is 4. The degree of freedom for the error is n-p-1 i.e. 39-4-1 = 34. Standard error: Standard error are the standard error for the regression coefficient. It helps us for constructing confidence interval. Sum of Square: Sum of square is formed by SST (Sum of square total) = SSM (Sum of square model) + SSE (Sum of square error). SST shows the total variation in the response. Sum of square error shows the unexplained variation within the no of followers i.e. yi-yi .We can say that this variation is due to deviation in the model and Sum of square model shows the explained variability i.e. yi - y. This shows variation due to model. Mean Square: It is calculated as ratio of sum of squares and its corresponding degree of freedom. Mean square error is an estimate of the variance σ2 for our model. Value of MSE is 0.02251. Root MSE: value is 0.15004.It shows the value of s, estimate for the parameter σ of our model. Departure Mean: The value is 1.51431, this indicates the mean of logtenY. F value: It is the test statistics to check whether the model is significant or not. H0: β1=β2=β3=β4 = 0 H1: not all β1, β2, β3, β4 = 0 Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05 F* = 5.41, F (0.95, 4, 35) = 2.6414 As 5.41 > 2.6414, we reject Ho and conclude that not all β1, β2, β3, β4 is zero. In other words, regression is significant. Coefficient of Multiple Determination (R2 ). From the table, the R2 value is 0.3891 fraction of variability in No of followers explained by the No of tweets, Years Since they joined, No of photos uploaded, following back. Significance of Predictors T value: It is the t statistics. It is a ratio of the parameter estimates to the standard error. The null hypothesis is regression coefficient is zero. If the regression coefficient of predictor is zero, then it does not significantly contribute to the model. We can drop the predictors when there is t* is less among all predictors

14 | P a g e

Pr>|t| are two sided. Examination of t statistics and its corresponding p value helps us to find significance. The p value for the No of tweets(X1) and following back(X4) is greater than α (0.05), so both these predictor are not significant for predicting no of followers. The p value for the Years Since they joined (X2) is almost equal to 0.05, so this predictor is significant and No of photos uploaded(X3) is less than to α (0.05), so both these predictors are significant in predicting no of followers. III Exploration of Interaction Terms using Partial Regression Plot Partial regression plot is also called as Added-variable plot. Partial regression plot helps us to understand the marginal role of the interaction term given other predictor variables are present. From this plot, we will understand which interaction will be helpful to predict no of followers(Y). For selecting the graph, we need to see trend i.e positive upward or negative. If there is no trend, then we should not add interaction. The following interaction terms are possible 1) X1X2: No of tweets Year since they joined 2) X1X3: No of tweets No of photos uploaded 3) X1X4: No of tweets Following back 4) X2X3: Year Since they joined No of photos uploaded 5) X2X4: Year Since they joined Following back 6) X3X4: No of photos uploaded Following back

Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x2 given X1, X2, X3, X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks like there is horizontal band, so this interaction does not contain additional information that is useful for predicting no of followers(Y).

15 | P a g e

Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x1x3 given X1, X2, X3, X4. From the plot, we can say that there is little negative trend. Though it have trend ,but looks like there is horizontal band, so this interaction does not contain additional information that is useful for predicting no of followers(Y).

Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of x1x4 given X1, X2, X3, X4. From the plot, we can say that trend is positive, so this interaction may be helpful in predicting no of followers.

16 | P a g e

Here, we have regress residual of logtenY given X1, X2, X3,X4 versus the residual of x2x3 given X1, X2, X3, X4. From the plot, we can say that there is no trend so this interaction will not provide any additional information for predicting No of followers.

Here, we have regress residual of logtenY given X1,X2,X3,X4 versus the residual of model x2x4 given X1, X2, X3, X4. From the plot, we can say that trend is negative so this interaction will be helpful in predicting no of followers

17 | P a g e

Here, we have regress residual of logtenY given X1, X2, X3, X4 versus the residual of model x3x4 given X1, X2, X3, X4. From the plot, we can say that trend is positive, so this interaction may be useful in predicting no of followers. From above interaction plots, the X1X2 and X1X3 plots have slightly negative trend so these two predictors won’t be helpful. The X2X3 plots having no trend so no need to add in the model. Between X1X4,X2X4 and X3X4, the X2X4 plot have more scatter around the regression line. So X2X4 is enough to add as an interaction term. Correlation involving the added interaction term before and after standardization. Standardization is important for the model having the interaction and the polynomial terms. Standardizing the predictors helps us to reduce the multicollinearity. In standardization, the standardized variable can be calculated by centering the mean to zero and scaling the variance to 1. Centering the predictors is important for the interaction terms. Before Standardization

After Standardization

18 | P a g e

From above results, we can see the effect of standardization. It helps to reduce the multicollinearity. From after standardization correlation matrix, we can say that correlation coefficient between No of tweets, Years Since they joined, No of photos and following back and Years since they joined*Following back has reduced.

IV. Model Search For the 4 predictor variable, the no of parameter would be p=5 and the total no of model would be 2p-1 .So in our model the total no of model would be 24 = 16. It is difficult to access all the models, so to deal with such complexity, we have three model search procedure and they are a) Best Subset b) Backward deletion c) Stepwise a) Best Subset: For selection of the model, there are three criteria need to check for potentially best model 1) Cp 2) AIC (Akaike’s Information Criterion) 3) SBC (Schwarz Bayesian Criterion) Firstly, we will look R2a in the model. At one stage Ra2 value will be level off. We will discard the model whose Ra2 decreased from previous step. Now we will use our mentioned criteria to find our model. When Cp=p (no of parameters), then there won’t be any bias.

Model 1

19 | P a g e

Number in Model

R2a R2 Cp AIC SBC 1 0.2356 0.2557 8.7531 -141.5964 -138.26926 2 0.2863 0.3238 6.7459 -143.3425 -138.35179 3 0.3053 0.3602 6.6100 -143.4968 -136.84251 4 0.3173 0.3891 6.9077 -143.3032 -134.09534 5 0.3535 0.4386 6 -144.5965 -134.61512

By using mentioned criteria, the min Cp and min AIC is of Model 3 and min SBC is of Model 2.Though, the model 5 is having min Cp and min AIC, we have not consider this model because there is severe multicollinearity problem. From this method, we have selected following models Model 1: logtenY (No of followers) = β0+β1 Years since they joined+β2 No of photos uploaded Model 2: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos uploaded b) Backward deletion

Model 2

20 | P a g e

From backward deletion, we have selected following models Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. C) Stepwise

Model 1

21 | P a g e

From stepwise, we have selected following models Model 1: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. From above model search procedure, we have selected following two models as our best models. Model I: logtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. Model II: logtenY (No of followers) = β0+β1 No of tweets+β2Years since they joined+β3 No of photos uploaded

Model 1

22 | P a g e

V. Model Selection Model I

LogtenY (No of followers) = 0.66796 + 0.11439Years since they joined + 0.00006297No of photos uploaded. Checking Model assumption a) The current MLR model is reasonable Yi = β0+β1xi1+β2xi2+εi

Where , Yi No of followers,Xi1 : Years since they joined Xi2 No of photos uploaded

From above graph, we can say that there is no curvature at all in any graph. So MLR model is reasonable. b) The residuals have constant variance Residual vs logtenY

23 | P a g e

From the graph, we conclude that graph have constant variance as there is no funnel shape c) The residuals are normally distributed Normal Probability Plot

From the graph, we conclude that graph is normal. Normality is satisfied d) The residual are uncorrelated. We have not collected data in time order. e) Diagnostic

24 | P a g e

Outlying X observation If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier. 2p/n= (2*3)/39 = 0.1538. Now we will look which value is greater than 0.1538. By comparing, we got 3, 11, 15 and 32 observation are greater than 0.1538, so observation 3, 11, 15 and 32 are x outliers. Outlying Y observation We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value. Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will perform Bonferroni Outlier test. Bonferroni outlier test α=0.10, n=39, p = 3 |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 35) = t (0.9987, 33) = 3.2431 To find Y outliers, we will look which observation value is greater than 3.2431.By comparing, we found all observation is less than 3.25817, so there are no Y outliers. So from above two results, we can say that there are total 3 X outliers. Influence

25 | P a g e

After finding outlier from X values, the next step is to look whether these outliers are influential or not. To check their influence, we need to perform influence measures. There are three influence measures 1) DFFITS 2) DFBETAS 3) COOKDI 1) DFFITS: It is helpful to understand influence of outliers on the fitted value Y (No of followers) Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential. 2*sqrt (3/39) = 0.5547

Observation No Type of Outlier DFFITS Value from Table Conclusion 3 X 0.1763 From comparison, we can say that

none of the observation have influence on the fitted value. 11 X 0.2651

15 X 0.2602 32 X 0.4234

2) DFBETAS: It is helpful to understand influence of outliers on the regression coefficient of intercept, years since they joined and no of photos uploaded. Guideline |DFBETAS| > 2/ (sqrt n) is consider influential. 2/ (sqrt 39) = 0.3202

Observation no Type of Outlier DFBETAS value from table Intercept X2 X3 3, 11, and 15 are

not influential while observation 32 is slightly influential on intercept and coefficient of years since they joined

3 X 0.1554 0.1554 0.0297 11 X 0.0162 0.0388 0.2518 15 X 0.0147 0.0345 0.2410 32 X 0.3771 0.3836 0.1965

3) COOKDISTANCE: It is helpful to understand influence of outliers on all n fitted values. It is an aggregate influence measure of the ith case on all n fitted values Guideline Di > F (0.50, p, n-p) then observation i is said to be influential. F (0.50, 5, 34) = 0.80381

Observation No Type of Outlier Cook’s value form table Remarks 3 X 0.07852 None of the observation

is influential on fitted values. 11 X 0.02396

15 X 0.02301 32 X 0.06062

From above three influence measures, we can say that none of the outliers is influential. Variance Influence

26 | P a g e

From above table, Guideline: (VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5. From above table (VIF) 1 = 1.04175, (VIF) 2 = 1.04175, So from above table, we can say that none of the predictors have VIF > 5 and VIF = 1.04175 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. Model II

LogtenY (No of followers) = 0.67727- 0.00000568No of tweets + 0.11367Years since they joined + 0.000118437No of photos uploaded. Verifying Model Assumption a) The current MLR model is reasonable Yi = β0+β1xi1+β2xi2+εi where Yi No of followers , Xi1 : No of tweets, Xi2 Years since they joined , Xi3 No of photos uploaded

27 | P a g e

From above plots of residual vs each predictors (i.e. No of tweets, Years since they joined, No of photos uploaded), we can say that there is no curvature at all in any graph. So we can say MLR model is reasonable. b) The residuals have constant variance Residual vs logten Y

There is no funnel shape, so variance is constant. C) The residuals are normally distributed Normal Probability plot

28 | P a g e

It is not perfectly straight but normality is okay. d) The residuals are uncorrelated: We have not collected data in time order. e) Diagnostics

Outlying X observation If hii value is large then observation i is consider x outlier and has high leverage. A leverage value hii is consider large when hii > (2p)/n. Hence leverage values greater than 2p/n are consider as X outlier. 2p/n= (2*4)/39 = 0.2051. Now we will look which value is greater than 0.2051. By comparing, we got 3, 11, 15, 31 and 32 observation are greater than 0.2564. So observation 3,11,15,31 and 32 are x outliers.

29 | P a g e

Outlying Y observation We consider Y observations as Y outliers whose studentized deleted residuals are large in absolute value. Here Rstudent column is studentized deleted residuals. To find the largest absolute value, | ti |, we will perform Bonferroni Outlier test Bonferroni outlier test α=0.10, n=39, p = 4 |ti| > t (1-α/2n, n-p-1) = t (1- (0.10/2*39), 34) = t (0.9987, 33) = 3.2504 To find Y outliers, we will look which observation value is greater than 3.2504.By comparing, we found that there are no Y outliers. So from above two results, we can say that there are total 5 X outliers. Influence After finding outlier with to X values, the next step is to look whether these X outliers are influential or not. To check their influence, we need to perform influence measures. There are three influence measures 1) DFFITS 2) DFBETAS 3) COOKDI 1) DFFITS: It is helpful to understand influence of X outliers on the fitted value Y (No of followers) Guideline: |DFFITS| > 2 * sqrt (p/n) is consider influential. 2*sqrt (4/39) = 0.6405

Observation No Type of Outlier DFFITS Value from Table Remarks 3 X 0.1054 From comparison, we can say that

observation 11 and 32 have higher value than 0.6405.These two observation have influence on the fitted value.

11 X 0.7406 15 X 0.0734 31 X 0.7877 32 X 0.2364

2) DFBETAS: It is helpful to understand influence of X outliers on the regression coefficient of intercept, no of tweets, years since they joined and no of photos uploaded Guideline |DFBETAS| > 2/ (sqrt n) is consider influential. 2/ (sqrt 39) = 0.3202

Observation No Type of Outlier

DFBETAS value from table Remarks Intercept

X1

X2 X3 11th observation

have influence on No of tweets and no of photos uploaded coefficient while observation 31 have influence on no of tweets.

3 X 0.0913 0.0612 0.0916 0.0236 11 X 0.0494 0.3676 0.0994 0.6727 15 X 0.0023 0.0404 0.0075 0.0051 31 X 0.0104 0.5898 0.0523 0.1557 32 X 0.1948 0.0965 0.1969 0.1382

30 | P a g e

3) COOKDISTANCE: It is helpful to understand influence of X outliers on all n fitted values. It is an aggregate influence measure of the ith case on all n fitted values Guideline Di > F (0.50, p, n-p) then observation i is said to be influential. F (0.50, 5, 34) = 0.8556

Observation No Type of Outlier Cook’s value form table 3 X 0.00286 None of the observation

is above F value, so we can say that none of the influential to the fitted values.

11 X 0.13717 15 X 0.00139 31 X 0.15318 32 X 0.01433

From above three influence measures, we can say that X outliers are not that influential. f) The predictors are not highly correlated with each other. Variance Influence

Guideline: (VIF) > > 1, (VIF) k > > = 10 then there is serious multicollinearity. We should avoid models with any (VIF) k >5. From above table, (VIF) 1 = 2.50902, (VIF) 2 = 1.04198, (VIF) 3= 2.55792. So from above table, we can say that none of the predictors have VIF > 5 and VIF = 2.0363 is not much bigger than 1.So conclude that serious multicollinearity is not a problem. Selection of best model Firstly, both the models are significant. The R2 value of Model 1 is 0.3238 and R2 for Model 2 is 0.3602. There is a not much increase in the R2 when we add no of tweets(X1) in model II. The residual vs logtenY for both the model are constant but the model 1 is showing more constant variance. The Normal probability plot of Model 1 is quite straight than Model II. Influence of X outliers is more in the model II as compared to model I. From the anova of Model I, all predictors are having p<α so all predictors are significant while in Model II the no of tweets(X1) are not significant as p>α and t* is less among all predictors. So No of tweets is not significant to add. It does not give additional information. From this analysis, we have selected Model 1 as best model.

FINAL MULTIPLE LINEAR REGRESSION MODEL After verifying model assumption and performing diagnostics, we are considering below model as our final model. We can predict the no of followers from the Years since they joined and no of photos uploaded. It means that whenever the no of followers increases, it mainly depends on the how long a person is on twitter and secondly it depends on the no of photos he/she has uploaded.

31 | P a g e

Fit of the model logtenY= 0.66796 + 0.11439Years since thy joined + 0.00006297No of photos uploaded There would be 0.11439 increase in no of followers when Years Since they joined increase by one year, given that no of photos uploaded is held constant. There would be 0.00006297 increase in no of followers when No of photos uploaded increase by 1no, given that years since they joined is held constant. As VIF values for Years since they joined and No of photos uploaded is less 5 and (VIF) is 1.04175 is slightly bigger than 1,so conclude that there won’t be any severe multicollinearity problem. From the table, we can say that the p value for years since they joined is less than 0.05, so this predictor is significant. The p value of no of photos uploaded is slightly greater than 0.05 so this predictor is marginally significant, but if we change the value of α to 0.10, then no of photos uploaded becomes significant. F test: To check whether regression is significant or not H0: β1=β2=0 H1: not all β1, β2, = 0 Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05 F* = 8.62, F (0.95, 2, 36) = 3.2594 As 8.62 > 3.259, we reject Ho and conclude that not all β1, β2 is zero. In other words, regression is significant. Coefficient of Multiple determination R2:R2 is 0.3238. It shows the fraction of variability in No of followers is explained by the model with Years since they joined and No of photos uploaded. Joint C.I for the parameters The Bonferroni joint confidence interval can be used to estimate regression coefficient simultaneously. g are the parameter to be estimated jointly where g≤p, the confidence limits can be find by using bk ± B s{bk}, where B = t ( 1 – α /2g ; n – p);

32 | P a g e

Now, b0 = 0.0581118 b1 = 0.11439, b2 = 1.0925E-9 S {b0 } = sqrt ( 0.0581118) = 0.241063 S {b1} = sqrt (0.0012516) = 0.03537 S {b2} = sqrt (1.0925E-9) = 3.30454E-05 B = 2.47887 C.I for β0 : (-0.53945,0.655676) β1 : (0.026712,0.202068) β2 : (-8.19E-05,8.19E-05). We are 95% confident that β0 is in (-0.53945, 0.655676) β1 is in (0.026712, 0.202068) β2 is in (-1.89E-05, 1.45E-04) simultaneously. C.I, C.B and P.I at xh of interest

xTh = (1 , 6.7 , 910 )

hnew,new value is smaller than the largest hii.As there wont be extrapolation, we can continue with this xh values.

Confidence Interval: xh = (1.4398893, 1.5434771) We are 95% confident that mean no of followers when Years since they joined (6.5) and No of photos uploaded (910) lies between (1.4398893, 1.5434771) million Prediction Interval: xh = (1.176267, 1.8070995) We are 95% confident that whenever the years since they joined is 6.5 years and no of photos uploaded is 910,then the no of followers will lie between 1.1762 to 1.807 million. As the no of followers are in millions. Confidence band xh = (1.4167958, 1.5665706)

{b0}

{b1}

{b2}

33 | P a g e

We are 95% confident that the region contains the entire regression surface overall combination of values of the x variables. Final discussion Here, we perform multiple linear regression analysis to see how many predictors are helpful in predicting the no of followers. For that, we started with a preliminary model with four predictors i.e. No of tweets, Years since they joined, No of photos uploaded and Following back and checked for the model assumptions. In our preliminary model, we had a non-constant variance in our model. To remove that we did the log transformation. Even we checked the multicollinearity between the predictors. After satisfying the assumptions with our transformed model, we checked if any of the interaction terms are to be added in the model. We tried different interaction terms, out of which we found that Years since they joined *Following back (X2X4) term interaction will be helpful to predict the number of followers the most. We added this standardized interaction term to the model. We used model search procedures to find the best model. Our best model is Number of followers =Y ears since they joined Number of photos and videos posted. This model satisfies the variance constant, normality is okay and there is no serious multicollinearity problem.

mutiple linear regression project

Data & Analytics