is4240 - ay1314s2 - assignment - dm1

IS4240 Business Intelligence Systems

AY 2013/14 Semester 2 Assignment Data Mining 1

Regression (10 marks) This question is based on the Bike Sharing dataset taken from the UCI Machine Learning Repository (originally from http://capitalbikeshare.com/system-data) http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. The original source of the dataset is attributed to:

Fanaee-T, H., Gama, J., Event Labeling Combining Ensemble Detectors and Background Knowledge, Progress in Artificial Intelligence, 2013, pp. 1-15, Springer Berlin Heidelberg.

The dataset is concerned with the domain of bike sharing systems. Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data. The dataset comes in two versions. In the first version, the rental bikes records are organized by day. In the second version, the rental bikes records are organized by hour of day. In general, you may think of one record in the first version for a particular day as being divided into 24 records in the second version, i.e., one for each hour of day. However, if a particular hour does not have a single bike being rented out; it will be excluded from the dataset. In other words, the first version of the dataset contains 731 observations but the second version of the dataset contains less than 731 x 24 = 17,544 observations. In fact, the second version of the dataset only has 17,379 observations. It is deemed that there is no missing data. This assignment is based on the second version of the dataset.

The second version of the dataset consists of 17 variables:

1. Instant record index 2. dteday date 3. season season (1: springer, 2: summer, 3: fall, 4: winter) 4. yr year (0: 2011, 1: 2012) 5. mnth month (1 to 12) 6. hr hour (0 to 23) 7. holiday weather day is holiday or not 8. weekday day of the week 9. workingday if day is neither weekend nor holiday is 1, otherwise is 0. 10. weathersit

1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain +

Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

11. temp Normalized temperature in Celsius. The values are divided to 41 (max) 12. atemp Normalized feeling temperature in Celsius. The values are divided to 50

(max) 13. hum Normalized humidity. The values are divided to 100 (max) 14. windspeed Normalized wind speed. The values are divided to 67 (max) 15. casual count of casual users 16. registered count of registered users 17. cnt count of total rental bikes including both casual and registered

For the purpose of this assignment, the target variable is cnt (i.e., variable 17). There are 12 explanatory variables from season to windspeed (i.e., variables 3 to 14). Perform the following tasks and answer the respective questions: 1) Using SAS, perform a bivariate correlation analysis on the appropriate explanatory

variables. List the variables that you have included in the analysis and explain any potential problem(s) that you detect. (1 marks)

2) Using SAS, build a multiple linear regression model to predict the value of cnt using all 12 explanatory variables. Request SAS to calculate the Variance Inflation Factor (VIF). Do NOT recode any explanatory variables. Report the Model Sum of Squares (MSS), Error Sum of Squares (ESS), F-Value, p-Value of the F-Value, R2 and Adjusted R2. Explain your observation and identify one major problem with this model. (2 marks)

3) Correct the problem identified in (2) by removing one explanatory variable from the model in (2). Which explanatory variable did you remove and why? Fit the new model and report the MSS, ESS, F-Value, p-Value of the F-Value, R2 and Adjusted R2. Comment on the validity of the model and attempt to interpret those regression coefficients that are statistically significant. Is there any problem with this model? (1 marks)

4) Fit a new model with stepwise model selection using the 11 explanatory variables from

(3). Report the MSS, ESS, F-Value, p-Value of the F-Value, R2 and Adjusted R2.

Is the stepwise model better than the model in (3)? (1 marks) 5) The preceding multiple linear regression models appear to suffer from a common

problem. To resolve this problem, examine each of the 11 explanatory variables from (3) carefully and attempt to recode them as appropriate. Explain the measures that you have taken.

Fit a new model using the recoded explanatory variables. Did you notice any major problem when fitting this model? State the problem that you have encountered and explain how you have attempted to resolve it. After resolving the problem, fit a final model and report the MSS, ESS, F-Value, p-Value of the F-Value, R2 and Adjusted R2. Comment on the validity of the final model and attempt to interpret those regression coefficients that are statistically significant. Is the final model better than the preceding models? (5 marks)

is4240 - ay1314s2 - assignment - dm1

Documents