predicting diamond price using linear model
DESCRIPTION
Predicting Diamond Price Using Linear Model. This is part of exploring various possibilities for developing a prediction algorithm for Predicting the Diamond price given its various characteristics. In subsequent articles, machine learning algorithms will be explored and the best algorithm for determining the diamond price will be identified. The diamonds dataset is taken from the ggplot2 library.TRANSCRIPT
-
Predicting Diamond Price using Linear ModelSarajit Poddar26 July 2015
Contents
1 Executive Summary 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Exploratory Analysis 22.1 Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Predicting the diamond price 113.1 Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . . 113.2 Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.6 Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.7 Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Final conclusion 20
1 Executive Summary
1.1 Objective
Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Usinglinear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. Themachine learning algorithms will be explored in subsequent articles.
1.2 About the data
1.2.1 Description
Prices of 50,000 round cut diamondsDescription: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variablesare as follows:
1
-
1.2.2 Details
price. price in US dollars ($326-$18,823)
carat. weight of the diamond (0.2-5.01)
cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour. diamond colour, from J (worst) to D (best)
clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
x. length in mm (0-10.74)
y. width in mm (0-58.9)
z. depth in mm (0-31.8)
depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
table. width of top of diamond relative to widest point (43-95)
2 Exploratory Analysis
2.1 Loading relevant libraries
# Load required librarieslibrary(dplyr);library(tidyr);library(ggplot2)
2.2 Subsetting the dataset
The dataset is subset to a smaller size as the dataset it huge
# Load the diamonds datasetdata(diamonds)
# Convert continuous variables to factors# Cut by interval of 1000diamonds$price2
-
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066## Max. :5.0100 I: 5422 VVS1 : 3655## J: 2808 (Other): 2531## depth table price x## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710## Median :61.80 Median :57.00 Median : 2401 Median : 5.700## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740#### y z price2 carat2## Min. : 0.000 Min. : 0.000 Min. : 1.000 Min. : 2.000## 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.: 1.000 1st Qu.: 4.000## Median : 5.710 Median : 3.530 Median : 3.000 Median : 7.000## Mean : 5.735 Mean : 3.539 Mean : 4.398 Mean : 8.468## 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.: 6.000 3rd Qu.:11.000## Max. :58.900 Max. :31.800 Max. :19.000 Max. :51.000##
# Structure of the diamond datasetstr(diamonds)
## 'data.frame': 53940 obs. of 12 variables:## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...## $ cut : Ord.factor w/ 5 levels "Fair"
-
2.3 Plotting the characteristics of dataset
2.3.1 Plotting using base graphics
#--------------------------------# Plotting with Base graphics#--------------------------------x
-
Frequency Distribution of Diamond Price
Diamond Price
Freq
uenc
y
1000 2000 3000 4000 5000
020
040
060
080
0MeanDensity CurveNormal Curve
2.3.2 Plotting using ggplot
g
-
0e+00
1e04
2e04
3e04
4e04
5e04
1000 2000 3000 4000 5000Diamond price
Freq
uenc
yFrequency Distribution of Diamond Price
2.3.3 Diamond price distribution with regards to Cut
g
-
0100
200
300
1000 2000 3000 4000 5000Price of Diamonds
Num
ber o
f Dia
mon
ds
cut Fair Good Very Good Premium Ideal
Prices of Sampled Diamonds
2.3.4 Regression line showing the impact of Carat on the price (Using lm)
g
-
2000
4000
6000
0.4 0.8 1.2carat
pric
e
clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
2.3.5 Regression line showing the impact of Carat on the price (Using Loess)
g
-
1000
2000
3000
4000
5000
0.4 0.8 1.2carat
pric
e
clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
2.3.6 Regression line faceted by Colour and Cut
g
-
Fair Good Very Good Premium Ideal
0200040006000
0200040006000
0200040006000
0200040006000
0200040006000
0200040006000
0200040006000
DE
FG
HI
J
0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2carat
pric
e
clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
2.3.7 Correlation plot between all variables
library(corrplot)# Convert Diamonds dataset all fields to numericdiamonds.num
-
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1cut
color
clarity
depth
table
x
y
z
price2
carat2
0.06
0.23
0.25
0.46
0.23
0.18
0.27
0.16
0.25
0.02
0.06
0.03
0.3
0.25
0.3
0.15
0.31
0.07
0.17
0.6
0.5
0.58
0.35
0.59
0.25
0
0
0.22
0.06
0.1
0.2
0.15
0.12
0.13
0.18
0.84
0.94
0.85
0.97
0.83
0.73
0.82
0.82
0.93 0.86
From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highlycorrelated to Carat. This can also mean that takening Carat 2 as the
2.3.8 Exploratory plot
# Loading required librarieslibrary(ggplot2)library(GGally)library(scales)
# Sampling the data for the plot generationdiasamp
-
reduced.model |t|)## (Intercept) -1757.964 436.220 -4.030 5.66e-05 ***## carat 5946.169 114.468 51.946 < 2e-16 ***## cut.L 289.409 18.773 15.416 < 2e-16 ***## cut.Q -118.602 14.754 -8.039 1.13e-15 ***## cut.C 122.496 13.433 9.119 < 2e-16 ***## cut^4 27.127 11.409 2.378 0.0175 *## color.L -889.472 16.658 -53.397 < 2e-16 ***## color.Q -205.282 14.891 -13.785 < 2e-16 ***## color.C -54.810 14.049 -3.901 9.69e-05 ***## color^4 31.203 13.128 2.377 0.0175 *## color^5 54.780 12.283 4.460 8.38e-06 ***## color^6 43.795 11.144 3.930 8.62e-05 ***## clarity.L 1938.848 28.227 68.689 < 2e-16 ***## clarity.Q -726.394 24.116 -30.121 < 2e-16 ***## clarity.C 451.869 20.706 21.823 < 2e-16 ***## clarity^4 -248.135 17.122 -14.492 < 2e-16 ***## clarity^5 77.947 14.643 5.323 1.06e-07 ***## clarity^6 -17.521 13.226 -1.325 0.1853## clarity^7 18.544 11.973 1.549 0.1215## depth -10.051 4.490 -2.239 0.0252 *## table -4.957 2.624 -1.889 0.0589 .## x 91.671 49.098 1.867 0.0619 .## z 97.390 44.027 2.212 0.0270 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 324.1 on 4977 degrees of freedom## Multiple R-squared: 0.927, Adjusted R-squared: 0.9267## F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16
We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, zand carat.
3.2 Exploring the predictors using box plot
#-------------------------------## Exploring the predictors using box plot
12
-
#-------------------------------# Exploring association of Cut with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = cut)) + xlab("Carat") +theme(legend.position="bottom")
1000
2000
3000
4000
5000
3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat
pric
e
cut Fair Good Very Good Premium Ideal
# Exploring association of Clarity with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = clarity)) + xlab("Carat") +theme(legend.position="bottom")
13
-
1000
2000
3000
4000
5000
3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat
pric
e
clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
# Exploring association of Color with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = color)) + xlab("Carat") +theme(legend.position="bottom")
1000
2000
3000
4000
5000
3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat
pric
e
color D E F G H I J
14
-
3.3 Generating the Model
# The Starting and Suggested Modelsimple.model
-
## clarity.L 1936.765 28.124 68.866 < 2e-16 ***## clarity.Q -733.049 24.012 -30.529 < 2e-16 ***## clarity.C 454.915 20.721 21.954 < 2e-16 ***## clarity^4 -247.327 17.145 -14.426 < 2e-16 ***## clarity^5 79.403 14.658 5.417 6.34e-08 ***## clarity^6 -18.140 13.241 -1.370 0.17075## clarity^7 19.361 11.993 1.614 0.10653## color.L -890.439 16.679 -53.387 < 2e-16 ***## color.Q -205.497 14.892 -13.800 < 2e-16 ***## color.C -54.116 14.056 -3.850 0.00012 ***## color^4 32.902 13.141 2.504 0.01232 *## color^5 54.910 12.298 4.465 8.18e-06 ***## color^6 44.704 11.158 4.007 6.25e-05 ***## table -1.865 2.462 -0.757 0.44892## y 30.742 12.057 2.550 0.01081 *## z 64.395 38.046 1.693 0.09060 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 324.6 on 4978 degrees of freedom## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265## F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16
# Conduct Analysis of Variance between the simple model and the best fitted modelanova(simple.model, fitted.model)
## Analysis of Variance Table#### Model 1: price ~ carat## Model 2: price ~ carat + cut + clarity + color + table + y + z## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 4998 1399334980## 2 4978 524428896 20 874906084 415.24 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.5 Analysing the Residuals
3.5.1 Checking unidentified patterns in the Residuals
The graph shows that variance between the actual and prediction are higher when the price of the dimondincreaes. There is a possibility that a factor that increases the price at a higher price range, is not captured inthe model. Hence the variance of the price cant be adequately captured by the model based on the availablepredictors.
x
-
ylab("Residual") +geom_smooth(method="loess", colour="red", lwd=1)
2000
1000
0
1000
1000 2000 3000 4000 5000Fitted value
Res
idua
l
3.5.2 Density plot of residuals to check Normal Distribution
The graph shows that the residula falls in a normal pattern.
x
- yfit
- model.rmse
-
par(mfrow=c(2, 2))plot(fitted.model)
0 1000 3000 5000
20
0020
00
Fitted values
Res
idua
ls
Residuals vs Fitted
521723678015
4 2 0 2 4
6
04
Theoretical Quantiles
Stan
dard
ized
resid
uals
Normal QQ
4919052172367
0 1000 3000 5000
0.0
1.5
Fitted values
Sta
nda
rdiz
ed
resi
dua
ls
ScaleLocation49190 52172367
0.0 0.2 0.4 0.6 0.8
8
2
4
Leverage
Stan
dard
ized
resid
uals
Cook's distance10.50.51
Residuals vs Leverage
49190
2315
4792
The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.
4 Final conclusion
We have seen that using Linear model, a good predictive model can be developed, provided that the variables(predictors) which significantly impact the outcome (price in this case) can be accurately identified.
We also observe tha the prediction may work within some boundary condition. If the boundary conditionsare accurately identified, then different models can be built for predicting the data outside the fitted model.
20
Executive SummaryObjectiveAbout the data
Exploratory AnalysisLoading relevant librariesSubsetting the datasetPlotting the characteristics of dataset
Predicting the diamond priceDetermining the Significant Predictors of Diamond priceExploring the predictors using box plotGenerating the ModelAnalysing the variance between multiple modelsAnalysing the ResidualsPredicting using the fitted modelPlotting the predicted data with actual data
Final conclusion