predicting diamond price using linear model

Predicting Diamond Price using Linear ModelSarajit Poddar26 July 2015

Contents

1 Executive Summary 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Exploratory Analysis 22.1 Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Predicting the diamond price 113.1 Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . . 113.2 Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.6 Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.7 Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Final conclusion 20

1 Executive Summary

1.1 Objective

Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Usinglinear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. Themachine learning algorithms will be explored in subsequent articles.

1.2 About the data

1.2.1 Description

Prices of 50,000 round cut diamondsDescription: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variablesare as follows:

1

1.2.2 Details

price. price in US dollars ($326-$18,823)

carat. weight of the diamond (0.2-5.01)

cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)

colour. diamond colour, from J (worst) to D (best)

clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))

x. length in mm (0-10.74)

y. width in mm (0-58.9)

z. depth in mm (0-31.8)

depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)

table. width of top of diamond relative to widest point (43-95)

2 Exploratory Analysis

2.1 Loading relevant libraries

# Load required librarieslibrary(dplyr);library(tidyr);library(ggplot2)

2.2 Subsetting the dataset

The dataset is subset to a smaller size as the dataset it huge

# Load the diamonds datasetdata(diamonds)

# Convert continuous variables to factors# Cut by interval of 1000diamonds$price2

## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066## Max. :5.0100 I: 5422 VVS1 : 3655## J: 2808 (Other): 2531## depth table price x## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710## Median :61.80 Median :57.00 Median : 2401 Median : 5.700## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740#### y z price2 carat2## Min. : 0.000 Min. : 0.000 Min. : 1.000 Min. : 2.000## 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.: 1.000 1st Qu.: 4.000## Median : 5.710 Median : 3.530 Median : 3.000 Median : 7.000## Mean : 5.735 Mean : 3.539 Mean : 4.398 Mean : 8.468## 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.: 6.000 3rd Qu.:11.000## Max. :58.900 Max. :31.800 Max. :19.000 Max. :51.000##

# Structure of the diamond datasetstr(diamonds)

## 'data.frame': 53940 obs. of 12 variables:## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...## $ cut : Ord.factor w/ 5 levels "Fair"

2.3 Plotting the characteristics of dataset

2.3.1 Plotting using base graphics

#--------------------------------# Plotting with Base graphics#--------------------------------x

Frequency Distribution of Diamond Price

Diamond Price

Freq

uenc

y

1000 2000 3000 4000 5000

020

040

060

080

0MeanDensity CurveNormal Curve

2.3.2 Plotting using ggplot

g

0e+00

1e04

2e04

3e04

4e04

5e04

1000 2000 3000 4000 5000Diamond price

Freq

uenc

yFrequency Distribution of Diamond Price

2.3.3 Diamond price distribution with regards to Cut

g

0100

200

300

1000 2000 3000 4000 5000Price of Diamonds

Num

ber o

f Dia

mon

ds

cut Fair Good Very Good Premium Ideal

Prices of Sampled Diamonds

2.3.4 Regression line showing the impact of Carat on the price (Using lm)

g

2000

4000

6000

0.4 0.8 1.2carat

pric

e

clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

2.3.5 Regression line showing the impact of Carat on the price (Using Loess)

g

1000

2000

3000

4000

5000

0.4 0.8 1.2carat

pric

e


2.3.6 Regression line faceted by Colour and Cut

g

Fair Good Very Good Premium Ideal

0200040006000

0200040006000

0200040006000

0200040006000

0200040006000

0200040006000

0200040006000

DE

FG

HI

J

0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2carat

pric

e


2.3.7 Correlation plot between all variables

library(corrplot)# Convert Diamonds dataset all fields to numericdiamonds.num

1

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1cut

color

clarity

depth

table

x

y

z

price2

carat2

0.06

0.23

0.25

0.46

0.23

0.18

0.27

0.16

0.25

0.02

0.06

0.03

0.3

0.25

0.3

0.15

0.31

0.07

0.17

0.6

0.5

0.58

0.35

0.59

0.25

0

0

0.22

0.06

0.1

0.2

0.15

0.12

0.13

0.18

0.84

0.94

0.85

0.97

0.83

0.73

0.82

0.82

0.93 0.86

From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highlycorrelated to Carat. This can also mean that takening Carat 2 as the

2.3.8 Exploratory plot

# Loading required librarieslibrary(ggplot2)library(GGally)library(scales)

# Sampling the data for the plot generationdiasamp

reduced.model |t|)## (Intercept) -1757.964 436.220 -4.030 5.66e-05 ***## carat 5946.169 114.468 51.946 < 2e-16 ***## cut.L 289.409 18.773 15.416 < 2e-16 ***## cut.Q -118.602 14.754 -8.039 1.13e-15 ***## cut.C 122.496 13.433 9.119 < 2e-16 ***## cut^4 27.127 11.409 2.378 0.0175 *## color.L -889.472 16.658 -53.397 < 2e-16 ***## color.Q -205.282 14.891 -13.785 < 2e-16 ***## color.C -54.810 14.049 -3.901 9.69e-05 ***## color^4 31.203 13.128 2.377 0.0175 *## color^5 54.780 12.283 4.460 8.38e-06 ***## color^6 43.795 11.144 3.930 8.62e-05 ***## clarity.L 1938.848 28.227 68.689 < 2e-16 ***## clarity.Q -726.394 24.116 -30.121 < 2e-16 ***## clarity.C 451.869 20.706 21.823 < 2e-16 ***## clarity^4 -248.135 17.122 -14.492 < 2e-16 ***## clarity^5 77.947 14.643 5.323 1.06e-07 ***## clarity^6 -17.521 13.226 -1.325 0.1853## clarity^7 18.544 11.973 1.549 0.1215## depth -10.051 4.490 -2.239 0.0252 *## table -4.957 2.624 -1.889 0.0589 .## x 91.671 49.098 1.867 0.0619 .## z 97.390 44.027 2.212 0.0270 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 324.1 on 4977 degrees of freedom## Multiple R-squared: 0.927, Adjusted R-squared: 0.9267## F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16

We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, zand carat.

3.2 Exploring the predictors using box plot

#-------------------------------## Exploring the predictors using box plot

12

#-------------------------------# Exploring association of Cut with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = cut)) + xlab("Carat") +theme(legend.position="bottom")

1000

2000

3000

4000

5000

3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat

pric

e

cut Fair Good Very Good Premium Ideal

# Exploring association of Clarity with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = clarity)) + xlab("Carat") +theme(legend.position="bottom")

13

1000

2000

3000

4000

5000

3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat

pric

e


# Exploring association of Color with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = color)) + xlab("Carat") +theme(legend.position="bottom")

1000

2000

3000

4000

5000

3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat

pric

e

color D E F G H I J

14

3.3 Generating the Model

# The Starting and Suggested Modelsimple.model

## clarity.L 1936.765 28.124 68.866 < 2e-16 ***## clarity.Q -733.049 24.012 -30.529 < 2e-16 ***## clarity.C 454.915 20.721 21.954 < 2e-16 ***## clarity^4 -247.327 17.145 -14.426 < 2e-16 ***## clarity^5 79.403 14.658 5.417 6.34e-08 ***## clarity^6 -18.140 13.241 -1.370 0.17075## clarity^7 19.361 11.993 1.614 0.10653## color.L -890.439 16.679 -53.387 < 2e-16 ***## color.Q -205.497 14.892 -13.800 < 2e-16 ***## color.C -54.116 14.056 -3.850 0.00012 ***## color^4 32.902 13.141 2.504 0.01232 *## color^5 54.910 12.298 4.465 8.18e-06 ***## color^6 44.704 11.158 4.007 6.25e-05 ***## table -1.865 2.462 -0.757 0.44892## y 30.742 12.057 2.550 0.01081 *## z 64.395 38.046 1.693 0.09060 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 324.6 on 4978 degrees of freedom## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265## F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16

# Conduct Analysis of Variance between the simple model and the best fitted modelanova(simple.model, fitted.model)

## Analysis of Variance Table#### Model 1: price ~ carat## Model 2: price ~ carat + cut + clarity + color + table + y + z## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 4998 1399334980## 2 4978 524428896 20 874906084 415.24 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5 Analysing the Residuals

3.5.1 Checking unidentified patterns in the Residuals

The graph shows that variance between the actual and prediction are higher when the price of the dimondincreaes. There is a possibility that a factor that increases the price at a higher price range, is not captured inthe model. Hence the variance of the price cant be adequately captured by the model based on the availablepredictors.

x

ylab("Residual") +geom_smooth(method="loess", colour="red", lwd=1)

2000

1000

0

1000

1000 2000 3000 4000 5000Fitted value

Res

idua

l

3.5.2 Density plot of residuals to check Normal Distribution

The graph shows that the residula falls in a normal pattern.

x

model.rmse

par(mfrow=c(2, 2))plot(fitted.model)

0 1000 3000 5000

20

0020

00

Fitted values

Res

idua

ls

Residuals vs Fitted

521723678015

4 2 0 2 4

6

04

Theoretical Quantiles

Stan

dard

ized

resid

uals

Normal QQ

4919052172367

0 1000 3000 5000

0.0

1.5

Fitted values

Sta

nda

rdiz

ed

resi

dua

ls

ScaleLocation49190 52172367

0.0 0.2 0.4 0.6 0.8

8

2

4

Leverage

Stan

dard

ized

resid

uals

Cook's distance10.50.51

Residuals vs Leverage

49190

2315

4792

The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.

4 Final conclusion

We have seen that using Linear model, a good predictive model can be developed, provided that the variables(predictors) which significantly impact the outcome (price in this case) can be accurately identified.

We also observe tha the prediction may work within some boundary condition. If the boundary conditionsare accurately identified, then different models can be built for predicting the data outside the fitted model.

20

Executive SummaryObjectiveAbout the data

Exploratory AnalysisLoading relevant librariesSubsetting the datasetPlotting the characteristics of dataset

Predicting the diamond priceDetermining the Significant Predictors of Diamond priceExploring the predictors using box plotGenerating the ModelAnalysing the variance between multiple modelsAnalysing the ResidualsPredicting using the fitted modelPlotting the predicted data with actual data

Final conclusion

predicting diamond price using linear model

Documents

diamond relative

diamond datasetstrdiamonds

diamond colour

predicted price

datasetthe dataset

characteristics of dataset

depth table price x

predicted data