predicting diamond price using linear model

20
Predicting Diamond Price using Linear Model Sarajit Poddar 26 July 2015 Contents 1 Executive Summary 1 1.1 Objective ............................................... 1 1.2 About the data ............................................ 1 2 Exploratory Analysis 2 2.1 Loading relevant libraries ...................................... 2 2.2 Subsetting the dataset ........................................ 2 2.3 Plotting the characteristics of dataset ............................... 4 3 Predicting the diamond price 11 3.1 Determining the Significant Predictors of Diamond price ..................... 11 3.2 Exploring the predictors using box plot .............................. 12 3.3 Generating the Model ........................................ 15 3.4 Analysing the variance between multiple models ......................... 15 3.5 Analysing the Residuals ....................................... 16 3.6 Predicting using the fitted model .................................. 18 3.7 Plotting the predicted data with actual data ............................ 19 4 Final conclusion 20 1 Executive Summary 1.1 Objective Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Using linear model to predict the prices, comparing the predicted price with the actuals and observb the residuals. Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. The machine learning algorithms will be explored in subsequent articles. 1.2 About the data 1.2.1 Description Prices of 50,000 round cut diamonds Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows: 1

Upload: sarajit-poddar

Post on 05-Sep-2015

16 views

Category:

Documents


2 download

DESCRIPTION

Predicting Diamond Price Using Linear Model. This is part of exploring various possibilities for developing a prediction algorithm for Predicting the Diamond price given its various characteristics. In subsequent articles, machine learning algorithms will be explored and the best algorithm for determining the diamond price will be identified. The diamonds dataset is taken from the ggplot2 library.

TRANSCRIPT

  • Predicting Diamond Price using Linear ModelSarajit Poddar26 July 2015

    Contents

    1 Executive Summary 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 Exploratory Analysis 22.1 Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3 Predicting the diamond price 113.1 Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . . 113.2 Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.6 Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.7 Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4 Final conclusion 20

    1 Executive Summary

    1.1 Objective

    Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Usinglinear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. Themachine learning algorithms will be explored in subsequent articles.

    1.2 About the data

    1.2.1 Description

    Prices of 50,000 round cut diamondsDescription: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variablesare as follows:

    1

  • 1.2.2 Details

    price. price in US dollars ($326-$18,823)

    carat. weight of the diamond (0.2-5.01)

    cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)

    colour. diamond colour, from J (worst) to D (best)

    clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))

    x. length in mm (0-10.74)

    y. width in mm (0-58.9)

    z. depth in mm (0-31.8)

    depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)

    table. width of top of diamond relative to widest point (43-95)

    2 Exploratory Analysis

    2.1 Loading relevant libraries

    # Load required librarieslibrary(dplyr);library(tidyr);library(ggplot2)

    2.2 Subsetting the dataset

    The dataset is subset to a smaller size as the dataset it huge

    # Load the diamonds datasetdata(diamonds)

    # Convert continuous variables to factors# Cut by interval of 1000diamonds$price2

  • ## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066## Max. :5.0100 I: 5422 VVS1 : 3655## J: 2808 (Other): 2531## depth table price x## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710## Median :61.80 Median :57.00 Median : 2401 Median : 5.700## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740#### y z price2 carat2## Min. : 0.000 Min. : 0.000 Min. : 1.000 Min. : 2.000## 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.: 1.000 1st Qu.: 4.000## Median : 5.710 Median : 3.530 Median : 3.000 Median : 7.000## Mean : 5.735 Mean : 3.539 Mean : 4.398 Mean : 8.468## 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.: 6.000 3rd Qu.:11.000## Max. :58.900 Max. :31.800 Max. :19.000 Max. :51.000##

    # Structure of the diamond datasetstr(diamonds)

    ## 'data.frame': 53940 obs. of 12 variables:## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...## $ cut : Ord.factor w/ 5 levels "Fair"

  • 2.3 Plotting the characteristics of dataset

    2.3.1 Plotting using base graphics

    #--------------------------------# Plotting with Base graphics#--------------------------------x

  • Frequency Distribution of Diamond Price

    Diamond Price

    Freq

    uenc

    y

    1000 2000 3000 4000 5000

    020

    040

    060

    080

    0MeanDensity CurveNormal Curve

    2.3.2 Plotting using ggplot

    g

  • 0e+00

    1e04

    2e04

    3e04

    4e04

    5e04

    1000 2000 3000 4000 5000Diamond price

    Freq

    uenc

    yFrequency Distribution of Diamond Price

    2.3.3 Diamond price distribution with regards to Cut

    g

  • 0100

    200

    300

    1000 2000 3000 4000 5000Price of Diamonds

    Num

    ber o

    f Dia

    mon

    ds

    cut Fair Good Very Good Premium Ideal

    Prices of Sampled Diamonds

    2.3.4 Regression line showing the impact of Carat on the price (Using lm)

    g

  • 2000

    4000

    6000

    0.4 0.8 1.2carat

    pric

    e

    clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

    2.3.5 Regression line showing the impact of Carat on the price (Using Loess)

    g

  • 1000

    2000

    3000

    4000

    5000

    0.4 0.8 1.2carat

    pric

    e

    clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

    2.3.6 Regression line faceted by Colour and Cut

    g

  • Fair Good Very Good Premium Ideal

    0200040006000

    0200040006000

    0200040006000

    0200040006000

    0200040006000

    0200040006000

    0200040006000

    DE

    FG

    HI

    J

    0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2 0.4 0.8 1.2carat

    pric

    e

    clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

    2.3.7 Correlation plot between all variables

    library(corrplot)# Convert Diamonds dataset all fields to numericdiamonds.num

  • 1

    0.8

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1cut

    color

    clarity

    depth

    table

    x

    y

    z

    price2

    carat2

    0.06

    0.23

    0.25

    0.46

    0.23

    0.18

    0.27

    0.16

    0.25

    0.02

    0.06

    0.03

    0.3

    0.25

    0.3

    0.15

    0.31

    0.07

    0.17

    0.6

    0.5

    0.58

    0.35

    0.59

    0.25

    0

    0

    0.22

    0.06

    0.1

    0.2

    0.15

    0.12

    0.13

    0.18

    0.84

    0.94

    0.85

    0.97

    0.83

    0.73

    0.82

    0.82

    0.93 0.86

    From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highlycorrelated to Carat. This can also mean that takening Carat 2 as the

    2.3.8 Exploratory plot

    # Loading required librarieslibrary(ggplot2)library(GGally)library(scales)

    # Sampling the data for the plot generationdiasamp

  • reduced.model |t|)## (Intercept) -1757.964 436.220 -4.030 5.66e-05 ***## carat 5946.169 114.468 51.946 < 2e-16 ***## cut.L 289.409 18.773 15.416 < 2e-16 ***## cut.Q -118.602 14.754 -8.039 1.13e-15 ***## cut.C 122.496 13.433 9.119 < 2e-16 ***## cut^4 27.127 11.409 2.378 0.0175 *## color.L -889.472 16.658 -53.397 < 2e-16 ***## color.Q -205.282 14.891 -13.785 < 2e-16 ***## color.C -54.810 14.049 -3.901 9.69e-05 ***## color^4 31.203 13.128 2.377 0.0175 *## color^5 54.780 12.283 4.460 8.38e-06 ***## color^6 43.795 11.144 3.930 8.62e-05 ***## clarity.L 1938.848 28.227 68.689 < 2e-16 ***## clarity.Q -726.394 24.116 -30.121 < 2e-16 ***## clarity.C 451.869 20.706 21.823 < 2e-16 ***## clarity^4 -248.135 17.122 -14.492 < 2e-16 ***## clarity^5 77.947 14.643 5.323 1.06e-07 ***## clarity^6 -17.521 13.226 -1.325 0.1853## clarity^7 18.544 11.973 1.549 0.1215## depth -10.051 4.490 -2.239 0.0252 *## table -4.957 2.624 -1.889 0.0589 .## x 91.671 49.098 1.867 0.0619 .## z 97.390 44.027 2.212 0.0270 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 324.1 on 4977 degrees of freedom## Multiple R-squared: 0.927, Adjusted R-squared: 0.9267## F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16

    We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, zand carat.

    3.2 Exploring the predictors using box plot

    #-------------------------------## Exploring the predictors using box plot

    12

  • #-------------------------------# Exploring association of Cut with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = cut)) + xlab("Carat") +theme(legend.position="bottom")

    1000

    2000

    3000

    4000

    5000

    3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat

    pric

    e

    cut Fair Good Very Good Premium Ideal

    # Exploring association of Clarity with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = clarity)) + xlab("Carat") +theme(legend.position="bottom")

    13

  • 1000

    2000

    3000

    4000

    5000

    3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat

    pric

    e

    clarity I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

    # Exploring association of Color with Carat and Priceggplot(data.sample, aes(factor(carat2), price)) +geom_boxplot(aes(fill = color)) + xlab("Carat") +theme(legend.position="bottom")

    1000

    2000

    3000

    4000

    5000

    3 4 5 6 7 8 9 10 11 12 13 14 15 16Carat

    pric

    e

    color D E F G H I J

    14

  • 3.3 Generating the Model

    # The Starting and Suggested Modelsimple.model

  • ## clarity.L 1936.765 28.124 68.866 < 2e-16 ***## clarity.Q -733.049 24.012 -30.529 < 2e-16 ***## clarity.C 454.915 20.721 21.954 < 2e-16 ***## clarity^4 -247.327 17.145 -14.426 < 2e-16 ***## clarity^5 79.403 14.658 5.417 6.34e-08 ***## clarity^6 -18.140 13.241 -1.370 0.17075## clarity^7 19.361 11.993 1.614 0.10653## color.L -890.439 16.679 -53.387 < 2e-16 ***## color.Q -205.497 14.892 -13.800 < 2e-16 ***## color.C -54.116 14.056 -3.850 0.00012 ***## color^4 32.902 13.141 2.504 0.01232 *## color^5 54.910 12.298 4.465 8.18e-06 ***## color^6 44.704 11.158 4.007 6.25e-05 ***## table -1.865 2.462 -0.757 0.44892## y 30.742 12.057 2.550 0.01081 *## z 64.395 38.046 1.693 0.09060 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 324.6 on 4978 degrees of freedom## Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265## F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16

    # Conduct Analysis of Variance between the simple model and the best fitted modelanova(simple.model, fitted.model)

    ## Analysis of Variance Table#### Model 1: price ~ carat## Model 2: price ~ carat + cut + clarity + color + table + y + z## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 4998 1399334980## 2 4978 524428896 20 874906084 415.24 < 2.2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    3.5 Analysing the Residuals

    3.5.1 Checking unidentified patterns in the Residuals

    The graph shows that variance between the actual and prediction are higher when the price of the dimondincreaes. There is a possibility that a factor that increases the price at a higher price range, is not captured inthe model. Hence the variance of the price cant be adequately captured by the model based on the availablepredictors.

    x

  • ylab("Residual") +geom_smooth(method="loess", colour="red", lwd=1)

    2000

    1000

    0

    1000

    1000 2000 3000 4000 5000Fitted value

    Res

    idua

    l

    3.5.2 Density plot of residuals to check Normal Distribution

    The graph shows that the residula falls in a normal pattern.

    x

  • yfit
  • model.rmse
  • par(mfrow=c(2, 2))plot(fitted.model)

    0 1000 3000 5000

    20

    0020

    00

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted

    521723678015

    4 2 0 2 4

    6

    04

    Theoretical Quantiles

    Stan

    dard

    ized

    resid

    uals

    Normal QQ

    4919052172367

    0 1000 3000 5000

    0.0

    1.5

    Fitted values

    Sta

    nda

    rdiz

    ed

    resi

    dua

    ls

    ScaleLocation49190 52172367

    0.0 0.2 0.4 0.6 0.8

    8

    2

    4

    Leverage

    Stan

    dard

    ized

    resid

    uals

    Cook's distance10.50.51

    Residuals vs Leverage

    49190

    2315

    4792

    The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.

    4 Final conclusion

    We have seen that using Linear model, a good predictive model can be developed, provided that the variables(predictors) which significantly impact the outcome (price in this case) can be accurately identified.

    We also observe tha the prediction may work within some boundary condition. If the boundary conditionsare accurately identified, then different models can be built for predicting the data outside the fitted model.

    20

    Executive SummaryObjectiveAbout the data

    Exploratory AnalysisLoading relevant librariesSubsetting the datasetPlotting the characteristics of dataset

    Predicting the diamond priceDetermining the Significant Predictors of Diamond priceExploring the predictors using box plotGenerating the ModelAnalysing the variance between multiple modelsAnalysing the ResidualsPredicting using the fitted modelPlotting the predicted data with actual data

    Final conclusion