data mining paper

56
Using Data Mining to Determine Car Dealer Auction Behavior Edward Egros Candidate for Departmental Distinction in Economics Advisor: Prof. Tom Fomby

Upload: edward-egros

Post on 11-Jan-2017

66 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Paper

Using Data Mining to Determine Car Dealer Auction

BehaviorEdward Egros

Candidate for Departmental Distinction in Economics

Advisor: Prof. Tom Fomby

Data Courtesy of DaimlerChrysler Financial Services

Presented May 11, 2006

Page 2: Data Mining Paper

Table of Contents

Section 1) Introduction…………………………………p. 3

Section 2) Multiple Linear Regression…………….p. 9

Section 3) Regression Tree..…………………………p. 11

Section 4) Neural Network……………………….….p. 17

Section 5) Ensemble Method………………………..p. 21

Section 6) Conclusions…………………………….....p. 24

Section 7) Appendices…………………………………p. 28

Section 8) Works Cited………………………………..p. 36

Abstract: DaimlerChrysler Financial Services hopes to better predict how dealers will bid for their used vehicles at auctions. These predictions will help company executives gain a better sense of the true value of their auctioned vehicles and how much revenue they stand to earn. Because vehicle gross proceeds can take on virtually any dollar amount, data mining offers a number of continuous dependent variable models to help forecast bidding. This paper explores four of these models: a multiple linear regression, a regression tree, an artificial neural network and an Ensemble Method that uses all of the three previous models. Data for this study includes a number of explanatory variable characteristics for each vehicle, including the mileage on the vehicle at the time of auction, the original manufactured suggested retail price and the time the vehicle was sold. These models will use these characteristics to predict the gross proceeds for these vehicles. This study concludes that the Ensemble Method best forecasts gross proceeds for a subset of DaimlerChrysler used vehicles.

1) Introduction

2

Page 3: Data Mining Paper

DaimlerChrysler Financial Services uses many outlets to sell its used vehicles.

One of the outlets that require the most research is the wholesale dealer auction. Any

vehicles that are either repossessed or its lease to a customer has expired go into a pool of

vehicles. The company then assigns these vehicles to auction sites all over the country

where auction representatives take these vehicles into their lots. There, at the auction

representative’s discretion, vehicles are detailed, meaning any blemishes or malfunctions

are repaired to increase the vehicle’s value. On sale day, these vehicles are then placed

on the auction line and dealers attend the auction to bid on these vehicles. The dealer

with the highest bid then buys the vehicle. However, if a bid is not high enough

according to the auction representative, they can call a “no sale” and refuse to sell the

vehicle to the dealer with the highest bid. The vehicle can then be carried over to the

next sale day or released from the auction site in some capacity, such as to a salvage yard

or to a third party.

Auction representatives want to make sure that vehicles they sell will receive a

fair value. The most efficient way of assuring fair value is to forecast what a dealer will

pay for a vehicle, given certain characteristics. Price forecasting aids auction

representatives and the company in different ways. It protects both the company and the

auction from underselling a vehicle and taking a significant loss that is then absorbed by

the company and the auction. Second, it prevents over-expectations of fair market value

where an auction representative forces too many “no sales” and does not accept the

highest value the vehicle will receive through time. Finally, it acts as a guide for auction

representatives to create a floor value that best incites dealers to bid the highest dollar

figure. Knowing the wholesale value also helps the auction representative make a

3

Page 4: Data Mining Paper

number of profitable decisions, including: the extent which vehicles are repaired, how to

run the auction to maximize value and whether to sell the vehicle at auction in the first

place. This paper specifically looks at a subset of vehicles that DaimlerChrysler

Financial Services sells on a regular basis.

To accomplish this task, this paper proposes a number of different methods to best

model dealer behavior. All of these methods employ data mining. One well-accepted

definition of data mining is “the science of extracting useful information from large data

sets or databases".1 More specifically, data mining is used to find patterns of behavior.

The objective here is to find patterns of car dealer behavior. Knowing these patterns will

create more valid forecasts when a specific vehicle with given characteristics comes to

auction. This paper proposes four methods to model these behaviors: a multiple linear

regression, a regression tree, an artificial neural network and an Ensemble Method that

employs portions of the three other methods simultaneously.

The vehicles comprising the data used in this paper, 9,573 in all, were sold at

auctions across the country held between February 25, 2004 and July 29, 2005

inclusively. Included in the data are a number of characteristics of each vehicle sold at

auction, with gross proceeds (gross_proc) as the dependent variable and the remaining

variables as explanatory variables. Here is a list of all variables used in this analysis:

Table 1: Variable Definitions

1 Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. Cambridge, MA. MIT Press, 2001.

4

Page 5: Data Mining Paper

This paper implements XLMiner software and also utilizes the same explanatory

variables for all four methods. 2 3

There is a five-step process for creating these methods. The data mining software

SAS Enterprise Miner uses the acronym SEMMA to illustrate this five-step process to

new users. SEMMA stands for:

2 Due to the 30 variable limitation of XLMiner, not all of the variables available could be used in this analysis. A two-step process was implemented to determine which variables should be used. First, a backward selection stepwise regression with a 90% significance level with the logged form of gross_proc using Intercooled STATA 7.0 software was used that included all explanatory variables. This means that the regression included all explanatory variables and dropped the largest p-values and reported anything below 0.1. Second, any variable with a t-statistic greater than the absolute value of 5.3 (p-value <.001) was included in this data mining analysis. 3The variable “cpiusedseason” is seasonally-adjusted monthly CPI data of new and used motor vehicles. U.S. Department of Labor, Bureau of Labor Statistics, Consumer Price Indices. 2006. Division of Consumer Prices and Price Indices. 17 May 2006. <http://www.bls.gov/cpi/home.htm#data>.

5

Page 6: Data Mining Paper

Figure 1: Acronym for Data Mining Protocol

SampleExploreModifyModelAssess4

Firstly, sampling the data means taking small portions of the data out of the pool

and randomly partitioning into three parts: a training set, a validation set and a test set.

This is a diagram of how the data are partitioned:

Figure 2: Data Partitioning

The training and validation sets will be used for the first three methods and the

test set will primarily be used for evaluating the Ensemble Method vis-à-vis the three

methods that make up the ensemble. The original 9,573 observations are randomly

placed into these three partitions and all variables available will be included in the

partitioning. 50% of the data are in the training set, 30% of the data are in the validation

set and 20% of the data are in the test set.

The second step is to explore the data to look for obvious trends, outliers, missing

information, etc. After running outlier analysis and through heuristics, there are some

things worth noting. These factors will help determine whether a model can be used as

4 SAS Technologies/Analytics. 2006. SAS Institute Inc. 17 May 2006. <http://www.sas.com/technologies/analytics/datamining/miner/semma.html>.

20%30%Training Set Validation Set Test Set

6

50%

Page 7: Data Mining Paper

an accurate predictor. Some of these factors include mileage has a negative effect on the

vehicle’s value and any vehicle cannot realistically have more than 250,000 miles. There

were no notable outliers, making the exploration step a quick one. Thirdly, the data must

be modified to reduce outliers. This data did not include any obvious outliers or

observations with missing information. Modification also includes transforming

variables. For example, variables such as quality category (inventory_) must be

transformed to include binary variables, or “dummies”, for the four possible categories.5

After these tasks are completed, it is time to model the data using the four

methods discussed. Here is a figure that diagrams these four models:

Figure 3: Methods Used

5 All of these vehicles fall under six categories numbered 0-5. A five is the best quality of vehicle, a zero is the worst quality and a three is average. For the sake of this analysis, observations with categories zero and one are dropped. Also, the variable representing category three vehicles is not included in any models to avoid the dummy variable trap and to verify the accuracy of the model (the coefficient for a two vehicle should be negative and the coefficients for fours and fives should be positive).

7

Page 8: Data Mining Paper

The training set will be used to construct these initial models. The constructed

models will then be scored by means of the validation set and the test set.

Finally, it is time to assess how well these models performed with new data. For

this paper, the main determinant for success is RMSE, or root means squared error. This

statistic comes from taking the average of the squared errors of each observation

prediction and then taking the square root of that mean; the lower the RMSE, the better

the model. For the first three models, the model that best predicts the validation set will

be considered the best model and will be recommended for future auctions and for future

vehicle price forecasting. For the Ensemble Method, the test set will be used to

Multiple Linear Regression

tktptt exxy 110

Regression Tree

Artificial Neural Network Ensemble Method

)(*)(*)(* 3210 NNRTMLRy

8

Page 9: Data Mining Paper

determine success. The purposes for using three different datasets will be explained later

in the study.

2) Multiple Linear Regression

The first model we explore is the multiple linear regression. This is the simplest

model econometrically and also one of the most popular models used in data mining

exercises. The idea behind this model starts with the algebraic equation for a straight

line. One way to write this equation for a straight line is , where is the

dependent variable that is trying to be modeled, is a coefficient that describes the

slope, is some independent variable and is a -intercept that explains what equals

when either or equal zero. The term does not have to be restricted to one

independent variable, instead it can be a number of factors and each factor can have its

own coefficient. This describes the construction of a “multiple” linear regression model.

Generally, the form of this model is:

Equation 1: Formula for the Multiple Linear Regression Model

;

where is the -intercept constant, represent the coefficients of the

independent factors and is an error term that shows the unexplained parts of the

regression. Ordinary least squares (OLS), a set of matrix algebra equations that

calculates the smallest value between the true value of and the that comes from the

regression, determines what these coefficients should be. For this regression to yield

legitimate results, there are some properties the regression must fulfill. First, the

regression must have a zero conditional mean, or the expected value of the error given all

9

Page 10: Data Mining Paper

of the explanatory variables equals zero. Second, there cannot be perfect collinearity; in

other words, the independent variables cannot have perfect linear relationships with each

other. Finally, the regression must be homoskedastic, or have a constant error variance.6

If these properties hold, then it can accurately reflect the relationships between the

different explanatory factors and the dependent variable.

In this model, we will use the vehicle’s characteristics previously mentioned as

these explanatory variables and the gross proceeds as the dependent variable. The most

significant characteristics will be included in this regression. Using the training set and

developing the estimates of the model vis-à-vis OLS, XLMiner calculated the model seen

in Appendix 1. Using these estimates, here is a subset of the scores of the validation

dataset:

Table 2: Subset of Scores for Multiple Linear Regression over the Validation Dataset

Row Id.

Predicted Value

Actual Value Residual

3 18582.33558 18800 217.664422211 17951.96275 18800 848.037251214 19889.9885 18800 -1089.988515 18774.51404 18800 25.4859645316 17310.62169 18800 1489.37831319 18705.61518 18800 94.3848171220 20245.20896 18800 -1445.2089623 17896.6315 18800 903.368500927 18158.52965 18800 641.470350831 19419.04455 18800 -619.04454732 17604.87104 18800 1195.12896234 18953.36258 18800 -153.36258242 19440.73648 18800 -640.73647747 19260.89011 18800 -460.89010850 19222.29166 18800 -422.29166453 19013.7063 18800 -213.70630457 19593.69709 18750 -843.697095

6 Wooldridge, Jeffrey. Introductory Econometrics, A Modern Approach. Australia, Thomson South-Western, 2003. pp. 85-95.

10

Page 11: Data Mining Paper

The RMSE for the training set is $1,499.39 with an average error close to zero;

for the validation set the RMSE is $1,552.76 with an average error of $84.74. Because

the RMSE for both datasets are within $53.37 of each other, the regression does seem to

forecast accurately relative to the original data. The R-squared statistic, a statistic used

for determining how well a regression fits the data, for this model equals 0.725, which is

acceptable given the conditions of the model. It should be expected that the average error

would be significantly more since calculations in OLS includes trying to avoid having

any bias or an average error not equal to zero. Although this property is next to

impossible when introducing new data, a low average error is acceptable. Now that the

multiple linear regression seems sound enough for implementation, it is time to proceed

in creating the other three models and comparing their respective results with the multiple

linear regression to determine which model best predicts dealer behavior.

3) Regression Tree

The next model to explore is the regression tree. This approach can also be

appealing, especially to engineers, because visually it is the easiest to understand despite

the algorithmic difficulties in construction. This is the simplest model that potentially

analyzes non-linear relationships between explanatory variables and the dependent

variable. This approach is unlike the multiple linear regression where only the -

intercept and linear coefficients model the relationship. This process does however

require more time to construct, a factor that should be considered when determining if it

is the most superior and efficient system.

11

Page 12: Data Mining Paper

An example of a regression tree looks like this:7

Figure 4: Example of a Regression Tree

Diagramming a tree begins with some explanatory variable with a number. This

number is referred to as a decision node. All records, hereafter referred to as

observations, start at this value and eventually fall down the tree according to values from

their respective explanatory variables. From this value there are two extending branches.

Each branch represents a range of values. If an observation has its explanatory value less

than or equal to the number, then the observation goes down to the left; if the

observation’s value of that explanatory variable is greater than the value listed then the

observation falls to the right. Everything underneath this variable hinges on whether the

7Bruce, Peter C., Patel, Nitin R., Shmueli, Galit. Data Mining in Excel: Lecture Notes and Cases. Arlington, VA. Galit Shmueli, Nitin R. Patel, Peter C. Bruce. Resampling Stats, Inc. 2005. p. 124.

12

Page 13: Data Mining Paper

value from the observation is greater than, less than or equal to the value listed.8 When

reading a regression tree, there will be numbers next to the branches. Those numbers

represent the number of observations that followed that particular branch. After

following the branch, the observation can either come to another decision node or a

terminal node. If it is another decision node then the process starts over again and the

observation falls under one of two branches. If it is a terminal node then the tree ends

and a predicted value for the dependent value is achieved. This one value is the same

predicted value for all observations that fit the same characteristics. Predictions come

from taking the average of all the predicted values of that subset of observations making

up the terminal node. In XLMiner, decision nodes are designated with circles and

terminal nodes are designated with rectangles and a blue font for the predicted value.9

When diagramming a regression tree there are many different options for defining

the branches of a regression tree. As with any data mining model, testing a variety of

different options, or “trial and error”, is the most reliable method to determine which

model performs best. There are three main types of trees: a full tree, a pruned tree and a

minimum error tree. In a full tree, it is the objective to fit virtually all of the observations

onto the tree, meaning the tree accurately captures all of the observations in the training

set. However, this method is subject to the malady of overfitting. Overfitting involves

devising a model that is so accurate with one dataset that it becomes worthless when it

8 One of the reasons why engineers like this model is because of how it controls for outliers. Any observation that is an outlier because of an explanatory variable will not be as powerful to changing the entire model because all explanatory variables fall under one of two categories, unlike a multiple linear regression where an outlier has a lot more clout. 9 Regression tree diagrams may also include rectangles with red text with the words “Sub Tree beneath”. The rules for diagramming trees in XLMiner include the rule that a spreadsheet may only contain so many levels of the tree. If the tree must have more levels than can be displayed, then the “Sub Tree” rectangle will appear. When this happens, check the tabular results to determine how an observation fit within that particular tree. Some trees in this analysis include such “Sub trees”.

13

Page 14: Data Mining Paper

tries to forecast a completely different set of data. The solution to overfitting is to create

a pruned tree. A pruned tree starts with a full tree and then removes, or “prunes”,

branches from the tree that do not significantly reduce the error rate. To determine this

significance, a chi-square statistic for independence is used. If the node split with the

strongest association to the dependent has a significant p-value according to the chi-

square statistic, then the split is included, otherwise it is pruned. As for the differences

between that and the minimum error tree, while pruned trees use the training set, a

minimum error tree uses the validation set to prune the full tree.

To apply regression trees to this auction data, all terminal nodes will include

dollar value predictions of a specific vehicle with the characteristics given in the decision

nodes. For this problem, four trees with different specifications were developed and the

tree with the lowest RMSE in the validation set will be considered the best tree for

prediction. The chart below provides these different specifications and their respective

results:

14

Page 15: Data Mining Paper

Table 3: Testing for Best Predictive Regression Tree

Tree Specifications RMSE of Validation Set (Average Error of Validation)

1 - 479 Cases Minimum in Terminal Node- 100 Maximum Number of Splits in for explanatory variables

2124.81(62.41)

2 - 100 Cases Minimum in Terminal Node- 100 Maximum Number of Splits in for explanatory variables

1824.93(60.19)

3 - 100 Cases Minimum in Terminal Node- 50 Maximum Number of Splits in for explanatory variables

1815.84(68.70)

4 - 479 Cases Minimum in Terminal Node- 200 Maximum Number of Splits in for explanatory variables

2070.84(64.62)

From these four trees, it is clear that there is an inconsistent tradeoff between the

RMSE and the average error. The tree with the average error closest to zero does not

have the RMSE closest to zero; at the same time the tree with the lowest average error

does not have the lowest RMSE.

From previous analysis, auction representatives as well as those representing

DaimlerChrysler Financial Services believes that a positive bias as well as a slightly

positive average error behooves the company more than a negative error or even a zero

error. The basis for this logic comes from the power auctions have over prices in the

market. There are potentially drastic consequences to under-forecasting vehicles. An

under-forecasted vehicle could influence how dealers value similar types of vehicles.

15

Page 16: Data Mining Paper

This devaluation results in losing a lot of money for the company and for auctions. To be

risk averse, the company prefers to slightly over-forecast vehicles to avoid devaluation.

What makes over-forecasting possible is that minor tweaks to prices may not be noticed

by dealers to the point where the company avoids devaluation as well as retains a firm

control over the market price.

Returning to the model, what this behavior means is that average error is not the

most important statistic when it comes to determining the best regression tree; therefore

the tree with the lowest RMSE, as long as the average error is not obscenely larger than

the others, will be the tree recommended as the best. According to the results, the third

tree with the fewest record minimum in a terminal node best predicts the validation data.

Pruning the data by having the fewest number of splits for all explanatory variables of all

the trees is the best method. Appendix 2 includes all split values and terminal node levels

of this tree. Here is a table of how the tree scored some of the validation dataset:

Table 4: Subset of Scores for Regression Tree over the Validation Dataset

Row Id.

Predicted Value

Actual Value Residual

3 19167.85714 18800 -367.85714311 18518.58407 18800 281.41592914 18218.93204 18800 581.06796115 19167.85714 18800 -367.85714316 17578.42466 18800 1221.57534219 19167.85714 18800 -367.85714320 19869.5 18800 -1069.523 17578.42466 18800 1221.57534227 18847.17742 18800 -47.17741931 19571.47887 18800 -771.47887332 17578.42466 18800 1221.57534234 19771.07438 18800 -971.0743842 19771.07438 18800 -971.0743847 19113.11475 18800 -313.11475450 17981.51261 18800 818.48739553 19167.85714 18800 -367.85714357 20156.34921 18750 -1406.34921

16

Page 17: Data Mining Paper

Also, it is important to note that these trees do not use all of the explanatory

variables given. After completing analysis with regression trees, it is time to look into

how neural networks perform with the data.

4) Neural Network

Of all the data mining methods discussed in this paper, the artificial neural

network is the most difficult to construct and the most difficult to understand

conceptually. Engineering one of these models requires numerous calculations and takes

the longest amount of time to complete of the four models for any statistical software

program. The major advantage though to learning about the neural network is that it best

captures complicated relationships that are non-linear between explanatory variables and

outputs and within explanatory variables that a regression tree is inferior at describing.

The structure of a typical neural network consists of three layers: input layers, a

group of hidden layers10 and an output layer. The input layer is made up of explanatory

variables and is represented by one node for each variable. The output layer consists of

one node that explains the behavior of the explanatory variables. If necessary, any

further calculations between these two layers are referred to as hidden layers. Here is an

example diagram of a neural network:11

10 In XLMiner, the user can specify anywhere from 1 to 4 of these hidden layers. 11 Bruce, Peter C., Patel, Nitin R., Shmueli, Galit. Data Mining in Excel: Lecture Notes and Cases. Arlington, VA. Galit Shmueli, Nitin R. Patel, Peter C. Bruce. Resampling Stats, Inc. 2005. p. 160.

17

Page 18: Data Mining Paper

Figure 5: Example of Artificial Neural Network

This diagram has one input layer with four input nodes on the left of the diagram,

two hidden layers in-between the input and output layers and one output layer with one

node on the right side. Not listed are numbers written next to the arrows (weights) and

numbers listed before (bias) and just after a node (output). Weights help refine a

prediction since an output consists of a weighted sum of explanatory variables. Biases

are added to, or subtracted from, the output of that node, whether it is the final output or

the output for another hidden layer. Both of these terms take on values usually between

the absolute values of 0.00 to 0.05. To explain fully how a calculation works, the weights

and explanatory variable values go into some function and are then calculated. The bias

is then added or subtracted and that equals the output for either a hidden layer or the final

output. Weights and biases are originally guessed and through the use of “back

18

Page 19: Data Mining Paper

propagation,” where the potential error from a prediction is calculated and then the

original estimates are then corrected. This system updates weights after each

observation.

The input layer for this particular problem will consist of the same vehicle

characteristics used for the previous two models. The output is the estimated price for

that vehicle. There are four combinations of hidden layers and hidden layer nodes that

will be implemented. The combination with the lowest RMSE or a reasonably low

RMSE and the lowest average error in the validation dataset will be the network

recommended. The settings for all four neural networks will include 300 epochs12, a

weight change momentum13 of 0.4 and an error tolerance of 0.01. The table listed below

includes the results of this study:

Table 5: Testing for Best Neural Network

Neural Network Specifications RMSE of Validation Set (Average Error of Validation)

1 - 1 Hidden Layer- 10 Nodes within Each Layer

3896.28(-3407.20)

2 - 1 Hidden Layer- 25 Nodes within Each Layer

4779.92(-4295.93)

3 - 2 Hidden Layers- 25 Nodes within Each Layer

2887.19(-2266.37)

4 - 4 Hidden Layers- 25 Nodes within Each Layer14

3079.57(-2027.18)

12 An epoch is a run through of the network where the weights are updated using new observations. 13 Weight change momentum retains some portion of the prediction of the earlier weights so that if an outlier is introduced to the network it will not seriously affect the results. 14 These are the maximum specifications in XLMiner for a neural network.

19

Page 20: Data Mining Paper

From these four neural networks, the one with the lowest RMSE is the third

network, and the model with the average error closest to zero is the fourth network. Once

again, just like with regression trees, there is disagreement over the best performing

model when it came to using either the RMSE or the average error. But unlike the

regression trees, the reasons behind neglecting the average error with regression trees

involved placing more emphasis on a higher average bias. All of the average errors in the

neural networks are negative, so it is more beneficial to the company to have a bias as

close to zero as possible given only negative alternatives. Any significantly negative bias

will only hurt the company’s status in the used vehicle market. The difference between

the two RMSE and the two average errors in the two finalist models is approximately the

same; therefore the fourth model will be considered the best neural network. It is

important to note that building the fourth model with the maximum number of hidden

layers takes much more time than the third neural network.15 If efficiency means more

than minimizing the average error magnitude then the third neural network should be

considered the best predictive model. It is up to the developer of the data mining models

to determine what the opportunity costs are for building a more sophisticated model.

However, for the sake of this paper, the time used to develop different models is not a

factor for predictive quality and so the fourth model will be deemed the best artificial

neural network. Appendix 3 provides the estimates for these weights and nodes. How

this neural network scores the validation dataset can be seen in the following table:

15 In XLMiner, the third neural network took 222.0 seconds to build. The fourth model took 583.0 seconds to build.

20

Page 21: Data Mining Paper

Table 6: Some Scores for Neural Network Models over the Validation Dataset

Row Id.

Predicted Value

Actual Value Residual

3 21726.92899 18800 -2926.9289911 20296.6756 18800 -1496.675614 22544.30552 18800 -3744.3055215 20910.90054 18800 -2110.9005416 19725.38579 18800 -925.38579219 21690.66788 18800 -2890.6678820 23146.59824 18800 -4346.5982423 20691.12712 18800 -1891.1271227 21441.52709 18800 -2641.5270931 23226.48269 18800 -4426.4826932 20349.19137 18800 -1549.1913734 20265.57072 18800 -1465.5707242 20780.53776 18800 -1980.5377647 21272.99611 18800 -2472.9961150 21074.60784 18800 -2274.6078453 22036.25948 18800 -3236.2594857 21307.45563 18750 -2557.45563

5) Ensemble Method

This paper has compiled the most acceptable linear regression, regression tree and

artificial neural network. Although one model may perform better than the other two,

this does not mean that the other two models should be discounted. In fact, it may be that

the two other models might still hold significance and thus predictive value! This is

possible theoretically thanks to the theorems published by C. Granger and R.

Ramanathan. They devised the Ensemble Method, devoted to implementing and

synthesizing a number of predictive models, including all three methods mentioned

earlier.16 The process works like that of a multiple linear regression.17 In this problem,

there are three explanatory variables: the first is the set of price predictions of the

16 Because XLMiner does not have an Ensemble Method included in its software, this method is generated manually and will be detailed later in the paper. 17 Granger, C. and Ramanathan, R.. “Improved Methods of Combining Forecasting,” Journal of Forecasting. 1984. 3, pp. 197-204.

21

Page 22: Data Mining Paper

validation data generated by the multiple linear regression; the second is the set of price

predictions from the same data generated by the regression tree and the third comes from

the neural network. The true vehicle gross proceeds will then be placed in a column next

to these three model estimates. After compiling these data, the next step is to retrieve the

same estimates of the three models that were generated from the test data. This will serve

the purpose of the validation data, since the actual validation data is being used as a

training dataset for the Ensemble Method. In XLMiner, the only way to setup the

Ensemble Method is to copy and paste the respective columns into a blank worksheet and

label each column with the type of variable it generates. After building the regression,

the next step is to generate new variables that represent the errors of both the validation

and test datasets. An average error and a root mean squared error will be created from

both datasets and will determine whether the Ensemble Method or one of the original

three models is the best predictor of these vehicles.

It is now time to generate one multiple linear regression with the estimates from

the previous models. The format of the regression generated looks like this:

Equation 2: Estimates of Ensemble Method Based on Validation Dataset Predictions

=1054.43 + (0.39)MLR + (0.09)RT + (0.42)NN; (255.61) (0.04) (0.03) (0.02)

where MLR, RT and NN are the validation dataset predictions for the multiple linear

regression, regression tree and artificial neural network, respectively. All of the four p-

values are less than 0.01, meaning that these estimates are significant to a 99%

confidence. Applying these estimates to the test data, the average error comes to $95.69.

As for the RMSE, that equals $1,081.47. The R-squared reported for this regression

22

Page 23: Data Mining Paper

using the validation data is 0.76, which makes the regression acceptable as far as fitting

the data of the three models. The scores using the test dataset are in the following table:

Table 7: Scores of the Three Original Methods and the Ensemble Method over the

Test Dataset

Test Set

Row Id.

Predicted Value for

MLRResidual

from MLRPredicted Value for

RTResidual from RT

Predicted Value for

NNResidual from NN

Predictions Using

Ensemble Method

Actual Value

Residual from Ensemble

Method

2 20169.72038 1369.720383 19484.2246 684.224599 22774.95389 3974.953892 20182.43578 18800 1382.4357847 17737.09486 -1062.90514 18392.85714 -407.142857 20541.14421 1741.144213 18198.55266 18800 -601.44733858 17887.12771 -912.872287 19113.11475 313.114754 20139.96794 1339.967936 18151.15752 18800 -648.8424787

13 18419.56646 -380.433537 18518.58407 -281.415929 21931.95774 3131.95774 19059.82972 18800 259.829720122 19875.68535 1075.685348 18218.93204 -581.067961 21789.53167 2989.531669 19545.48193 18800 745.481929224 18795.21086 -4.78914423 19167.85714 367.857143 22002.1839 3202.1839 19292.35938 18800 492.359381825 19154.69982 354.6998237 19113.11475 313.114754 22124.21957 3324.219566 19479.78391 18800 679.783909628 19794.2241 994.2241033 21049.2 2249.2 23108.63182 4308.631818 20309.43821 18800 1509.43820533 18465.05765 -334.942354 19167.85714 367.857143 20884.03034 2084.030341 18694.31686 18800 -105.683136940 19811.59787 1011.597873 21861.84971 3061.849711 23868.71268 5068.712676 20704.57403 18800 1904.57402946 17066.67275 -1733.32725 17860.28037 -939.719626 18809.16919 9.169189 17164.08734 18800 -1635.91266459 17416.50953 -1333.49047 18392.85714 -357.142857 19189.32546 439.325457 17506.34582 18750 -1243.654179

All four methods have been implemented on the test dataset to determine which

model best predicts car dealer behavior at auctions. Three of four models used different

combinations and, using the determinant statistics of the most useful model for each

method, here are the results:

23

Page 24: Data Mining Paper

Table 8: Summary Results of Four Competing Methods over Test Dataset

Model RMSE(Average error)

Linear Regression $1511.21($10.17)

Regression Tree $2046.73(-$34.06)

Neural Network $2961.63(-$2373.42)

Ensemble Method $1,081.47($95.69)

6) Conclusions

The first glaring observation in looking at the test dataset results is that the

artificial neural network is the worst model in both the average error and RMSE. The

neural network sometimes attempts to model relationships between explanatory variables

and the dependent variable in a way that is far more complicated than reality. Nearly all

of the predictions this model generated were significantly lower than the true gross

proceeds. This anomaly means that the model is negatively biased. One possibility for

this bias is that the original estimates for weights were already inadequately lower than

what they should have been. Despite the unacceptable results, the estimates generated by

the neural network are still important in that they are critical in calculating estimates for

the Ensemble Method. Although there are numerous combinations of the Ensemble

Method that would not include the neural network predictions, the fact that XLMiner

came up with a significant estimate shows its importance. It is also worth noting that the

coefficient for the neural network explanatory variable is the largest of the three models,

0.42.

24

Page 25: Data Mining Paper

As for the other three models, once again there is an almost linear tradeoff

between RMSE and the average error. If average error is deemed the most important

factor, then the regression tree would be considered the best model. However, the RMSE

of the regression tree is $429.74 worse than the model with the lowest RMSE, the

Ensemble Method. When situations like this in data mining occur, it is best to consult

with the ideals and objectives of the experiment to determine the model to use. In this

case, there is a monetary advantage to having a slightly positive bias.18 There are severe

consequences to devaluating vehicles and the company wishes to avoid these. Ceteris

paribus, there is a precise dollar amount that the company and auctions can increase this

subset of used vehicles without considerable dealer backlash and a loss of faith in the

company’s price forecasting altogether. Unfortunately, determining this precise amount

is difficult. After all, because of market sensitivities, an experiment where each

forecasting period means slightly increasing the bias until the auction notices the dealer

backlash is impossible to conduct practically. Not only is dealer loss of faith difficult to

regain but there is also a last-period, time-sensitive correlation with price forecasts. More

studies and other creative means need to be devised to determine exactly what bias the

market can sustain. For the time being, it is assumed that the market can sustain a

positive bias of $95.69.

For the sake of this study, it is also assumed that the amount of time spent

developing a forecasting model is not a factor in determining the best method; therefore

the Ensemble Method is the best predictor of these used vehicles at auction. In XLMiner,

it took a fraction of the time to build the multiple linear regression model compared with

18 See the discussion in Section 3 about the dangers of under-forecasting vehicles and the company’s mission to be as risk averse as possible.

25

Page 26: Data Mining Paper

developing the neural network. Also, because XLMiner is the focus of this research and

does not include an Ensemble Method, the time it takes to build the final model is the

sum of running all three models and then taking this time to combine the three models

into the Ensemble Method.

On the whole, used vehicle auction data can be quite volatile. Bids for an auction

are sometimes in $250 increments and a number of things can affect whether any dealer

makes one additional bid or not. These unforeseen and immeasurable things can include:

heightened dealer competition where two or more dealers are vying for a limited

inventory, bad weather conditions where it is more difficult for dealers to show up to an

auction and the way a particular auction is run. For instance, some auctioneers and ring

men19 may be more or less influential when it comes to assuring a dealer will make

another bid. Currently, there is no forecasting model that captures these factors perfectly,

since it is next to impossible to come up with variables that explain how to determine this

behavior.

Despite this white noise, new and innovative ways of price forecasting

implemented with data mining are better capturing relationships between a gamut of

vehicle and auction factors and the true value of used vehicles. This study explores how

three of the most popular continuous variable models performed with actual auction data.

The Ensemble Method performs the best because it takes into account both the simple

and complicated relationships between vehicle characteristics and gross proceeds. The

other three models can only model either a linear relationship or a non-linear, non-

parametric complicated one. What implementing the Ensemble Method also insinuates is

19 Ring men are a group of auction assistants who look at each dealer and, in a hurried fashion, attempt to get more bids out of dealers.

26

Page 27: Data Mining Paper

that the most sophisticated models for predicting auction bids usually also take the most

time to develop. It is up to executives to determine how much time can be devoted to a

project to where if it is more beneficial to spend more time on developing a better model

or if an inferior model is better at saving opportunity costs. Thankfully, data mining

offers all econometricians and engineers alike these options. Showcasing data mining in

new and different ways will also advance these new modeling techniques to becoming

more adopted and widespread in numerous academic fields.

27

Page 28: Data Mining Paper

Appendix 1: Multiple Linear Regression Coefficients and Estimates

Input variables Coefficient Std. Error p-value SSConstant term 82981.14844 3421.535645 0 1.72931E+12msrpamount 0.28138652 0.00771736 0 10726190000sold_mileage -0.09676149 0.00166331 0 12501070000inventory_1 -738.397461 92.78156281 0 366366200inventory_3 610.1702881 50.7352829 0 78098100inventory_4 1034.455811 80.36746216 0 490844400auction_loc1 -212.905685 67.29554749 0.00165312 1227785auction_loc3 -651.486572 100.7588577 0 21980950auction_loc5 362.5553589 73.17302704 0.000001 151648000auction_loc8 -604.85968 95.44612122 0 163980000auction_loc9 -496.780975 88.9276123 0.00000004 66948200auction_loc10 -381.332794 89.72070313 0.00002552 134050700c_model_key19 -2050.70532 129.8644867 0 894399100c_model_key25 -1170.71643 185.2324677 0 5389190c_model_key27 -2200.5874 95.7226181 0 1070617000c_model_key28 -25.6804009 675.5402832 0.96969134 157382.5469carmile3 -0.01260703 0.00142604 0 63611690cpiusedseason -737.568237 35.53587341 0 1050569000week_sold1 -570.329224 117.7169495 0.0000017 30346850week_sold3 -627.526917 113.0041199 0.00000005 51884980week_sold10 179.948288 98.53777313 0.06842326 21606660week_sold11 518.2744141 94.07507324 0.00000006 113547500week_sold14 -337.202576 98.73433685 0.00068951 8397428week_sold21 -918.109131 145.2453919 0 70136130week_sold22 -643.478455 117.8486633 0.00000008 52729520week_sold23 -910.053955 127.2749252 0 106006300week_sold24 -1028.31201 132.9921722 0 135160200

28

Page 29: Data Mining Paper

Appendix 2: Regression Tree Split Values and Terminal Node Levels

Prune Tree Rules

Level NodeID

ParentID SplitVar SplitValu

e Cases LeftChild

RightChild PredVal Node

Type Operator

0 0 N/A c_model_key27 0.5 2872 1 2 19008.59 Decision <=1 1 0 sold_mileage 40848.5 2541 3 4 19488.91 Decision <=1 2 0 sold_mileage 34870 331 21 22 15361.76 Decision <=2 3 1 msrpamount 35421.5 1480 5 6 20636.49 Decision <=2 4 1 sold_mileage 54370 1061 7 8 17931.3 Decision <=3 5 3 cpiusedseason 94.8 837 11 12 19923.85 Decision <=3 6 3 msrpamount 42255.18 643 9 10 21593.98 Decision <=3 7 4 msrpamount 38804.4 708 13 14 18604.52 Decision <=3 8 4 sold_mileage 68966.5 353 27 28 16631.08 Decision <=4 9 6 cpiusedseason 94.5 570 15 16 21280.4 Decision <=4 10 6 N/A N/A 73 N/A N/A 24538.5 Terminal  4 11 5 carmile3 30268.5 253 23 24 21015.63 Decision <=4 12 5 sold_mileage 29244.5 584 19 20 19455.63 Decision <=4 13 7 cpiusedseason 94.5 430 25 26 18013.64 Decision <=4 14 7 cpiusedseason 94.5 278 29 30 19462.63 Decision <=5 15 9 sold_mileage 25823.5 177 47 48 22420.7 Decision <=5 16 9 sold_mileage 25218.5 393 17 18 20739.8 Decision <=6 17 16 N/A N/A 104 N/A N/A 22063.81 Terminal  6 18 16 msrpamount 40846 289 31 32 20214.25 Decision <=5 19 12 sold_mileage 19440.5 252 45 46 20182.03 Decision <=5 20 12 carmile3 30574.5 332 35 36 18932.57 Decision <=2 21 2 sold_mileage 24627 166 55 56 16128.79 Decision <=2 22 2 sold_mileage 46820.5 165 41 42 14704.67 Decision <=5 23 11 sold_mileage 26769 184 53 54 21520.97 Decision <=5 24 11 N/A N/A 69 N/A N/A 19771.07 Terminal  5 25 13 N/A N/A 91 N/A N/A 19167.86 Terminal  5 26 13 carmile3 41141.5 339 33 34 17725.09 Decision <=4 27 8 cpiusedseason 94.8 279 37 38 16950.6 Decision <=4 28 8 N/A N/A 74 N/A N/A 15502.11 Terminal  5 29 14 N/A N/A 63 N/A N/A 20477.38 Terminal  5 30 14 sold_mileage 49138 215 39 40 19103.48 Decision <=7 31 18 sold_mileage 34212.5 178 49 50 20721.75 Decision <=7 32 18 N/A N/A 111 N/A N/A 19484.22 Terminal  6 33 26 sold_mileage 46833.5 215 51 52 18105.22 Decision <=6 34 26 N/A N/A 124 N/A N/A 17019.13 Terminal  6 35 20 sold_mileage 37906 265 61 62 19184.63 Decision <=6 36 20 N/A N/A 67 N/A N/A 17981.51 Terminal  5 37 27 N/A N/A 61 N/A N/A 17860.28 Terminal  5 38 27 carmile3 27261.5 218 43 44 16687.53 Decision <=6 39 30 inventory_3 0.5 147 57 58 19463.6 Decision <=6 40 30 N/A N/A 68 N/A N/A 18218.93 Terminal  3 41 22 N/A N/A 99 N/A N/A 15089.7 Terminal  3 42 22 N/A N/A 66 N/A N/A 13946.04 Terminal  6 43 38 sold_mileage 59917 150 59 60 17015.75 Decision <=6 44 38 N/A N/A 68 N/A N/A 16036.37 Terminal  6 45 19 N/A N/A 62 N/A N/A 20838.1 Terminal  6 46 19 sold_mileage 25389 190 65 66 19889.93 Decision <=6 47 15 N/A N/A 66 N/A N/A 23103.74 Terminal  6 48 15 N/A N/A 111 N/A N/A 22045.9 Terminal  8 49 31 N/A N/A 93 N/A N/A 21219.93 Terminal  8 50 31 N/A N/A 85 N/A N/A 20156.35 Terminal  

29

Page 30: Data Mining Paper

7 51 33 N/A N/A 110 N/A N/A 18458.03 Terminal  7 52 33 N/A N/A 105 N/A N/A 17578.42 Terminal  6 53 23 N/A N/A 102 N/A N/A 21861.85 Terminal  6 54 23 N/A N/A 82 N/A N/A 21049.2 Terminal  3 55 21 N/A N/A 60 N/A N/A 16574.29 Terminal  3 56 21 N/A N/A 106 N/A N/A 15821.05 Terminal  7 57 39 N/A N/A 83 N/A N/A 19198.3 Terminal  7 58 39 N/A N/A 64 N/A N/A 19869.5 Terminal  7 59 43 N/A N/A 80 N/A N/A 17264.08 Terminal  7 60 43 N/A N/A 70 N/A N/A 16676.68 Terminal  7 61 35 inventory_3 0.5 191 63 64 19313.38 Decision <=7 62 35 N/A N/A 74 N/A N/A 18847.18 Terminal  8 63 61 N/A N/A 100 N/A N/A 19113.11 Terminal  8 64 61 N/A N/A 91 N/A N/A 19571.48 Terminal  7 65 46 N/A N/A 94 N/A N/A 20094.23 Terminal  7 66 46 N/A N/A 96 N/A N/A 19638.98 Terminal  

Appendix 3: Example of Artificial Neural Network Weights and Node Values

30

Page 31: Data Mining Paper

Hidden Layer # 1 msrpamount sold_mileage inventory_1 inventory_3 inventory_4 auction_loc1 auction_loc3 auction_loc5Node # 1 -0.9924 -0.92508 -0.10909 -0.0869 0.407445 0.76707 -0.30193 -0.03286Node # 2 -1.17377 -0.14221 0.324732 0.077698 0.732684 0.889584 -0.98856 -0.7258Node # 3 1.681852 -1.30234 0.163891 -0.54717 0.596856 -0.03677 0.358691 -0.29186Node # 4 -1.17747 0.028954 0.014453 1.440397 1.636011 -0.16739 -0.32554 0.075728Node # 5 -0.55085 0.875159 0.23121 -0.08847 1.125314 -0.7746 0.251863 0.207554Node # 6 0.136688 -0.06929 0.170625 -0.91841 0.241333 1.168893 0.112296 0.388781Node # 7 -0.88426 0.079155 -0.33953 0.412718 -0.52903 0.672348 -0.75731 -0.96897Node # 8 -1.48219 -2.02359 -0.20219 0.05322 -1.46616 0.148482 0.523352 -0.22174Node # 9 -2.21439 -1.62425 -1.30926 -1.16685 -0.16121 0.99316 -0.46173 0.540713Node # 10 0.599278 -0.53332 0.704159 -0.64314 -0.68084 -0.05174 0.871066 -0.86043Node # 11 -0.89016 -0.6622 -0.03572 -0.17735 0.411346 -0.05405 -0.32155 0.445525Node # 12 -0.66279 0.820028 -0.24676 1.047421 0.816548 0.99647 0.897403 -0.26335Node # 13 0.437691 -1.13075 0.165526 0.975609 0.49331 -0.57517 -0.42918 -0.08316Node # 14 -1.45984 2.243076 0.118265 -0.18112 0.219713 0.096708 0.295095 -0.98391Node # 15 -0.15003 0.555863 -0.79 0.08103 -0.47099 -0.80839 1.021082 -0.43369Node # 16 2.631468 -1.22108 -0.49396 0.265158 0.481147 -0.27958 -0.32635 -0.60925Node # 17 0.842513 -1.26125 0.054213 1.31948 0.070621 0.370988 0.660617 0.639514Node # 18 -1.1525 -1.37519 -0.12184 0.41435 1.376342 -0.5832 1.191347 0.247147Node # 19 0.701494 0.533896 -0.05387 0.252024 -0.24887 0.371217 -0.10109 0.100897Node # 20 -1.45107 -1.62176 0.240022 -0.07923 -0.05257 -0.15995 0.693413 -0.14446Node # 21 0.9122 -2.06603 -0.10715 -0.23796 -0.80318 -0.96537 -0.79056 0.624025Node # 22 2.33234 -1.53131 0.099541 0.52869 0.943848 -0.00983 1.240352 -0.10791Node # 23 1.482287 -0.98135 0.487451 0.133631 0.572966 0.462336 -0.59029 0.171184Node # 24 0.705824 -1.04552 -1.13635 -0.07596 0.169277 -0.86072 -0.03822 -1.39362Node # 25 -0.70268 -1.40022 -0.15968 0.615309 0.270217 -0.84642 -0.41743 -0.13798

auction_loc8 auction_loc9 auction_loc10 c_model_key19 c_model_key25 c_model_key27 c_model_key28 carmsrp3-0.68719 -0.1695 -0.64015 0.409003 0.027659 0.436983 -0.32028 -0.02445-0.01476 -0.31733 -0.30253 -0.97444 -0.83175 -1.61418 -0.79489 -1.246920.512693 0.732171 -0.20315 -0.25761 0.459377 -0.43849 0.833435 -0.7809

31

Page 32: Data Mining Paper

-1.03672 0.066675 -0.37611 0.561342 0.837288 1.831136 0.253165 -1.04480.192486 -0.70182 0.833364 0.206985 -0.16358 -0.08897 0.479617 0.041510.700279 -0.58293 0.43277 0.425912 -1.1256 -0.33346 0.019433 -0.10582-0.53146 -0.32691 -0.55511 0.916883 -0.40672 -1.08508 0.748814 -0.48829-0.36053 -0.16751 0.389014 0.72763 -0.05283 1.561751 -0.42445 -1.00985-0.55497 -0.44734 -0.97941 0.394627 -0.17005 -0.26687 -0.22782 -0.28239-0.88747 -0.58158 -0.28296 -0.98666 -0.36893 -0.24161 -0.88248 -0.540630.627369 0.237745 0.899651 -0.97675 0.292983 0.631681 -0.60964 -0.30602-0.30476 -0.61264 -1.05206 -0.20284 0.198626 -0.31322 0.428692 -0.683730.863118 -0.54035 0.167506 0.192717 0.795492 -0.22119 -0.61908 -0.602010.336286 -0.19178 -0.63855 -0.03573 -0.54581 1.329112 0.339764 -0.6277-0.80173 0.930607 -0.03662 1.528065 0.32384 -1.03271 -0.79569 -0.876270.61515 0.610025 0.585676 -0.87385 0.075709 -0.58927 -0.82107 -0.288990.08563 0.745317 0.860211 -0.21497 -0.76958 -0.40456 0.250641 0.999438

0.024398 -0.01478 1.220478 -0.19288 -0.25458 -0.80071 0.061673 -1.234440.09506 0.623617 -0.53474 -0.22707 -0.52724 1.032895 -0.89921 -0.08121

0.434759 0.679767 0.288873 -0.42937 -0.2202 0.330814 -0.97329 1.428648-1.12235 -0.48783 -0.4254 0.131853 -0.88971 0.017354 -0.89144 0.213313-0.40495 -0.67428 -0.21395 -0.30772 -0.26346 0.116944 0.332774 0.0696480.160148 -0.66679 -0.4654 0.498713 -0.55199 -0.81764 0.224658 -0.56217-1.12072 0.451109 -0.90296 -0.32911 0.477865 -0.73074 0.188448 -0.0550.241846 -0.21691 -0.79459 -0.07977 -1.23787 -0.74746 -0.19797 -1.05825

cpiusedseasonweek_sold

1week_sold

3week_sold

10week_sold

11week_sold

14week_sold

21week_sold

22week_sold

23week_sold

24 Bias Node-1.06856 0.494343 -0.34477 0.536594 0.813539 -0.96874 -0.4093 -0.27045 -0.164 0.593274 -0.67055-0.79364 0.310549 0.217381 -0.01888 -0.90315 -0.21823 0.697298 0.260858 -0.30171 -0.89301 0.190636-0.19797 -0.73587 -1.09392 -0.21148 0.097434 -0.60608 0.218231 -0.57452 -0.41566 0.779051 -0.69324-1.31937 -0.18955 -0.21545 0.509938 0.252948 -0.60545 -0.18481 -0.37123 -0.31644 0.339214 0.881749-1.23701 -0.84133 1.092705 -0.13309 -0.16496 0.739499 -0.33448 0.934445 0.706528 -0.80172 1.0484220.012575 0.927546 -0.38402 0.565068 1.127216 0.643576 0.036541 0.134394 0.839714 -0.47786 -0.52905

32

Page 33: Data Mining Paper

-0.1259 -0.73849 0.680244 0.516041 0.795527 0.566929 0.574377 0.388258 -0.52722 -0.1002 -1.14265-0.09004 -0.38251 0.064623 -0.68686 0.987403 -0.36321 0.709633 0.293998 -0.08194 -0.29232 0.723154-0.16078 0.594917 -0.34634 0.100551 0.862145 -0.30645 -0.72148 0.069513 -1.21487 0.621487 1.2294630.35686 -0.45992 0.574555 0.15774 0.78205 0.029443 -0.37114 -0.54483 -0.82437 -0.45879 -0.55105

-0.05049 -0.65733 -0.23109 0.706275 0.0179 0.751328 -0.64764 -0.76806 0.283248 -0.20831 -0.917050.991489 0.649908 0.72425 -0.17619 -1.07845 0.053003 -0.62702 -0.52565 0.829766 -0.74425 -0.65573-1.33509 0.159883 -0.40129 0.913664 -0.50971 -0.73716 0.631569 -0.80927 -0.23147 -0.02393 0.240345-0.80704 0.753918 0.516593 0.213614 0.75346 -0.05771 -0.52011 0.775751 -0.05586 0.161895 -1.35774-1.39803 -0.2068 -1.18577 0.121322 0.712316 -0.93368 -0.65725 0.406796 0.144823 0.443699 0.475491-0.08599 0.86134 0.191324 -0.61797 0.19939 -0.29108 0.109975 -0.23358 1.081557 -0.54301 -0.81373-1.13369 -0.18077 -0.67046 0.218585 -0.13868 0.034448 -0.44913 0.066341 0.2013 -0.94501 -0.251730.28419 0.141433 0.421484 0.159309 0.474259 1.039231 -1.07307 0.080598 0.556284 -0.60161 -0.32247

-1.28863 0.469951 0.259576 0.308309 -0.78201 -0.04043 0.341024 0.566419 0.067478 0.590179 0.365647-0.05973 -0.6187 -0.81371 0.170875 -0.04853 0.06619 -0.06075 -0.39958 -0.25524 0.757934 0.747168-0.28903 -1.18107 0.801829 -0.80847 -0.60603 0.195567 0.11167 1.007777 -0.96406 -0.55747 0.568838-1.57976 -0.32695 0.316019 0.004535 1.074185 0.34285 -0.45953 0.173553 0.782916 0.284625 -0.994210.72044 -0.36463 -0.24892 0.422995 -0.01425 -0.37706 -0.73148 0.698259 0.602804 0.091654 -1.28872

-0.55409 0.790707 0.662302 0.656484 0.765029 0.375535 -0.50792 -0.68322 -0.32642 0.592788 -0.87042-0.14981 0.315957 -0.08987 -0.27936 -0.91763 -0.83759 -0.96482 -0.49006 0.294802 -0.40715 -0.34972

Hidden Layer # 2 Node # 1

Node # 2

Node # 3

Node # 4

Node # 5

Node # 6

Node # 7

Node # 8

Node # 9

Node # 10

Node # 11

Node # 12

Node # 13

Node # 14

Node # 1 0.641476 0.881 0.137 1.369 0.143 0.343 0.233 0.1213 -0.28 -0.77 0.108 0.072 0.133 0.662Node # 2 0.326106 -0.22 0.262 -0.06 -0.4 0.172 -0.46 0.2271 -1.46 0.023 -0.51 0.209 -0.13 1.603Node # 3 -0.44992 -0.92 1.99 -2.2 -2.14

-0.352 -1.19 -2.647 -2.88 1.084 -0.73 -0.35 0.386

-2.154

33

Page 34: Data Mining Paper

Node # 4 -1.09216 -1 -0.98 1.147 0.356 0.186 -0.92 0.5818 -1.22 -0.08 -0.31 -0.5 -0.87 0.092Node # 5 -1.03697 -1.22 -0.17 0.337 0.624 0.059 0.543 0.3743 0.62 -1 0.656 -0.15 -0.32 1.384Node # 6 -0.80484 -0.88 -0.89 0.267 -0.69 0.198 0.211 -0.958 0.174 0.553 -0.14 -1.38 -0.15

-0.908

Node # 7 -0.06145 0.394 -0.58 1.065 0.24

-0.401 0.517 0.927 -0.55 0.051 0.706 -0.42 0.507 0.297

Node # 8 0.676753 -0.55 0.64 0.814 0.612 0.313 0.025 1.0707 -0.01 0.469 -0.08 0.635 0.21

-0.554

Node # 9 -1.18214 -0.57 -0.16 -0.86 0.31

-0.057 0.345 -2.003 -1.39 -1.56 -0.77 -1.97 -1.39

-1.762

Node # 10 -0.91463 -1.51 0.084 -1.21 -0.26 0.683 -0.35 0.5599 0.637 0.1 0.403 -0.15 -1.73 0.183Node # 11 0.163182 -1.23 -0.9 0.39 1.122 -0.66 -0.62 -0.774 -0.85 -0.01 0.542 0.054 0.433 0.185Node # 12 -0.51169 -0.26 0.895 -1.43 -1.19 0.296 -0.61 -1.216 -0.67 -0.03 -0.33 -1.06 0.097

-0.526

Node # 13 -0.87908 -1.76 0.533 -1.07 0.26

-0.355 0.07 -1.297 0.041 -0.15 -0.22 -1.47 -0.12 0.05

Node # 14 0.216068 -0.61 -0.05 -1.96 -0.96 0.449 -0.13 -1.141 0.17 0.104 -0.78 -1.25 -0.91

-0.821

Node # 15 -0.86761 0.6 0.645 -1.44 -0.41 0.212 0.586 -0.287 0.05 -0.34 -0.84 0.244 0.638

-0.349

Node # 16 0.296629 -0.75 -0.3 -1.09 -0.51 0.027 0.116 -1.735 -0.61 -0.77 0.189 -0.86 -0.66

-1.777

Node # 17 -1.12164 -1.16 0.049 -0.41 0.584

-0.585 -0.11 0.2556 -0.31 -1.35 0.07 -0.15 -0.28 1.271

Node # 18 0.273078 -0.11 -0.51 0.383 0.135 0.568 -0.91 0.0316 -1.27 0.264 -0.79 1.166 -0.54 0.784Node # 19 0.711892 -0.1 -0.68 -0.83 0.211 0.721 -0.43 -1.53 -0.78 0.248 -0.94 -0.71 -0.89

-1.593

Node # 20 -0.98573 -0.17 0.601 -0.06 -0.84 0.351 0.808 -0.932 -0.1 -0.96 -1.33 -1.25 0.685 0.028Node # 21 -0.35658 -0.81 0.123 0.504 -0.33 1.672 0.563 0.1947 -0.95 -1.28 -0.05 -0.72 -0.6 1.674Node # 22 0.499712 -0.26 -1.43 -0.08 0.294 0.051 0.315 -0.451 -0.34 0.712 -0.33 -0.01 -0.95 0.383Node # 23 -0.47264 -0.24 -0.47 1.056 -0.5 0.254 -0.15 -1.144 -1.15 0.29 -0.29 0.6 0.291

-0.297

Node # 24 -0.01629 -1.23 -0.06 0.428 0.715 0.007 -0.2 -3.072 -0.82 -0.02 -0.83 1.347 -0.67 1.297Node # 25 -0.57772 0.492 -0.59 0.174 0.697 -0.34 -0.39 -0.881 -1.35 -1.07 0.184 0.3 -0.52 0.883

Node # 15

Node # 16

Node # 17

Node # 18

Node # 19

Node # 20

Node # 21

Node # 22

Node # 23

Node # 24

Node # 25

Bias Node

1.244763 0.02912 -0.1031 1.157162 0.450834 0.194279 -0.61398 -0.57911 -0.53875 -0.83773 0.837261 1.6963160.273078 0.10875 -1.266 -0.99166 0.808083 0.584741 0.190806 -1.03974 0.701266 -0.33424 -0.92276 -0.6153-1.07216 2.565761 0.201533 -2.22554 1.229969 -1.86793 0.688202 2.477933 1.516446 1.120388 -1.79749 -1.18336-0.35629 0.675587 -0.60258 0.090882 0.886545 0.012681 -0.29487 -1.2136 -0.72036 -0.30958 -0.78495 -0.46509-0.89842 0.508858 -0.85685 0.323749 -0.22631 -0.61501 -0.79171 -0.08494 -0.7582 -0.46453 -1.14866 -0.60771

34

Page 35: Data Mining Paper

0.572054 -0.95611 -0.3928 0.587011 -1.26224 -0.60551 0.224318 -0.56344 -0.41754 -0.38871 0.479654 0.419889-0.47751 0.015747 -0.56429 -1.23699 0.400402 0.159208 -0.38083 -0.65296 -0.94126 -0.66863 -1.26128 1.1079521.349779 -0.84557 0.48411 -1.6511 0.041749 -0.3498 -0.20354 -0.70874 -0.77405 -0.95613 -0.95117 0.815853-0.32766 0.844564 1.710168 -0.10044 -0.27813 -1.20942 0.079218 1.68803 1.307386 -0.62585 -0.32097 -0.61407-0.19574 -0.63468 0.10598 -0.11142 1.104343 1.098824 -0.14116 -0.41047 -0.10022 -0.16065 -0.37257 -0.75687-0.36492 0.342522 -0.45905 -0.4999 -0.94589 0.189367 -1.21236 -1.21605 -0.24188 -0.18036 -0.4876 0.100611-0.33247 1.16239 -0.20059 -0.99515 -0.72356 -0.17477 0.063635 0.090606 1.083367 -0.33553 0.486218 -0.85417-1.02061 0.446933 0.67339 -0.51139 0.231413 0.117244 -0.0942 0.698364 0.725336 0.319327 -0.25956 -2.22817-0.90352 -1.12433 0.614415 -0.4327 -1.62498 0.178787 0.35591 0.220546 0.433427 -0.19372 0.123669 -0.96407-1.22952 -0.02288 -1.26346 -0.07106 -0.73255 -0.62538 -0.5453 -0.0214 0.109355 0.558338 0.974134 -0.57895-1.28699 0.401999 0.683317 0.193033 0.423317 -1.41028 0.214163 1.475142 0.645822 0.560228 0.308556 -1.36346-0.03543 0.57141 -0.62931 -0.73949 0.088486 -0.85081 -1.20375 0.408516 -0.46815 -0.29167 -0.37848 0.0729810.956775 -0.31538 -0.6183 -0.78966 1.336611 0.666733 -0.13133 -0.77713 -0.10526 -0.14157 -0.94197 -0.57089-0.69907 -0.22111 0.108956 -0.09149 -0.9328 1.29771 -0.10924 0.211597 -0.09594 0.733091 -0.17564 -0.408021.008849 1.0713 0.674911 0.146671 -0.52633 0.320565 0.477679 0.18593 0.323665 0.752229 0.151594 -0.752670.061606 0.833173 -1.71056 -1.16896 -0.96654 -1.70131 -0.91769 -0.57042 1.420147 -0.22581 -0.21925 -0.898760.495311 -0.83239 -0.08449 -0.45795 -0.49285 -0.75871 -0.71121 -0.16968 0.165758 -0.58555 0.025711 -0.472830.970387 -0.14237 0.347591 -0.91277 -0.32448 -0.87333 -0.81388 0.155109 0.304517 -1.33 -0.50764 -0.522860.368653 -0.5301 -0.00858 -0.14667 0.779272 -0.55311 -1.35974 -0.20032 -0.46016 -0.86184 -0.17088 -0.47223

0.724866 -0.27228 -1.17552 -1.03725 -0.33357 -1.72588 -1.37698 0.868067 0.155631 -0.81155 -0.19297 -0.04056

Output Layer

Node # 1

Node # 2

Node # 3

Node # 4

Node # 5

Node # 6

Node # 7

Node # 8

Node # 9

Node # 10

Node # 11

Node # 12

Node # 13

Node # 14

Output Node 0.74122

-1.133 2.579

-1.087

-1.057 -0.03

-0.321

-0.547 1.277 -1.11 -0.71 0.3171 0.061 0.2974

Node # 15

Node # 16

Node # 17

Node # 18

Node # 19

Node # 20

Node # 21

Node # 22

Node # 23

Node # 24

Node # 25

Bias Node

0.441386 0.78956-

0.73704-

0.41418-

0.70685 0.810797-

1.04499-

0.81083 0.134049-

1.04509-

0.47947 0.76785

Works Cited

Bruce, Peter C., Patel, Nitin R., Shmueli, Galit. Data Mining in Excel: Lecture Notes and

Cases. Arlington, VA. Galit Shmueli, Nitin R. Patel, Peter C. Bruce.

Resampling Stats, Inc. 2005.

35

Page 36: Data Mining Paper

Granger, C. and Ramanathan, R.. “Improved Methods of Combining Forecasting,”

Journal of Forecasting. 1984.

Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. Cambridge, MA. MIT

Press, 2001.

SAS Technologies/Analytics. 2006. SAS Institute Inc. 17 May 2006.

<http://www.sas.com/technologies/analytics/datamining/miner/semma.html>.

U.S. Department of Labor, Bureau of Labor Statistics, Consumer Price Indices. 2006.

Division of Consumer Prices and Price Indices. 17 May 2006.

<http://www.bls.gov/cpi/home.htm#data>.

Wooldridge, Jeffrey. Introductory Econometrics, A Modern Approach. Australia,

Thomson South-Western. 2003.

36